How to Lag Variables in R and Create a Pairwise Dataset with Dyadic and Group Data

Kareena del Rosario

2024-01-19

In this R tutorial, we’re going go over how to create lagged variables and restructure an individual dataset (i.e., one row per person) into a pairwise dataset (i.e., each row has that person’s scores and their partner’s scores, structured like a checkerboard).

# Load libraries
pkgs <- c("tidyverse",
          "haven")
lapply(pkgs, library, character.only = TRUE)

Lagging Variables in R

Creating lagged variables is very easy to do in R. First, let’s load and inspect the data.

# We're going to load a dataset called "restructure_data.sav"
# Make sure your R code and data are saved to the same pathway OR enter the pathway below.
df <- read_sav("restructure_data.sav")

# Let's check out our variables
colnames(df)
##  [1] "Dyad"          "id"            "partner"       "Condition"    
##  [5] "Actor_Cond"    "Observer_Cond" "gender"        "Age"          
##  [9] "BMI"           "Time"          "React_PEP"     "React_IBI"    
## [13] "genderR"

To create lagged variables, we first group by any identifiers. Here, we’re grouping by participant ID (id). If participants were in multiple dyads/groups, you would also want to group by the group id. Because we are lagging the IBI variable, do not group by time.

The code below can be modified to lag across several variables. If you have multiple physiological measures that all start with the word “Mean”, you would specify that the variables to lag start with “Mean.” Furthermore, if you wish to lag your variables by more than 1 time point, edit the “n =” code in the lag function. Alternatively, if you need to shift a variable forward by a time point, replace “lag” with “lead.”

lag_df <- df %>% 
  group_by(id) %>% 
  mutate(across(starts_with(c("React_PEP", "React_IBI")), 
                ~ lag(.x, n = 1), .names = "lag_{col}")) %>% 
  ungroup()

# Preview new variable with non-lagged variable and time. 
# Now we can see that lag_React_PEP at Time 3 is the same as React_PEP at Time 2

lag_df %>% 
  select(id, Time, React_PEP, lag_React_PEP)
## # A tibble: 10,272 × 4
##       id Time                                 React_PEP lag_React_PEP
##    <dbl> <dbl+lbl>                                <dbl>         <dbl>
##  1  1001  1 [recall task prep 1 (30 seconds)]      0            NA   
##  2  1001  2 [recall task prep 2 (30 seconds)]     -9             0   
##  3  1001  3 [recall task prep 3 (30 seconds)]     -4            -9   
##  4  1001  4 [recall task prep 4 (30 seconds)]     -3.20         -4   
##  5  1001  5 [recall task 1 (30 seconds)]          -2            -3.20
##  6  1001  6 [recall task 2 (30 seconds)]          -8            -2   
##  7  1001  7 [recall task 3 (30 seconds)]          -3            -8   
##  8  1001  8 [recall task 4 (30 seconds)]          -2            -3   
##  9  1001  9 [recall task 5 (30 seconds)]          -1            -2   
## 10  1001 10 [recall task 6 (30 seconds)]          -3.20         -1   
## # ℹ 10,262 more rows
colnames(lag_df)
##  [1] "Dyad"          "id"            "partner"       "Condition"    
##  [5] "Actor_Cond"    "Observer_Cond" "gender"        "Age"          
##  [9] "BMI"           "Time"          "React_PEP"     "React_IBI"    
## [13] "genderR"       "lag_React_PEP" "lag_React_IBI"

Pairwise Dyadic Dataset

We have a dyadic dataset that is structured as an individual file, meaning that each person has their own row. To look at influence, we’ll need to reshape this to be pairwise. A pairwise file has both the person’s score AND their partner’s score on the same line.

Use the code below to restructure an individual file into a pairwise file. This code copies dataframe into an “actor” and “partner” dataframe. The variables of interest (reactivity scores) are given the suffix “_actor” and “_partner.” In the partner dataframe, we also rename the id to be partner_id.

Then, we simply combine these new dataframes and match them by Dyad and Time. Because the dataframe is combining the two dataframes based on these variables, we’re going to have “dyads” with each person listed with their partner on the same row, but we’ll also have each person listed with their own scores (listed as their own partner). To get rid of that, we filter the dataframe to get rid of any rows in which the “id” matches the “partner_id”. In other words, any rows where people are accidentally listed as their own partner. Now, we have a row where Dyad 100 has two rows: one where Participant 1 is the actor and another where Participant 1 is the partner.

Note that we are renaming each person’s own scores to include “_actor.” This step is for demonstration purposes to make the process super clear. You could skip that step by simply deleting the RENAME line when creating the actor_df.

library(dplyr)
library(tidyr)

# Separating data for actor and partner
actor_df <- lag_df %>%
  select(Dyad, id, Time, matches("React_")) %>% 
  rename_with(~ paste0(., "_actor"), c(React_PEP, React_IBI, lag_React_PEP, lag_React_IBI))


partner_df <- lag_df %>%
  select(Dyad, id, Time, matches("React_")) %>%
  rename_with(~ paste0(., "_partner"), c(React_PEP, React_IBI, lag_React_PEP, lag_React_IBI)) %>% 
  rename(partner_id = id)

# Merging actor and partner data
pairwise_df <- actor_df %>%
  left_join(partner_df, by = c("Dyad", "Time")) %>%
  filter(id != partner_id)

# inspect the data
head(pairwise_df)
## # A tibble: 6 × 12
##    Dyad    id Time           React_PEP_actor React_IBI_actor lag_React_PEP_actor
##   <dbl> <dbl> <dbl+lbl>                <dbl>           <dbl>               <dbl>
## 1   100  1001 1 [recall tas…            0              -21.3               NA   
## 2   100  1001 2 [recall tas…           -9              -59.4                0   
## 3   100  1001 3 [recall tas…           -4               57.6               -9   
## 4   100  1001 4 [recall tas…           -3.20            17.4               -4   
## 5   100  1001 5 [recall tas…           -2              -65.2               -3.20
## 6   100  1001 6 [recall tas…           -8              -26.3               -2   
## # ℹ 6 more variables: lag_React_IBI_actor <dbl>, partner_id <dbl>,
## #   React_PEP_partner <dbl>, React_IBI_partner <dbl>,
## #   lag_React_PEP_partner <dbl>, lag_React_IBI_partner <dbl>

Pairwise Group Dataset

Here, we have a dataset we’re calling “pairwise_demo.” There are 16 groups of 4 roles (S1: Lead Surgeon, S2: Junior Surgeon, SN: Scrub Nurse, A1: Anesthetist). Just as with the dyad file, we want to create a pairwise file. The challenge here is that there is more than one type of dyad. In other words, instead of Partner 1 and Partner 2 / Partner 2 and Partner 1, we have four group members and we want to represent every possible pairing in the file. Below are the dyads that will need to be represented (12 pairings total):

S1 and S2
S2 and S1
S1 and SN
SN and S1
S2 and SN
SN and S2
S1 and A1
A1 and S1 
A1 and S2
S2 and A1
A1 and SN
SN and A1

Given the complexity, we’ll demonstrate how to do this in R, which is the most flexible in terms of data manipulation. To start, we have an individual file, in which every group member is represented in their own row.

# Load the data
pairwise_demo <- read.csv("group_data.csv")

pairwise_demo <- pairwise_demo %>% 
  filter(event != "Arrival")

The first thing we’ll want to do is duplicate the individual file into two matching files that represent the Actor and Partner. These files are identical except in the Actor file every variable is followed by the suffix “_a” and in the variables in the Partner file have the suffix “_p.” Note that all shared variables (e.g., time, case number) are not given this suffix.

# Define function to add suffix to dataframe columns
add_suffix <- function(df, suffix) {
  df %>% rename_with(~paste0(., suffix), -c("task_time", "time", "event", "case"))
}

# Create duplicate dataframes with unique suffixes for actor and partner
physio_actor <- pairwise_demo %>% add_suffix("_a")
physio_partner <- pairwise_demo %>% add_suffix("_p")

Next, we’ll create vectors that represent every possible pairing. At this point ‘a’ and ‘b’ are identical because every role is an actor and a partner and ‘c’ represents the groups in the study.

# Generate all dyad pairings
role_a <- c("S1", "S2", "SN", "A1") # actor
role_p <- role_a # partner (same as actor)
case <- c(1:12, 14:17) # cases (no case 13)

Now we generate every possible pairing of the vectors we created.

# Create all combinations and filter for valid dyads
naming_df <- expand.grid(role_a = role_a, role_p = role_p, case = case) %>%
  filter(role_a != role_p) %>%
  mutate(case = as.double(case)) %>%
  arrange(role_a, case)

head(naming_df)
##   role_a role_p case
## 1     S1     S2    1
## 2     S1     SN    1
## 3     S1     A1    1
## 4     S1     S2    2
## 5     S1     SN    2
## 6     S1     A1    2

At this point, we have the IDs of every dyad in our study. We join that ID dataframe with the physio data.

# Join with actor and partner data
naming_df1 <- naming_df %>%
  left_join(physio_actor, by = c("role_a", "case")) %>%
  left_join(physio_partner, by = c("role_p", "case")) %>%
  select(role_a, role_p, case, pid_a, pid_p) %>%
  drop_na() %>% 
  distinct()

Because we are creating dyads, we’ll need to assign unique Dyad IDs after the data has been restructured.

# Assuming max_pid is the maximum possible value of pid_a and pid_p
max_pid <- max(naming_df1$pid_a, naming_df1$pid_p, na.rm = TRUE)

# Create a unique dyad ID
naming_id_df <- naming_df1 %>% mutate(
    ordered_pair = paste(pmin(pid_a, pid_p), pmax(pid_a, pid_p), sep = "-"),
    dyad_case = paste(ordered_pair, case, sep = "_")
  ) %>%
  mutate(
    dyad_id = as.numeric(as.factor(dyad_case))
  ) %>% 
  select(-c(ordered_pair, dyad_case))

# inspect dyad ids to make sure each case has two duplicates (role_a = p1, role_p = p2 and then the inverse)
naming_id_df %>% 
  arrange(case) %>% 
  head()
##   role_a role_p case pid_a pid_p dyad_id
## 1     S1     S2    1     3     4      43
## 2     S1     SN    1     3     1       5
## 3     S1     A1    1     3     5      45
## 4     S2     S1    1     4     3      43
## 5     S2     SN    1     4     1       7
## 6     S2     A1    1     4     5      56

Finally, we are done setting up the IDs and data structure. The last thing we need to do is merge this new file with our existing data. Note that we are creating a new variable called “obs_id” which is a unique identifier that represents each pair of observations from each dyad at every time point.

# Join and finalize pairwise data
final_pairwise <- naming_id_df %>%
  right_join(physio_actor, by = c("role_a", "case" , "pid_a")) %>%
  group_by(case) %>%
  mutate(obs_id = time + ((max(time)) * (dyad_id - 1))) %>% # create obs_id
  ungroup() %>%
  right_join(physio_partner, by = c("role_p", "case", "pid_p", "time", "task_time", "event")) %>% 
  filter(!is.na(pid_a) & !is.na(pid_p)) # get rid of any non-dyads


# preview structure of data
final_pairwise %>% 
  filter((time == 58 | time == 59), case == 1) %>% 
  arrange(dyad_id, pid_a, time) %>% 
  select(case, dyad_id, pid_a, role_a, role_p, time, MeanIBI_a, lag_ibi_a, MeanIBI_p, lag_ibi_p) %>% 
  head()
## # A tibble: 6 × 10
##    case dyad_id pid_a role_a role_p  time MeanIBI_a lag_ibi_a MeanIBI_p
##   <dbl>   <dbl> <int> <chr>  <chr>  <int>     <dbl>     <dbl>     <dbl>
## 1     1       5     1 SN     S1        58      627.      630.      777.
## 2     1       5     1 SN     S1        59      633.      627.      787.
## 3     1       5     3 S1     SN        58      777.      824.      627.
## 4     1       5     3 S1     SN        59      787.      777.      633.
## 5     1       7     1 SN     S2        58      627.      630.      505.
## 6     1       7     1 SN     S2        59      633.      627.      505.
## # ℹ 1 more variable: lag_ibi_p <dbl>