1 Getting acquainted with for loops

1.1 What is a loop?

For loop: Run the same code multiple times, for different values of a variable.

Loop = the whole thing, one loop of many iterations.

Iteration = one run through of the loop.


1.2 Structure of for loops in R

for ( index_variable in all_values ) {

{ CODE BLOCK is here within }

Here you write code that you want to run in one iteration, with whatever the current value of index_variable is

index_variable = will take iteratively each value of all_values

code code code

more code

code code

}


1.3 Example

for (i in 1:5) {
  # i is the index variable
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
  • The index variable “i” will take on values 1 to 5 (1,2,3,4,and then 5… five total iterations of the loop).
  • For each iteration, it will execute the code inside the loop. The variable “i” will be assigned to different values each iteration.

For the first iteration, i = 1:

i = 1 
print(1)
## [1] 1

Second iteration, i=2:

i=2
print(2)
## [1] 2

…until the fifth and last iteration, i=5:

i=5
print(5)
## [1] 5

1.4 Another example

  • instead of just numbers, the index variable can take on characters as well!
  • notice how this loop uses cat() instead of print()
all_my_favourite_things = c("bikes", "coffee", "brains")

for( one_thing in all_my_favourite_things ) {
  cat("\nI love", one_thing)
}
## 
## I love bikes
## I love coffee
## I love brains

1.5 …and even fancier

  • We can also get a bit fancier with conditional statements to have more flexibility
  • For example, if we want to print out something different on the last iteration of the loop (i.e. when one_thing == “brains”)…
for( one_thing in all_my_favourite_things ) {
  
    if ( one_thing == all_my_favourite_things[ length(all_my_favourite_things) ]  ) {
    cat("\nFinally, I also love", one_thing) 
    } else { 
      cat("\nI love ", one_thing) }
} 
## 
## I love  bikes
## I love  coffee
## Finally, I also love brains

1.6 Beyond just printing the output: example of saving each iteration’s output into another vector

  • Sometimes when you run the loop, you don’t want to just print the output of each iteration. You want to save it somehow, before it gets overwritten by the next iteration.
  • We can do this by appending the results of each iteration, to the end of a vector that stores all the outputs.
  • Then, when your loop ends, instead of having only the last iteration’s values stored in your environment, you will be able to save values from ALL iterations.
# we have to initialize an empty vector, so it exists in the environment. Otherwise, R will not know where to store the output of the iteration! 
parent_vector = NULL  

all_values = c(10, 12, 28, 34)
for (i in all_values){

  new_value = i*2   # run the operation on the current value of i 
  
  parent_vector = c(parent_vector, new_value)  # add the new value to the end of the parent_vector

  cat("\nThis is the parent vector:", parent_vector) # print something out to yourself 
}
## 
## This is the parent vector: 20
## This is the parent vector: 20 24
## This is the parent vector: 20 24 56
## This is the parent vector: 20 24 56 68

1.7 When to use a loop?

When you need to run the same code, on many different people/samples/conditions/etc., and you find yourself copy-pasting the same code over and over again.


1.8 When NOT to use a loop

Loops can take a long time to run. Lots of people generally caution to avoid loops. Yes, if you have a massive data set (1000’s of observations of 100’s of variables), you will want to find ways to do things using “vectorized” code, i.e.the apply family (sapply(), lapply(), etc.)…

However, in my experience, I use loops as needed. They are human-readable and logical to me, I’m okay waiting ~2 min if needed for my loop to run.

“If the loop isn’t the bottleneck, it’s almost always more readable that way to me, so I do it.” - Cory on Stack Overflow


2 Rules of Thumb with for loops

Before putting any code in the loop, make sure you are looping through your variable correctly. You can do this by first just printing out the value of each iteration, with no other code: i.e. for(i in 1:5) { print(i) }


When building, test the internal code on just one iteration, without running the whole loop. You can do this by assigning i=1 in your environment, and running all the code within the loop line-by-line on i=1. If this works, then you can let the loop go on all values of i (i.e. i=1:5)!


Within the loop, have code that prints out messages to yourself. This is you know where the loop is at while it’s going. If it has an error, these messages can also help you notice what step is causing the error.


Add in some conditionals within the loop as error detection… for example, if you know something should be length == 5, then add an if statement (i.e. if length != 5 then break will stop the loop or next will move to the next iteration).


Notice how when you run the loop, things will save to your environment… with each iteration it overwrites these values, so what you see in your environment after the loop has finished is only the LAST iteration.


How do you save things within the loop? CONCATENATE WITH A PARENT VECTOR/DATA FRAME so you can save the results from each iteration.


You can SAVE files/plots to a directory within each loop, using commands such as write.csv() or ggsave(). Remember, you’ll have to create a file_name variable so you can save it with an informative name for each iteration…


3 Practice!


3.1 Practice 1:

Loop through the following character vector, calculate the number of characters in each word, and print out the following sentence: "There are __ characters in the word __."

all_words = c("bikes", "biology", "coffee", "serendipity")


3.2 Practice 2:

Loop through the following numeric vector. If it is an even number, store it in an “even number” parent vector. If it is an odd number, store it in an “odd number” parent vector. If is is NA, move on to the next iteration. ONLY print a message to yourself showing you what iteration you’re on, but nothing else.

all_nums = c(10,12,28,34,NA,NA,11,11)


4 Practical example: live data wrangling across many subjects with loops

  • In the file heart_rate_data, you have HR data for 10 subjects, each subject stored under their own directory (i.e. heart_rate_data/sub).
xfun::embed_dir('heart_rate_data', text = 'Download heart rate data')
Download heart rate data
  • The heart rate is recorded during a 4-minute long intravenous infusion of a drug.

  • Each subject has 6 total infusions: 2 trials each of 3 different doses of the drug (Saline, 0.5mcg, 2.0mcg).

  • Example HR curves for each dose:

  • The order of these doses is randomized across subjects. There is a reference sheet that tells you which trial 1-6 is which dose, for each subject.

  • Keep in mind when loading into R: These csvs have NO header (header=F). They are separated by sep="". They have six columns: HR is column 2. We want to downsample from 40 to 1Hz, so we will pick out every 40th row for 1-240 seconds.

4.1 Your mission:

Wrangle HR data across all subjects into one R data frame with dose information. Save as an “RDS”. By the end of it you should have one data frame with the following columns: subject, trial 1-6, dose, time 1-240 seconds, and HR. 10 sub x 6 trials x 240 rows per trial = 14400 rows.

Here is the pseudocode for your mission:


1. Load the reference csv, that will tell you which trial is which dose.

xfun::embed_file('trial_dose_reference.csv', text = "Download reference csv")
Download reference csv
ref_df = read.csv("trial_dose_reference.csv")

2. Make a variable (vector) of all subject ID’s (you’ll need this to loop through).

all_subs = c("sub_1", "sub_2", "sub_3", "sub_4", "sub_5", "sub_6", "sub_7", "sub_8", "sub_9", "sub_10")

3. Loop through this vector. Within your loop, do the following for each subject.

  • Change to that subject’s directory (heart_rate_data/subject)

  • Load each subject’s data into R, for all trials 1-6, and clean as you go!

  • Match up which trial is which condition.

  • Save subject data into a big parent df.

  • Average across both trials for that subject, for that condition

  • Make AND SAVE a plot of this average for all three conditions for each subject

  • Save that subject into one big parent data frame.

  • Save that parent data frame as an “RDS” for easy future use. RDS is an easy file storage system for one R data frame (like a csv, but tkaes up less space).

Remember, test the loop out with one subject first! Assign sub="sub_1" in your environment, and test all commands. Then you can let the loop go for(sub in all_subs)


4. Make a group average plot showing the HR response for each dose, if you want.


Example code:

hr.indexes = seq(from=1, to=57600, by=40)  # data is in 40Hz, we want to downsample to 1Hz 

# Change to the heart_rate_data directory, so we can easily change into and out of each subject's folder
setwd("/Users/eadamic/Dropbox (LIBR)/My PC (EmilysYoga)/Desktop/BIOL7263_Seminar_DataReproducibleAnalyses/AdamicBIOL7263/MyLesson/heart_rate_data")

hr.df = NULL # initialize empty parent df

for( sub in all_subs ) {
  
  # Print a message to ourselves, so we know which subject we're on in our loop 
  cat("\nWorking on", sub)
  
  # Change to sub dir so R knows where to find the files
  setwd(sub)

  # Make a list of all files in that subject's directory 
  all_files = list.files(pattern="*.csv")
  
  # Read all of them into R
  sub.df = do.call(bind_rows, c( lapply(all_files, read.csv, header=F, sep=""), .id="trial" ) )
  
    # working inside out: 
    # lapply() will "apply" the function to all elements... so it will read.csv to every element in all_files, and put them all together into a list. 
    # do.call applies the function across all elements of that list made by lapply. So, it will apply bind_rows(), which puts all the data frame's (stored as list elements) together into one data frame.  
    # .id is an argument we want to give to bind_rows... but because it's part of do.call, you have to put the "c()" around the lapply() and .id.
    # .id = trial will create a new variable called "trial", that will be the number of the list element... i.e. 1 for the first csv we're binding, 2 for the second, etc. 
  
    # Multiple ways to skin the cat: I have done this in the past using ANOTHER for loop, looping through runs 1-6 and loading each run individually.
    # For loops take much longer to run than the above... so this is likely the better of the two options, especially if we have a lot of subjects. 
  
  sub.df$trial = as.numeric(sub.df$trial) # change trial to a numeric 
  sub.df = sub.df[hr.indexes,] # reduce to just 1Hz
  sub.df$time = rep(1:240,6) # make a time column for seconds 1-240, 6 times
  sub.df$sub_num = sub # make a subject column so we can bind to the parent 
  sub.df = sub.df %>% dplyr::select(sub_num, trial, time, hr = V2) # select cols we want, rename to hr 
  
  # Match up which trial is which dose. We'll get this from ref_df, and rename the columns so we can easily left_join. 
  dose_order = ref_df %>% filter(sub_num == sub) %>% dplyr::select(trial, dose)
  sub.df = left_join(sub.df, dose_order, by = c("trial"))
  
  # Now, we have wrangled and cleaned that subject. We can bind to the parent. 
  hr.df = bind_rows(hr.df, sub.df)
  
  # Change back to the heart_rate_data dir, to set up the next iteration (just one backwards from the subject). 
  setwd('..')
 
}
## 
## Working on sub_1
## Working on sub_2
## Working on sub_3
## Working on sub_4
## Working on sub_5
## Working on sub_6
## Working on sub_7
## Working on sub_8
## Working on sub_9
## Working on sub_10

Yay!!!! R is awesome!