20 R Functions Excercise

20.1 Load Libraries

 library(tidyverse)
# library(tidylog)

20.2 Data set creation code

i <- 6
for (i in 1:10) {
fl <- data.frame(name=rep(paste0("name",i),26))
b <- data.frame(name = rep(NA, 26))
b$name <- paste0(fl$name,"_",letters)
b$trait <- rnorm(26)
write_tsv(b,paste0("data/dataset",i,".txt"))
}

20.3 Example

Here we have been sent three data sets in the files that contain the trait quantitative values for each person in the data set:

“dataset1.txt” “dataset2.txt” “dataset3.txt”

And we’ve been asked to make a table that gives, for each dataset, the sample size (N), the mean of the trait, the median, and the variance.

We could do this by reading in each data set, one by one, as follows:

results <- data.frame(dataset=rep(NA,3),N=NA, mean=NA, median=NA, var=NA)
fl1 <- read.table("data/dataset1.txt",sep="\t",header=TRUE)
results$dataset[1] <- "dataset1"
results$N <- nrow(fl1)
results$mean[1] <- mean(fl1$trait)
results$median[1] <- median(fl1$trait)
results$var[1] <- var(fl1$trait)
results

   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2     <NA> 26         NA        NA        NA
3     <NA> 26         NA        NA        NA

fl2 <- read.table("data/dataset2.txt",sep="\t",header=TRUE)
results$dataset[2] <- "dataset2"
results$N <- nrow(fl2)
results$mean[2] <- mean(fl2$trait)
results$median[2] <- median(fl2$trait)
results$var[2] <- var(fl2$trait)
results

   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3     <NA> 26         NA        NA        NA

fl3 <- read.table("data/dataset3.txt",sep="\t",header=TRUE)
results$dataset[3] <- "dataset3"
results$N <- nrow(fl3)
results$mean[3] <- mean(fl3$trait)
results$median[3] <- median(fl3$trait)
results$var[3] <- var(fl3$trait)
results

   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 dataset3 26 0.07508335 0.0445614 0.7950574

Your colleague initially sent you the three data sets above, but now your colleague has sent you three more data sets and asked you to update the ‘results’ table.

As you can see, the code above is very repetitive. So let’s automate this by writing a function that loops through a list of data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.

20.3.1 Question: How could we construct a list of file names?

Note

This Run code WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.

We now have the files “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, …, “dataset6.txt” in the ‘data’ directory.

Question: How could we construct a list of file names?

Hint: Use the list.files command

Expand to see solution

Hint: the list.files command provides a handy way to get a list of the input files:

fls <- list.files(path="data",pattern="dataset*")
fls

[1] "dataset1.txt" "dataset2.txt" "dataset3.txt" "dataset4.txt" "dataset5.txt"
[6] "dataset6.txt"

20.3.2 Question: Outline a possible algorithm

Outline a possible algorithm that loops through a list of input data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.

Expand to see solution

Read in the input file names into a list
Set up an empty results table
For each file in our file name list
- Read the file
- Compute the statistics
- Insert the information into the results table
- Return the filled-in results table

20.3.3 Question: Construct a more detailed step-by-step algorithm.

Construct a more detailed step-by-step algorithm.

Expand to see solution

Input the path to the folder containing the data files
Read in the input file names into a list fls
Count the number of input files N
Set up an empty results table with N rows
For each file in our file name list fls
- Read the file
- Compute the statistics
- Insert the information into the correct row of the results table
Return the filled-in results table

20.3.4 Task: Write a `read_data_file` function.

Write a read_data_file function to accomplish the required steps for a single input data file.

Make the number in the data file name an argument.

Expand to see solution

Here we make the number in the data file name an argument

results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  results
}
(results <- read_data_file(n=1, results))

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> 26         NA        NA        NA
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

(results <- read_data_file(n=2, results))

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

Make the path to the input file an argument to your read_data_file function.

Expand to see solution

Here we make the path to the input file an argument.

read_data_file_v2 <- function(n=1, flnm="dataset1.txt", results) {
  fl1 <- read.table(paste0("data/",flnm),sep="\t",header=TRUE)
  results$dataset[n] <- flnm
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  results
}
(results <- read_data_file_v2(n=1, flnm = "dataset1.txt", results))

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

(results <- read_data_file_v2(n=2, flnm = "dataset2.txt", results))

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

20.3.5 Question: What does the above code assume?

What does the above code assume?

Expand to see solution

Assumes a file naming style of ‘dataset*.txt’ where the asterisk represents 1, 2, 3, …

Assumes the files are in the “data” folder.

20.3.6 Question: Extend your function to process all of the files

The above function read_data_file processes one file at a time. How would you write a function to loop this over to process all of our files?

Expand to see solution

fls <- list.files(path="data",pattern="dataset*")

loop_over_dataset <- function(fls) {
  # Input: the list of file names
  # Output: the 'results table
  # Count the number of data set file names in fls
  n_datasets <- length(fls)
  # Set up a results dataframe with n_datasets rows
  results <- data.frame(dataset=rep(NA,n_datasets),N=NA, mean=NA, median=NA, var=NA)
  for (n in 1:n_datasets) {
    results <- read_data_file(n=n, results=results)
  }
  return(results)
}

loop_over_dataset(fls = fls)

       dataset  N        mean      median       var
1 dataset1.txt 26  0.09762111  0.21989574 0.5974116
2 dataset2.txt 26  0.43486401  0.35587359 1.0936651
3 dataset3.txt 26  0.07508335  0.04456140 0.7950574
4 dataset4.txt 26  0.06259720  0.04813915 0.9186042
5 dataset5.txt 26 -0.09288522 -0.19155759 0.9978161
6 dataset6.txt 26 -0.20266667 -0.23845426 1.5605823

20.3.7 Bonus question: Can you find a subtle mistake in the `read_data_file` function?

results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  invisible(results)
}

Expand to see solution

If N varies across the data sets, then this line will not do the right thing:

results$N <- nrow(fl1)

Instead this line should be

results$N[n] <- nrow(fl1)

results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N[n] <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  invisible(results)
}

20.3.8 Bonus question: Why does this end in an error?

read_data_file_v2("dataset1.txt",results)

Error in file(file, "rt"): invalid 'description' argument

Expand to see solution

The read_data_file_v2 function’s arguments are n, flnm, and results.

When we call it in this manner:

read_data_file_v2("dataset1.txt",results)

we are calling it using unamed arguments, so they are interpreted by position. That means it is assigning the string “dataset1.txt” to the n argument, and the results R object to the flnm argument, but this is not what was intended.

If we use named arguments, then this runs without any errors:

read_data_file_v2(flnm = "dataset1.txt",results = results)

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> 26         NA        NA        NA
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

In this case, note that n took on the default value of 1.

20.3.9 Bonus question: Write a more concise function

Instead of inserting item by item, write a more concise function by putting all the data in a one-row data frame, and then insert the one-row data frame into the appropriate row of the pre-allocated results data frame.

Expand to see solution

Here we set up a data frame containing a new row of data.

read_data_file_v3 <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  NewRow <- data.frame(dataset = paste0("dataset",n,".txt"),
  N = nrow(fl1),
  mean = mean(fl1$trait),
  median = median(fl1$trait),
  var = var(fl1$trait)
  )
  results[n,] <- NewRow
  results
}
read_data_file_v3(1, results)

       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> NA         NA        NA        NA
3         <NA> NA         NA        NA        NA
4         <NA> NA         NA        NA        NA
5         <NA> NA         NA        NA        NA
6         <NA> NA         NA        NA        NA

20.1 Load Libraries

20.2 Data set creation code

20.3 Example

20.3.1 Question: How could we construct a list of file names?

20.3.2 Question: Outline a possible algorithm

20.3.3 Question: Construct a more detailed step-by-step algorithm.

20.3.4 Task: Write a read_data_file function.

20.3.5 Question: What does the above code assume?

20.3.6 Question: Extend your function to process all of the files

20.3.7 Bonus question: Can you find a subtle mistake in the read_data_file function?

20.3.8 Bonus question: Why does this end in an error?

20.3.9 Bonus question: Write a more concise function

20.3.4 Task: Write a `read_data_file` function.

20.3.7 Bonus question: Can you find a subtle mistake in the `read_data_file` function?