library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
# library(tidylog)
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
# library(tidylog)
i <- 6
for (i in 1:10) {
fl <- data.frame(name=rep(paste0("name",i),26))
b <- data.frame(name = rep(NA, 26))
b$name <- paste0(fl$name,"_",letters)
b$trait <- rnorm(26)
write_tsv(b,paste0("data/dataset",i,".txt"))
}
Here we have been sent three data sets in the files that contain the trait quantitative values for each person in the data set:
“dataset1.txt” “dataset2.txt” “dataset3.txt”
And we’ve been asked to make a table that gives, for each dataset, the sample size (N), the mean of the trait, the median, and the variance.
We could do this by reading in each data set, one by one, as follows:
<- data.frame(dataset=rep(NA,3),N=NA, mean=NA, median=NA, var=NA)
results <- read.table("data/dataset1.txt",sep="\t",header=TRUE)
fl1 $dataset[1] <- "dataset1"
results$N <- nrow(fl1)
results$mean[1] <- mean(fl1$trait)
results$median[1] <- median(fl1$trait)
results$var[1] <- var(fl1$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
<- read.table("data/dataset2.txt",sep="\t",header=TRUE)
fl2 $dataset[2] <- "dataset2"
results$N <- nrow(fl2)
results$mean[2] <- mean(fl2$trait)
results$median[2] <- median(fl2$trait)
results$var[2] <- var(fl2$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
<- read.table("data/dataset3.txt",sep="\t",header=TRUE)
fl3 $dataset[3] <- "dataset3"
results$N <- nrow(fl3)
results$mean[3] <- mean(fl3$trait)
results$median[3] <- median(fl3$trait)
results$var[3] <- var(fl3$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 dataset3 26 0.07508335 0.0445614 0.7950574
Your colleague initially sent you the three data sets above, but now your colleague has sent you three more data sets and asked you to update the ‘results’ table.
As you can see, the code above is very repetitive. So let’s automate this by writing a function that loops through a list of data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.
This Run code
WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.
We now have the files “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, …, “dataset6.txt” in the ‘data’ directory.
Question: How could we construct a list of file names?
Hint: Use the list.files
command
Hint: the list.files
command provides a handy way to get a list of the input files:
<- list.files(path="data",pattern="dataset*")
fls fls
[1] "dataset1.txt" "dataset2.txt" "dataset3.txt" "dataset4.txt" "dataset5.txt"
[6] "dataset6.txt"
Outline a possible algorithm that loops through a list of input data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.
Construct a more detailed step-by-step algorithm.
fls
N
N
rowsfls
read_data_file
function.Write a read_data_file
function to accomplish the required steps for a single input data file.
Here we make the number in the data file name an argument
<- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
results <- function(n=1, results) {
read_data_file <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 $dataset[n] <- paste0("dataset",n,".txt")
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
results
results
}<- read_data_file(n=1, results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
<- read_data_file(n=2, results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
read_data_file
function.Here we make the path to the input file an argument.
<- function(n=1, flnm="dataset1.txt", results) {
read_data_file_v2 <- read.table(paste0("data/",flnm),sep="\t",header=TRUE)
fl1 $dataset[n] <- flnm
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
results
results
}<- read_data_file_v2(n=1, flnm = "dataset1.txt", results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
<- read_data_file_v2(n=2, flnm = "dataset2.txt", results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
What does the above code assume?
Assumes a file naming style of ‘dataset*.txt’ where the asterisk represents 1, 2, 3, …
Assumes the files are in the “data” folder.
The above function read_data_file
processes one file at a time. How would you write a function to loop this over to process all of our files?
<- list.files(path="data",pattern="dataset*")
fls
<- function(fls) {
loop_over_dataset # Input: the list of file names
# Output: the 'results table
# Count the number of data set file names in fls
<- length(fls)
n_datasets # Set up a results dataframe with n_datasets rows
<- data.frame(dataset=rep(NA,n_datasets),N=NA, mean=NA, median=NA, var=NA)
results for (n in 1:n_datasets) {
<- read_data_file(n=n, results=results)
results
}return(results)
}
loop_over_dataset(fls = fls)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.21989574 0.5974116
2 dataset2.txt 26 0.43486401 0.35587359 1.0936651
3 dataset3.txt 26 0.07508335 0.04456140 0.7950574
4 dataset4.txt 26 0.06259720 0.04813915 0.9186042
5 dataset5.txt 26 -0.09288522 -0.19155759 0.9978161
6 dataset6.txt 26 -0.20266667 -0.23845426 1.5605823
read_data_file
function?results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
results$dataset[n] <- paste0("dataset",n,".txt")
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
invisible(results)
}
If N
varies across the data sets, then this line will not do the right thing:
results$N <- nrow(fl1)
Instead this line should be
results$N[n] <- nrow(fl1)
<- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
results <- function(n=1, results) {
read_data_file <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 $dataset[n] <- paste0("dataset",n,".txt")
results$N[n] <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
resultsinvisible(results)
}
read_data_file_v2("dataset1.txt",results)
Error in file(file, "rt"): invalid 'description' argument
The read_data_file_v2
function’s arguments are n
, flnm
, and results
.
When we call it in this manner:
read_data_file_v2("dataset1.txt",results)
we are calling it using unamed arguments, so they are interpreted by position. That means it is assigning the string “dataset1.txt” to the n
argument, and the results
R object to the flnm
argument, but this is not what was intended.
If we use named arguments, then this runs without any errors:
read_data_file_v2(flnm = "dataset1.txt",results = results)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
In this case, note that n
took on the default value of 1
.
Instead of inserting item by item, write a more concise function by putting all the data in a one-row data frame, and then insert the one-row data frame into the appropriate row of the pre-allocated results
data frame.
Here we set up a data frame containing a new row of data.
<- function(n=1, results) {
read_data_file_v3 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 <- data.frame(dataset = paste0("dataset",n,".txt"),
NewRow N = nrow(fl1),
mean = mean(fl1$trait),
median = median(fl1$trait),
var = var(fl1$trait)
)<- NewRow
results[n,]
results
}read_data_file_v3(1, results)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> NA NA NA NA
3 <NA> NA NA NA NA
4 <NA> NA NA NA NA
5 <NA> NA NA NA NA
6 <NA> NA NA NA NA