library(tidyverse)
# library(tidylog)
library(knitr)
10 R Character Exercise
10.1 Load Libraries
10.2 Useful RStudio cheatsheet
See the “String manipulation with stringr cheatsheet” at
10.3 Scenario 1
You are working with three different sets of collaborators: 1) the clinical group that did the field work and generated the anthropometric measurements; 2) the medical laboratory that measured blood pressure in a controlled environment; and 3) the molecular laboratory that generated the genotypes.
<- read.table(file = "data/clinical_data.txt", header=TRUE)
clin kable(clin)
ID | height |
---|---|
1 | 152 |
104 | 172 |
2112 | 180 |
2543 | 163 |
<- read.table(file = "data/lab_data.txt", header = TRUE)
lab kable(lab)
ID | SBP |
---|---|
SG0001 | 120 |
SG0104 | 111 |
SG2112 | 125 |
SG2543 | 119 |
<- read.table(file = "data/genotype_data.txt", header = TRUE)
geno kable(geno)
Sample | rs1212 |
---|---|
TaqMan-SG0001-190601 | G/C |
TaqMan-SG0104-190602 | G/G |
TaqMan-SG2112-190603 | C/C |
TaqMan-Sg2543-190603 | C/G |
10.4 Discussion Questions
10.4.1 Question 1
The clinical group, which measured height, used integer IDs, but the medical group, which measured the blood pressure, decided to prefix the integer IDs with the string ‘SG’ (so as to distinguish them from other studies that were also using integer IDs). So ID ‘1’ was mapped to ID ‘SG0001’.
ID | height |
---|---|
1 | 152 |
104 | 172 |
2112 | 180 |
2543 | 163 |
Discuss how, using R commands, you would reformat the integer IDs to be in the format “SGXXXX”. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.
Hint: Use the formatC
function.
10.4.1.1 Interactive WebR chunk
You can interactively run R within this WebR chunk by clicking the Run code
tab. Note that this is a limited version of R which runs within your web browser.
This Run code
WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.
10.4.2 Answer 1
$SUBJECT_ID <- paste0("SG", formatC(clin$ID, width = 4, flag = "0000"))
clinkable(clin)
ID | height | SUBJECT_ID |
---|---|---|
1 | 152 | SG0001 |
104 | 172 | SG0104 |
2112 | 180 | SG2112 |
2543 | 163 | SG2543 |
# Or here's an alternative using the 'sub' command:
sub("00","SG",formatC(clin$ID, flag="0000", width=6))
[1] "SG0001" "SG0104" "SG2112" "SG2543"
# Or can be done using a `case_when`:
case_when(
$ID < 10 ~ paste0("SG000",clin$ID),
clin$ID < 100 ~ paste0("SG00",clin$ID),
clin$ID < 1000 ~ paste0("SG0",clin$ID),
clin$ID < 10000 ~ paste0("SG",clin$ID)
clin )
[1] "SG0001" "SG0104" "SG2112" "SG2543"
10.4.3 Question 2
Discuss how, using R commands, you would reformat the “SGXXXX” IDs to be integer IDs. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.
ID | SBP |
---|---|
SG0001 | 120 |
SG0104 | 111 |
SG2112 | 125 |
SG2543 | 119 |
Hint: Use either the gsub
command or the str_replace_all
command from the stringr
package.
To read in and load the data within the WebR environment, be sure to run all of the WebR chunks in order. For example, to usefully run R code in this WebR chunk here, you first need to run the WebR chunk above in Question 1.
10.4.4 Answer 2
$ID2 <- as.numeric(gsub("SG","",lab$ID))
labkable(lab)
ID | SBP | ID2 |
---|---|---|
SG0001 | 120 | 1 |
SG0104 | 111 | 104 |
SG2112 | 125 | 2112 |
SG2543 | 119 | 2543 |
$ID2 <- NA
lab$ID2 <- str_replace_all(lab$ID, pattern = "SG", replacement = "") %>% as.numeric()
labkable(lab)
ID | SBP | ID2 |
---|---|---|
SG0001 | 120 | 1 |
SG0104 | 111 | 104 |
SG2112 | 125 | 2112 |
SG2543 | 119 | 2543 |
10.4.5 Question 3
The genotype group used IDs in the style “TaqMan-SG0001-190601”, where the first string is “TaqMan” and the ending string is the date of the genotyping experiment.
Discuss how, using R commands, you would extract an “SGXXXX” style ID from the “TaqMan-SG0001-190601” style IDs. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.
Note that one of the IDs has a lower case ‘g’ in it - how would you correct this, using R commands?
Sample | rs1212 |
---|---|
TaqMan-SG0001-190601 | G/C |
TaqMan-SG0104-190602 | G/G |
TaqMan-SG2112-190603 | C/C |
TaqMan-Sg2543-190603 | C/G |
Hint: Use either the str_split_fixed
function from the stringr
package or the separate
function from the tidyr
package.
10.4.6 Answer 3
<- str_split_fixed(geno$Sample, pattern = "-",n=3)
a a
[,1] [,2] [,3]
[1,] "TaqMan" "SG0001" "190601"
[2,] "TaqMan" "SG0104" "190602"
[3,] "TaqMan" "SG2112" "190603"
[4,] "TaqMan" "Sg2543" "190603"
$ID <- toupper(a[,2])
genokable(geno)
Sample | rs1212 | ID |
---|---|---|
TaqMan-SG0001-190601 | G/C | SG0001 |
TaqMan-SG0104-190602 | G/G | SG0104 |
TaqMan-SG2112-190603 | C/C | SG2112 |
TaqMan-Sg2543-190603 | C/G | SG2543 |
The separate
function from the tidyr
package is also useful:
%>%
geno separate(Sample, into=c("Tech","ID2","Suffix"), sep="-") %>%
mutate(ID2=toupper(ID2))
Tech ID2 Suffix rs1212 ID
1 TaqMan SG0001 190601 G/C SG0001
2 TaqMan SG0104 190602 G/G SG0104
3 TaqMan SG2112 190603 C/C SG2112
4 TaqMan SG2543 190603 C/G SG2543
The separate
function is being superseded in favor of separate_wider_delim
and separate_wider_position
. In this case, separate_wider_delim
is applicable.
%>%
geno separate_wider_delim(cols=Sample, delim = "-", names=c("Tech","ID2","Suffix")) %>%
mutate(ID2=toupper(ID2))
# A tibble: 4 × 5
Tech ID2 Suffix rs1212 ID
<chr> <chr> <chr> <chr> <chr>
1 TaqMan SG0001 190601 G/C SG0001
2 TaqMan SG0104 190602 G/G SG0104
3 TaqMan SG2112 190603 C/C SG2112
4 TaqMan SG2543 190603 C/G SG2543
10.5 Scenario 2
A replication sample has been measured, and that is using IDs in the style “RP5XXX”.
<- read.table(file = "data/joint_data.txt", header = TRUE)
joint kable(joint)
ID | SBP |
---|---|
SG0001 | 120 |
SG0104 | 111 |
SG2112 | 125 |
SG2543 | 119 |
RP5002 | 121 |
RP5012 | 118 |
RP5113 | 112 |
RP5213 | 142 |
10.5.1 Question 4
Discuss how you would use R commands to split the ‘joint’ data frame into an ‘SG’ and ‘RP’ specific piece? Write down your ideas in the next section, and, if you have time, try them out within an R chunk.
ID | SBP |
---|---|
SG0001 | 120 |
SG0104 | 111 |
SG2112 | 125 |
SG2543 | 119 |
RP5002 | 121 |
RP5012 | 118 |
RP5113 | 112 |
RP5213 | 142 |
10.5.2 Answer 4
grep(pattern = "SG",joint$ID)
[1] 1 2 3 4
grep(pattern = "RP", joint$ID)
[1] 5 6 7 8
<- joint[grep(pattern = "SG",joint$ID), ]
joint.SG <- joint[grep(pattern = "RP", joint$ID), ]
joint.RP kable(joint.SG)
ID | SBP |
---|---|
SG0001 | 120 |
SG0104 | 111 |
SG2112 | 125 |
SG2543 | 119 |
kable(joint.RP)
ID | SBP | |
---|---|---|
5 | RP5002 | 121 |
6 | RP5012 | 118 |
7 | RP5113 | 112 |
8 | RP5213 | 142 |
# Reset row names
rownames(joint.RP) <- NULL
kable(joint.RP)
ID | SBP |
---|---|
RP5002 | 121 |
RP5012 | 118 |
RP5113 | 112 |
RP5213 | 142 |