library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
# library(tidylog)
Load the tidyverse
packages
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
# library(tidylog)
Letβs use the World Health Organization TB data set from the tidyr
package
<- tidyr::who
who dim(who)
[1] 7240 60
head(who[,1:6] %>% filter(!is.na(new_sp_m014)))
# A tibble: 6 Γ 6
country iso2 iso3 year new_sp_m014 new_sp_m1524
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF AFG 1997 0 10
2 Afghanistan AF AFG 1998 30 129
3 Afghanistan AF AFG 1999 8 55
4 Afghanistan AF AFG 2000 52 228
5 Afghanistan AF AFG 2001 129 379
6 Afghanistan AF AFG 2002 90 476
See the help page for who
for more information about this data set.
In particular, note this description:
βThe data uses the original codes given by the World Health Organization. The column names for columns five through 60 are made by combining new_ to a code for method of diagnosis (rel = relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary) to a code for gender (f = female, m = male) to a code for age group (014 = 0-14 yrs of age, 1524 = 15-24 years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older).β
So new_sp_m014
represents the counts of new TB cases detected by a positive pulmonary smear in males in the 0-14 age group.
Tidy data: Have each variable in a column.
Question: Are these data tidy?
No these data are not tidy because aspects of the data that should be variables are encoded in the name of the variables.
These aspects are
Question: How would we make these data tidy?
Consider this portion of the data:
head(who[,1:5] %>% filter(!is.na(new_sp_m014) & new_sp_m014>0), 1)
# A tibble: 1 Γ 5
country iso2 iso3 year new_sp_m014
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan AF AFG 1998 30
We would replace the new_sp_m014
with the following four columns:
type sex age n
sp m 014 30
This would place each variable in its own column.
<- tibble(
stocks time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
head(stocks)
# A tibble: 6 Γ 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.817 0.595 0.126
2 2009-01-02 -0.399 -1.28 7.11
3 2009-01-03 -0.181 1.13 0.899
4 2009-01-04 1.00 0.393 -5.95
5 2009-01-05 0.102 1.05 1.96
6 2009-01-06 -1.00 -2.46 -0.275
%>% gather("stock", "price", -time) %>% head() stocks
# A tibble: 6 Γ 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X -0.817
2 2009-01-02 X -0.399
3 2009-01-03 X -0.181
4 2009-01-04 X 1.00
5 2009-01-05 X 0.102
6 2009-01-06 X -1.00
%>% pivot_longer(c(X,Y,Z), names_to= "stock", values_to = "price") %>%
stocks head()
# A tibble: 6 Γ 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X -0.817
2 2009-01-01 Y 0.595
3 2009-01-01 Z 0.126
4 2009-01-02 X -0.399
5 2009-01-02 Y -1.28
6 2009-01-02 Z 7.11
Question: How would we convert this to tidy form?
head(who[,1:6] %>% filter(!is.na(new_sp_m014)))
# A tibble: 6 Γ 6
country iso2 iso3 year new_sp_m014 new_sp_m1524
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF AFG 1997 0 10
2 Afghanistan AF AFG 1998 30 129
3 Afghanistan AF AFG 1999 8 55
4 Afghanistan AF AFG 2000 52 228
5 Afghanistan AF AFG 2001 129 379
6 Afghanistan AF AFG 2002 90 476
<- who %>% pivot_longer(starts_with("new"), names_to = "demo", values_to = "n") %>% filter(!is.na(n))
who.long head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
Question: How would we split demo
into variables?
head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
Look at the variable naming scheme:
names(who) %>% grep("m014",., value=TRUE)
[1] "new_sp_m014" "new_sn_m014" "new_ep_m014" "newrel_m014"
Question: How should we adjust the demo
strings so as to be able to easily split all of them into the desired variables?
<- who.long %>%
who.long mutate(demo = str_replace(demo, "newrel", "new_rel"))
grep("m014",who.long$demo, value=TRUE) %>% unique()
[1] "new_sp_m014" "new_sn_m014" "new_ep_m014" "new_rel_m014"
Question: After adjusting the demo
strings, how would we then separate them into the desired variables?
Hint: Use separate_wider_position()
and separate_wider_delim()
.
<- who.long %>%
who.long.v1 separate(demo, into = c("new", "type", "sexagerange"), sep="_") %>%
separate(sexagerange, into=c("sex","age_range"), sep=1) %>%
select(-new)
head(who.long.v1)
# A tibble: 6 Γ 8
country iso2 iso3 year type sex age_range n
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AF AFG 1997 sp m 014 0
2 Afghanistan AF AFG 1997 sp m 1524 10
3 Afghanistan AF AFG 1997 sp m 2534 6
4 Afghanistan AF AFG 1997 sp m 3544 3
5 Afghanistan AF AFG 1997 sp m 4554 5
6 Afghanistan AF AFG 1997 sp m 5564 2
Note that separate()
has been superseded in favour of separate_wider_position()
and separate_wider_delim()
. So here we use those two functions instead of separate()
:
<- who.long %>%
who.long.v2 separate_wider_delim(demo, names = c("new", "type", "sexagerange"), delim="_") %>%
separate_wider_position(sexagerange, widths=c("sex"=1,"age_range"=4), too_few="align_start" ) %>%
select(-new)
head(who.long.v2)
# A tibble: 6 Γ 8
country iso2 iso3 year type sex age_range n
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AF AFG 1997 sp m 014 0
2 Afghanistan AF AFG 1997 sp m 1524 10
3 Afghanistan AF AFG 1997 sp m 2534 6
4 Afghanistan AF AFG 1997 sp m 3544 3
5 Afghanistan AF AFG 1997 sp m 4554 5
6 Afghanistan AF AFG 1997 sp m 5564 2
Now our untidy data are tidy.
head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
This exercise was modeled, in part, on this exercise:
https://people.duke.edu/\~ccc14/cfar-data-workshop-2018/CFAR_R_Workshop_2018_Exercisees.html