suppressMessages(library(tidyverse))
library(dbGaPCheckup)
## Load DD.dict.I and DS.data.I
data(ExampleI)
26 Data Cleaning Exercise
26.1 Data cleaning principles
Data cleaning principles from Karl Broman’s slide set:
26.2 dbGaP quality control
As described in:
Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, Lee M, Popova N, Sharopova N, Kimura M, Feolo M. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research. 2014 Jan 1;42(D1):D975–D979. PMID: 24297256 PMCID: PMC3965052 DOI: https://doi.org/10.1093/nar/gkt1211
“The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype.”
Under NIH data sharing guidelines, all properly consented large-scale genetic or ’omics studies must deposit their data in dbGaP. To so, one must closely follow the formatting requirements as described in the dbGaP Submission Guide:
https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/
This involves setting up a data dictionary that follows dbGaP specifications:
26.3 Minimum and Maximum Values Check
26.3.1 MIN, MAX check
In the data dictionary, for some variables, MIN
and MAX
values may be specified. For example, for age, it has a natural minimum of zero.
MIN The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.
MAX The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.
Task: Design and implement a check that the specified MIN
and MAX
values observed in the data are consistent with the values as specified in the data dictionary.
Using DD.dict.I
and DS.data.I
, check the PERCEIVED_CONFLICT
variable to see if all the values fall within the stated MIN
and MAX
values.
%>%
DD.dict.I filter(VARNAME=="PERCEIVED_CONFLICT") %>%
select(VARNAME,MIN,MAX)
# A tibble: 1 × 3
VARNAME MIN MAX
<chr> <dbl> <dbl>
1 PERCEIVED_CONFLICT 1 15
26.3.2 Pseudo-code
First try to write out an algorithm for this Minimum and Maximum Values Check in pseudo-code, outlining each step.
Possible steps:
- Read the vector of
PERCEIVED_CONFLICT
fromDS.data.I
- Read the
MIN
andMAX
values forPERCEIVED_CONFLICT
fromDD.dict.I
- Count and list any
PERCEIVED_CONFLICT
values that lie outside of the range [MIN
, …,MAX
].
26.3.3 Implement MIN, MAX check in R
Implement your algorithm in code.
# Read the vector of `PERCEIVED_CONFLICT` from `DS.data.I`
<- DS.data.I$PERCEIVED_CONFLICT
trait # Read the `MIN` and `MAX` values for `PERCEIVED_CONFLICT` from `DD.dict.I`
<- DD.dict.I %>%
min.val filter(VARNAME=="PERCEIVED_CONFLICT") %>%
pull(MIN)
<- DD.dict.I %>%
max.val filter(VARNAME=="PERCEIVED_CONFLICT") %>%
pull(MAX)
# List any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
# Count any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait length()
[1] 44
These results are consistent with those returned by the minmax_check
function from the dbGaPCheckup
R package:
<- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 1 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PERCEIVED_CONFLICT FALSE 1 15 <tibble [11 × 1]>
<- details[[1]]$OutOfRangeValues[[1]]
df.tmp order(df.tmp$PERCEIVED_CONFLICT),] df.tmp[
# A tibble: 11 × 1
PERCEIVED_CONFLICT
<int>
1 16
2 17
3 19
4 20
5 21
6 24
7 25
8 26
9 28
10 29
11 30
26.3.4 Make your check more robust
After implementing your algorithm in R code, think about it a bit further - is it robust to the situation where only one of the MIN and MAX values is specified and the other is missing? Is it robust to the situation where both MIN and MAX are missing?
# List any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
The code proposed here is not robust to MIN or MAX being NA
because, for example, if MIN is NA
and MAX is 15
, in some situations the logical indexing into the trait
vector used of trait < min.val | trait > max.val
would return NA
instead of TRUE
or FALSE
as intended.
# MIN=NA, MAX=15
16 < NA | 16 > 15
[1] TRUE
# MIN=1, MAX=NA
16 < 1 | 16 > NA
[1] NA
# MIN=NA, MAX=NA
16 < NA | 16 > NA
[1] NA
This are possible steps toward writing a more robust check:
<- NA
vals.low <- NA
vals.high if (!is.na(min.val)) {
<- trait[trait < min.val]
vals.low
}if (!is.na(max.val)) {
<- trait[trait > max.val]
vals.high
}
<- c(vals.low,vals.high)
vals.OutOfRange %>%
vals.OutOfRange na.omit() %>%
unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
But what would the above code return if both MIN and MAX were NA
?
If we look at the minmax_check
code by typing minmax_check
at the R prompt, we see that it uses a which
when it tries to find the out-of-range values:
flagged <- dataset_na[which(dataset_na[, ind] <
range_dictionary[1] | dataset_na[, ind] >
range_dictionary[2]), , drop = FALSE]
Why is this robust to either one or both of MIN and MAX being missing?
26.3.5 Check the PREGNANT
variable
Now apply your MIN and MAX checking algorithm to the PREGNANT
variable.
%>%
DD.dict.I filter(VARNAME=="PREGNANT") %>%
select(VARNAME,MIN,MAX)
# A tibble: 1 × 3
VARNAME MIN MAX
<chr> <dbl> <dbl>
1 PREGNANT 0 1
# Read the vector of `PREGNANT` from `DS.data.I`
<- DS.data.I$PREGNANT
trait # Read the `MIN` and `MAX` values for `PREGNANT` from `DD.dict.I`
<- DD.dict.I %>%
min.val filter(VARNAME=="PREGNANT") %>%
pull(MIN)
<- DD.dict.I %>%
max.val filter(VARNAME=="PREGNANT") %>%
pull(MAX)
# List any `PREGNANT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] -9999 -4444
# Count any `PREGNANT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait length()
[1] 53
These out-of-range values of -9999
and -4444
look kind of strange and are unexpected given the first two entries of the VALUES
column of the data dictionary for this variable:
which(DD.dict.I=="PREGNANT"),c(1,17,18)] DD.dict.I[
# A tibble: 1 × 3
VARNAME VALUES ...18
<chr> <chr> <chr>
1 PREGNANT 0=no 1=yes
Based on this, we’d expect to see only 0
and 1
values in the PREGNANT
variable.
What’s going on?
Let’s examine some more columns of the data dictionary:
26.3.6 Handle missing values
If we further check the data dictionary for the PREGNANT
variable, we see that the out-of-range values we observed in our check above are actually missing value codes and so should not be flagged as being out of range.
which(DD.dict.I=="PREGNANT"),c(1,19,20)] DD.dict.I[
# A tibble: 1 × 3
VARNAME ...19 ...20
<chr> <chr> <chr>
1 PREGNANT -9999=missing value -4444=not applicable, participant assigned male …
Extend your algorithm to handle missing value codes. To do this first outline your approach in pseudo-code. Then implement it in R.
Possible steps:
- Read the vector of
PREGNANT
fromDS.data.I
- Read the
MIN
andMAX
values forPREGNANT
fromDD.dict.I
- Have the user provide a list of missing value codes
- Recode any
PREGNANT
value that matches one of the missing value codes to the standardNA
R missing value code. - Count and list any non-missing
PREGNANT
values that lie outside of the range [MIN
, …,MAX
].
This is essentially the approach used in the minmax_check
function of the dbGaPCheckup
R package.
# Without missing value codes specified
<- minmax_check(DD.dict.I, DS.data.I)$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 2 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PREGNANT FALSE 0 1 <int [2]>
2 PERCEIVED_CONFLICT FALSE 1 15 <int [11]>
1]]$OutOfRangeValues details[[
[[1]]
[1] -4444 -9999
[[2]]
[1] 25 24 16 28 17 21 30 19 26 20 29
# With missing value codes specified
# PREGNANT is no longer flagged as having out of range values.
<- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 1 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PERCEIVED_CONFLICT FALSE 1 15 <tibble [11 × 1]>
If we examine the minmax_check
code by typing minmax_check
without parentheses at the R prompt, we see that this is how the missing value recoding step is done:
for (value in na.omit(non.NA.missing.codes)) {
dataset_na <- dataset_na %>% mutate(across(everything(),
~na_if(.x, value)))
}
26.4 References and Resources
Heinsberg LW, Weeks DE. dbGaPCheckup: pre-submission checks of dbGaP-formatted subject phenotype files. BMC Bioinformatics. 2023 Mar 3;24(1):77. PMID: 36869285 PMCID: PMC9985192 DOI: https://doi.org/10.1186/s12859-023-05200-8
Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, Lee M, Popova N, Sharopova N, Kimura M, Feolo M. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research. 2014 Jan 1;42(D1):D975–D979. PMID: 24297256 PMCID: PMC3965052 DOI: https://doi.org/10.1093/nar/gkt1211
dbGaPCheckup: https://lwheinsberg.github.io/dbGaPCheckup/index.html
NCBI’s GaPTools: https://github.com/ncbi/gaptools