suppressMessages(library(tidyverse))
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
library(dbGaPCheckup)
## Load DD.dict.I and DS.data.I
data(ExampleI)
Data cleaning principles from Karl Broman’s slide set:
As described in:
Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, Lee M, Popova N, Sharopova N, Kimura M, Feolo M. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research. 2014 Jan 1;42(D1):D975–D979. PMID: 24297256 PMCID: PMC3965052 DOI: https://doi.org/10.1093/nar/gkt1211
“The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype.”
Under NIH data sharing guidelines, all properly consented large-scale genetic or ’omics studies must deposit their data in dbGaP. To so, one must closely follow the formatting requirements as described in the dbGaP Submission Guide:
https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/
This involves setting up a data dictionary that follows dbGaP specifications:
In the data dictionary, for some variables, MIN
and MAX
values may be specified. For example, for age, it has a natural minimum of zero.
MIN The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.
MAX The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.
Task: Design and implement a check that the specified MIN
and MAX
values observed in the data are consistent with the values as specified in the data dictionary.
suppressMessages(library(tidyverse))
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
library(dbGaPCheckup)
## Load DD.dict.I and DS.data.I
data(ExampleI)
Using DD.dict.I
and DS.data.I
, check the PERCEIVED_CONFLICT
variable to see if all the values fall within the stated MIN
and MAX
values.
%>%
DD.dict.I filter(VARNAME=="PERCEIVED_CONFLICT") %>%
select(VARNAME,MIN,MAX)
# A tibble: 1 × 3
VARNAME MIN MAX
<chr> <dbl> <dbl>
1 PERCEIVED_CONFLICT 1 15
First try to write out an algorithm for this Minimum and Maximum Values Check in pseudo-code, outlining each step.
Possible steps:
PERCEIVED_CONFLICT
from DS.data.I
MIN
and MAX
values for PERCEIVED_CONFLICT
from DD.dict.I
PERCEIVED_CONFLICT
values that lie outside of the range [MIN
, …, MAX
].Implement your algorithm in code.
# Read the vector of `PERCEIVED_CONFLICT` from `DS.data.I`
<- DS.data.I$PERCEIVED_CONFLICT
trait # Read the `MIN` and `MAX` values for `PERCEIVED_CONFLICT` from `DD.dict.I`
<- DD.dict.I %>%
min.val filter(VARNAME=="PERCEIVED_CONFLICT") %>%
pull(MIN)
<- DD.dict.I %>%
max.val filter(VARNAME=="PERCEIVED_CONFLICT") %>%
pull(MAX)
# List any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
# Count any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait length()
[1] 44
These results are consistent with those returned by the minmax_check
function from the dbGaPCheckup
R package:
<- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 1 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PERCEIVED_CONFLICT FALSE 1 15 <tibble [11 × 1]>
<- details[[1]]$OutOfRangeValues[[1]]
df.tmp order(df.tmp$PERCEIVED_CONFLICT),] df.tmp[
# A tibble: 11 × 1
PERCEIVED_CONFLICT
<int>
1 16
2 17
3 19
4 20
5 21
6 24
7 25
8 26
9 28
10 29
11 30
After implementing your algorithm in R code, think about it a bit further - is it robust to the situation where only one of the MIN and MAX values is specified and the other is missing? Is it robust to the situation where both MIN and MAX are missing?
# List any `PERCEIVED_CONFLICT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
The code proposed here is not robust to MIN or MAX being NA
because, for example, if MIN is NA
and MAX is 15
, in some situations the logical indexing into the trait
vector used of trait < min.val | trait > max.val
would return NA
instead of TRUE
or FALSE
as intended.
# MIN=NA, MAX=15
16 < NA | 16 > 15
[1] TRUE
# MIN=1, MAX=NA
16 < 1 | 16 > NA
[1] NA
# MIN=NA, MAX=NA
16 < NA | 16 > NA
[1] NA
This are possible steps toward writing a more robust check:
<- NA
vals.low <- NA
vals.high if (!is.na(min.val)) {
<- trait[trait < min.val]
vals.low
}if (!is.na(max.val)) {
<- trait[trait > max.val]
vals.high
}
<- c(vals.low,vals.high)
vals.OutOfRange %>%
vals.OutOfRange na.omit() %>%
unique() %>%
sort()
[1] 16 17 19 20 21 24 25 26 28 29 30
But what would the above code return if both MIN and MAX were NA
?
If we look at the minmax_check
code by typing minmax_check
at the R prompt, we see that it uses a which
when it tries to find the out-of-range values:
flagged <- dataset_na[which(dataset_na[, ind] <
range_dictionary[1] | dataset_na[, ind] >
range_dictionary[2]), , drop = FALSE]
Why is this robust to either one or both of MIN and MAX being missing?
PREGNANT
variableNow apply your MIN and MAX checking algorithm to the PREGNANT
variable.
%>%
DD.dict.I filter(VARNAME=="PREGNANT") %>%
select(VARNAME,MIN,MAX)
# A tibble: 1 × 3
VARNAME MIN MAX
<chr> <dbl> <dbl>
1 PREGNANT 0 1
# Read the vector of `PREGNANT` from `DS.data.I`
<- DS.data.I$PREGNANT
trait # Read the `MIN` and `MAX` values for `PREGNANT` from `DD.dict.I`
<- DD.dict.I %>%
min.val filter(VARNAME=="PREGNANT") %>%
pull(MIN)
<- DD.dict.I %>%
max.val filter(VARNAME=="PREGNANT") %>%
pull(MAX)
# List any `PREGNANT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait unique() %>%
sort()
[1] -9999 -4444
# Count any `PREGNANT` values that lie outside of the range [`MIN`, ..., `MAX`].
< min.val | trait > max.val] %>%
trait[trait length()
[1] 53
These out-of-range values of -9999
and -4444
look kind of strange and are unexpected given the first two entries of the VALUES
column of the data dictionary for this variable:
which(DD.dict.I=="PREGNANT"),c(1,17,18)] DD.dict.I[
# A tibble: 1 × 3
VARNAME VALUES ...18
<chr> <chr> <chr>
1 PREGNANT 0=no 1=yes
Based on this, we’d expect to see only 0
and 1
values in the PREGNANT
variable.
What’s going on?
Let’s examine some more columns of the data dictionary:
If we further check the data dictionary for the PREGNANT
variable, we see that the out-of-range values we observed in our check above are actually missing value codes and so should not be flagged as being out of range.
which(DD.dict.I=="PREGNANT"),c(1,19,20)] DD.dict.I[
# A tibble: 1 × 3
VARNAME ...19 ...20
<chr> <chr> <chr>
1 PREGNANT -9999=missing value -4444=not applicable, participant assigned male …
Extend your algorithm to handle missing value codes. To do this first outline your approach in pseudo-code. Then implement it in R.
Possible steps:
PREGNANT
from DS.data.I
MIN
and MAX
values for PREGNANT
from DD.dict.I
PREGNANT
value that matches one of the missing value codes to the standard NA
R missing value code.PREGNANT
values that lie outside of the range [MIN
, …, MAX
].This is essentially the approach used in the minmax_check
function of the dbGaPCheckup
R package.
# Without missing value codes specified
<- minmax_check(DD.dict.I, DS.data.I)$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 2 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PREGNANT FALSE 0 1 <int [2]>
2 PERCEIVED_CONFLICT FALSE 1 15 <int [11]>
1]]$OutOfRangeValues details[[
[[1]]
[1] -4444 -9999
[[2]]
[1] 25 24 16 28 17 21 30 19 26 20 29
# With missing value codes specified
# PREGNANT is no longer flagged as having out of range values.
<- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details
$Message
[1] "ERROR: some variables have values outside of the MIN to MAX range."
$Information
# A tibble: 1 × 5
Trait Check ListedMin ListedMax OutOfRangeValues
<chr> <lgl> <dbl> <dbl> <list>
1 PERCEIVED_CONFLICT FALSE 1 15 <tibble [11 × 1]>
If we examine the minmax_check
code by typing minmax_check
without parentheses at the R prompt, we see that this is how the missing value recoding step is done:
for (value in na.omit(non.NA.missing.codes)) {
dataset_na <- dataset_na %>% mutate(across(everything(),
~na_if(.x, value)))
}
Heinsberg LW, Weeks DE. dbGaPCheckup: pre-submission checks of dbGaP-formatted subject phenotype files. BMC Bioinformatics. 2023 Mar 3;24(1):77. PMID: 36869285 PMCID: PMC9985192 DOI: https://doi.org/10.1186/s12859-023-05200-8
Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, Lee M, Popova N, Sharopova N, Kimura M, Feolo M. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research. 2014 Jan 1;42(D1):D975–D979. PMID: 24297256 PMCID: PMC3965052 DOI: https://doi.org/10.1093/nar/gkt1211
dbGaPCheckup: https://lwheinsberg.github.io/dbGaPCheckup/index.html
NCBI’s GaPTools: https://github.com/ncbi/gaptools