4  Active Learning and Readings

4.1 Introduction and Overview

4.1.1 Learning Objectives

  • Review the syllabus
  • Describe bioinformatics and genetic/genomic data
  • Describe dbGaP, an important genomic data repository

4.1.2 Required Reading

Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007 Oct;39(10):1181-6. doi: 10.1038/ng1007-1181. PMID: 17898773; PMCID: PMC2031016. https://pubmed.ncbi.nlm.nih.gov/17898773/

4.1.3 Suggested Readings

Barnes (2007) Chapter 1 Carey MA, Papin JA. Ten simple rules for biologists learning to program. PLoS Comput Biol. 2018;14(1):e1005871. https://doi.org/10.1371/journal.pcbi.1005871

Dudley JT, Butte AJ. A quick guide for developing effective bioinformatics programming skills. PLoS Comput Biol. 2009;5(12):e1000589. https://doi.org/10.1371/journal.pcbi.1000589

4.2 GitHub

4.2.1 Learning Objectives

  • To learn how to use GitHub
  • To learn how to use GitHub Classroom
  • To learn how to use GitHub within RStudio

4.2.2 Online Lecture

GitHub Introduction: https://danieleweeks.github.io/HuGen2071/gitIntro.html

4.2.3 Active Learning

Version Control with git and GitHub (Sections 4.1 - 4.4): https://learning.nceas.ucsb.edu/2020-11-RRCourse/session-4-version-control-with-git-and-github.html

4.2.4 Required Readings

GitHub Classroom Guide for Students

To set up GitHub Classroom, please follow the steps to set up RStudio, R, and git in this detailed guide: https://github.com/jfiksel/github-classroom-for-students

Choose your GitHub user name carefully, as later in your career you may end up using it in a professional context.

Be sure to generate an SSH key so you don’t need to enter your password every time you interact with GitHub.

Warning

Do not clone your repository onto a OneDrive or other cloud folder, as git does not work properly on cloud drives. Cloud drive systems typically maintain their own backup copies and this confuses git.

4.2.5 Suggested Readings

Happy Git and GitHub for the useR. https://happygitwithr.com/

Perez-Riverol Y, Gatto L, Wang R, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol. 2016;12(7):e1004947. https://doi.org/10.1371/journal.pcbi.1004947

Version Control with Git: https://swcarpentry.github.io/git-novice/

Using Git from RStudio: https://ucsbcarpentry.github.io/2020-08-10-Summer-GitBash/24-supplemental-rstudio/index.html

4.3 R: Basics

4.3.1 Learning Objectives

  • To become familiar with the R language and concepts
  • To learn how to read and write data with R
  • To learn control flow: choices and loops

4.3.2 Online Lectures

R Basics: https://danieleweeks.github.io/HuGen2071/RBasicsLecture.html

4.3.3 Active Learning:

https://datacarpentry.org/genomics-r-intro/01-r-basics.html

4.3.4 Suggested Readings

Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)

Read the first four sections, up to the end of ‘Vectors, Vectorization, and Indexing’

https://pitt.primo.exlibrisgroup.com/permalink/01PITT_INST/i25aoe/cdi_askewsholts_vlebooks_9781449367510

https://datacarpentry.org/R-genomics/01-intro-to-R.html

Supplementary Reading: Spector (2008) Chapters 1 & 2 (Available online through PittCat+; link in syllabus)

4.4 R: Factors, Dates, Subscripting

4.4.1 Learning Objectives

  • To learn how to subset data with R
  • To learn how to handle factors and dates with R
  • To learn how to manipulate characters with R

4.4.2 Online Lecture

R: factors, subscripting: https://danieleweeks.github.io/HuGen2071/RFactors.html

4.4.3 Active Learning:

Subsetting: https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting.html. This uses the gapminder data from here.

Factors: https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors.html. This uses data from this Zip file.

4.4.4 Suggested Readings

Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)

Read the ‘Factors and classes in R’ subsection at the end of the ‘Vectors, Vectorization, and Indexing’ section.

Read the ‘Exploring Data Through Slicing and Dicing: Subsetting Dataframes’ section.

Read the ‘Working with Strings’ section.

https://pitt.primo.exlibrisgroup.com/permalink/01PITT_INST/i25aoe/cdi_askewsholts_vlebooks_9781449367510

https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html

Supplementary Readings: Spector (2008) Chapters 4, 5, 6

4.5 R: Character Manipulation

4.5.1 Learning Objectives

  • To learn how to handle character data in R
  • To learn how to use regular expressions in R

4.5.2 Active Learning

Regular expressions: https://csiro-data-school.github.io/regex/08-r-regexs/index.html

4.5.3 Required Readings

Read the chapter on “Strings” in “R for Data Science”: https://r4ds.hadley.nz/strings

4.5.4 Suggested Readings

See the “String manipulation with stringr cheatsheet” at https://rstudio.github.io/cheatsheets/html/strings.html

Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)

Read the ‘Working with Strings’ section at the end of the “Working with and Visualizing Data in R” section.

https://pitt.primo.exlibrisgroup.com/permalink/01PITT_INST/i25aoe/cdi_askewsholts_vlebooks_9781449367510

Read the chapter on “Strings” in “R for Data Science”: https://r4ds.hadley.nz/strings

Read the chapter on “Regular expressions” in “R for Data Science”: https://r4ds.hadley.nz/regexps

Supplementary Reading: Spector (2008) Chapter 7

4.6 R: Loops and Flow Control

4.6.1 Learning Objectives

  • To learn how to implement loops in R
  • To learn how to control flow in R
  • To learn how to vectorize operations

4.6.2 Online Lectures

Loops in R: https://danieleweeks.github.io/HuGen2071/RLoops.html

4.6.3 Active Learning:

Flow control and loops: https://swcarpentry.github.io/r-novice-gapminder/07-control-flow.html

Loops in R, Part I: https://danieleweeks.github.io/HuGen2071/loops.html

Vectorization: https://swcarpentry.github.io/r-novice-gapminder/09-vectorization.html

4.7 R: Functions and Packages, Debugging R

4.7.1 Learning Objectives

  • To learn how to write R functions and packages
  • To learn how to debug R code

4.7.2 Active Learning:

https://swcarpentry.github.io/r-novice-gapminder/10-functions.html

4.7.3 Suggested Readings

Functions Explained: https://swcarpentry.github.io/r-novice-gapminder/10-functions.html

Buffalo (2015) Chapter 8: Read the section ‘Digression: Debugging R Code’

4.8 R: Tidyverse

4.8.1 Learning Objectives

  • To learn how to use the pipe operator
  • To learn how to use Tidyverse functions

4.8.2 Active Learning:

https://datacarpentry.org/genomics-r-intro/05-dplyr.html

The data file used in this is the combined_tidy_vcf.csv file that can be downloaded from here.

4.8.3 Suggested Readings

Introduction to the Tidyverse: Manipulating tibbles with dplyr https://uomresearchit.github.io/r-day-workshop/04-dplyr/

Supplementary Reading: Buffalo (2015) Chapter 8: section ‘Exploring Dataframes with dplyr’

4.9 R: Recoding and Reshaping Data

4.9.1 Learning Objectives

  • To learn how to reformat and reshape data in R

4.9.2 Active Learning:

Reshaping data https://sscc.wisc.edu/sscc/pubs/dwr/reshape-tidy.html

Recoding data: Pay particular attention to the Recoding values and Creating new variables sections

https://librarycarpentry.org/lc-r/03-data-cleaning-and-transformation.html

4.9.3 Suggested Readings

Supplementary Reading: Spector (2008) Chapters 8 & 9

4.10 R: Merging Data

4.10.1 Learning Objectives

  • To learn how to use the R ‘merge’ command
  • To learn how to use the R Tidyverse join commands

4.10.2 Active Learning:

https://mikoontz.github.io/data-carpentry-week/lesson_joins.html

continents.RDA data set used near the end of this Active Learning exercise: https://mikoontz.github.io/data-carpentry-week/data/continents.RDA

4.10.3 Required Reading

Tidy Animated Verbs https://www.garrickadenbuie.com/project/tidyexplain/

4.10.4 Suggested Readings

https://mikoontz.github.io/data-carpentry-week/lesson_joins.html#practice_with_joins_using_gapminder

Supplementary Reading: Buffalo (2015) Chapter 8 ‘Merging and Combining Data’. Spector (2008) Chapter 9.

4.11 R: Traditional Graphics & Advanced Graphics

4.11.1 Learning Objectives

  • To learn the basic graphics commands of R
  • To learn the R graphing package ggplot2

4.11.2 Active Learning:

Data visualization with ggplot2: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html

To create the required data for this “Data visualization with ggplot2” exercise, run this code:

library(tidyverse)
download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "portal_data_joined.csv")
surveys <- read_csv("portal_data_joined.csv")              
surveys_complete <- surveys %>%
  filter(!is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         !is.na(sex))

4.11.3 Suggested Readings

Plotting with ggplot2 https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html

Supplementary Reading: Wickham (2009) Chapters 2 & 3

4.12 R: Exploratory Data Analysis

4.12.1 Learning Objectives

  • To learn how to summarize data frames
  • To learn how to visualize missing data patterns
  • To learn how to visualize covariation

4.12.2 Active Learning

Exploratory analysis of RNAseq count data https://tavareshugo.github.io/data-carpentry-rnaseq/02_rnaseq_exploratory.html

4.12.3 Readings

Missing value visualization with tidyverse in R https://towardsdatascience.com/missing-value-visualization-with-tidyverse-in-r-a9b0fefd2246

Suggested Reading: Buffalo (2015) Chapter 8 Sections: Exploring Data Visually with ggplot2 I: Scatterplots and Densities Exploring Data Visually with ggplot2 II: Smoothing Binning Data with cut() and Bar Plots with ggplot2 Using ggplot2 Facets.

4.13 R: Genomic Ranges; Interactive Graphics

4.13.1 Learning Objectives - Genomic Ranges

  • To learn about Genomic Ranges
  • To learn to use Genomic Ranges to annotate SNPs of interest

4.13.2 Preparation - Genomic Ranges

Before class, install these BioConductor packages: (1) TxDb.Hsapiens.UCSC.hg19.knownGene, and (2) org.Hs.eg.db

To install these, use these commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")

BiocManager::install("org.Hs.eg.db")

4.13.3 Required Reading - Genomic Ranges

An Introduction to Bioconductor’s Packages for Working with Range Data

https://github.com/vsbuffalo/genomicranges-intro/blob/master/notes.md

4.13.4 Active Learning - Genomic Ranges

Working with genomics ranges

https://carpentries-incubator.github.io/bioc-project/07-genomic-ranges.html

4.13.5 Suggested Readings - Genomic Ranges

In “Bioinformatics Data Skills”, see Chapter 9 “Working with Range Data”

Bioinformatics Data Skills
Editor: Vince Buffalo
Publisher: O’Reilly
Web access: link

Hello Ranges: An Introduction to Analyzing Genomic Ranges in R.
link

4.13.6 Learning Objectives - Interactive Graphics

  • To learn how to use interactive and dynamic graphics to explore your data more thoroughly
  • To learn to use plotly

4.13.7 Required Reading - Interactive Graphics

Create interactive ggplot2 graphs with plotly https://www.littlemissdata.com/blog/interactiveplots

4.14 Suggested Reading - Interactive Graphics

Wickham (2009) Chapters 2 & 3

4.15 Data Quality Checking and Filters

4.15.1 Learning Objectives

  • To learn the principles of data cleaning
  • To practice applying data cleaning principles
  • To learn how to check genotype data for quality

4.15.2 Active Learning

To see an example of quality control for SNP genotyping using Illumina genotyping microarrays, please read through this example report:

https://khp-informatics.github.io/COPILOT/README_summary_report.html

For more details, see this Current Protocols paper, which is long and detailed, but you can get most of the main points by concentrating on the Figures:

Patel H, Lee S-H, Breen G, Menzel S, Ojewunmi O, Dobson RJB. The COPILOT Raw Illumina Genotyping QC Protocol. Current Protocols. 2022;2(4):e373. PMID: 35452565 DOI: https://doi.org/10.1002/cpz1.373

4.15.3 Suggested Readings

Kässens JC, Wienbrandt L, Ellinghaus D. BIGwas: Single-command quality control and association testing for multi-cohort and biobank-scale GWAS/PheWAS data. GigaScience. 2021 Jun 1;10(6):giab047. PMID: 34184051 PMCID: PMC8239664 DOI: https://doi.org/10.1093/gigascience/giab047

Brandenburg J-T, Clark L, Botha G, Panji S, Baichoo S, Fields C, Hazelhurst S. H3AGWAS: a portable workflow for genome wide association studies. BMC Bioinformatics. 2022 Nov 19;23(1):498. PMID: 36402955 PMCID: PMC9675212 DOI: https://doi.org/10.1186/s12859-022-05034-w

Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Data quality control in genetic case-control association studies. Nat Protoc. 2010 Sep;5(9):1564–1573. DOI: https://doi.org/10.1038/nprot.2010.116

Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, Gabriel SB, Harris EL, Hu FB, Jacobs KB, Kraft P, Landi MT, Lumley T, Manolio TA, McHugh C, Painter I, Paschall J, Rice JP, Rice KM, Zheng X, Weir BS, GENEVA Investigators. Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic epidemiology. 2010 Sep;34(6):591–602. PMID: 20718045 DOI: https://doi.org/10.1002/gepi.20516

4.16 Unix: Basics

4.16.1 Learning Objectives

  • To learn basic Unix commands

4.16.2 Preparation

  • Watch the online lecture and do the Active Learning before class

  • Do the Unix setup homework assignment

4.16.3 Online Lecture

Unix Basics: https://danieleweeks.github.io/HuGen2071/unix_basics.html

4.16.4 Active Learning

Software Carpentry Unix Shell intro parts 1-3 https://swcarpentry.github.io/shell-novice/

4.16.5 Required Reading

See Active Learning.

4.16.6 Suggested Reading

Buffalo (2015) Chapter 2. Setting up and managing a bioinformatics project.

Buffalo (2015)Chapter 3. Remedial Unix Shell (beginning of chapter up to and not including “working with streams and redirection”)

Terminus, a web-based game for learning and practicing basic Unix commands https://web.mit.edu/mprat/Public/web/Terminus/Web/main.html

“Chapter 43: Redirecting Input and Output” in Unix Power Tools, 3rd Edition by Jerry Peek, Shelley Powers, Tim O’Reilly, Mike Loukides. Published by O’Reilly Media, Inc. https://pitt.primo.exlibrisgroup.com/permalink/01PITT_INST/e8h8hp/alma9998520758606236

4.17 Unix: Streams, Pipes, Scripts

4.17.1 Learning Objectives

  • To learn how streams operate in Unix
  • To learn out to pass streamed data from program to program in Unix
  • To learn how to interact with running processes
  • To learn how to write a script that can run in Unix
  • To learn about the cluster and how to submit jobs there

4.17.2 Preparation

  • Watch the online lecture and do the Active Learning before class

4.17.3 Online Lecture

Unix: Streams, Pipes, Scripts: https://danieleweeks.github.io/HuGen2071/unix_streams_pipes_scripts.html

4.17.4 Active Learning

Software Carpentry Unix Shell intro parts 4 and 6 https://swcarpentry.github.io/shell-novice/

4.17.5 Required Reading

See Active Learning.

4.17.6 Suggested Reading

Buffalo (2015)Chapter 3. Remedial Unix Shell (from “working with streams and redirection” to and not including “command substitution”)

4.18 Genetic Data Structures

4.18.1 Learning Objectives

By the end of the learning activities on this topic, students will be able to:

  • Implement a method for storing relationship information about individuals.
  • Implement a method for distinguishing between social gender, biological sex, and sex chromosome complement.
  • Create a data set that can store basic pedigree information “by hand.”

4.18.2 Readings

Bennett RL, Steinhaus KA, Uhrich SB, O’Sullivan CK, Resta RG, Lochner-Doyle D, Markel DS, Vincent V, Hamanishi J. Recommendations for standardized human pedigree nomenclature. J Genet Couns. 1995 Dec;4(4):267-79. https://doi.org/10.1007/BF01408073. PMID: 24234481.

Bennett RL, French KS, Resta RG, Doyle DL. Standardized human pedigree nomenclature: update and assessment of the recommendations of the National Society of Genetic Counselors. J Genet Couns. 2008 Oct;17(5):424-33. https://doi.org/10.1007/s10897-008-9169-9. Epub 2008 Sep 16. PMID: 18792771.

Bennett RL, French KS, Resta RG, Austin J. Practice resource-focused revision: Standardized pedigree nomenclature update centered on sex and gender inclusivity: A practice resource of the National Society of Genetic Counselors. J Genet Couns. 2022 Sep 15. https://doi.org/10.1002/jgc4.1621. Epub ahead of print. PMID: 36106433.

Montañez A. Beyond XX and XY: The Extraordinary Complexity of Sex Determination. Sci Am. 2017 Sep;317(3):50. https://doi.org/10.1038/scientificamerican0917-50.

Montañez A. Visualizing Sex as a Spectrum. Sci Am blog. 2017 Aug 29. https://www.scientificamerican.com/blog/sa-visual/visualizing-sex-as-a-spectrum/.

4.22 Unix: Data Manipulation

4.22.1 Learning Objectives

  • To learn Unix tools like sed and awk that can be used to manipulate data

4.22.2 Preparation

  • Watch the online lecture and do the Active Learning before class

4.22.3 Online Lecture

Unix Data Manipulation: https://danieleweeks.github.io/HuGen2071/unix_data_manipulation.html

See Required Reading.

4.22.4 Active Learning

See Required Reading.

4.22.5 Required Reading

Buffalo (2015)Chapter 7. Unix Data Tools (Beginning of chapter up to and including “Finding Unique values in Uniq”)

4.22.6 Suggested Reading

None.

4.23 Unix: Miscellaneous

4.23.1 Learning Objectives

  • To learn to string programs together to process data
  • To learn how to parallelize functions in Unix

4.23.2 Preparation

  • Watch the online lecture and do the Active Learning before class

4.23.3 Online Lecture

Unix Miscellaneous: https://danieleweeks.github.io/HuGen2071/unix_miscellaneous.html

4.23.4 Active Learning

See Required Reading.

4.23.5 Required Reading

Buffalo (2015)Chapter 7. Unix Data Tools (“Join” through the end of the chapter)

4.23.6 Suggested Reading

None.

4.24 Unix: Scripting

4.24.1 Learning Objectives - Unix: Scripting

  • To learn how to use control structures in Unix scripting
  • To learning how to use variables in Unix

4.24.2 Preparation - Unix: Scripting

  • Do the Active Learning before class - the lecture will assume you have; otherwise you will have difficulty with the in-class exercises

4.24.3 Active Learning - Unix: Scripting

Software Carpentry Unix Shell intro parts 5 and 7 https://swcarpentry.github.io/shell-novice/

4.24.4 Required Reading - Unix: Scripting

See Active Learning.

4.24.5 Suggested Reading - Unix: Scripting

Buffalo (2015)Chapter 3. Remedial Unix Shell (“command substitution” through the end of the chapter.)

Buffalo (2015)Chapter 12. Bioinformatics Shell Scripting (entire chapter)

4.25 VCF, bcftools, vcftools

4.25.1 Learning Objectives

  • To learn about VCF data format
  • To learn about bcftools and vcftools for manipulating VCF files

4.26 SAM & samtools

4.26.1 Learning Objectives

  • To learn about SAM data format for sequence data
  • To learn about samtools to manipulate SAM data files

4.26.2 Readings

Buffalo Chapter 11 “Working with Alignment Data”

Data Wrangling and Processing for Genomics https://data-lessons.github.io/wrangling-genomics/

Relevant links: The Sequence Alignment/Map Format Specification http://samtools.github.io/hts-specs/

4.27 Genetic Data in R and GDS

4.27.1 Learning Objectives

  • To learn about data structures in R for storing genetic data
  • To learn about the GDS format

4.27.2 Preparation

  • Watch the online lecture and do the Active Learning before class

4.27.3 Online Lecture

Genetic Data in R and GDS: https://danieleweeks.github.io/HuGen2071/R_genetics_data_gds.html

4.27.4 Active Learning - Genetic Data in R, GDS

None. See Required Reading.

4.27.5 Required Reading - Genetic Data in R, GDS

Zheng X, Gogarten SM, Lawrence M, Stilp A, Conomos MP, Weir BS, Laurie C, Levine D. SeqArray-a storage-efficient high-performance data format for WGS variant calls. Bioinformatics. 2017 Aug 1;33(15):2251-2257. doi: 10.1093/bioinformatics/btx145. PMID: 28334390; PMCID: PMC5860110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860110/

4.27.6 Suggested Reading - Genetic Data in R, GDS

None