data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · data...
TRANSCRIPT
![Page 1: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/1.jpg)
Data analysisMayo-Illinois Computational Genomics Course
June 8, 2020
Dave Zhao
Department of Statistics
Carl R. Woese Institute for Genomic Biology
University of Illinois at Urbana-Champaign
![Page 2: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/2.jpg)
Introduction
![Page 3: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/3.jpg)
Objective
To learn to learn tovisualize, analyze
genomic data
![Page 4: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/4.jpg)
genomic data
Collins et al. (2003). A vision for the future of genomics research.https://www.nature.com/articles/nature01626
![Page 5: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/5.jpg)
visualize, analyze
Data analysis is the iterative process of advancing scientific theoryusing quantitative data
Question
DataTheory
Answer
![Page 6: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/6.jpg)
To learn to learn to
“I tell my students, ‘the language in which you’ll spend most of your working life hasn't been invented yet, so we can't teach it to you. Instead we have to give you the skills you need to learn new languages as they appear.’”
Brian Harvey
“Why Structure and Interpretation of Computer
Programs matters”, Boston Globe, 2011
https://people.eecs.berkeley.edu/~bh/sicp.html
Learn key concepts rather than specific implementations.
statistical method/
software package use
tools
![Page 7: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/7.jpg)
Overview
![Page 8: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/8.jpg)
Genomic data analysis workflow
1. Experimental design
2. Quality control
3. Preprocessing
4. Analysis
5. Biological interpretation
![Page 9: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/9.jpg)
Today’s topics
1. Experimental design
2. Quality control
3. Preprocessing
4. Analysis• Statistics
• Framework
• Methods
• R
5. Biological interpretation
• Illustration
Pandey et al. (2018) “Comprehensive Identification and Spatial Mapping of Habenular Neuronal Types Using Single-Cell RNA-Seq”. Curr. Biol.
![Page 10: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/10.jpg)
Statistics
![Page 11: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/11.jpg)
Introduction
Statistics provides a mathematical framework and a set of methods for drawing inferences from quantitative data
Question
DataTheory
Answer
![Page 12: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/12.jpg)
Framework
Population
distribution
function
Question
DataTheory
Answer
Sampling
(randomness)
Inference
(uncertainty)
Formulation
Interpretation
![Page 13: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/13.jpg)
Methods
What they do:
• Supervised learning• Testing
• Estimation
• Prediction
• Unsupervised learning• Clustering
• Dimension reduction
• Visualization
• Ex. Alzheimer’s disease GWAS
• Supervised (AD status)• Is a SNP associated with AD?
• How is a SNP associated?
• Given SNPs, predict AD status.
• Unsupervised• What are the subpopulations?
• Reconstruct ancestry scores.
![Page 14: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/14.jpg)
Methods
What they are:
Classical Standard
(introductory)
Standard
(advanced)
New
![Page 15: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/15.jpg)
Methods
https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
How to choose: match analysis task with data structure
Rosner (2015)
![Page 16: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/16.jpg)
R
![Page 17: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/17.jpg)
R language
• R is a programming language for data analysis and visualization
• R language = objects (data) and procedures (manipulate objects)
• R expression = valid combination of objects and procedures
• R script = sequence of R expressions (at most one per line)
• R package = community-contributed R procedures and scripts
Example script:file <- "~/data/GSM2818521_larva_counts_matrix.txt"
pandey <- read.table(file, header = TRUE)
dim(pandey)
![Page 18: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/18.jpg)
R interpreter
An (R) interpreter is a computer program that:
1. Reads an (R) expression
2. Evaluates the expression
3. Prints the result
4. Loops
Ignores lines starting with # (comments)
![Page 19: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/19.jpg)
RStudio
• RStudio is a development environment, i.e., a program that makes writing R scripts easier
• Panes1. Source
2. Console
3. Environment
4. Output
1
2
3
4
![Page 20: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/20.jpg)
RStudio customization
![Page 21: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/21.jpg)
Illustration
![Page 22: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/22.jpg)
Example data
![Page 23: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/23.jpg)
Theory
“Characterize” cell types in larval habenula
![Page 24: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/24.jpg)
Genomic data analysis workflow
1. Experimental design
2. Quality control
3. Preprocessing
4. Analysis
5. Biological interpretation
![Page 25: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/25.jpg)
Prepare R
Analysis will be implemented using the R package Seurat v3.1.
library("Seurat")
## set random seed for reproducibility
set.seed(1)
s_obj <- CreateSeuratObject(pandey)
![Page 26: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/26.jpg)
2. Quality control
s_obj <- PercentageFeatureSet(s_obj,
pattern = "^MT-",
col.name = "percent.mito")
VlnPlot(s_obj,
features = c("nCount_RNA",
"nFeature_RNA",
"percent.mito"))
s_obj <- subset(s_obj, percent.mito <= 5 & nCount_RNA <= 2e4)
![Page 27: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/27.jpg)
3. Preprocessing
s_obj <- NormalizeData(s_obj)
s_obj <- FindVariableFeatures(s_obj)
s_obj <- ScaleData(s_obj, vars.to.regress = c("nCount_RNA"))
s_obj <- RunPCA(s_obj)
![Page 28: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/28.jpg)
Iteration 1: Question
What are the different cell types in the larval habenula?
Gene A
Gene B
Gene A
Gene B
Formulation
![Page 29: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/29.jpg)
Iteration 1: Analysis
Identify cell types
• Task:• Unsupervised
• Clustering
• Method:• SNN clustering
s_obj <- FindNeighbors(s_obj)
s_obj <- FindClusters(s_obj,
resolution = 0.5)
Determines
number of
clusters
![Page 30: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/30.jpg)
Iteration 1: Analysis
Visualize cell types
• Task:• Unsupervised
• Dimension reduction
• Visualization
• Method:• UMAP projection
• Scatterplot
s_obj <- RunUMAP(s_obj, dims = 1:20)
DimPlot(s_obj)
![Page 31: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/31.jpg)
Iteration 1: Answer
![Page 32: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/32.jpg)
Iteration 2: Question
Which genes are substantially differentially expressed between the cell types?
Formulation
![Page 33: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/33.jpg)
Iteration 2: Analysis
Identify genes
• Task:• Supervised
• Testing
• Estimation
• Methods:• Wilcoxon, FDR
• Mean
markers = FindAllMarkers(s_obj,
logfc.threshold = 1.5)
markers = markers[markers$p_val_adj <=
0.05,]
head(markers)
“Substantial”
“Differential”
![Page 34: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/34.jpg)
Iteration 2: Analysis
Visualize genes
• Task:• Visualization
• Methods:• Scatterplot
FeaturePlot(s_obj, features =
c("G0S2", "TP53I11B", "FXYD1")) +
patchwork::plot_layout(ncol = 3)
![Page 35: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/35.jpg)
Iteration 2: Answer
![Page 36: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/36.jpg)
Iteration 3: Question
In what biological processes are the differentially expressed genes involved?
Formulation
![Page 37: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/37.jpg)
Iteration 3: Analysis
Identify processes
• Task:• Supervised
• Testing
• Methods:• Fisher’s exact test
gene_names = unique(markers$gene)
length(gene_names)
cat(gene_names, sep = "\n")
https://david.ncifcrf.gov/
![Page 39: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/39.jpg)
5. Biological interpretation?
Larval habenula cells are mostly distinguished by whether they are neurons or not (?).
![Page 40: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/40.jpg)
Conclusions
![Page 41: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/41.jpg)
Next steps
• https://astrobiomike.github.io/about/:1. Fundamentals and concepts are important, not details.
2. Don’t let yourself become paralyzed by options.
3. Try to find a bioinformatics community to be a part of.
4. Good documentation is for science, you, and the community.
5. Be aware that you will often need to let some things go.
• Simple statistical methods can go a long way:枯れた技術の水平思考
![Page 42: Data analysis - publish.illinois.edupublish.illinois.edu/.../2020/06/data_analysis.pdf · Data analysis is the iterative process of advancing scientific theory using quantitative](https://reader034.vdocuments.mx/reader034/viewer/2022050112/5f49dc4fe4f5766e517b0284/html5/thumbnails/42.jpg)
Thank you!