using single-cell transcriptome sequencing to infer ... · sensory neuronsandnon-neuronal support...
Post on 04-Oct-2019
2 Views
Preview:
TRANSCRIPT
-
Using Single-Cell Transcriptome Sequencing toInfer Olfactory Stem Cell Fate Trajectories
Sandrine DudoitDivision of Biostatistics and Department of Statistics
University of California, Berkeleywww.stat.berkeley.edu/~sandrine
Computational and Systems Immunology SeminarStanford University
June 28, 2016
Version: 26/06/2016, 22:58
1 / 105
http://www.stat.berkeley.edu/~sandrine
-
Acknowledgments
• Sandrine Dudoit, Division of Biostatistics and Department ofStatistics, UC Berkeley.
I Fanny Perraudeau, Graduate Group in Biostatistics, UCBerkeley.
I Davide Risso, Division of Biostatistics, UC Berkeley.I Kelly Street, Graduate Group in Biostatistics, UC Berkeley.
• John Ngai, Department of Molecular and Cell Biology, UCBerkeley – Principal investigator.
I Diya Das.I Russell Fletcher.I David Stafford.
• Elizabeth Purdom, Department of Statistics, UC Berkeley.• Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,
France.I Svetlana Gribkova.
2 / 105
-
Acknowledgments
• Nir Yosef, Department of Electrical Engineering and ComputerSciences, UC Berkeley.
I Michael Cole.I Allon Wagner.
• Funded by BRAIN Initiative and California Institute forRegenerative Medicine (CIRM).
3 / 105
-
Outline
1 Olfactory Stem Cell Fate TrajectoriesOlfactory Stem Cells and Neural RegenerationOlfactory Epithelium p63 DatasetAnalysis Pipeline
2 Exploratory Data Analysis and Quality Assessment/Control
3 Normalization and Expression QuantitationMotivationMethodsSoftware: sconeZero-Inflated Negative Binomial Model
4 Resampling-Based Sequential Ensemble ClusteringMotivationMethodsSoftware: clusterExperiment
5 Cell Lineage and Pseudotime InferenceMotivation
4 / 105
-
Outline
MethodsSoftware: slingshot
5 / 105
-
Olfactory Stem Cells and Neural Regeneration
R. Fletcher, J. Ngai
6 / 105
-
Olfactory Stem Cells and Neural Regeneration
• The nature of stem cells giving rise to the nervous system isof particular interest in neurobiology, because neural stemcells remain active in certain brain regions for the entire life ofan individual.
• We focus on the mouse olfactory epithelium (OE), a site ofactive neurogenesis in the postnatal animal.
• Adult olfactory stem cells support the replacement of olfactorysensory neurons and non-neuronal support cells (e.g.,sustentacular) over postnatal life and can reconstitute theentire OE following injury.
• The OE is a convenient system to study, due to itsexperimental accessibility (in situ analysis) and its limitednumber of cell types:
I olfactory sensory neurons (OSN),I sustentacular cells (SUS),
7 / 105
-
Olfactory Stem Cells and Neural Regeneration
I cells of the Bowman gland,I microvillous cells (rare).
8 / 105
-
Olfactory Stem Cells and Neural Regeneration
Figure 1: Mouse olfactory epithelium.
9 / 105
-
Olfactory Stem Cells and Neural Regeneration
Sustentacular cells
Mature olfactory neurons
Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia
Bowman’s gland
• HBCs: multipotent, quiescent – deep reserve adult tissue stem cell • GBCs: proliferative progenitor cells + transit-amplifying cells
The Horizontal Basal Cell Is an Adult Tissue Stem Cell
Figure 2: Olfactory epithelium cell types.
10 / 105
-
Olfactory Stem Cells and Neural Regeneration
Open questions.
• Determine the stage at which the neuronal and non-neuronallineages bifurcate/diverge.
• Characterize discrete intermediate stages of cell differentiation.• Identify the genetic networks and signaling pathways that
promote self-renewal and regulate the transition todifferentiation.
11 / 105
-
Olfactory Stem Cells and Neural Regeneration
p63 regulation of horizontal basal cells.
• The horizontal basal cell (HBC) is an adult tissue stem cell.• The p63 protein (tumor protein p63, TP63) promotes
self-renewal of HBC by blocking differentiation.
• When p63 is down-regulated, this “brake” is removed,allowing differentiation to proceed at the expense ofself-renewal. Thus, p63 can be viewed as a “molecularswitch” that decides between the alternate stem cell fates ofself-renewal vs. differentiation.
• We use single-cell RNA-Seq to analyze cell fate trajectoriesfrom olfactory stem cells (HBC) of p63 conditional knock-outmice.
12 / 105
-
Olfactory Epithelium p63 Dataset
X
Cre-ER Krt5
YFP
loxP-STOP-loxP
Rosa
p63
+
Experimental design
Russell Fletcher, Levi Gadye, Mike Sanchez
~ ~
0 24h 48h 72h 96h 7d 14d
tamoxifen YFP+;p63+/+: 98 cells
Resting HBCs
FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq
Differentiating cells
YFP+;p63-/-: 88 cells 100 cells 128 cells 110 cells 98 cells
Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102wild-type and 534 p63 knock-out cells.
13 / 105
-
Olfactory Epithelium p63 Dataset
P01
P02
P03
A
P03
B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
Number of cells per batch, colored by bio
num
ber
of c
ells
0
10
20
30
40
50
60KO_14DPTKO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT
Figure 4: Experimental design. Number of cells per batch, colored bybiological condition.
14 / 105
-
Olfactory Epithelium p63 Dataset
• Single-cell RNA-Seq for 636 HBC.I 102 wild-type (WT)/resting cells
534 p63 knock-out (KO) cells, at five timepoints followingtamoxifen treatment.
I Biological replicate: Cells from 1–3 mice.At least two replicates per biological condition.
I One FACS run and one C1 run per biological replicate=⇒ 14 batches.
I 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).
• Some confounding between biological and technical effects.• Batch effects nested within biological effects.
15 / 105
-
Olfactory Epithelium p63 Dataset
• Marker genes. 94 marker genes, curated from literature andfrom prior microarray, sequencing, and in situ experiments,e.g., neuronal, progenitor cell markers.
• Housekeeping genes. 715 housekeeping (HK) genes, curatedfrom prior microarray experiments, expected to be constantlyand highly-expressed across cells of the OE.
• Gene-level read counts. TopHat2 alignment to RefSeq mm10genome and featureCounts(bioinf.wehi.edu.au/featureCounts) counting, withgenes defined as union of all isoforms.
16 / 105
http://bioinf.wehi.edu.au/featureCounts
-
Analysis Pipeline
• Input/Output (I/O).I Input. FASTQ files, gene-level annotation metadata,
sample-level annotation metadata.I Output. Cluster labels corresponding to cellular types, cell
lineages and pseudotimes, list of differentially expressed genescorresponding to markers/transcriptome signatures for thesetypes/lineages.
• Pre-processing. Read alignment, gene-level expressionquantitation, quality assessment/control.
• Exploratory data analysis (EDA) and qualityassessment/control (QA/QC). Numerical and graphicalsummaries of gene-level read counts, sample filtering (e.g.,based on QC measures), gene filtering (e.g., zero inflation).
17 / 105
-
Analysis Pipeline
• Normalization and expression quantitation. Adjust forunwanted technical effects (e.g., C1 run effects), account forzero inflation and over-dispersion.
• Cluster analysis. Dimensionality reduction (e.g., based onprincipal component analysis), distance measure, clusteringprocedure and tuning parameters, confidencemeasures/validation of clustering results.
• Cell lineage and pseudotime inference.• Differential expression (DE) analysis. Identification of marker
genes for cell types/lineages.
18 / 105
-
Computationally Reproducible Research
• Version control system for software development and dataanalysis: Git, GitHub.
• R data packages, with standard object structure (S4 classesExpressionSet, SummarizedExperiment) for gene-levelexpression measures and gene-level and sample-levelannotation metadata.
• Compendia, i.e., integrated and executable input documents,with text, data, code, and software, that allow the generationof dynamic and reproducible output documents, intermixingtext and code output (textual and graphical). Different viewsof the output document (e.g., PDF, HTML) can beautomatically generated and updated whenever the text, data,or code change.E.g. For R and LATEX/markdown, Sweave, knitr, RStudio.
19 / 105
-
Zero Inflation
Proportion of genes with zero count, pre gene filtering
0.0
0.2
0.4
0.6
0.8
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
6
Proportion of cells in which a gene is detected, pre gene filtering
N = 22054 Bandwidth = 0.02443
Den
sity
(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected
Figure 5: Zero inflation. Pre gene filtering.
20 / 105
-
Zero Inflation
• Single-cell RNA-Seq data have many more genes with zeroread counts than bulk RNA-Seq data.
• This zero inflation could occur for biological reasons (i.e., thegene is simply not expressed) or technical reasons (e.g., lowcapture efficiency).
• Zero-count gene filtering is advisable for normalization anddownstream analyses.
• Most RNA-Seq normalization methods involve scaling andperform poorly when many genes have zero counts.
• In particular, the global-scaling method of Anders and Huber(2010), implemented in the Bioconductor R package DESeq,discards any gene having zero count in at least one sample. Inpractice, the scaling factors are therefore estimated based ononly a handful of genes, e.g., 5/22,054 genes for OE dataset.
21 / 105
-
Zero Inflation
• Full-quantile (FQ) normalization also doesn’t behave properlydue to ties from the large number of zeros.
• We apply the following zero-count gene filtering to the OEdataset: Retain only the genes with at least nr = 20 reads, inat least ns = 40 samples.This yields 9,133/22,054 genes.
22 / 105
-
Zero Inflation
Proportion of genes with zero count, post gene filtering
0.0
0.2
0.4
0.6
0.8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
Proportion of cells in which a gene is detected, post gene filtering
N = 9133 Bandwidth = 0.03333
Den
sity
(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected
Figure 6: Zero inflation. Post gene filtering.
23 / 105
-
Sample-Level QC
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
1e+06
2e+06
3e+06
QC measures by batch
NR
EA
DS
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
500000100000015000002000000250000030000003500000
QC measures by batch
NA
LIG
NE
D
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
90
92
94
96
QC measures by batch
RA
LIG
N
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
203040506070
QC measures by batch
TOTA
L_D
UP
Figure 7: Sample-level QC. Boxplots of QC measures, by batch.
24 / 105
-
Sample-Level QC
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.00.10.20.30.40.50.6
QC measures by batch
PC
T_I
NT
RO
NIC
_BA
SE
S
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.100.150.200.250.30
QC measures by batch
PC
T_I
NT
ER
GE
NIC
_BA
SE
S
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.20.30.40.50.60.70.8
QC measures by batch
PC
T_M
RN
A_B
AS
ES
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.60.81.01.21.41.61.82.0
QC measures by batch
ME
DIA
N_C
V_C
OV
ER
AG
E
Figure 8: Sample-level QC. Boxplots of QC measures, by batch.
25 / 105
-
Sample-Level QC
●
●
● ●
●●
●
●
●
●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●●
●
●
●
●
●● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
−5 0 5 10
−4
−2
02
46
QC PCA by batch
PC1
PC
2
P01
P02
P03
A
P03
B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
−4
−2
0
2
4
6
QC PCA by batch
PC
1
(a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch
Figure 9: Sample-level QC. Principal component analysis (PCA) ofsample-level QC measures.
26 / 105
-
Sample-Level QC
●
●●
●
●
●
●
●
●
●●●
●
● ●
●●
●● ●
● ●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
● ●
● ●
●
●
●
●
●●
● ●●
●
●
●●
●●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●
●● ●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●● ●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●● ●
●●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●●
● ●●●
●●
●
●
●
● ●●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
● ●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●●
●
● ●
●
●●
−40 −20 0 20 40 60 80
−5
05
10QC PC1 vs. count PC1
Count PC1
QC
PC
1
Cor=−0.49
PC1 PC2 PC3 PC4 PC5
Absolute correlation of count PC and QC measures
0.0
0.2
0.4
0.6
0.8
1.0
NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS
(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures
Figure 10: Sample-level QC. Association of counts and sample-level QCmeasures.
27 / 105
-
Sample-Level QC: Summary
• The distribution of QC measures can vary substantially withinand between batches.
• Some QC measures clearly point to low-quality samples, e.g.,low percentage of mapped reads (RALIGN).
• There can be a strong association between QC measures andread counts (cf. PCA).
• Filtering samples based on QC measures is advisable, asnormalization procedures may not be able to adjust for QCand some samples simply have low quality.
• Normalization procedures based on QC measures (e.g.,regression on first few PC of QC measures) should also beconsidered.
28 / 105
-
Gene-Level Counts
0
2
4
6
8
10
Gene−level log−count
log(
coun
t+1)
−4
−2
0
2
4
6
8
Gene−level RLE
RLE
(a) Log-count (b) RLE
Figure 11: Gene-level counts. Gene-level log-count and relative logexpression (RLE = log-ratio of read count to median read count acrosssamples).
29 / 105
-
Gene-Level Counts: Summary
• After gene and sample filtering and before normalization,there are large differences in gene-level count distributionswithin and between batches (cf. RLE, housekeeping genes).
• The counts are still zero-inflated.• There can be substantial association of counts and
sample-level QC measures.
• Normalization is essential before any clustering or differentialexpression analysis, to ensure that observed differences inexpression measures between samples and/or genes are trulydue to differential expression and not technical artifacts.
30 / 105
-
Motivation
• In order to derive expression measures and compare thesemeasures between samples and/or genomic regions of interest(ROI), one first needs to normalize read counts to adjust fordifferences in sequencing depths (i.e., total read counts) andROI lengths, and possibly more complex technical effects, e.g.,C1 run/library preparation/flow-cell/lane effects andnucleotide composition (e.g., GC-content) effects –sequenceability.
• Normalization is essential to ensure that observed differencesin expression measures between samples and/or ROI are trulydue to differential expression and not technical artifacts.
31 / 105
-
Motivation
• Normalization is still an important issue for RNA-Seq, despiteinitial optimistic claims such as: “One particularly powerfuladvantage of RNA-Seq is that it can capture transcriptomedynamics across different tissues or conditions withoutsophisticated normalization of data sets” (Wang et al., 2009).
• This subject has received relatively little attention and fewnovel methods have been proposed in the RNA-Seq literature.
• It is still common to only scale ROI-level counts by ROI lengthand total sample read count to adjust, respectively, fordifferences in ROI lengths and sequencing depths (even fortechnical replicate lanes).E.g. Estimators based on Poisson model as in Marioni et al.(2008); RPKM, Reads Per Kilobase of exon model per Millionmapped reads, as in Mortazavi et al. (2008).
32 / 105
-
Motivation
• In many cases, however, one needs to account for morecomplex features of the genomic regions and experimentaldesign.
• Normalization seems even more important for single-cellRNA-Seq than bulk RNA-Seq.
• There are large effects related to C1 run (cells processed inbatches of 96 wells).
• There is also greater variation in gene-level counts acrosssingle cells than across bulk RNA-Seq samples.
• The very small amount of RNA present in a single cell maycause amplification bias (i.e., many sequenced copies of atranscript), hence high-magnitude outliers, and low captureefficiency (i.e., genes expressed in the cell, but not capturedby the sequencing process), hence zero inflation.
33 / 105
-
Motivation
• Widely-used bulk RNA-Seq methods are not well-suited forhandling the zero inflation and high level of technical noise ofscRNA-Seq data (Vallejos et al., 2016).
34 / 105
-
Between-Sample Normalization
Most normalization methods proposed to date are adaptations ofmethods for bulk RNA-Seq and microarrays.
• Global-scaling normalization. Gene-level counts are scaled bya single factor per sample.
I Total-count (TC), as in RPKM (Mortazavi et al., 2008), stillwidely used.
I Housekeeping gene count, as in qRT-PCR, e.g., per-samplePOLR2A count (Bullard et al., 2010).
I Single-quantile, e.g., per-sample upper-quartile (UQ) of countsfor genes with non-zero counts (Bullard et al., 2010).
• Pairwise global-scaling normalization. Gene-level counts arescaled by a single factor per sample computed with respect toa reference sample.
I Trimmed Mean of M values (TMM) (Robinson and Oshlack,2010): Select a reference sample and obtain a robust estimateof the overall fold-change between each sample and thereference. Implemented in edgeR.
35 / 105
-
Between-Sample Normalization
I Anders and Huber (2010): The global-scaling factor for agiven sample is defined as the median fold-change betweenthat sample and a synthetic reference sample whose counts aredefined as the geometric means of the counts across samples.Implemented in DESeq and edgeR (as “RLE”).
• Full-quantile normalization (FQ). All quantiles of the genecount distributions are matched between samples (Bullardet al., 2010). Specifically, for each sample, the distribution ofsorted read counts is matched to a reference distributiondefined in terms of a function of the sorted counts (e.g.,median) across samples. Inspired from the microarrayliterature (Irizarry et al., 2003).
36 / 105
-
Between-Sample Normalization
• Loess normalization. Loess fits are performed formean-difference plots (MD-plots) of log-counts for pairs ofsamples (all possible pairs (cyclic) or each sample paired witha synthetic reference obtained by averaging counts acrosssamples). This approach, also inspired from the microarrayliterature (Bolstad et al., 2003), was used in Lovén et al.(2012) and is implemented in the affy functionloess.normalize and limma functionnormalizeCyclicLoess.
• Regression-based normalization for known technical factors,e.g., C1 run, QC measures.Linear model (limma), negative binomial model (edgeR),empirical Bayes model (ComBat of Johnson et al. (2007),implemented in R package sva), supervised version of RUVbelow.
• Remove unwanted variation (RUV).37 / 105
-
Between-Sample Normalization
I Proposed initially for the normalization of microarray data(Gagnon-Bartsch and Speed, 2012; Gagnon-Bartsch et al.,2013; Jacob et al., 2013) and then adapted to sequencing data(Risso et al., 2014a,b), the main idea is to remove “unwantedvariation” by performing factor analysis for a suitably-chosensubset of control genes and/or samples.
I For n samples and J genes, consider the log-linear/generalizedlinear model (GLM)
log E[Y |X ,W ] = Wα + Xβ, (1)
where Y is the n × J matrix of gene-level read counts, X is ann ×M design matrix corresponding to the M covariates ofinterest/factors of “wanted variation” (e.g., treatment) and βits associated M × J matrix of parameters of interest, and Wis an n × K matrix corresponding to unknown factors of“unwanted variation” and α its associated K × J matrix ofnuisance parameters.
38 / 105
-
Between-Sample Normalization
I The unwanted factors W can be estimated as follows.
• RUVg: Based on control genes, with either constantexpression (β = 0) or known expression fold-changes betweensamples (e.g., housekeeping genes, ERCC spike-ins, “in-silico”empirical controls).
• RUVr: Based on residuals, such as deviance residuals from afirst-pass GLM regression of the counts on the covariates ofinterest, without RUVg normalization, but possibly after someother normalization procedure such as upper-quartile.
• RUVs: Based on replicate/negative control samples for whichthe covariates of interest are constant.
• Generalization of RUV. The RUV model can be readilyextended to include known sample-level covariates (e.g., C1run, QC measures) and known gene-level covariates (e.g.,GC-content) that should be adjusted for.
39 / 105
-
Between-Sample Normalization: RUV
= + α
logE[Y|W,X] W X β
Unwanted variation Variation of interest
J
n
K
n n
K
J
M
M J
Read counts
n samples x J genes
Figure 12: RUV model.
40 / 105
-
SCONE
D. Risso, M. Cole, N. Yosef
41 / 105
-
SCONE
SCONE: Single-Cell Overview of Normalized Expression. A generalframework for the normalization of scRNA-Seq data.
• Range of normalization methods.I Global-scaling, e.g., DESeq, TMM.I Full-quantile (FQ).I Unknown factors of unwanted variation: Remove unwanted
variation (RUV).I Known factors of unwanted variation: Regression-based
normalization on, e.g., QC PC, C1 run.
• Normalization performance metrics.• Numerical and graphical summaries of normalized read counts
and metrics.
• R package scone, to be released through the BioconductorProject: github.com/yoseflab/scone.
42 / 105
http://github.com/yoseflab/scone
-
SCONE
Figure 13: scone. Regression model.
43 / 105
-
SCONE
Performance metrics. (Green: Good when high; Red: Good whenlow.)
• BIO SIL: Average silhouette width by biological condition.• BATCH SIL: Average silhouette width by batch.• PAM SIL: Maximum average silhouette width for PAM
clusterings, for a range of user-supplied numbers of clusters.
• EXP QC COR: Maximum squared Spearman correlationbetween count PCs and QC measures.
• EXP UV COR: Maximum squared Spearman correlationbetween count PCs and factors of unwanted variation(preferably derived from other set of negative control genesthan used in RUV).
• EXP WV COR: Maximum squared Spearman correlationbetween count PCs and factors of wanted variation (derivedfrom positive control genes).
44 / 105
-
SCONE
• RLE MED: Mean squared median relative log expression(RLE).
• RLE IQR: Mean inter-quartile range (IQR) of RLE.
45 / 105
-
Software Package scone
Application to OE p63 dataset.
• Apply and evaluate 172 normalization procedures using mainscone function.
I scaling method: None, DESeq, TMM, FQ.I uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5.I adjust biology: Yes/no.I adjust batch: Yes/no.
• Select a normalization procedure based on (function of) theperformance scores.Unweighted mean score =⇒ none,fq,qc k=4,bio,no batchWeighted mean score =⇒none,fq,qc k=2,no bio,no batch
46 / 105
-
Software Package scone
●
●
●
●
●
●
●
●
● ●
●
●● ●●
●
●
●●
●●
●●●
● ●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●●
●
●●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
−0.15 −0.10 −0.05 0.00 0.05 0.10
−0.
15−
0.05
0.00
0.05
0.10
SCONE: Biplot of scores colored by mean score
PC1
PC
2
BIO_SIL
BATCH_SILPAM_SIL
EXP_QC_COR
EXP_UV_COREXP_WV_COR
RLE_MEDRLE_IQR
Figure 14: scone. Biplot of performance scores, colored by mean score(yellow high/good, blue low/bad).
47 / 105
-
Software Package scone
●
●
●
●
●
●
●
●
● ●
●
●●●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by mean score
PC1
PC
2
Top meanTop weighted meanfq,ruv_k=1none
Figure 15: scone. PCA of performance scores, colored by mean score.
48 / 105
-
Software Package scone
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by method
PC1
PC
2●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●● ●
●
●
●●●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
nonefqdeseqtmm
Figure 16: scone. PCA of performance scores, colored by method –scaling method.
49 / 105
-
Software Package scone
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by method
PC1
PC
2
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
● ●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
no_bio,no_batchno_bio,batchbio,no_batchbio,batch
Figure 17: scone. PCA of performance scores, colored by method –adjust biology, adjust batch.
50 / 105
-
Software Package scone
FQ + RUVg(HK, k=1): W by batch
W
−0.
15−
0.05
0.00
0.05
0.10
●
●●
●
●
●
●
●
●
●● ●
●
●●
●●
●●●
● ●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
● ●
●●
●
●
●
●
●●
● ●●
●
●
●●
●●
●
●
●
● ●●
●
● ●
●
●●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
● ●
●●
●
●●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●●●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●● ●
●●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●●
●●●●
●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●●
● ●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●●●
●
●●
●
●●
−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10
−5
05
10
FQ + RUVg(HK, k=1): QC PC1 vs. W
W
QC
PC
1
Cor=−0.56
(a) Unwanted factor W (b) QC PC1 vs. W
Figure 18: scone. Association of RUVg unwanted factor W and QCmeasures for none,fq,ruv k=1,no bio,batch.
51 / 105
-
Software Package scone
−3
−2
−1
0
1
2
3
4
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: RLE
RLE
−2
0
2
4
6
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: HK RLE
RLE
(a) All genes (b) Housekeeping genes
Figure 19: scone. Gene-level relative log expression (RLE = log-ratio ofread count to median read count across samples) for method with topweighted mean score none,fq,qc k=2,no bio,no batch.
52 / 105
-
Software Package scone
●
●●
●
●
●
●
●
●
●●●
●
● ●
●●
●●●
● ●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●
● ●●●
●
●
●
●
● ●
●
●
●
●
●●
● ●
●●
●
●
●
●
●●
● ●●
●
●
●●
●●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●●
●● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●● ●
●●
●
●
●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●● ●
● ●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
● ●●●
●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●●
● ●
● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●●
●
● ●
●
●●
−20 0 20 40
−5
05
10
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1
count PC1
QC
PC
1
Cor=0
PC1 PC2 PC3 PC4 PC5
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure
0.0
0.2
0.4
0.6
0.8
1.0
NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS
(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures
Figure 20: scone. Association of counts and sample-level QC measures,none,fq,qc k=2,no bio,no batch.
53 / 105
-
Software Package scone: Summary
• Unnormalized gene-level counts exhibit large differences indistributions within and between batches and association withsample-level QC measures.
• Different normalization methods vary in performanceaccording to SCONE metrics and lead to different distributionsof gene-level counts, hence clustering and DE results.
• Global-scaling normalization. Not aggressive enough to handlepotentially large batch effects and association of counts andQC measures. Biological effects are dominated by nuisancetechnical effects. Additionally, for DESeq, the scaling factorsare computed based on only a handful of genes with non-zerocounts in all cells (5/22,054).
54 / 105
-
Software Package scone: Summary
• Batch effect normalization. Adjusting for batch effectswithout properly accounting for the nesting of batch withinbiological effects (no bio,batch) in the regression model isproblematic, as this removes the biological effects of interest(e.g., empirical Bayes framework of ComBat).
• FQ followed by QC-based or RUVg normalization. Seemseffective: Similar RLE distributions between samples, lowerassociation of counts and QC measures. The first unwantedfactor of RUVg is correlated with the first QC PC.
• The remaining analyses are based onnone,fq,qc k=2,no bio,no batch, the best methodaccording to weighted mean score.
55 / 105
-
Software Package scone: Summary
• Interpretation of performance metrics. Some metrics tend tofavor certain methods over others, e.g., EXP UV COR(correlation between count PCs and factors of unwantedvariation) naturally favors RUVg, especially when the same setof negative controls are used for normalization and evaluation.Hence, a careful, global interpretation of the metrics isrecommended.
• Negative controls. The selection of proper, distinct sets ofnegative controls is important, as these are used for bothnormalization (RUVg) and assessment of normalization results(EXP UV COR).
• Ongoing efforts.I Zero-inflated negative binomial (ZINB) model.I User-supplied factors unwanted and wanted variation (UV and
WV, respectively).I Other methods (e.g., ComBat/sva).
56 / 105
-
Software Package scone: Summary
I Other performance metrics.I Visualization.I Shiny app for interactive web interface.
57 / 105
-
Resampling-Based Sequential Ensemble Clustering
D. Risso, E. Purdom
58 / 105
-
Motivation
• Robustness to choice of samples. Both hierarchical andpartitioning methods tend to be sensitive to the choice ofsamples to be clustered. Outlying samples/clusters (e.g., glia)are common in scRNA-Seq and mask interesting substructurein the data, often requiring the successive pruning out ofdominating clusters to get to the finer structure.
• Robustness to clustering algorithm and tuning parameters.Clustering results are sensitive to pre-processing steps such asnormalization and dimensionality reduction, as well as to thechoice of clustering algorithm and associated tuningparameters (e.g., distance function, number of clusters).
59 / 105
-
Motivation
• Not focusing on the number of clusters. A major tuningparameter of partitioning methods such as partitioning aroundmedoids (PAM) and k-means is the number of clusters k .Methods for selecting k (e.g., silhouette width) are sensitiveto the choice of samples, normalization, and other tuningparameters. They tend to be conservative (low k), i.e.,capture only the coarse clustering structure and maskinteresting substructure in the data. Additionally, the numberof clusters k is often not of primary interest.E.g. Silhouette width with PAM selects only k = 2 clusters forthe OE p63 dataset.
• Not forcing samples into clusters. Some samples may beoutliers, that do not really belong to any clusters. Leavingthem out can improve the quality and interpretability of theclustering as well as downstream analyses (e.g., identificationof cluster marker genes).
60 / 105
-
Motivation
• Cluster gene expression signatures. Common differentialexpression statistics are not well-suited for finding markergenes for the clusters, especially for finer structure in ahierarchy.
61 / 105
-
Motivation
2 3 4 5 6 7 8 9 10 11 12 13 14 15
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC: Average silhouette width
k
0.0
0.1
0.2
0.3
0.4
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC k=2
silh
ouet
te
0.0
0.1
0.2
0.3
0.4
0.5
0.6 KO_14DPT
KO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT
(a) Average silhouette width (b) Silhouette widths, k = 2
Figure 21: Partitioning around medoids. 20 PC, one-minus-correlationdistance.
62 / 105
-
Resampling-Based Sequential Ensemble Clustering
• We have developed a resampling-based sequential ensembleclustering approach, with the aim of obtaining stable andtight clusters.
• Ensemble clustering, i.e., aggregating multiple clusteringsobtained from different algorithms or applications of a givenalgorithm to resampled versions of the learning set, is ageneral approach for improving stability. This can be viewedas the unsupervised analog of ensemble methods in supervisedlearning, e.g., bagging, boosting, random forests.
• Our approach is related to bagged/consensus/tight clustering(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong,2005).
• R package clusterExperiment, released through theBioconductor Project.
63 / 105
-
Resampling-Based Sequential Ensemble Clustering
RSEC: Resampling-based Sequential Ensemble Clustering.
• Given a base clustering algorithm (e.g., PAM, k-means) andassociated tuning parameters (e.g., number of principalcomponents, number of clusters k , distance matrix), generatea single candidate clustering using
I resampling-based clustering to find robust and tight clusters;I sequential clustering to find stable clusters over a range of
numbers of clusters (Tseng and Wong, 2005).
• Generate a collection of candidate clusterings by repeating theabove procedure for different base clustering algorithms andtuning parameters.
• Identify a consensus over the different candidate clusterings.• Merge non-differential clusters.• Find cluster signatures by testing for differential expression
between selected subsets of clusters.
64 / 105
-
Resampling-Based Sequential Ensemble Clustering
• Visualization. Comparison of multiple clusterings of the samesamples, heatmaps of co-clustering matrices, heatmaps withhierarchical clustering of genes and/or samples.
65 / 105
-
Resampling-Based Clustering
• Consider a particular base partitioning clustering algorithm(e.g., PAM, k-means) and associated tuning parameters (e.g.,number of principal components, number of clusters k ,distance matrix).
• Generate B subsamples of the n observations to be clustered,e.g., bootstrap or random sample without replacement (of size0.7n, say).
• For each subsample, apply the base clustering algorithm (withk clusters) and assign all n original observations to clustersbased on their distances from cluster centroids/medoids.
• Generate an n × n co-clustering matrix C by recording foreach pair of observations the proportion of times they areclustered together (out of B subsamples).
66 / 105
-
Resampling-Based Clustering
• Identify tight clusters based on the co-clustering matrix. Forexample, given a constant α close to 0, cut branchestop-down from a hierarchical clustering based on theco-clustering matrix C , such that Ci ,j ≥ 1− α ∀i , j in theselected branch (or some other less stringent function thanmin). Retain all clusters above a certain size or the largestcertain number of clusters.
67 / 105
-
Sequential Clustering
Based on Tseng and Wong (2005).
• Starting with a user-supplied k = k0, generate a collection ofclusterings, {Ck1 , Ck2 , . . . , Ckmk}, by applying the aboveresampling-based procedure to the base partitioning clusteringalgorithm with k clusters.
• Given a constant β close to 1, increase k by one untils(Cki , C
k+1j ) ≥ β, for some i , j , where s is a measure of
similarity between clusters.
• Identify Ck+1j as a stable cluster.
• Remove Ck+1j and repeat the above steps with k ← k − 1,until k is below a certain threshold or a certain number ofclusters are identified.
68 / 105
-
Sequential Clustering
• Here, we evaluate a collection of clusters for stability based onthe pairwise similarity measure
s(Ci , Cj) ≡ |Ci ∩ Cj |/|Ci ∪ Cj |. (2)
69 / 105
-
Ensemble Clustering
• Generate a collection of candidate clusterings as above, for arange of base clustering algorithms and tuning parameters.
• Find a single consensus clustering based on the co-clusteringmatrix (as in resampling, only now across different algorithmsor tuning parameters).
• Merge non-differential clusters by creating a hierarchy ofclusters, working up the tree, testing for differential expressionbetween sister nodes, and collapsing nodes with insufficientDE.
70 / 105
-
Differential Expression
• Find cluster gene expression signatures, i.e., marker genes, bytesting for differential expression between selected subsets ofclusters.
• Standard F -statistic. Tests for any difference betweenclusters. Sensitive to outlying samples/clusters. Non-specific,i.e., not useful for interpreting differences between clusters.
• Standard solution in (generalized) linear models/ANOVA is toconsider contrasts between groups of clusters. By using themachinery of the (generalized) linear model, we use all of thesamples in testing these contrasts, rather than just thosesamples involved in the corresponding clusters.
I All pairwise. All pairwise comparisons between clusters.I One against all. Compare each cluster to union of remaining
clusters.
71 / 105
-
Differential Expression
I Dendrogram. Create a hierarchy of clusters, work up the tree,test for DE between sister nodes (as in approach used formerging clusters).
• For each contrast, test for DE using empirical Bayes linearmodeling approach of R package limma, with voom option toaccount for mean-variance relationship of log-counts (i.e.,over-dispersion).
72 / 105
-
Software Package clusterExperiment
Workflow.
• clusterMany. Generate a collection of candidate clusterings,for different base clustering algorithms and tuning parameters,with option to use resampling and sequential approaches.
• combineMany. Find consensus clustering across severalclusterings.
• Identify non-differential clusters that should be merged intolarger clusters.
I makeDendrogram. Hierarchical clustering of the clusters foundby combineMany.
I mergeClusters. Merge clusters of this hierarchy based on DEbetween nodes.
• RSEC. Wrapper function around the clusterExperimentworkflow.
73 / 105
-
Software Package clusterExperiment
• getBestFeatures. Find cluster signatures by testing fordifferential expression between selected subsets of clusters.
• Visualization.I plotClusters. Comparison of multiple clusterings of the
same samples. Based on ConsensusClusterPlus package.I plotHeatmap. Heatmaps of co-clustering matrices, heatmaps
with hierarchical clustering of genes and/or samples (interfaceto aheatmap from NMF package).
74 / 105
-
Software Package clusterExperiment
Application to OE p63 dataset.
• clusterMany: Generate 22 candidate clusterings.I Dimensionality reduction: 25, 50 PC.I Euclidean distance.I Base clustering method: PAM, k = 5, · · · , 15.I Resampling-based clustering: B = 100, proportion = 0.7,α = 0.3.
I Sequential clustering: k0 = 15, β = 0.9.I clusterFunction=c("hierarchical01").
• combineMany(ce, clusterFunction="hierarchical01",whichClusters="workflow", proportion=0.7,
propUnassigned=0.5, minSize=5).
• mergeClusters(ce, mergeMethod="adjP",cutoff=0.05).
75 / 105
-
Software Package clusterExperiment
Clusterings from clusterMany
nPCAFeatures=50,k0=15nPCAFeatures=25,k0=15nPCAFeatures=50,k0=14nPCAFeatures=25,k0=14nPCAFeatures=50,k0=13nPCAFeatures=25,k0=13nPCAFeatures=50,k0=12nPCAFeatures=25,k0=12nPCAFeatures=50,k0=11nPCAFeatures=25,k0=11nPCAFeatures=50,k0=10nPCAFeatures=25,k0=10nPCAFeatures=50,k0=9nPCAFeatures=25,k0=9nPCAFeatures=50,k0=8nPCAFeatures=25,k0=8nPCAFeatures=50,k0=7nPCAFeatures=25,k0=7nPCAFeatures=50,k0=6nPCAFeatures=25,k0=6nPCAFeatures=50,k0=5nPCAFeatures=25,k0=5
Figure 22: clusterExperiment. Comparison of 22 clusterManyclusterings using plotClusters.
76 / 105
-
Software Package clusterExperiment
combineMany−1c1c10c11c12c13c14c15c16c17c18c19c2c20c21c22c23c24c25c26c27c28c29c3
0
0.2
0.4
0.6
0.8
1
Co−clustering proportion matrix
combineMany
Figure 23: clusterExperiment. Heatmap of co-clustering matrix forclusterMany clusterings, used to create combineMany clustering(plotHeatmap).
77 / 105
-
Software Package clusterExperiment
c1
c2
c3
c4
c5
c6
c7c8c9
c10
c11c12
c13
c14
c15
c16
c17c18
c19c20
c21c22
c23
c24c25
c26c27
c28
c29c30c31
c32
c1
c2
c3
c4
c5
c6
c7c8c9
c10
c11c12
c13
c14
c15
c16
c17c18
c19c20
c21c22
c23
c24c25
c26c27
c28
c29c30c31
c32
0.38
0.058
0.15
0.028
0.066
0.088
0.035 0.014
0.014
0.039
0.039
0.026
0.023
0.017
0.016
0.0065
0.076
0.00790.011
0.0086
0.0110.0042
0.0077
0.0092
0.006
0.0065
0.0086
0.0056
0.0051
0.0038
0.0056
Figure 24: clusterExperiment. Dendrograms from combineMany andmergeClusters.
78 / 105
-
Software Package clusterExperiment
●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●●●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●●●
●
● ●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●
●
● ●
●
●● ●
●
●
●
−50 0 50
−40
−20
020
4060
80
Count PCA by cluster
PC1
PC
2
●
●
●
●
●
●
●
●
●
−112345678
Figure 25: clusterExperiment. PCA of gene-level log-counts, colored bymergeClusters.
79 / 105
-
Software Package clusterExperiment
mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
−4
−2
0
2
4
6
8
10
Heatmap of DE genes, dendrogram contrasts
mergeClustersBio
Figure 26: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using dendrogram contrasts (269, top 50 ineach of the 6 nodes).
80 / 105
-
Software Package clusterExperiment
sirt5
hnrnpl
rmi2
gm17821
nup98
cox6c
oaz1
rpl27
rplp1
rpl13a
rps20
rps7
rpl23
rpl37a
rpl23a
rps29
rpl32
rplp2
rpl24
fau
rpl36
rpl37
rps23
rpl35
prrc2c
son
kcnq1ot1
nip7
eif1
cfl1
rpl10
lars2
rps25
rpl13
naca
rplp0
rpl7
rpl8
rps6
rps3a1
rpl21
rps11
rpl14
hmgb1
rpl3
trns1
rnr1
ppia
rps27
myh9
rpl9
rps9
rps19
rpl41
rps14
map1b
eif4g2
calm2
ywhae
gsk3b
calm1
rn18s−rs5
malat1
actb
gnb1
rnr2
hsp90ab1
eef1a1
h3f3b
canx
btg1
srsf2
csde1
fth1
sod1
set
ywhaz
ubb
tmsb4x
prdx1
akr1a1
sptbn1
ubc
mt1
dstn
rps5
rpl4
rps3
rpl18a
actr2
ddx5
hnrnpa2b1
skp1a
hspa8
ptges3
tpt1
actg1
ftl1
ptma
gm1821 mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
0
2
4
6
8
10
Heatmap of DE genes, F−test
mergeClustersBio
Figure 27: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using F -test (top 100).
81 / 105
-
Software Package clusterExperiment
sox11
creer
gstm2
cyp2f2
cbr2
trim66
gap43
ebf1
ebf2
ncam1
omp
cnga2
gng13
rgs11
lhx2
ebf3
nhlh1
gng8
arhgdig
cyp2g1
lgr5
cyp1a2
scarb1
coch
cd24a
cftr
ceacam1
prmt5
fabp5
neurod1
nhlh2
zfp423
ebf4
rrm1
ezh2
hes6
ccnd1
rbm24
kit
dtl
tead2
mcm7
ccna2
ect2
top2a
cdca8
pbk
il6
notch1
yap1
icam1
ppara
fgfr2
dkk3
igfbp4
itgb5
egfr
igfbp2
pax6
sall1
elf5
il33
sox2
wnt4
lifr
nrcam
trp63
krt14
krt5
sox9
kitl
sfrp1
f3
perp
ccnd2
hmgb2
tubb5
cdkn1b
egfp
igfbp5
krt8
hes1
notch2 mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
0
2
4
6
8
10
Heatmap of markers genes
mergeClustersBio
Figure 28: clusterExperiment. Heatmap of log-counts for “a priori”markers genes (83).
82 / 105
-
Software Package clusterExperiment: Summary
• Tuning parameters.I α controls tightness.I β controls stability.I k0, the initial number of clusters used in sequential clustering,
is the parameter with the greatest impact on the results.Larger k0 tend to lead to smaller and tighter clusters.
• Caveat. The DE analysis is exploratory and nominal p-valuesonly a rough summary of significance (reliance on models, tinyp-values even after adjustment for multiple testing, same dataused to define clusters and to perform DE analysis).
• Ongoing efforts.I Cluster confidence measures.I Potentially assign unclustered observations to clusters.I DE using ZINB model.I Greater modularity (e.g., distance functions, DE test).I Shiny app for interactive web interface.
83 / 105
-
Cell Lineage and Pseudotime Inference
K. Street, D. Risso, E. Purdom
84 / 105
-
Motivation
• Mapping transcriptional progression from stem cells tospecialized cell types is essential for properly understandingthe mechanisms regulating cell and tissue differentiation.
• There may not always be a clear distinction between states,but rather a smooth transition, with individual cells existingon a continuum between states.
• In such a case, cells may undergo gradual transcriptionalchanges, where the relationship between states can berepresented as a continuous lineage dependent upon anunderlying spatial or temporal variable. This representation,referred to as pseudotemporal ordering, can help usunderstand how cells differentiate and how cell fate decisionsare made (Bendall et al., 2014; Campbell et al., 2015; Ji andJi, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnellet al., 2014).
85 / 105
-
Motivation
• We have developed Slingshot as a flexible and robustframework for inferring cell lineages and pseudotimes in thestudy of continuous differentiation processes.
86 / 105
-
Cell Lineage and Pseudotime Inference
• Input/Output.I Input. Normalized gene expression measures and cell
clustering.I Output. Cell lineages, i.e., subsets of ordered cell clusters.
Cell pseudotimes, i.e., for each lineage, ordered sequence ofcells and associated pseudotimes.
• Dimensionality reduction.I Principal component analysis (PCA) seems effective and
simple, in conjunction with steps detailed next.I Other approaches include related linear methods, e.g.,
independent component analysis (ICA) (Trapnell et al., 2014,Monocle), and non-linear methods, e.g., Laplacianeigenmaps/spectral embedding (Campbell et al., 2015,Embeddr), t-distributed stochastic neighbor embedding(t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016,Wanderlust).
87 / 105
-
Cell Lineage and Pseudotime Inference
• Inferring cell lineages.I Minimum spanning tree (MST; ape package) over cell clusters,
with between-cluster distance based on Euclidean distancebetween cluster means scaled by within-cluster covariance.
I Outlying clusters. Identified using granularity parameter ω thatlimits maximum edge weight in the tree. Specifically, buildMST using an artificial cluster Ω, with distance ω from otherclusters (a fraction of maximum pairwise distance betweenclusters), and then remove Ω.
I Root and leaf nodes. May either be pre-specified orautomatically selected.Root node. If not pre-specified, selected based on parsimony(i.e., set of lineages with maximal number of clusters sharedbetween them).Leaf nodes. If pre-specified, constrained MST.
I A lineage is then defined as any unique path coming out of theroot node and ending in a leaf node.
88 / 105
-
Cell Lineage and Pseudotime Inference
I Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014,Monocle) offers greater stability and computational efficiency,less complex lineages, and easier determination of directionalityand branching.
• Inferring cell pseudotimes.I Iterative procedure inspired from the principal curve algorithm
of Hastie and Stuetzle (1989); principal.curve function inprincurve package.
I In the case of branching lineages, a shrinkage step is includedat each iteration, that forces a degree of similarity between thecurves in the neighborhood of shared clusters.
I Pseudotime values are derived by orthogonal projection ontothe curves.
I Cells belonging to clusters that are included in multiplelineages have multiple, similar pseudotime values.
89 / 105
-
Cell Lineage and Pseudotime Inference
I Previous approaches also use smooth curves to representlineages (Campbell et al., 2015; Petropoulos et al., 2016,Embeddr), while others use piecewise linear paths through theMST and extract orderings either by orthogonal projection (Jiand Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree(Trapnell et al., 2014, Monocle).
I We find that smooth curves provide discerning power not foundin piecewise linear trajectories, while also adding stability overa range of dimensionality reduction and clustering methods.
• Differential expression. Regression of gene expressionmeasures on pseudotime, e.g., generalized additive models(GAM) (Ji and Ji, 2016, TSCAN).
• Visualization. Two- and three-dimensional plots of celllineages and pseudotimes, gene-level trajectories, heatmaps forDE genes.
90 / 105
-
Cell Lineage and Pseudotime Inference
• R package slingshot, to be released through the BioconductorProject: github.com/kstreet13/slingshot.
91 / 105
http://github.com/kstreet13/slingshot
-
Software Package slingshot
• Modularity.I Integrates easily with a range of normalization, clustering, and
dimensionality reduction methods.I get lineages: Given expression measures and cluster labels,
use MST to infer lineages.get lineages(X, clus.labels, start.clus = NULL,
end.clus = NULL, dist.fun = NULL, omega = Inf,
distout = FALSE).I get curves: Given lineages, infer pseudotimes.
get curves(X, clus.labels, lineages, thresh =
1e-04, maxit = 100, stretch = 2, shrink = TRUE).
• Flexibility. Can be used with varying levels of supervision.I Cluster-based approach allows for easy supervision when
researchers have prior knowledge of cell classes, while stillbeing able to detect novel branching events.
I User-supplied or data-driven selection of root and leaf nodes.
92 / 105
-
Software Package slingshot
• Visualization.I plot tree: MST in 2 and 3D.I plot curves: Lineage curves in 2 and 3D.
93 / 105
-
Software Package slingshot
Application to OE p63 dataset.
• Applied to the first three principal components, Slingshotidentifies two lineages: The first corresponds to theHBC-to-neurons transition, the second to theHBC-to-sustentacular cells transition.
• A first-pass DE analysis, based on a regression of log-count onpseudotime using GAM, suggests that many genes areinvolved in the differentiation process.
• Among the top 100 DE genes for each lineage, only 13 are DEin both, suggesting distinct processes in the neuronal vs.non-neuronal lineages.
top related