using single-cell transcriptome sequencing to infer ... · sensory neuronsandnon-neuronal support...

Using Single-Cell Transcriptome Sequencing toInfer Olfactory Stem Cell Fate Trajectories

Sandrine DudoitDivision of Biostatistics and Department of Statistics

University of California, Berkeleywww.stat.berkeley.edu/~sandrine

Computational and Systems Immunology SeminarStanford University

June 28, 2016

Version: 26/06/2016, 22:58

1 / 105

http://www.stat.berkeley.edu/~sandrine

Acknowledgments

• Sandrine Dudoit, Division of Biostatistics and Department ofStatistics, UC Berkeley.

I Fanny Perraudeau, Graduate Group in Biostatistics, UCBerkeley.

I Davide Risso, Division of Biostatistics, UC Berkeley.I Kelly Street, Graduate Group in Biostatistics, UC Berkeley.

• John Ngai, Department of Molecular and Cell Biology, UCBerkeley – Principal investigator.

I Diya Das.I Russell Fletcher.I David Stafford.

• Elizabeth Purdom, Department of Statistics, UC Berkeley.• Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,

France.I Svetlana Gribkova.

2 / 105

Acknowledgments

• Nir Yosef, Department of Electrical Engineering and ComputerSciences, UC Berkeley.

I Michael Cole.I Allon Wagner.

• Funded by BRAIN Initiative and California Institute forRegenerative Medicine (CIRM).

3 / 105

Outline

1 Olfactory Stem Cell Fate TrajectoriesOlfactory Stem Cells and Neural RegenerationOlfactory Epithelium p63 DatasetAnalysis Pipeline

2 Exploratory Data Analysis and Quality Assessment/Control

3 Normalization and Expression QuantitationMotivationMethodsSoftware: sconeZero-Inflated Negative Binomial Model

4 Resampling-Based Sequential Ensemble ClusteringMotivationMethodsSoftware: clusterExperiment

5 Cell Lineage and Pseudotime InferenceMotivation

4 / 105

Outline

MethodsSoftware: slingshot

5 / 105

Olfactory Stem Cells and Neural Regeneration

R. Fletcher, J. Ngai

6 / 105


• The nature of stem cells giving rise to the nervous system isof particular interest in neurobiology, because neural stemcells remain active in certain brain regions for the entire life ofan individual.

• We focus on the mouse olfactory epithelium (OE), a site ofactive neurogenesis in the postnatal animal.

• Adult olfactory stem cells support the replacement of olfactorysensory neurons and non-neuronal support cells (e.g.,sustentacular) over postnatal life and can reconstitute theentire OE following injury.

• The OE is a convenient system to study, due to itsexperimental accessibility (in situ analysis) and its limitednumber of cell types:

I olfactory sensory neurons (OSN),I sustentacular cells (SUS),

7 / 105


I cells of the Bowman gland,I microvillous cells (rare).

8 / 105


Figure 1: Mouse olfactory epithelium.

9 / 105


Sustentacular cells

Mature olfactory neurons

Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia

Bowman’s gland

•  HBCs: multipotent, quiescent – deep reserve adult tissue stem cell •  GBCs: proliferative progenitor cells + transit-amplifying cells

The Horizontal Basal Cell Is an Adult Tissue Stem Cell

Figure 2: Olfactory epithelium cell types.

10 / 105


Open questions.

• Determine the stage at which the neuronal and non-neuronallineages bifurcate/diverge.

• Characterize discrete intermediate stages of cell differentiation.• Identify the genetic networks and signaling pathways that

promote self-renewal and regulate the transition todifferentiation.

11 / 105


p63 regulation of horizontal basal cells.

• The horizontal basal cell (HBC) is an adult tissue stem cell.• The p63 protein (tumor protein p63, TP63) promotes

self-renewal of HBC by blocking differentiation.

• When p63 is down-regulated, this “brake” is removed,allowing differentiation to proceed at the expense ofself-renewal. Thus, p63 can be viewed as a “molecularswitch” that decides between the alternate stem cell fates ofself-renewal vs. differentiation.

• We use single-cell RNA-Seq to analyze cell fate trajectoriesfrom olfactory stem cells (HBC) of p63 conditional knock-outmice.

12 / 105

Olfactory Epithelium p63 Dataset

X

Cre-ER Krt5

YFP

loxP-STOP-loxP

Rosa

p63

+

Experimental design

Russell Fletcher, Levi Gadye, Mike Sanchez

~ ~

0 24h 48h 72h 96h 7d 14d

tamoxifen YFP+;p63+/+: 98 cells

Resting HBCs

FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq

Differentiating cells

YFP+;p63-/-: 88 cells 100 cells 128 cells 110 cells 98 cells

Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102wild-type and 534 p63 knock-out cells.

13 / 105


P01

P02

P03

A

P03

B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

Number of cells per batch, colored by bio

num

ber

of c

ells

0

10

20

30

40

50

60KO_14DPTKO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT

Figure 4: Experimental design. Number of cells per batch, colored bybiological condition.

14 / 105


• Single-cell RNA-Seq for 636 HBC.I 102 wild-type (WT)/resting cells

534 p63 knock-out (KO) cells, at five timepoints followingtamoxifen treatment.

I Biological replicate: Cells from 1–3 mice.At least two replicates per biological condition.

I One FACS run and one C1 run per biological replicate=⇒ 14 batches.

I 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).

• Some confounding between biological and technical effects.• Batch effects nested within biological effects.

15 / 105


• Marker genes. 94 marker genes, curated from literature andfrom prior microarray, sequencing, and in situ experiments,e.g., neuronal, progenitor cell markers.

• Housekeeping genes. 715 housekeeping (HK) genes, curatedfrom prior microarray experiments, expected to be constantlyand highly-expressed across cells of the OE.

• Gene-level read counts. TopHat2 alignment to RefSeq mm10genome and featureCounts(bioinf.wehi.edu.au/featureCounts) counting, withgenes defined as union of all isoforms.

16 / 105

http://bioinf.wehi.edu.au/featureCounts

Analysis Pipeline

• Input/Output (I/O).I Input. FASTQ files, gene-level annotation metadata,

sample-level annotation metadata.I Output. Cluster labels corresponding to cellular types, cell

lineages and pseudotimes, list of differentially expressed genescorresponding to markers/transcriptome signatures for thesetypes/lineages.

• Pre-processing. Read alignment, gene-level expressionquantitation, quality assessment/control.

• Exploratory data analysis (EDA) and qualityassessment/control (QA/QC). Numerical and graphicalsummaries of gene-level read counts, sample filtering (e.g.,based on QC measures), gene filtering (e.g., zero inflation).

17 / 105

Analysis Pipeline

• Normalization and expression quantitation. Adjust forunwanted technical effects (e.g., C1 run effects), account forzero inflation and over-dispersion.

• Cluster analysis. Dimensionality reduction (e.g., based onprincipal component analysis), distance measure, clusteringprocedure and tuning parameters, confidencemeasures/validation of clustering results.

• Cell lineage and pseudotime inference.• Differential expression (DE) analysis. Identification of marker

genes for cell types/lineages.

18 / 105

Computationally Reproducible Research

• Version control system for software development and dataanalysis: Git, GitHub.

• R data packages, with standard object structure (S4 classesExpressionSet, SummarizedExperiment) for gene-levelexpression measures and gene-level and sample-levelannotation metadata.

• Compendia, i.e., integrated and executable input documents,with text, data, code, and software, that allow the generationof dynamic and reproducible output documents, intermixingtext and code output (textual and graphical). Different viewsof the output document (e.g., PDF, HTML) can beautomatically generated and updated whenever the text, data,or code change.E.g. For R and LATEX/markdown, Sweave, knitr, RStudio.

19 / 105

Zero Inflation

Proportion of genes with zero count, pre gene filtering

0.0

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

Proportion of cells in which a gene is detected, pre gene filtering

N = 22054 Bandwidth = 0.02443

Den

sity

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 5: Zero inflation. Pre gene filtering.

20 / 105

Zero Inflation

• Single-cell RNA-Seq data have many more genes with zeroread counts than bulk RNA-Seq data.

• This zero inflation could occur for biological reasons (i.e., thegene is simply not expressed) or technical reasons (e.g., lowcapture efficiency).

• Zero-count gene filtering is advisable for normalization anddownstream analyses.

• Most RNA-Seq normalization methods involve scaling andperform poorly when many genes have zero counts.

• In particular, the global-scaling method of Anders and Huber(2010), implemented in the Bioconductor R package DESeq,discards any gene having zero count in at least one sample. Inpractice, the scaling factors are therefore estimated based ononly a handful of genes, e.g., 5/22,054 genes for OE dataset.

21 / 105

Zero Inflation

• Full-quantile (FQ) normalization also doesn’t behave properlydue to ties from the large number of zeros.

• We apply the following zero-count gene filtering to the OEdataset: Retain only the genes with at least nr = 20 reads, inat least ns = 40 samples.This yields 9,133/22,054 genes.

22 / 105

Zero Inflation

Proportion of genes with zero count, post gene filtering

0.0

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Proportion of cells in which a gene is detected, post gene filtering

N = 9133 Bandwidth = 0.03333

Den

sity

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 6: Zero inflation. Post gene filtering.

23 / 105

Sample-Level QC

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

1e+06

2e+06

3e+06

QC measures by batch

NR

EA

DS

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

500000100000015000002000000250000030000003500000


NA

LIG

NE

D

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

90

92

94

96


RA

LIG

N

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

203040506070


TOTA

L_D

UP

Figure 7: Sample-level QC. Boxplots of QC measures, by batch.

24 / 105

Sample-Level QC

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.00.10.20.30.40.50.6


PC

T_I

NT

RO

NIC

_BA

SE

S

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.100.150.200.250.30


PC

T_I

NT

ER

GE

NIC

_BA

SE

S

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.20.30.40.50.60.70.8


PC

T_M

RN

A_B

AS

ES

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.60.81.01.21.41.61.82.0


ME

DIA

N_C

V_C

OV

ER

AG

E

Figure 8: Sample-level QC. Boxplots of QC measures, by batch.

25 / 105

Sample-Level QC

●

●

● ●

●●

●

●

●

●

●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

●

●

●●

●

●

●

●

●● ●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

−5 0 5 10

−4

−2

02

46

QC PCA by batch

PC1

PC

2

P01

P02

P03

A

P03

B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

−4

−2

0

2

4

6

QC PCA by batch

PC

1

(a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch

Figure 9: Sample-level QC. Principal component analysis (PCA) ofsample-level QC measures.

26 / 105

Sample-Level QC

●

●●

●

●

●

●

●

●

●●●

●

● ●

●●

●● ●

● ●

●

●

●

●

● ●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●●

● ●

● ●

●

●

●

●

●●

● ●●

●

●

●●

●●

●

●

●

●●●

●

● ●

●

●●

●

●

●

●

●

● ●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●● ●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●

●

●●

●●

●

● ●

●●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●● ●

●●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●● ●

●●●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●●

● ●●●

●●

●

●

●

● ●●

●

●

●

●

●●●

●

●

●

●

●

● ●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●●

●

● ●

●

●●

−40 −20 0 20 40 60 80

−5

05

10QC PC1 vs. count PC1

Count PC1

QC

PC

1

Cor=−0.49

PC1 PC2 PC3 PC4 PC5

Absolute correlation of count PC and QC measures

0.0

0.2

0.4

0.6

0.8

1.0

NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 10: Sample-level QC. Association of counts and sample-level QCmeasures.

27 / 105

Sample-Level QC: Summary

• The distribution of QC measures can vary substantially withinand between batches.

• Some QC measures clearly point to low-quality samples, e.g.,low percentage of mapped reads (RALIGN).

• There can be a strong association between QC measures andread counts (cf. PCA).

• Filtering samples based on QC measures is advisable, asnormalization procedures may not be able to adjust for QCand some samples simply have low quality.

• Normalization procedures based on QC measures (e.g.,regression on first few PC of QC measures) should also beconsidered.

28 / 105

Gene-Level Counts

0

2

4

6

8

10

Gene−level log−count

log(

coun

t+1)

−4

−2

0

2

4

6

8

Gene−level RLE

RLE

(a) Log-count (b) RLE

Figure 11: Gene-level counts. Gene-level log-count and relative logexpression (RLE = log-ratio of read count to median read count acrosssamples).

29 / 105

Gene-Level Counts: Summary

• After gene and sample filtering and before normalization,there are large differences in gene-level count distributionswithin and between batches (cf. RLE, housekeeping genes).

• The counts are still zero-inflated.• There can be substantial association of counts and

sample-level QC measures.

• Normalization is essential before any clustering or differentialexpression analysis, to ensure that observed differences inexpression measures between samples and/or genes are trulydue to differential expression and not technical artifacts.

30 / 105

Motivation

• In order to derive expression measures and compare thesemeasures between samples and/or genomic regions of interest(ROI), one first needs to normalize read counts to adjust fordifferences in sequencing depths (i.e., total read counts) andROI lengths, and possibly more complex technical effects, e.g.,C1 run/library preparation/flow-cell/lane effects andnucleotide composition (e.g., GC-content) effects –sequenceability.

• Normalization is essential to ensure that observed differencesin expression measures between samples and/or ROI are trulydue to differential expression and not technical artifacts.

31 / 105

Motivation

• Normalization is still an important issue for RNA-Seq, despiteinitial optimistic claims such as: “One particularly powerfuladvantage of RNA-Seq is that it can capture transcriptomedynamics across different tissues or conditions withoutsophisticated normalization of data sets” (Wang et al., 2009).

• This subject has received relatively little attention and fewnovel methods have been proposed in the RNA-Seq literature.

• It is still common to only scale ROI-level counts by ROI lengthand total sample read count to adjust, respectively, fordifferences in ROI lengths and sequencing depths (even fortechnical replicate lanes).E.g. Estimators based on Poisson model as in Marioni et al.(2008); RPKM, Reads Per Kilobase of exon model per Millionmapped reads, as in Mortazavi et al. (2008).

32 / 105

Motivation

• In many cases, however, one needs to account for morecomplex features of the genomic regions and experimentaldesign.

• Normalization seems even more important for single-cellRNA-Seq than bulk RNA-Seq.

• There are large effects related to C1 run (cells processed inbatches of 96 wells).

• There is also greater variation in gene-level counts acrosssingle cells than across bulk RNA-Seq samples.

• The very small amount of RNA present in a single cell maycause amplification bias (i.e., many sequenced copies of atranscript), hence high-magnitude outliers, and low captureefficiency (i.e., genes expressed in the cell, but not capturedby the sequencing process), hence zero inflation.

33 / 105

Motivation

• Widely-used bulk RNA-Seq methods are not well-suited forhandling the zero inflation and high level of technical noise ofscRNA-Seq data (Vallejos et al., 2016).

34 / 105

Between-Sample Normalization

Most normalization methods proposed to date are adaptations ofmethods for bulk RNA-Seq and microarrays.

• Global-scaling normalization. Gene-level counts are scaled bya single factor per sample.

I Total-count (TC), as in RPKM (Mortazavi et al., 2008), stillwidely used.

I Housekeeping gene count, as in qRT-PCR, e.g., per-samplePOLR2A count (Bullard et al., 2010).

I Single-quantile, e.g., per-sample upper-quartile (UQ) of countsfor genes with non-zero counts (Bullard et al., 2010).

• Pairwise global-scaling normalization. Gene-level counts arescaled by a single factor per sample computed with respect toa reference sample.

I Trimmed Mean of M values (TMM) (Robinson and Oshlack,2010): Select a reference sample and obtain a robust estimateof the overall fold-change between each sample and thereference. Implemented in edgeR.

35 / 105


I Anders and Huber (2010): The global-scaling factor for agiven sample is defined as the median fold-change betweenthat sample and a synthetic reference sample whose counts aredefined as the geometric means of the counts across samples.Implemented in DESeq and edgeR (as “RLE”).

• Full-quantile normalization (FQ). All quantiles of the genecount distributions are matched between samples (Bullardet al., 2010). Specifically, for each sample, the distribution ofsorted read counts is matched to a reference distributiondefined in terms of a function of the sorted counts (e.g.,median) across samples. Inspired from the microarrayliterature (Irizarry et al., 2003).

36 / 105


• Loess normalization. Loess fits are performed formean-difference plots (MD-plots) of log-counts for pairs ofsamples (all possible pairs (cyclic) or each sample paired witha synthetic reference obtained by averaging counts acrosssamples). This approach, also inspired from the microarrayliterature (Bolstad et al., 2003), was used in Lovén et al.(2012) and is implemented in the affy functionloess.normalize and limma functionnormalizeCyclicLoess.

• Regression-based normalization for known technical factors,e.g., C1 run, QC measures.Linear model (limma), negative binomial model (edgeR),empirical Bayes model (ComBat of Johnson et al. (2007),implemented in R package sva), supervised version of RUVbelow.

• Remove unwanted variation (RUV).37 / 105


I Proposed initially for the normalization of microarray data(Gagnon-Bartsch and Speed, 2012; Gagnon-Bartsch et al.,2013; Jacob et al., 2013) and then adapted to sequencing data(Risso et al., 2014a,b), the main idea is to remove “unwantedvariation” by performing factor analysis for a suitably-chosensubset of control genes and/or samples.

I For n samples and J genes, consider the log-linear/generalizedlinear model (GLM)

log E[Y |X ,W ] = Wα + Xβ, (1)

where Y is the n × J matrix of gene-level read counts, X is ann ×M design matrix corresponding to the M covariates ofinterest/factors of “wanted variation” (e.g., treatment) and βits associated M × J matrix of parameters of interest, and Wis an n × K matrix corresponding to unknown factors of“unwanted variation” and α its associated K × J matrix ofnuisance parameters.

38 / 105


I The unwanted factors W can be estimated as follows.

• RUVg: Based on control genes, with either constantexpression (β = 0) or known expression fold-changes betweensamples (e.g., housekeeping genes, ERCC spike-ins, “in-silico”empirical controls).

• RUVr: Based on residuals, such as deviance residuals from afirst-pass GLM regression of the counts on the covariates ofinterest, without RUVg normalization, but possibly after someother normalization procedure such as upper-quartile.

• RUVs: Based on replicate/negative control samples for whichthe covariates of interest are constant.

• Generalization of RUV. The RUV model can be readilyextended to include known sample-level covariates (e.g., C1run, QC measures) and known gene-level covariates (e.g.,GC-content) that should be adjusted for.

39 / 105

Between-Sample Normalization: RUV

= + α

logE[Y|W,X] W X β

Unwanted variation Variation of interest

J

n

K

n n

K

J

M

M J

Read counts

n samples x J genes

Figure 12: RUV model.

40 / 105

SCONE

D. Risso, M. Cole, N. Yosef

41 / 105

SCONE

SCONE: Single-Cell Overview of Normalized Expression. A generalframework for the normalization of scRNA-Seq data.

• Range of normalization methods.I Global-scaling, e.g., DESeq, TMM.I Full-quantile (FQ).I Unknown factors of unwanted variation: Remove unwanted

variation (RUV).I Known factors of unwanted variation: Regression-based

normalization on, e.g., QC PC, C1 run.

• Normalization performance metrics.• Numerical and graphical summaries of normalized read counts

and metrics.

• R package scone, to be released through the BioconductorProject: github.com/yoseflab/scone.

42 / 105

http://github.com/yoseflab/scone

SCONE

Figure 13: scone. Regression model.

43 / 105

SCONE

Performance metrics. (Green: Good when high; Red: Good whenlow.)

• BIO SIL: Average silhouette width by biological condition.• BATCH SIL: Average silhouette width by batch.• PAM SIL: Maximum average silhouette width for PAM

clusterings, for a range of user-supplied numbers of clusters.

• EXP QC COR: Maximum squared Spearman correlationbetween count PCs and QC measures.

• EXP UV COR: Maximum squared Spearman correlationbetween count PCs and factors of unwanted variation(preferably derived from other set of negative control genesthan used in RUV).

• EXP WV COR: Maximum squared Spearman correlationbetween count PCs and factors of wanted variation (derivedfrom positive control genes).

44 / 105

SCONE

• RLE MED: Mean squared median relative log expression(RLE).

• RLE IQR: Mean inter-quartile range (IQR) of RLE.

45 / 105

Software Package scone

Application to OE p63 dataset.

• Apply and evaluate 172 normalization procedures using mainscone function.

I scaling method: None, DESeq, TMM, FQ.I uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5.I adjust biology: Yes/no.I adjust batch: Yes/no.

• Select a normalization procedure based on (function of) theperformance scores.Unweighted mean score =⇒ none,fq,qc k=4,bio,no batchWeighted mean score =⇒none,fq,qc k=2,no bio,no batch

46 / 105


●

●

●

●

●

●

●

●

● ●

●

●● ●●

●

●

●●

●●

●●●

● ●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●●

●

●●●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

−0.15 −0.10 −0.05 0.00 0.05 0.10

−0.

15−

0.05

0.00

0.05

0.10

SCONE: Biplot of scores colored by mean score

PC1

PC

2

BIO_SIL

BATCH_SILPAM_SIL

EXP_QC_COR

EXP_UV_COREXP_WV_COR

RLE_MEDRLE_IQR

Figure 14: scone. Biplot of performance scores, colored by mean score(yellow high/good, blue low/bad).

47 / 105


●

●

●

●

●

●

●

●

● ●

●

●●●

●

●

●

●

●

●

●

●

●●

● ●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●●●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by mean score

PC1

PC

2

Top meanTop weighted meanfq,ruv_k=1none

Figure 15: scone. PCA of performance scores, colored by mean score.

48 / 105


−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by method

PC1

PC

2●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●● ●

●

●

●●●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●

●

●

●

●

●

nonefqdeseqtmm

Figure 16: scone. PCA of performance scores, colored by method –scaling method.

49 / 105


−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by method

PC1

PC

2

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

● ●

●

●●●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

no_bio,no_batchno_bio,batchbio,no_batchbio,batch

Figure 17: scone. PCA of performance scores, colored by method –adjust biology, adjust batch.

50 / 105


FQ + RUVg(HK, k=1): W by batch

W

−0.

15−

0.05

0.00

0.05

0.10

●

●●

●

●

●

●

●

●

●● ●

●

●●

●●

●●●

● ●

●

●

●

●

● ●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●●

●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●●

● ●

●●

●

●

●

●

●●

● ●●

●

●

●●

●●

●

●

●

● ●●

●

● ●

●

●●

●

●

●

●

●

● ●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●

●

● ●

●●

●

●●

●●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●●●

●●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●● ●

●●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●● ●

●

●

●

●

●

●●●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●●

● ●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●●●

●

●●

●

●●

−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10

−5

05

10

FQ + RUVg(HK, k=1): QC PC1 vs. W

W

QC

PC

1

Cor=−0.56

(a) Unwanted factor W (b) QC PC1 vs. W

Figure 18: scone. Association of RUVg unwanted factor W and QCmeasures for none,fq,ruv k=1,no bio,batch.

51 / 105


−3

−2

−1

0

1

2

3

4

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: RLE

RLE

−2

0

2

4

6

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: HK RLE

RLE

(a) All genes (b) Housekeeping genes

Figure 19: scone. Gene-level relative log expression (RLE = log-ratio ofread count to median read count across samples) for method with topweighted mean score none,fq,qc k=2,no bio,no batch.

52 / 105


●

●●

●

●

●

●

●

●

●●●

●

● ●

●●

●●●

● ●

●

●

●

●

● ●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●●

●

●

● ●●●

●

●

●

●

● ●

●

●

●

●

●●

● ●

●●

●

●

●

●

●●

● ●●

●

●

●●

●●

●

●

●

●●●

●

● ●

●

●●

●

●

●

●

●

● ●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●● ●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●

●

●●

●●

●

● ●

●●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●● ●

●●

●

●

●

●

●●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●● ●

● ●●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●●

● ●●●

●●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●●●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●●

● ●

● ●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●●

●

● ●

●

●●

−20 0 20 40

−5

05

10

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1

count PC1

QC

PC

1

Cor=0

PC1 PC2 PC3 PC4 PC5

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure

0.0

0.2

0.4

0.6

0.8

1.0

NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 20: scone. Association of counts and sample-level QC measures,none,fq,qc k=2,no bio,no batch.

53 / 105

Software Package scone: Summary

• Unnormalized gene-level counts exhibit large differences indistributions within and between batches and association withsample-level QC measures.

• Different normalization methods vary in performanceaccording to SCONE metrics and lead to different distributionsof gene-level counts, hence clustering and DE results.

• Global-scaling normalization. Not aggressive enough to handlepotentially large batch effects and association of counts andQC measures. Biological effects are dominated by nuisancetechnical effects. Additionally, for DESeq, the scaling factorsare computed based on only a handful of genes with non-zerocounts in all cells (5/22,054).

54 / 105


• Batch effect normalization. Adjusting for batch effectswithout properly accounting for the nesting of batch withinbiological effects (no bio,batch) in the regression model isproblematic, as this removes the biological effects of interest(e.g., empirical Bayes framework of ComBat).

• FQ followed by QC-based or RUVg normalization. Seemseffective: Similar RLE distributions between samples, lowerassociation of counts and QC measures. The first unwantedfactor of RUVg is correlated with the first QC PC.

• The remaining analyses are based onnone,fq,qc k=2,no bio,no batch, the best methodaccording to weighted mean score.

55 / 105


• Interpretation of performance metrics. Some metrics tend tofavor certain methods over others, e.g., EXP UV COR(correlation between count PCs and factors of unwantedvariation) naturally favors RUVg, especially when the same setof negative controls are used for normalization and evaluation.Hence, a careful, global interpretation of the metrics isrecommended.

• Negative controls. The selection of proper, distinct sets ofnegative controls is important, as these are used for bothnormalization (RUVg) and assessment of normalization results(EXP UV COR).

• Ongoing efforts.I Zero-inflated negative binomial (ZINB) model.I User-supplied factors unwanted and wanted variation (UV and

WV, respectively).I Other methods (e.g., ComBat/sva).

56 / 105


I Other performance metrics.I Visualization.I Shiny app for interactive web interface.

57 / 105

Resampling-Based Sequential Ensemble Clustering

D. Risso, E. Purdom

58 / 105

Motivation

• Robustness to choice of samples. Both hierarchical andpartitioning methods tend to be sensitive to the choice ofsamples to be clustered. Outlying samples/clusters (e.g., glia)are common in scRNA-Seq and mask interesting substructurein the data, often requiring the successive pruning out ofdominating clusters to get to the finer structure.

• Robustness to clustering algorithm and tuning parameters.Clustering results are sensitive to pre-processing steps such asnormalization and dimensionality reduction, as well as to thechoice of clustering algorithm and associated tuningparameters (e.g., distance function, number of clusters).

59 / 105

Motivation

• Not focusing on the number of clusters. A major tuningparameter of partitioning methods such as partitioning aroundmedoids (PAM) and k-means is the number of clusters k .Methods for selecting k (e.g., silhouette width) are sensitiveto the choice of samples, normalization, and other tuningparameters. They tend to be conservative (low k), i.e.,capture only the coarse clustering structure and maskinteresting substructure in the data. Additionally, the numberof clusters k is often not of primary interest.E.g. Silhouette width with PAM selects only k = 2 clusters forthe OE p63 dataset.

• Not forcing samples into clusters. Some samples may beoutliers, that do not really belong to any clusters. Leavingthem out can improve the quality and interpretability of theclustering as well as downstream analyses (e.g., identificationof cluster marker genes).

60 / 105

Motivation

• Cluster gene expression signatures. Common differentialexpression statistics are not well-suited for finding markergenes for the clusters, especially for finer structure in ahierarchy.

61 / 105

Motivation

2 3 4 5 6 7 8 9 10 11 12 13 14 15

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC: Average silhouette width

k

0.0

0.1

0.2

0.3

0.4

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC k=2

silh

ouet

te

0.0

0.1

0.2

0.3

0.4

0.5

0.6 KO_14DPT

KO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT

(a) Average silhouette width (b) Silhouette widths, k = 2

Figure 21: Partitioning around medoids. 20 PC, one-minus-correlationdistance.

62 / 105


• We have developed a resampling-based sequential ensembleclustering approach, with the aim of obtaining stable andtight clusters.

• Ensemble clustering, i.e., aggregating multiple clusteringsobtained from different algorithms or applications of a givenalgorithm to resampled versions of the learning set, is ageneral approach for improving stability. This can be viewedas the unsupervised analog of ensemble methods in supervisedlearning, e.g., bagging, boosting, random forests.

• Our approach is related to bagged/consensus/tight clustering(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong,2005).

• R package clusterExperiment, released through theBioconductor Project.

63 / 105


RSEC: Resampling-based Sequential Ensemble Clustering.

• Given a base clustering algorithm (e.g., PAM, k-means) andassociated tuning parameters (e.g., number of principalcomponents, number of clusters k , distance matrix), generatea single candidate clustering using

I resampling-based clustering to find robust and tight clusters;I sequential clustering to find stable clusters over a range of

numbers of clusters (Tseng and Wong, 2005).

• Generate a collection of candidate clusterings by repeating theabove procedure for different base clustering algorithms andtuning parameters.

• Identify a consensus over the different candidate clusterings.• Merge non-differential clusters.• Find cluster signatures by testing for differential expression

between selected subsets of clusters.

64 / 105


• Visualization. Comparison of multiple clusterings of the samesamples, heatmaps of co-clustering matrices, heatmaps withhierarchical clustering of genes and/or samples.

65 / 105

Resampling-Based Clustering

• Consider a particular base partitioning clustering algorithm(e.g., PAM, k-means) and associated tuning parameters (e.g.,number of principal components, number of clusters k ,distance matrix).

• Generate B subsamples of the n observations to be clustered,e.g., bootstrap or random sample without replacement (of size0.7n, say).

• For each subsample, apply the base clustering algorithm (withk clusters) and assign all n original observations to clustersbased on their distances from cluster centroids/medoids.

• Generate an n × n co-clustering matrix C by recording foreach pair of observations the proportion of times they areclustered together (out of B subsamples).

66 / 105

Resampling-Based Clustering

• Identify tight clusters based on the co-clustering matrix. Forexample, given a constant α close to 0, cut branchestop-down from a hierarchical clustering based on theco-clustering matrix C , such that Ci ,j ≥ 1− α ∀i , j in theselected branch (or some other less stringent function thanmin). Retain all clusters above a certain size or the largestcertain number of clusters.

67 / 105

Sequential Clustering

Based on Tseng and Wong (2005).

• Starting with a user-supplied k = k0, generate a collection ofclusterings, {Ck1 , Ck2 , . . . , Ckmk}, by applying the aboveresampling-based procedure to the base partitioning clusteringalgorithm with k clusters.

• Given a constant β close to 1, increase k by one untils(Cki , C

k+1j ) ≥ β, for some i , j , where s is a measure of

similarity between clusters.

• Identify Ck+1j as a stable cluster.

• Remove Ck+1j and repeat the above steps with k ← k − 1,until k is below a certain threshold or a certain number ofclusters are identified.

68 / 105

Sequential Clustering

• Here, we evaluate a collection of clusters for stability based onthe pairwise similarity measure

s(Ci , Cj) ≡ |Ci ∩ Cj |/|Ci ∪ Cj |. (2)

69 / 105

Ensemble Clustering

• Generate a collection of candidate clusterings as above, for arange of base clustering algorithms and tuning parameters.

• Find a single consensus clustering based on the co-clusteringmatrix (as in resampling, only now across different algorithmsor tuning parameters).

• Merge non-differential clusters by creating a hierarchy ofclusters, working up the tree, testing for differential expressionbetween sister nodes, and collapsing nodes with insufficientDE.

70 / 105

Differential Expression

• Find cluster gene expression signatures, i.e., marker genes, bytesting for differential expression between selected subsets ofclusters.

• Standard F -statistic. Tests for any difference betweenclusters. Sensitive to outlying samples/clusters. Non-specific,i.e., not useful for interpreting differences between clusters.

• Standard solution in (generalized) linear models/ANOVA is toconsider contrasts between groups of clusters. By using themachinery of the (generalized) linear model, we use all of thesamples in testing these contrasts, rather than just thosesamples involved in the corresponding clusters.

I All pairwise. All pairwise comparisons between clusters.I One against all. Compare each cluster to union of remaining

clusters.

71 / 105

Differential Expression

I Dendrogram. Create a hierarchy of clusters, work up the tree,test for DE between sister nodes (as in approach used formerging clusters).

• For each contrast, test for DE using empirical Bayes linearmodeling approach of R package limma, with voom option toaccount for mean-variance relationship of log-counts (i.e.,over-dispersion).

72 / 105

Software Package clusterExperiment

Workflow.

• clusterMany. Generate a collection of candidate clusterings,for different base clustering algorithms and tuning parameters,with option to use resampling and sequential approaches.

• combineMany. Find consensus clustering across severalclusterings.

• Identify non-differential clusters that should be merged intolarger clusters.

I makeDendrogram. Hierarchical clustering of the clusters foundby combineMany.

I mergeClusters. Merge clusters of this hierarchy based on DEbetween nodes.

• RSEC. Wrapper function around the clusterExperimentworkflow.

73 / 105


• getBestFeatures. Find cluster signatures by testing fordifferential expression between selected subsets of clusters.

• Visualization.I plotClusters. Comparison of multiple clusterings of the

same samples. Based on ConsensusClusterPlus package.I plotHeatmap. Heatmaps of co-clustering matrices, heatmaps

with hierarchical clustering of genes and/or samples (interfaceto aheatmap from NMF package).

74 / 105



• clusterMany: Generate 22 candidate clusterings.I Dimensionality reduction: 25, 50 PC.I Euclidean distance.I Base clustering method: PAM, k = 5, · · · , 15.I Resampling-based clustering: B = 100, proportion = 0.7,α = 0.3.

I Sequential clustering: k0 = 15, β = 0.9.I clusterFunction=c("hierarchical01").

• combineMany(ce, clusterFunction="hierarchical01",whichClusters="workflow", proportion=0.7,

propUnassigned=0.5, minSize=5).

• mergeClusters(ce, mergeMethod="adjP",cutoff=0.05).

75 / 105


Clusterings from clusterMany

nPCAFeatures=50,k0=15nPCAFeatures=25,k0=15nPCAFeatures=50,k0=14nPCAFeatures=25,k0=14nPCAFeatures=50,k0=13nPCAFeatures=25,k0=13nPCAFeatures=50,k0=12nPCAFeatures=25,k0=12nPCAFeatures=50,k0=11nPCAFeatures=25,k0=11nPCAFeatures=50,k0=10nPCAFeatures=25,k0=10nPCAFeatures=50,k0=9nPCAFeatures=25,k0=9nPCAFeatures=50,k0=8nPCAFeatures=25,k0=8nPCAFeatures=50,k0=7nPCAFeatures=25,k0=7nPCAFeatures=50,k0=6nPCAFeatures=25,k0=6nPCAFeatures=50,k0=5nPCAFeatures=25,k0=5

Figure 22: clusterExperiment. Comparison of 22 clusterManyclusterings using plotClusters.

76 / 105


combineMany−1c1c10c11c12c13c14c15c16c17c18c19c2c20c21c22c23c24c25c26c27c28c29c3

0

0.2

0.4

0.6

0.8

1

Co−clustering proportion matrix

combineMany

Figure 23: clusterExperiment. Heatmap of co-clustering matrix forclusterMany clusterings, used to create combineMany clustering(plotHeatmap).

77 / 105


c1

c2

c3

c4

c5

c6

c7c8c9

c10

c11c12

c13

c14

c15

c16

c17c18

c19c20

c21c22

c23

c24c25

c26c27

c28

c29c30c31

c32

c1

c2

c3

c4

c5

c6

c7c8c9

c10

c11c12

c13

c14

c15

c16

c17c18

c19c20

c21c22

c23

c24c25

c26c27

c28

c29c30c31

c32

0.38

0.058

0.15

0.028

0.066

0.088

0.035 0.014

0.014

0.039

0.039

0.026

0.023

0.017

0.016

0.0065

0.076

0.00790.011

0.0086

0.0110.0042

0.0077

0.0092

0.006

0.0065

0.0086

0.0056

0.0051

0.0038

0.0056

Figure 24: clusterExperiment. Dendrograms from combineMany andmergeClusters.

78 / 105


●

●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●● ●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●●●●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

● ●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●●●

●

● ●

●

●

●●

●

●

● ●

●●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●

●

● ●

●

●● ●

●

●

●

−50 0 50

−40

−20

020

4060

80

Count PCA by cluster

PC1

PC

2

●

●

●

●

●

●

●

●

●

−112345678

Figure 25: clusterExperiment. PCA of gene-level log-counts, colored bymergeClusters.

79 / 105


mergeClusters−1m1m2m3m4m5m6m7m8

Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor

−4

−2

0

2

4

6

8

10

Heatmap of DE genes, dendrogram contrasts

mergeClustersBio

Figure 26: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using dendrogram contrasts (269, top 50 ineach of the 6 nodes).

80 / 105


sirt5

hnrnpl

rmi2

gm17821

nup98

cox6c

oaz1

rpl27

rplp1

rpl13a

rps20

rps7

rpl23

rpl37a

rpl23a

rps29

rpl32

rplp2

rpl24

fau

rpl36

rpl37

rps23

rpl35

prrc2c

son

kcnq1ot1

nip7

eif1

cfl1

rpl10

lars2

rps25

rpl13

naca

rplp0

rpl7

rpl8

rps6

rps3a1

rpl21

rps11

rpl14

hmgb1

rpl3

trns1

rnr1

ppia

rps27

myh9

rpl9

rps9

rps19

rpl41

rps14

map1b

eif4g2

calm2

ywhae

gsk3b

calm1

rn18s−rs5

malat1

actb

gnb1

rnr2

hsp90ab1

eef1a1

h3f3b

canx

btg1

srsf2

csde1

fth1

sod1

set

ywhaz

ubb

tmsb4x

prdx1

akr1a1

sptbn1

ubc

mt1

dstn

rps5

rpl4

rps3

rpl18a

actr2

ddx5

hnrnpa2b1

skp1a

hspa8

ptges3

tpt1

actg1

ftl1

ptma

gm1821 mergeClusters−1m1m2m3m4m5m6m7m8


0

2

4

6

8

10

Heatmap of DE genes, F−test

mergeClustersBio

Figure 27: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using F -test (top 100).

81 / 105


sox11

creer

gstm2

cyp2f2

cbr2

trim66

gap43

ebf1

ebf2

ncam1

omp

cnga2

gng13

rgs11

lhx2

ebf3

nhlh1

gng8

arhgdig

cyp2g1

lgr5

cyp1a2

scarb1

coch

cd24a

cftr

ceacam1

prmt5

fabp5

neurod1

nhlh2

zfp423

ebf4

rrm1

ezh2

hes6

ccnd1

rbm24

kit

dtl

tead2

mcm7

ccna2

ect2

top2a

cdca8

pbk

il6

notch1

yap1

icam1

ppara

fgfr2

dkk3

igfbp4

itgb5

egfr

igfbp2

pax6

sall1

elf5

il33

sox2

wnt4

lifr

nrcam

trp63

krt14

krt5

sox9

kitl

sfrp1

f3

perp

ccnd2

hmgb2

tubb5

cdkn1b

egfp

igfbp5

krt8

hes1

notch2 mergeClusters−1m1m2m3m4m5m6m7m8


0

2

4

6

8

10

Heatmap of markers genes

mergeClustersBio

Figure 28: clusterExperiment. Heatmap of log-counts for “a priori”markers genes (83).

82 / 105

Software Package clusterExperiment: Summary

• Tuning parameters.I α controls tightness.I β controls stability.I k0, the initial number of clusters used in sequential clustering,

is the parameter with the greatest impact on the results.Larger k0 tend to lead to smaller and tighter clusters.

• Caveat. The DE analysis is exploratory and nominal p-valuesonly a rough summary of significance (reliance on models, tinyp-values even after adjustment for multiple testing, same dataused to define clusters and to perform DE analysis).

• Ongoing efforts.I Cluster confidence measures.I Potentially assign unclustered observations to clusters.I DE using ZINB model.I Greater modularity (e.g., distance functions, DE test).I Shiny app for interactive web interface.

83 / 105

Cell Lineage and Pseudotime Inference

K. Street, D. Risso, E. Purdom

84 / 105

Motivation

• Mapping transcriptional progression from stem cells tospecialized cell types is essential for properly understandingthe mechanisms regulating cell and tissue differentiation.

• There may not always be a clear distinction between states,but rather a smooth transition, with individual cells existingon a continuum between states.

• In such a case, cells may undergo gradual transcriptionalchanges, where the relationship between states can berepresented as a continuous lineage dependent upon anunderlying spatial or temporal variable. This representation,referred to as pseudotemporal ordering, can help usunderstand how cells differentiate and how cell fate decisionsare made (Bendall et al., 2014; Campbell et al., 2015; Ji andJi, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnellet al., 2014).

85 / 105

Motivation

• We have developed Slingshot as a flexible and robustframework for inferring cell lineages and pseudotimes in thestudy of continuous differentiation processes.

86 / 105


• Input/Output.I Input. Normalized gene expression measures and cell

clustering.I Output. Cell lineages, i.e., subsets of ordered cell clusters.

Cell pseudotimes, i.e., for each lineage, ordered sequence ofcells and associated pseudotimes.

• Dimensionality reduction.I Principal component analysis (PCA) seems effective and

simple, in conjunction with steps detailed next.I Other approaches include related linear methods, e.g.,

independent component analysis (ICA) (Trapnell et al., 2014,Monocle), and non-linear methods, e.g., Laplacianeigenmaps/spectral embedding (Campbell et al., 2015,Embeddr), t-distributed stochastic neighbor embedding(t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016,Wanderlust).

87 / 105


• Inferring cell lineages.I Minimum spanning tree (MST; ape package) over cell clusters,

with between-cluster distance based on Euclidean distancebetween cluster means scaled by within-cluster covariance.

I Outlying clusters. Identified using granularity parameter ω thatlimits maximum edge weight in the tree. Specifically, buildMST using an artificial cluster Ω, with distance ω from otherclusters (a fraction of maximum pairwise distance betweenclusters), and then remove Ω.

I Root and leaf nodes. May either be pre-specified orautomatically selected.Root node. If not pre-specified, selected based on parsimony(i.e., set of lineages with maximal number of clusters sharedbetween them).Leaf nodes. If pre-specified, constrained MST.

I A lineage is then defined as any unique path coming out of theroot node and ending in a leaf node.

88 / 105


I Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014,Monocle) offers greater stability and computational efficiency,less complex lineages, and easier determination of directionalityand branching.

• Inferring cell pseudotimes.I Iterative procedure inspired from the principal curve algorithm

of Hastie and Stuetzle (1989); principal.curve function inprincurve package.

I In the case of branching lineages, a shrinkage step is includedat each iteration, that forces a degree of similarity between thecurves in the neighborhood of shared clusters.

I Pseudotime values are derived by orthogonal projection ontothe curves.

I Cells belonging to clusters that are included in multiplelineages have multiple, similar pseudotime values.

89 / 105


I Previous approaches also use smooth curves to representlineages (Campbell et al., 2015; Petropoulos et al., 2016,Embeddr), while others use piecewise linear paths through theMST and extract orderings either by orthogonal projection (Jiand Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree(Trapnell et al., 2014, Monocle).

I We find that smooth curves provide discerning power not foundin piecewise linear trajectories, while also adding stability overa range of dimensionality reduction and clustering methods.

• Differential expression. Regression of gene expressionmeasures on pseudotime, e.g., generalized additive models(GAM) (Ji and Ji, 2016, TSCAN).

• Visualization. Two- and three-dimensional plots of celllineages and pseudotimes, gene-level trajectories, heatmaps forDE genes.

90 / 105


• R package slingshot, to be released through the BioconductorProject: github.com/kstreet13/slingshot.

91 / 105

http://github.com/kstreet13/slingshot

Software Package slingshot

• Modularity.I Integrates easily with a range of normalization, clustering, and

dimensionality reduction methods.I get lineages: Given expression measures and cluster labels,

use MST to infer lineages.get lineages(X, clus.labels, start.clus = NULL,

end.clus = NULL, dist.fun = NULL, omega = Inf,

distout = FALSE).I get curves: Given lineages, infer pseudotimes.

get curves(X, clus.labels, lineages, thresh =

1e-04, maxit = 100, stretch = 2, shrink = TRUE).

• Flexibility. Can be used with varying levels of supervision.I Cluster-based approach allows for easy supervision when

researchers have prior knowledge of cell classes, while stillbeing able to detect novel branching events.

I User-supplied or data-driven selection of root and leaf nodes.

92 / 105


• Visualization.I plot tree: MST in 2 and 3D.I plot curves: Lineage curves in 2 and 3D.

93 / 105



• Applied to the first three principal components, Slingshotidentifies two lineages: The first corresponds to theHBC-to-neurons transition, the second to theHBC-to-sustentacular cells transition.

• A first-pass DE analysis, based on a regression of log-count onpseudotime using GAM, suggests that many genes areinvolved in the differentiation process.

• Among the top 100 DE genes for each lineage, only 13 are DEin both, suggesting distinct processes in the neuronal vs.non-neuronal lineages.

using single-cell transcriptome sequencing to infer ... · sensory neuronsandnon-neuronal support...

Documents