using single-cell transcriptome sequencing to infer ... · sensory neuronsandnon-neuronal support...

105
Using Single-Cell Transcriptome Sequencing to Infer Olfactory Stem Cell Fate Trajectories Sandrine Dudoit Division of Biostatistics and Department of Statistics University of California, Berkeley www.stat.berkeley.edu/~sandrine Computational and Systems Immunology Seminar Stanford University June 28, 2016 Version: 26/06/2016, 22:58 1 / 105

Upload: others

Post on 04-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Using Single-Cell Transcriptome Sequencing toInfer Olfactory Stem Cell Fate Trajectories

    Sandrine DudoitDivision of Biostatistics and Department of Statistics

    University of California, Berkeleywww.stat.berkeley.edu/~sandrine

    Computational and Systems Immunology SeminarStanford University

    June 28, 2016

    Version: 26/06/2016, 22:58

    1 / 105

    http://www.stat.berkeley.edu/~sandrine

  • Acknowledgments

    • Sandrine Dudoit, Division of Biostatistics and Department ofStatistics, UC Berkeley.

    I Fanny Perraudeau, Graduate Group in Biostatistics, UCBerkeley.

    I Davide Risso, Division of Biostatistics, UC Berkeley.I Kelly Street, Graduate Group in Biostatistics, UC Berkeley.

    • John Ngai, Department of Molecular and Cell Biology, UCBerkeley – Principal investigator.

    I Diya Das.I Russell Fletcher.I David Stafford.

    • Elizabeth Purdom, Department of Statistics, UC Berkeley.• Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,

    France.I Svetlana Gribkova.

    2 / 105

  • Acknowledgments

    • Nir Yosef, Department of Electrical Engineering and ComputerSciences, UC Berkeley.

    I Michael Cole.I Allon Wagner.

    • Funded by BRAIN Initiative and California Institute forRegenerative Medicine (CIRM).

    3 / 105

  • Outline

    1 Olfactory Stem Cell Fate TrajectoriesOlfactory Stem Cells and Neural RegenerationOlfactory Epithelium p63 DatasetAnalysis Pipeline

    2 Exploratory Data Analysis and Quality Assessment/Control

    3 Normalization and Expression QuantitationMotivationMethodsSoftware: sconeZero-Inflated Negative Binomial Model

    4 Resampling-Based Sequential Ensemble ClusteringMotivationMethodsSoftware: clusterExperiment

    5 Cell Lineage and Pseudotime InferenceMotivation

    4 / 105

  • Outline

    MethodsSoftware: slingshot

    5 / 105

  • Olfactory Stem Cells and Neural Regeneration

    R. Fletcher, J. Ngai

    6 / 105

  • Olfactory Stem Cells and Neural Regeneration

    • The nature of stem cells giving rise to the nervous system isof particular interest in neurobiology, because neural stemcells remain active in certain brain regions for the entire life ofan individual.

    • We focus on the mouse olfactory epithelium (OE), a site ofactive neurogenesis in the postnatal animal.

    • Adult olfactory stem cells support the replacement of olfactorysensory neurons and non-neuronal support cells (e.g.,sustentacular) over postnatal life and can reconstitute theentire OE following injury.

    • The OE is a convenient system to study, due to itsexperimental accessibility (in situ analysis) and its limitednumber of cell types:

    I olfactory sensory neurons (OSN),I sustentacular cells (SUS),

    7 / 105

  • Olfactory Stem Cells and Neural Regeneration

    I cells of the Bowman gland,I microvillous cells (rare).

    8 / 105

  • Olfactory Stem Cells and Neural Regeneration

    Figure 1: Mouse olfactory epithelium.

    9 / 105

  • Olfactory Stem Cells and Neural Regeneration

    Sustentacular cells

    Mature olfactory neurons

    Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia

    Bowman’s gland

    •  HBCs: multipotent, quiescent – deep reserve adult tissue stem cell •  GBCs: proliferative progenitor cells + transit-amplifying cells

    The Horizontal Basal Cell Is an Adult Tissue Stem Cell

    Figure 2: Olfactory epithelium cell types.

    10 / 105

  • Olfactory Stem Cells and Neural Regeneration

    Open questions.

    • Determine the stage at which the neuronal and non-neuronallineages bifurcate/diverge.

    • Characterize discrete intermediate stages of cell differentiation.• Identify the genetic networks and signaling pathways that

    promote self-renewal and regulate the transition todifferentiation.

    11 / 105

  • Olfactory Stem Cells and Neural Regeneration

    p63 regulation of horizontal basal cells.

    • The horizontal basal cell (HBC) is an adult tissue stem cell.• The p63 protein (tumor protein p63, TP63) promotes

    self-renewal of HBC by blocking differentiation.

    • When p63 is down-regulated, this “brake” is removed,allowing differentiation to proceed at the expense ofself-renewal. Thus, p63 can be viewed as a “molecularswitch” that decides between the alternate stem cell fates ofself-renewal vs. differentiation.

    • We use single-cell RNA-Seq to analyze cell fate trajectoriesfrom olfactory stem cells (HBC) of p63 conditional knock-outmice.

    12 / 105

  • Olfactory Epithelium p63 Dataset

    X

    Cre-ER Krt5

    YFP

    loxP-STOP-loxP

    Rosa

    p63

    +

    Experimental design

    Russell Fletcher, Levi Gadye, Mike Sanchez

    ~ ~

    0 24h 48h 72h 96h 7d 14d

    tamoxifen YFP+;p63+/+: 98 cells

    Resting HBCs

    FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq

    Differentiating cells

    YFP+;p63-/-: 88 cells 100 cells 128 cells 110 cells 98 cells

    Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102wild-type and 534 p63 knock-out cells.

    13 / 105

  • Olfactory Epithelium p63 Dataset

    P01

    P02

    P03

    A

    P03

    B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    Number of cells per batch, colored by bio

    num

    ber

    of c

    ells

    0

    10

    20

    30

    40

    50

    60KO_14DPTKO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT

    Figure 4: Experimental design. Number of cells per batch, colored bybiological condition.

    14 / 105

  • Olfactory Epithelium p63 Dataset

    • Single-cell RNA-Seq for 636 HBC.I 102 wild-type (WT)/resting cells

    534 p63 knock-out (KO) cells, at five timepoints followingtamoxifen treatment.

    I Biological replicate: Cells from 1–3 mice.At least two replicates per biological condition.

    I One FACS run and one C1 run per biological replicate=⇒ 14 batches.

    I 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).

    • Some confounding between biological and technical effects.• Batch effects nested within biological effects.

    15 / 105

  • Olfactory Epithelium p63 Dataset

    • Marker genes. 94 marker genes, curated from literature andfrom prior microarray, sequencing, and in situ experiments,e.g., neuronal, progenitor cell markers.

    • Housekeeping genes. 715 housekeeping (HK) genes, curatedfrom prior microarray experiments, expected to be constantlyand highly-expressed across cells of the OE.

    • Gene-level read counts. TopHat2 alignment to RefSeq mm10genome and featureCounts(bioinf.wehi.edu.au/featureCounts) counting, withgenes defined as union of all isoforms.

    16 / 105

    http://bioinf.wehi.edu.au/featureCounts

  • Analysis Pipeline

    • Input/Output (I/O).I Input. FASTQ files, gene-level annotation metadata,

    sample-level annotation metadata.I Output. Cluster labels corresponding to cellular types, cell

    lineages and pseudotimes, list of differentially expressed genescorresponding to markers/transcriptome signatures for thesetypes/lineages.

    • Pre-processing. Read alignment, gene-level expressionquantitation, quality assessment/control.

    • Exploratory data analysis (EDA) and qualityassessment/control (QA/QC). Numerical and graphicalsummaries of gene-level read counts, sample filtering (e.g.,based on QC measures), gene filtering (e.g., zero inflation).

    17 / 105

  • Analysis Pipeline

    • Normalization and expression quantitation. Adjust forunwanted technical effects (e.g., C1 run effects), account forzero inflation and over-dispersion.

    • Cluster analysis. Dimensionality reduction (e.g., based onprincipal component analysis), distance measure, clusteringprocedure and tuning parameters, confidencemeasures/validation of clustering results.

    • Cell lineage and pseudotime inference.• Differential expression (DE) analysis. Identification of marker

    genes for cell types/lineages.

    18 / 105

  • Computationally Reproducible Research

    • Version control system for software development and dataanalysis: Git, GitHub.

    • R data packages, with standard object structure (S4 classesExpressionSet, SummarizedExperiment) for gene-levelexpression measures and gene-level and sample-levelannotation metadata.

    • Compendia, i.e., integrated and executable input documents,with text, data, code, and software, that allow the generationof dynamic and reproducible output documents, intermixingtext and code output (textual and graphical). Different viewsof the output document (e.g., PDF, HTML) can beautomatically generated and updated whenever the text, data,or code change.E.g. For R and LATEX/markdown, Sweave, knitr, RStudio.

    19 / 105

  • Zero Inflation

    Proportion of genes with zero count, pre gene filtering

    0.0

    0.2

    0.4

    0.6

    0.8

    0.0 0.2 0.4 0.6 0.8 1.0

    01

    23

    45

    6

    Proportion of cells in which a gene is detected, pre gene filtering

    N = 22054 Bandwidth = 0.02443

    Den

    sity

    (a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

    Figure 5: Zero inflation. Pre gene filtering.

    20 / 105

  • Zero Inflation

    • Single-cell RNA-Seq data have many more genes with zeroread counts than bulk RNA-Seq data.

    • This zero inflation could occur for biological reasons (i.e., thegene is simply not expressed) or technical reasons (e.g., lowcapture efficiency).

    • Zero-count gene filtering is advisable for normalization anddownstream analyses.

    • Most RNA-Seq normalization methods involve scaling andperform poorly when many genes have zero counts.

    • In particular, the global-scaling method of Anders and Huber(2010), implemented in the Bioconductor R package DESeq,discards any gene having zero count in at least one sample. Inpractice, the scaling factors are therefore estimated based ononly a handful of genes, e.g., 5/22,054 genes for OE dataset.

    21 / 105

  • Zero Inflation

    • Full-quantile (FQ) normalization also doesn’t behave properlydue to ties from the large number of zeros.

    • We apply the following zero-count gene filtering to the OEdataset: Retain only the genes with at least nr = 20 reads, inat least ns = 40 samples.This yields 9,133/22,054 genes.

    22 / 105

  • Zero Inflation

    Proportion of genes with zero count, post gene filtering

    0.0

    0.2

    0.4

    0.6

    0.8

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Proportion of cells in which a gene is detected, post gene filtering

    N = 9133 Bandwidth = 0.03333

    Den

    sity

    (a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

    Figure 6: Zero inflation. Post gene filtering.

    23 / 105

  • Sample-Level QC

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    1e+06

    2e+06

    3e+06

    QC measures by batch

    NR

    EA

    DS

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    500000100000015000002000000250000030000003500000

    QC measures by batch

    NA

    LIG

    NE

    D

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    90

    92

    94

    96

    QC measures by batch

    RA

    LIG

    N

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    203040506070

    QC measures by batch

    TOTA

    L_D

    UP

    Figure 7: Sample-level QC. Boxplots of QC measures, by batch.

    24 / 105

  • Sample-Level QC

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    0.00.10.20.30.40.50.6

    QC measures by batch

    PC

    T_I

    NT

    RO

    NIC

    _BA

    SE

    S

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    0.100.150.200.250.30

    QC measures by batch

    PC

    T_I

    NT

    ER

    GE

    NIC

    _BA

    SE

    S

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    0.20.30.40.50.60.70.8

    QC measures by batch

    PC

    T_M

    RN

    A_B

    AS

    ES

    P01

    P02

    P03

    AP

    03B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    0.60.81.01.21.41.61.82.0

    QC measures by batch

    ME

    DIA

    N_C

    V_C

    OV

    ER

    AG

    E

    Figure 8: Sample-level QC. Boxplots of QC measures, by batch.

    25 / 105

  • Sample-Level QC

    ● ●

    ●●

    ●●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    −5 0 5 10

    −4

    −2

    02

    46

    QC PCA by batch

    PC1

    PC

    2

    P01

    P02

    P03

    A

    P03

    B

    P04

    P05

    P06

    P10

    P11

    P12

    P13

    P14

    Y01

    Y04

    −4

    −2

    0

    2

    4

    6

    QC PCA by batch

    PC

    1

    (a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch

    Figure 9: Sample-level QC. Principal component analysis (PCA) ofsample-level QC measures.

    26 / 105

  • Sample-Level QC

    ●●

    ●●●

    ● ●

    ●●

    ●● ●

    ● ●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●●

    ●● ●

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●●

    ● ●●●

    ●●

    ● ●●

    ●●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ● ●

    ● ●

    ●●

    ●●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    −40 −20 0 20 40 60 80

    −5

    05

    10QC PC1 vs. count PC1

    Count PC1

    QC

    PC

    1

    Cor=−0.49

    PC1 PC2 PC3 PC4 PC5

    Absolute correlation of count PC and QC measures

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

    (a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

    Figure 10: Sample-level QC. Association of counts and sample-level QCmeasures.

    27 / 105

  • Sample-Level QC: Summary

    • The distribution of QC measures can vary substantially withinand between batches.

    • Some QC measures clearly point to low-quality samples, e.g.,low percentage of mapped reads (RALIGN).

    • There can be a strong association between QC measures andread counts (cf. PCA).

    • Filtering samples based on QC measures is advisable, asnormalization procedures may not be able to adjust for QCand some samples simply have low quality.

    • Normalization procedures based on QC measures (e.g.,regression on first few PC of QC measures) should also beconsidered.

    28 / 105

  • Gene-Level Counts

    0

    2

    4

    6

    8

    10

    Gene−level log−count

    log(

    coun

    t+1)

    −4

    −2

    0

    2

    4

    6

    8

    Gene−level RLE

    RLE

    (a) Log-count (b) RLE

    Figure 11: Gene-level counts. Gene-level log-count and relative logexpression (RLE = log-ratio of read count to median read count acrosssamples).

    29 / 105

  • Gene-Level Counts: Summary

    • After gene and sample filtering and before normalization,there are large differences in gene-level count distributionswithin and between batches (cf. RLE, housekeeping genes).

    • The counts are still zero-inflated.• There can be substantial association of counts and

    sample-level QC measures.

    • Normalization is essential before any clustering or differentialexpression analysis, to ensure that observed differences inexpression measures between samples and/or genes are trulydue to differential expression and not technical artifacts.

    30 / 105

  • Motivation

    • In order to derive expression measures and compare thesemeasures between samples and/or genomic regions of interest(ROI), one first needs to normalize read counts to adjust fordifferences in sequencing depths (i.e., total read counts) andROI lengths, and possibly more complex technical effects, e.g.,C1 run/library preparation/flow-cell/lane effects andnucleotide composition (e.g., GC-content) effects –sequenceability.

    • Normalization is essential to ensure that observed differencesin expression measures between samples and/or ROI are trulydue to differential expression and not technical artifacts.

    31 / 105

  • Motivation

    • Normalization is still an important issue for RNA-Seq, despiteinitial optimistic claims such as: “One particularly powerfuladvantage of RNA-Seq is that it can capture transcriptomedynamics across different tissues or conditions withoutsophisticated normalization of data sets” (Wang et al., 2009).

    • This subject has received relatively little attention and fewnovel methods have been proposed in the RNA-Seq literature.

    • It is still common to only scale ROI-level counts by ROI lengthand total sample read count to adjust, respectively, fordifferences in ROI lengths and sequencing depths (even fortechnical replicate lanes).E.g. Estimators based on Poisson model as in Marioni et al.(2008); RPKM, Reads Per Kilobase of exon model per Millionmapped reads, as in Mortazavi et al. (2008).

    32 / 105

  • Motivation

    • In many cases, however, one needs to account for morecomplex features of the genomic regions and experimentaldesign.

    • Normalization seems even more important for single-cellRNA-Seq than bulk RNA-Seq.

    • There are large effects related to C1 run (cells processed inbatches of 96 wells).

    • There is also greater variation in gene-level counts acrosssingle cells than across bulk RNA-Seq samples.

    • The very small amount of RNA present in a single cell maycause amplification bias (i.e., many sequenced copies of atranscript), hence high-magnitude outliers, and low captureefficiency (i.e., genes expressed in the cell, but not capturedby the sequencing process), hence zero inflation.

    33 / 105

  • Motivation

    • Widely-used bulk RNA-Seq methods are not well-suited forhandling the zero inflation and high level of technical noise ofscRNA-Seq data (Vallejos et al., 2016).

    34 / 105

  • Between-Sample Normalization

    Most normalization methods proposed to date are adaptations ofmethods for bulk RNA-Seq and microarrays.

    • Global-scaling normalization. Gene-level counts are scaled bya single factor per sample.

    I Total-count (TC), as in RPKM (Mortazavi et al., 2008), stillwidely used.

    I Housekeeping gene count, as in qRT-PCR, e.g., per-samplePOLR2A count (Bullard et al., 2010).

    I Single-quantile, e.g., per-sample upper-quartile (UQ) of countsfor genes with non-zero counts (Bullard et al., 2010).

    • Pairwise global-scaling normalization. Gene-level counts arescaled by a single factor per sample computed with respect toa reference sample.

    I Trimmed Mean of M values (TMM) (Robinson and Oshlack,2010): Select a reference sample and obtain a robust estimateof the overall fold-change between each sample and thereference. Implemented in edgeR.

    35 / 105

  • Between-Sample Normalization

    I Anders and Huber (2010): The global-scaling factor for agiven sample is defined as the median fold-change betweenthat sample and a synthetic reference sample whose counts aredefined as the geometric means of the counts across samples.Implemented in DESeq and edgeR (as “RLE”).

    • Full-quantile normalization (FQ). All quantiles of the genecount distributions are matched between samples (Bullardet al., 2010). Specifically, for each sample, the distribution ofsorted read counts is matched to a reference distributiondefined in terms of a function of the sorted counts (e.g.,median) across samples. Inspired from the microarrayliterature (Irizarry et al., 2003).

    36 / 105

  • Between-Sample Normalization

    • Loess normalization. Loess fits are performed formean-difference plots (MD-plots) of log-counts for pairs ofsamples (all possible pairs (cyclic) or each sample paired witha synthetic reference obtained by averaging counts acrosssamples). This approach, also inspired from the microarrayliterature (Bolstad et al., 2003), was used in Lovén et al.(2012) and is implemented in the affy functionloess.normalize and limma functionnormalizeCyclicLoess.

    • Regression-based normalization for known technical factors,e.g., C1 run, QC measures.Linear model (limma), negative binomial model (edgeR),empirical Bayes model (ComBat of Johnson et al. (2007),implemented in R package sva), supervised version of RUVbelow.

    • Remove unwanted variation (RUV).37 / 105

  • Between-Sample Normalization

    I Proposed initially for the normalization of microarray data(Gagnon-Bartsch and Speed, 2012; Gagnon-Bartsch et al.,2013; Jacob et al., 2013) and then adapted to sequencing data(Risso et al., 2014a,b), the main idea is to remove “unwantedvariation” by performing factor analysis for a suitably-chosensubset of control genes and/or samples.

    I For n samples and J genes, consider the log-linear/generalizedlinear model (GLM)

    log E[Y |X ,W ] = Wα + Xβ, (1)

    where Y is the n × J matrix of gene-level read counts, X is ann ×M design matrix corresponding to the M covariates ofinterest/factors of “wanted variation” (e.g., treatment) and βits associated M × J matrix of parameters of interest, and Wis an n × K matrix corresponding to unknown factors of“unwanted variation” and α its associated K × J matrix ofnuisance parameters.

    38 / 105

  • Between-Sample Normalization

    I The unwanted factors W can be estimated as follows.

    • RUVg: Based on control genes, with either constantexpression (β = 0) or known expression fold-changes betweensamples (e.g., housekeeping genes, ERCC spike-ins, “in-silico”empirical controls).

    • RUVr: Based on residuals, such as deviance residuals from afirst-pass GLM regression of the counts on the covariates ofinterest, without RUVg normalization, but possibly after someother normalization procedure such as upper-quartile.

    • RUVs: Based on replicate/negative control samples for whichthe covariates of interest are constant.

    • Generalization of RUV. The RUV model can be readilyextended to include known sample-level covariates (e.g., C1run, QC measures) and known gene-level covariates (e.g.,GC-content) that should be adjusted for.

    39 / 105

  • Between-Sample Normalization: RUV

    = + α

    logE[Y|W,X] W X β

    Unwanted variation Variation of interest

    J

    n

    K

    n n

    K

    J

    M

    M J

    Read counts

    n samples x J genes

    Figure 12: RUV model.

    40 / 105

  • SCONE

    D. Risso, M. Cole, N. Yosef

    41 / 105

  • SCONE

    SCONE: Single-Cell Overview of Normalized Expression. A generalframework for the normalization of scRNA-Seq data.

    • Range of normalization methods.I Global-scaling, e.g., DESeq, TMM.I Full-quantile (FQ).I Unknown factors of unwanted variation: Remove unwanted

    variation (RUV).I Known factors of unwanted variation: Regression-based

    normalization on, e.g., QC PC, C1 run.

    • Normalization performance metrics.• Numerical and graphical summaries of normalized read counts

    and metrics.

    • R package scone, to be released through the BioconductorProject: github.com/yoseflab/scone.

    42 / 105

    http://github.com/yoseflab/scone

  • SCONE

    Figure 13: scone. Regression model.

    43 / 105

  • SCONE

    Performance metrics. (Green: Good when high; Red: Good whenlow.)

    • BIO SIL: Average silhouette width by biological condition.• BATCH SIL: Average silhouette width by batch.• PAM SIL: Maximum average silhouette width for PAM

    clusterings, for a range of user-supplied numbers of clusters.

    • EXP QC COR: Maximum squared Spearman correlationbetween count PCs and QC measures.

    • EXP UV COR: Maximum squared Spearman correlationbetween count PCs and factors of unwanted variation(preferably derived from other set of negative control genesthan used in RUV).

    • EXP WV COR: Maximum squared Spearman correlationbetween count PCs and factors of wanted variation (derivedfrom positive control genes).

    44 / 105

  • SCONE

    • RLE MED: Mean squared median relative log expression(RLE).

    • RLE IQR: Mean inter-quartile range (IQR) of RLE.

    45 / 105

  • Software Package scone

    Application to OE p63 dataset.

    • Apply and evaluate 172 normalization procedures using mainscone function.

    I scaling method: None, DESeq, TMM, FQ.I uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5.I adjust biology: Yes/no.I adjust batch: Yes/no.

    • Select a normalization procedure based on (function of) theperformance scores.Unweighted mean score =⇒ none,fq,qc k=4,bio,no batchWeighted mean score =⇒none,fq,qc k=2,no bio,no batch

    46 / 105

  • Software Package scone

    ● ●

    ●● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −0.15 −0.10 −0.05 0.00 0.05 0.10

    −0.

    15−

    0.05

    0.00

    0.05

    0.10

    SCONE: Biplot of scores colored by mean score

    PC1

    PC

    2

    BIO_SIL

    BATCH_SILPAM_SIL

    EXP_QC_COR

    EXP_UV_COREXP_WV_COR

    RLE_MEDRLE_IQR

    Figure 14: scone. Biplot of performance scores, colored by mean score(yellow high/good, blue low/bad).

    47 / 105

  • Software Package scone

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −0.4 −0.2 0.0 0.2

    −0.

    20−

    0.10

    0.00

    0.10

    SCONE: Score PCA colored by mean score

    PC1

    PC

    2

    Top meanTop weighted meanfq,ruv_k=1none

    Figure 15: scone. PCA of performance scores, colored by mean score.

    48 / 105

  • Software Package scone

    −0.4 −0.2 0.0 0.2

    −0.

    20−

    0.10

    0.00

    0.10

    SCONE: Score PCA colored by method

    PC1

    PC

    2●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●● ●

    ●●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    nonefqdeseqtmm

    Figure 16: scone. PCA of performance scores, colored by method –scaling method.

    49 / 105

  • Software Package scone

    −0.4 −0.2 0.0 0.2

    −0.

    20−

    0.10

    0.00

    0.10

    SCONE: Score PCA colored by method

    PC1

    PC

    2

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    no_bio,no_batchno_bio,batchbio,no_batchbio,batch

    Figure 17: scone. PCA of performance scores, colored by method –adjust biology, adjust batch.

    50 / 105

  • Software Package scone

    FQ + RUVg(HK, k=1): W by batch

    W

    −0.

    15−

    0.05

    0.00

    0.05

    0.10

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ● ●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●●

    ● ●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ● ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●●

    ●●●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ●●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10

    −5

    05

    10

    FQ + RUVg(HK, k=1): QC PC1 vs. W

    W

    QC

    PC

    1

    Cor=−0.56

    (a) Unwanted factor W (b) QC PC1 vs. W

    Figure 18: scone. Association of RUVg unwanted factor W and QCmeasures for none,fq,ruv k=1,no bio,batch.

    51 / 105

  • Software Package scone

    −3

    −2

    −1

    0

    1

    2

    3

    4

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: RLE

    RLE

    −2

    0

    2

    4

    6

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: HK RLE

    RLE

    (a) All genes (b) Housekeeping genes

    Figure 19: scone. Gene-level relative log expression (RLE = log-ratio ofread count to median read count across samples) for method with topweighted mean score none,fq,qc k=2,no bio,no batch.

    52 / 105

  • Software Package scone

    ●●

    ●●●

    ● ●

    ●●

    ●●●

    ● ●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ● ●

    ●● ●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●●

    ●● ●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ● ●

    ● ●

    ●●

    ●● ●

    ●●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    −20 0 20 40

    −5

    05

    10

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1

    count PC1

    QC

    PC

    1

    Cor=0

    PC1 PC2 PC3 PC4 PC5

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

    (a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

    Figure 20: scone. Association of counts and sample-level QC measures,none,fq,qc k=2,no bio,no batch.

    53 / 105

  • Software Package scone: Summary

    • Unnormalized gene-level counts exhibit large differences indistributions within and between batches and association withsample-level QC measures.

    • Different normalization methods vary in performanceaccording to SCONE metrics and lead to different distributionsof gene-level counts, hence clustering and DE results.

    • Global-scaling normalization. Not aggressive enough to handlepotentially large batch effects and association of counts andQC measures. Biological effects are dominated by nuisancetechnical effects. Additionally, for DESeq, the scaling factorsare computed based on only a handful of genes with non-zerocounts in all cells (5/22,054).

    54 / 105

  • Software Package scone: Summary

    • Batch effect normalization. Adjusting for batch effectswithout properly accounting for the nesting of batch withinbiological effects (no bio,batch) in the regression model isproblematic, as this removes the biological effects of interest(e.g., empirical Bayes framework of ComBat).

    • FQ followed by QC-based or RUVg normalization. Seemseffective: Similar RLE distributions between samples, lowerassociation of counts and QC measures. The first unwantedfactor of RUVg is correlated with the first QC PC.

    • The remaining analyses are based onnone,fq,qc k=2,no bio,no batch, the best methodaccording to weighted mean score.

    55 / 105

  • Software Package scone: Summary

    • Interpretation of performance metrics. Some metrics tend tofavor certain methods over others, e.g., EXP UV COR(correlation between count PCs and factors of unwantedvariation) naturally favors RUVg, especially when the same setof negative controls are used for normalization and evaluation.Hence, a careful, global interpretation of the metrics isrecommended.

    • Negative controls. The selection of proper, distinct sets ofnegative controls is important, as these are used for bothnormalization (RUVg) and assessment of normalization results(EXP UV COR).

    • Ongoing efforts.I Zero-inflated negative binomial (ZINB) model.I User-supplied factors unwanted and wanted variation (UV and

    WV, respectively).I Other methods (e.g., ComBat/sva).

    56 / 105

  • Software Package scone: Summary

    I Other performance metrics.I Visualization.I Shiny app for interactive web interface.

    57 / 105

  • Resampling-Based Sequential Ensemble Clustering

    D. Risso, E. Purdom

    58 / 105

  • Motivation

    • Robustness to choice of samples. Both hierarchical andpartitioning methods tend to be sensitive to the choice ofsamples to be clustered. Outlying samples/clusters (e.g., glia)are common in scRNA-Seq and mask interesting substructurein the data, often requiring the successive pruning out ofdominating clusters to get to the finer structure.

    • Robustness to clustering algorithm and tuning parameters.Clustering results are sensitive to pre-processing steps such asnormalization and dimensionality reduction, as well as to thechoice of clustering algorithm and associated tuningparameters (e.g., distance function, number of clusters).

    59 / 105

  • Motivation

    • Not focusing on the number of clusters. A major tuningparameter of partitioning methods such as partitioning aroundmedoids (PAM) and k-means is the number of clusters k .Methods for selecting k (e.g., silhouette width) are sensitiveto the choice of samples, normalization, and other tuningparameters. They tend to be conservative (low k), i.e.,capture only the coarse clustering structure and maskinteresting substructure in the data. Additionally, the numberof clusters k is often not of primary interest.E.g. Silhouette width with PAM selects only k = 2 clusters forthe OE p63 dataset.

    • Not forcing samples into clusters. Some samples may beoutliers, that do not really belong to any clusters. Leavingthem out can improve the quality and interpretability of theclustering as well as downstream analyses (e.g., identificationof cluster marker genes).

    60 / 105

  • Motivation

    • Cluster gene expression signatures. Common differentialexpression statistics are not well-suited for finding markergenes for the clusters, especially for finer structure in ahierarchy.

    61 / 105

  • Motivation

    2 3 4 5 6 7 8 9 10 11 12 13 14 15

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC: Average silhouette width

    k

    0.0

    0.1

    0.2

    0.3

    0.4

    SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−, PAM PC k=2

    silh

    ouet

    te

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6 KO_14DPT

    KO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT

    (a) Average silhouette width (b) Silhouette widths, k = 2

    Figure 21: Partitioning around medoids. 20 PC, one-minus-correlationdistance.

    62 / 105

  • Resampling-Based Sequential Ensemble Clustering

    • We have developed a resampling-based sequential ensembleclustering approach, with the aim of obtaining stable andtight clusters.

    • Ensemble clustering, i.e., aggregating multiple clusteringsobtained from different algorithms or applications of a givenalgorithm to resampled versions of the learning set, is ageneral approach for improving stability. This can be viewedas the unsupervised analog of ensemble methods in supervisedlearning, e.g., bagging, boosting, random forests.

    • Our approach is related to bagged/consensus/tight clustering(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong,2005).

    • R package clusterExperiment, released through theBioconductor Project.

    63 / 105

  • Resampling-Based Sequential Ensemble Clustering

    RSEC: Resampling-based Sequential Ensemble Clustering.

    • Given a base clustering algorithm (e.g., PAM, k-means) andassociated tuning parameters (e.g., number of principalcomponents, number of clusters k , distance matrix), generatea single candidate clustering using

    I resampling-based clustering to find robust and tight clusters;I sequential clustering to find stable clusters over a range of

    numbers of clusters (Tseng and Wong, 2005).

    • Generate a collection of candidate clusterings by repeating theabove procedure for different base clustering algorithms andtuning parameters.

    • Identify a consensus over the different candidate clusterings.• Merge non-differential clusters.• Find cluster signatures by testing for differential expression

    between selected subsets of clusters.

    64 / 105

  • Resampling-Based Sequential Ensemble Clustering

    • Visualization. Comparison of multiple clusterings of the samesamples, heatmaps of co-clustering matrices, heatmaps withhierarchical clustering of genes and/or samples.

    65 / 105

  • Resampling-Based Clustering

    • Consider a particular base partitioning clustering algorithm(e.g., PAM, k-means) and associated tuning parameters (e.g.,number of principal components, number of clusters k ,distance matrix).

    • Generate B subsamples of the n observations to be clustered,e.g., bootstrap or random sample without replacement (of size0.7n, say).

    • For each subsample, apply the base clustering algorithm (withk clusters) and assign all n original observations to clustersbased on their distances from cluster centroids/medoids.

    • Generate an n × n co-clustering matrix C by recording foreach pair of observations the proportion of times they areclustered together (out of B subsamples).

    66 / 105

  • Resampling-Based Clustering

    • Identify tight clusters based on the co-clustering matrix. Forexample, given a constant α close to 0, cut branchestop-down from a hierarchical clustering based on theco-clustering matrix C , such that Ci ,j ≥ 1− α ∀i , j in theselected branch (or some other less stringent function thanmin). Retain all clusters above a certain size or the largestcertain number of clusters.

    67 / 105

  • Sequential Clustering

    Based on Tseng and Wong (2005).

    • Starting with a user-supplied k = k0, generate a collection ofclusterings, {Ck1 , Ck2 , . . . , Ckmk}, by applying the aboveresampling-based procedure to the base partitioning clusteringalgorithm with k clusters.

    • Given a constant β close to 1, increase k by one untils(Cki , C

    k+1j ) ≥ β, for some i , j , where s is a measure of

    similarity between clusters.

    • Identify Ck+1j as a stable cluster.

    • Remove Ck+1j and repeat the above steps with k ← k − 1,until k is below a certain threshold or a certain number ofclusters are identified.

    68 / 105

  • Sequential Clustering

    • Here, we evaluate a collection of clusters for stability based onthe pairwise similarity measure

    s(Ci , Cj) ≡ |Ci ∩ Cj |/|Ci ∪ Cj |. (2)

    69 / 105

  • Ensemble Clustering

    • Generate a collection of candidate clusterings as above, for arange of base clustering algorithms and tuning parameters.

    • Find a single consensus clustering based on the co-clusteringmatrix (as in resampling, only now across different algorithmsor tuning parameters).

    • Merge non-differential clusters by creating a hierarchy ofclusters, working up the tree, testing for differential expressionbetween sister nodes, and collapsing nodes with insufficientDE.

    70 / 105

  • Differential Expression

    • Find cluster gene expression signatures, i.e., marker genes, bytesting for differential expression between selected subsets ofclusters.

    • Standard F -statistic. Tests for any difference betweenclusters. Sensitive to outlying samples/clusters. Non-specific,i.e., not useful for interpreting differences between clusters.

    • Standard solution in (generalized) linear models/ANOVA is toconsider contrasts between groups of clusters. By using themachinery of the (generalized) linear model, we use all of thesamples in testing these contrasts, rather than just thosesamples involved in the corresponding clusters.

    I All pairwise. All pairwise comparisons between clusters.I One against all. Compare each cluster to union of remaining

    clusters.

    71 / 105

  • Differential Expression

    I Dendrogram. Create a hierarchy of clusters, work up the tree,test for DE between sister nodes (as in approach used formerging clusters).

    • For each contrast, test for DE using empirical Bayes linearmodeling approach of R package limma, with voom option toaccount for mean-variance relationship of log-counts (i.e.,over-dispersion).

    72 / 105

  • Software Package clusterExperiment

    Workflow.

    • clusterMany. Generate a collection of candidate clusterings,for different base clustering algorithms and tuning parameters,with option to use resampling and sequential approaches.

    • combineMany. Find consensus clustering across severalclusterings.

    • Identify non-differential clusters that should be merged intolarger clusters.

    I makeDendrogram. Hierarchical clustering of the clusters foundby combineMany.

    I mergeClusters. Merge clusters of this hierarchy based on DEbetween nodes.

    • RSEC. Wrapper function around the clusterExperimentworkflow.

    73 / 105

  • Software Package clusterExperiment

    • getBestFeatures. Find cluster signatures by testing fordifferential expression between selected subsets of clusters.

    • Visualization.I plotClusters. Comparison of multiple clusterings of the

    same samples. Based on ConsensusClusterPlus package.I plotHeatmap. Heatmaps of co-clustering matrices, heatmaps

    with hierarchical clustering of genes and/or samples (interfaceto aheatmap from NMF package).

    74 / 105

  • Software Package clusterExperiment

    Application to OE p63 dataset.

    • clusterMany: Generate 22 candidate clusterings.I Dimensionality reduction: 25, 50 PC.I Euclidean distance.I Base clustering method: PAM, k = 5, · · · , 15.I Resampling-based clustering: B = 100, proportion = 0.7,α = 0.3.

    I Sequential clustering: k0 = 15, β = 0.9.I clusterFunction=c("hierarchical01").

    • combineMany(ce, clusterFunction="hierarchical01",whichClusters="workflow", proportion=0.7,

    propUnassigned=0.5, minSize=5).

    • mergeClusters(ce, mergeMethod="adjP",cutoff=0.05).

    75 / 105

  • Software Package clusterExperiment

    Clusterings from clusterMany

    nPCAFeatures=50,k0=15nPCAFeatures=25,k0=15nPCAFeatures=50,k0=14nPCAFeatures=25,k0=14nPCAFeatures=50,k0=13nPCAFeatures=25,k0=13nPCAFeatures=50,k0=12nPCAFeatures=25,k0=12nPCAFeatures=50,k0=11nPCAFeatures=25,k0=11nPCAFeatures=50,k0=10nPCAFeatures=25,k0=10nPCAFeatures=50,k0=9nPCAFeatures=25,k0=9nPCAFeatures=50,k0=8nPCAFeatures=25,k0=8nPCAFeatures=50,k0=7nPCAFeatures=25,k0=7nPCAFeatures=50,k0=6nPCAFeatures=25,k0=6nPCAFeatures=50,k0=5nPCAFeatures=25,k0=5

    Figure 22: clusterExperiment. Comparison of 22 clusterManyclusterings using plotClusters.

    76 / 105

  • Software Package clusterExperiment

    combineMany−1c1c10c11c12c13c14c15c16c17c18c19c2c20c21c22c23c24c25c26c27c28c29c3

    0

    0.2

    0.4

    0.6

    0.8

    1

    Co−clustering proportion matrix

    combineMany

    Figure 23: clusterExperiment. Heatmap of co-clustering matrix forclusterMany clusterings, used to create combineMany clustering(plotHeatmap).

    77 / 105

  • Software Package clusterExperiment

    c1

    c2

    c3

    c4

    c5

    c6

    c7c8c9

    c10

    c11c12

    c13

    c14

    c15

    c16

    c17c18

    c19c20

    c21c22

    c23

    c24c25

    c26c27

    c28

    c29c30c31

    c32

    c1

    c2

    c3

    c4

    c5

    c6

    c7c8c9

    c10

    c11c12

    c13

    c14

    c15

    c16

    c17c18

    c19c20

    c21c22

    c23

    c24c25

    c26c27

    c28

    c29c30c31

    c32

    0.38

    0.058

    0.15

    0.028

    0.066

    0.088

    0.035 0.014

    0.014

    0.039

    0.039

    0.026

    0.023

    0.017

    0.016

    0.0065

    0.076

    0.00790.011

    0.0086

    0.0110.0042

    0.0077

    0.0092

    0.006

    0.0065

    0.0086

    0.0056

    0.0051

    0.0038

    0.0056

    Figure 24: clusterExperiment. Dendrograms from combineMany andmergeClusters.

    78 / 105

  • Software Package clusterExperiment

    ●●

    ●●

    ● ●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●● ●

    −50 0 50

    −40

    −20

    020

    4060

    80

    Count PCA by cluster

    PC1

    PC

    2

    −112345678

    Figure 25: clusterExperiment. PCA of gene-level log-counts, colored bymergeClusters.

    79 / 105

  • Software Package clusterExperiment

    mergeClusters−1m1m2m3m4m5m6m7m8

    Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor

    −4

    −2

    0

    2

    4

    6

    8

    10

    Heatmap of DE genes, dendrogram contrasts

    mergeClustersBio

    Figure 26: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using dendrogram contrasts (269, top 50 ineach of the 6 nodes).

    80 / 105

  • Software Package clusterExperiment

    sirt5

    hnrnpl

    rmi2

    gm17821

    nup98

    cox6c

    oaz1

    rpl27

    rplp1

    rpl13a

    rps20

    rps7

    rpl23

    rpl37a

    rpl23a

    rps29

    rpl32

    rplp2

    rpl24

    fau

    rpl36

    rpl37

    rps23

    rpl35

    prrc2c

    son

    kcnq1ot1

    nip7

    eif1

    cfl1

    rpl10

    lars2

    rps25

    rpl13

    naca

    rplp0

    rpl7

    rpl8

    rps6

    rps3a1

    rpl21

    rps11

    rpl14

    hmgb1

    rpl3

    trns1

    rnr1

    ppia

    rps27

    myh9

    rpl9

    rps9

    rps19

    rpl41

    rps14

    map1b

    eif4g2

    calm2

    ywhae

    gsk3b

    calm1

    rn18s−rs5

    malat1

    actb

    gnb1

    rnr2

    hsp90ab1

    eef1a1

    h3f3b

    canx

    btg1

    srsf2

    csde1

    fth1

    sod1

    set

    ywhaz

    ubb

    tmsb4x

    prdx1

    akr1a1

    sptbn1

    ubc

    mt1

    dstn

    rps5

    rpl4

    rps3

    rpl18a

    actr2

    ddx5

    hnrnpa2b1

    skp1a

    hspa8

    ptges3

    tpt1

    actg1

    ftl1

    ptma

    gm1821 mergeClusters−1m1m2m3m4m5m6m7m8

    Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor

    0

    2

    4

    6

    8

    10

    Heatmap of DE genes, F−test

    mergeClustersBio

    Figure 27: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using F -test (top 100).

    81 / 105

  • Software Package clusterExperiment

    sox11

    creer

    gstm2

    cyp2f2

    cbr2

    trim66

    gap43

    ebf1

    ebf2

    ncam1

    omp

    cnga2

    gng13

    rgs11

    lhx2

    ebf3

    nhlh1

    gng8

    arhgdig

    cyp2g1

    lgr5

    cyp1a2

    scarb1

    coch

    cd24a

    cftr

    ceacam1

    prmt5

    fabp5

    neurod1

    nhlh2

    zfp423

    ebf4

    rrm1

    ezh2

    hes6

    ccnd1

    rbm24

    kit

    dtl

    tead2

    mcm7

    ccna2

    ect2

    top2a

    cdca8

    pbk

    il6

    notch1

    yap1

    icam1

    ppara

    fgfr2

    dkk3

    igfbp4

    itgb5

    egfr

    igfbp2

    pax6

    sall1

    elf5

    il33

    sox2

    wnt4

    lifr

    nrcam

    trp63

    krt14

    krt5

    sox9

    kitl

    sfrp1

    f3

    perp

    ccnd2

    hmgb2

    tubb5

    cdkn1b

    egfp

    igfbp5

    krt8

    hes1

    notch2 mergeClusters−1m1m2m3m4m5m6m7m8

    Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor

    0

    2

    4

    6

    8

    10

    Heatmap of markers genes

    mergeClustersBio

    Figure 28: clusterExperiment. Heatmap of log-counts for “a priori”markers genes (83).

    82 / 105

  • Software Package clusterExperiment: Summary

    • Tuning parameters.I α controls tightness.I β controls stability.I k0, the initial number of clusters used in sequential clustering,

    is the parameter with the greatest impact on the results.Larger k0 tend to lead to smaller and tighter clusters.

    • Caveat. The DE analysis is exploratory and nominal p-valuesonly a rough summary of significance (reliance on models, tinyp-values even after adjustment for multiple testing, same dataused to define clusters and to perform DE analysis).

    • Ongoing efforts.I Cluster confidence measures.I Potentially assign unclustered observations to clusters.I DE using ZINB model.I Greater modularity (e.g., distance functions, DE test).I Shiny app for interactive web interface.

    83 / 105

  • Cell Lineage and Pseudotime Inference

    K. Street, D. Risso, E. Purdom

    84 / 105

  • Motivation

    • Mapping transcriptional progression from stem cells tospecialized cell types is essential for properly understandingthe mechanisms regulating cell and tissue differentiation.

    • There may not always be a clear distinction between states,but rather a smooth transition, with individual cells existingon a continuum between states.

    • In such a case, cells may undergo gradual transcriptionalchanges, where the relationship between states can berepresented as a continuous lineage dependent upon anunderlying spatial or temporal variable. This representation,referred to as pseudotemporal ordering, can help usunderstand how cells differentiate and how cell fate decisionsare made (Bendall et al., 2014; Campbell et al., 2015; Ji andJi, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnellet al., 2014).

    85 / 105

  • Motivation

    • We have developed Slingshot as a flexible and robustframework for inferring cell lineages and pseudotimes in thestudy of continuous differentiation processes.

    86 / 105

  • Cell Lineage and Pseudotime Inference

    • Input/Output.I Input. Normalized gene expression measures and cell

    clustering.I Output. Cell lineages, i.e., subsets of ordered cell clusters.

    Cell pseudotimes, i.e., for each lineage, ordered sequence ofcells and associated pseudotimes.

    • Dimensionality reduction.I Principal component analysis (PCA) seems effective and

    simple, in conjunction with steps detailed next.I Other approaches include related linear methods, e.g.,

    independent component analysis (ICA) (Trapnell et al., 2014,Monocle), and non-linear methods, e.g., Laplacianeigenmaps/spectral embedding (Campbell et al., 2015,Embeddr), t-distributed stochastic neighbor embedding(t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016,Wanderlust).

    87 / 105

  • Cell Lineage and Pseudotime Inference

    • Inferring cell lineages.I Minimum spanning tree (MST; ape package) over cell clusters,

    with between-cluster distance based on Euclidean distancebetween cluster means scaled by within-cluster covariance.

    I Outlying clusters. Identified using granularity parameter ω thatlimits maximum edge weight in the tree. Specifically, buildMST using an artificial cluster Ω, with distance ω from otherclusters (a fraction of maximum pairwise distance betweenclusters), and then remove Ω.

    I Root and leaf nodes. May either be pre-specified orautomatically selected.Root node. If not pre-specified, selected based on parsimony(i.e., set of lineages with maximal number of clusters sharedbetween them).Leaf nodes. If pre-specified, constrained MST.

    I A lineage is then defined as any unique path coming out of theroot node and ending in a leaf node.

    88 / 105

  • Cell Lineage and Pseudotime Inference

    I Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014,Monocle) offers greater stability and computational efficiency,less complex lineages, and easier determination of directionalityand branching.

    • Inferring cell pseudotimes.I Iterative procedure inspired from the principal curve algorithm

    of Hastie and Stuetzle (1989); principal.curve function inprincurve package.

    I In the case of branching lineages, a shrinkage step is includedat each iteration, that forces a degree of similarity between thecurves in the neighborhood of shared clusters.

    I Pseudotime values are derived by orthogonal projection ontothe curves.

    I Cells belonging to clusters that are included in multiplelineages have multiple, similar pseudotime values.

    89 / 105

  • Cell Lineage and Pseudotime Inference

    I Previous approaches also use smooth curves to representlineages (Campbell et al., 2015; Petropoulos et al., 2016,Embeddr), while others use piecewise linear paths through theMST and extract orderings either by orthogonal projection (Jiand Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree(Trapnell et al., 2014, Monocle).

    I We find that smooth curves provide discerning power not foundin piecewise linear trajectories, while also adding stability overa range of dimensionality reduction and clustering methods.

    • Differential expression. Regression of gene expressionmeasures on pseudotime, e.g., generalized additive models(GAM) (Ji and Ji, 2016, TSCAN).

    • Visualization. Two- and three-dimensional plots of celllineages and pseudotimes, gene-level trajectories, heatmaps forDE genes.

    90 / 105

  • Cell Lineage and Pseudotime Inference

    • R package slingshot, to be released through the BioconductorProject: github.com/kstreet13/slingshot.

    91 / 105

    http://github.com/kstreet13/slingshot

  • Software Package slingshot

    • Modularity.I Integrates easily with a range of normalization, clustering, and

    dimensionality reduction methods.I get lineages: Given expression measures and cluster labels,

    use MST to infer lineages.get lineages(X, clus.labels, start.clus = NULL,

    end.clus = NULL, dist.fun = NULL, omega = Inf,

    distout = FALSE).I get curves: Given lineages, infer pseudotimes.

    get curves(X, clus.labels, lineages, thresh =

    1e-04, maxit = 100, stretch = 2, shrink = TRUE).

    • Flexibility. Can be used with varying levels of supervision.I Cluster-based approach allows for easy supervision when

    researchers have prior knowledge of cell classes, while stillbeing able to detect novel branching events.

    I User-supplied or data-driven selection of root and leaf nodes.

    92 / 105

  • Software Package slingshot

    • Visualization.I plot tree: MST in 2 and 3D.I plot curves: Lineage curves in 2 and 3D.

    93 / 105

  • Software Package slingshot

    Application to OE p63 dataset.

    • Applied to the first three principal components, Slingshotidentifies two lineages: The first corresponds to theHBC-to-neurons transition, the second to theHBC-to-sustentacular cells transition.

    • A first-pass DE analysis, based on a regression of log-count onpseudotime using GAM, suggests that many genes areinvolved in the differentiation process.

    • Among the top 100 DE genes for each lineage, only 13 are DEin both, suggesting distinct processes in the neuronal vs.non-neuronal lineages.