towards whole- transcriptome deconvolution with single-cell data
DESCRIPTION
Towards Whole- Transcriptome Deconvolution with Single-cell Data. James Lindsay 1 Ion mandoiu 1 Craig Nelson 2. University Of Connecticut 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology. Mouse Embryo. ANTERIOR / HEAD. Neural tube. Somites. - PowerPoint PPT PresentationTRANSCRIPT
JAMES LINDSAY1
ION MANDOIU1
CRAIG NELSON2
Towards Whole-Transcriptome
Deconvolution with Single-cell Data
UNIVERSITY OF CONNECTICUT1DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING2DEPARTMENT OF MOLECULAR AND CELL BIOLOGY
Mouse Embryo
Somites
POSTERIOR / TAIL
ANTERIOR / HEAD
Node
Neural
tube
Primitive streak
Unknown Mesoderm Progenitor
• What is the expression profile of the progenitor cell type?
NSB=node-streak border; PSM=presomitic mesoderm; S=somite; NT=neural tube/neurectoderm; EN=endoderm
Characterizing Cell-types
• Goal: Whole transcriptome expression profiles of individual cell-types
• Technically challenging to measure whole transcriptome expression from single-cells
• Approach: Computational Deconvolution of cell mixtures• Assisted by single-cell qPCR
expression data for a small number of genes
Modeling Cell Mixtures
Mixtures (X) are a linear combination of signature matrix (S) and concentration matrix (C)
𝑋𝑚𝑥𝑛=𝑆𝑚𝑥𝑘∙𝐶𝑘𝑥𝑛
mixtures
gene
s
cell typesge
nes
mixtures
cell
type
s
Previous Work
1. Coupled Deconvolution• Given: X, Infer: S, C
• NMF Repsilber, BMC Bioinformatics, 2010• Minimum polytope Schwartz, BMC Bioinformatics, 2010
2. Estimation of Mixing Proportions• Given: X, S Infer: C
• Quadratic Prog Gong, PLoS One, 2012• LDA Qiao, PLoS Comp Bio, 2o12
3. Estimation of Expression Signatures• Given: X, C Infer: S
• csSAM Shen-Orr, Nature Brief Com, 2010
Single-cell Assisted Deconvolution
Given: X and single-cells qPCR data Infer: S, C Approach:1. Identify cell-types and estimate reduced
signature matrix using single-cells qPCR data
• Outlier removal • K-means clustering followed by averaging
2. Estimate mixing proportions C using • Quadratic programming, 1 mixture at a time
3. Estimate full expression signature matrix S using C
• Quadratic programming , 1 gene at a time
�̂�
�̂�
Step 1: Outlier Removal + Clustering
unfiltered filtered
Remove cells that have maximum Pearson correlation to other cells below .95
Step 1: PCA of Clustering
Step 2: Estimate Mixture Proportions
min (‖�̂�𝑐−𝑥‖¿¿2) ,𝑠 . 𝑡 .{ ∑𝑐=1𝑐 𝑙≥0 ∀ 𝑙=0…𝑘
¿
𝑐=𝐶𝑙 ,𝑖 ∀ 𝑙=1…𝑘𝑥=𝑋 𝑗 , 𝑖∀ 𝑗=1…𝑚
For a given mixture i:
Reduced signature matrix.Centroid of k-means clusters
Step 3: Estimating Full Expression Signatures
s: new gene to estimate signatures
mixtures
gene
s
cell types
gene
s
mixtures
cell
type
smin (‖𝑠𝐶−𝑥‖¿¿2)¿Now solve:
C: known from step 2x: observed signals from new gene
Experimental Design
Simulated Concentrations• Sample uniformly at
random [0,1]• Scale column sum to 1.
Simulated Mixtures• Choose single-cells
randomly with replacement from each cluster
• Sum to generate mixture
Single Cell Profiles• 92 profiles• 31 genes
Data: RT-qPCR
• CT values are the cycle in which gene was detected
• Relative Normalization to house-keeping genes
• HouseKeeping genes • gapdh, bactin1• geometric mean• Vandesompele, 2002
• dCT(x) = geometric mean – CT(x)• expression(x) = 2^dCT(x)
Accuracy of Inferred Mixing Proportions
Concentration Matrix: Concordance
Concentration by # Genes: Random
Concentration by # Genes: Ranked
Leave-one-out: Concentration: 50 mixR
MSE
2^dC
T
Missing Gene
Leave-one-out: Signature: 10 mixR
MSE
2^dC
T
Missing Gene
Leave-one-out: Signature: 50 mixR
MSE
2^dC
T
Missing Gene
Future Work
• Bootstrapping to report a confidence interval of each estimated concentration and signature• Show correlation between large CI and poor accuracy
• Mixing of heterogeneous technologies• qPCR for single-cells, RNA-seq for mixtures• Normalization (need to be linear)
• Whole-genome scale• # genes to estimate 10,000+ signatures• Data!
Conclusion
Special Thanks to:• Ion Mandoiu• Craig Nelson• Caroline Jakuba• Mathew Gajdosik