reconstructing mutation histories from single-cell dataen)/zif/kg/2016gene... · casp3 dummy2...

Katharina Jahn, ETH Zurich, Basel With Jack Kuipers and Niko Beerenwinkel

September 12, 2016

Reconstructing mutation histories from single-cell data

2

Intra-tumour heterogeneity

Heterogeneous tumour Clonal expansion Mutation tree

time

§  Bulk sequencing data §  Mixture of hundred thousands of cells §  Deconvolution of admixed mutation profiles §  Limited resolution: no low-frequency subclones, limited #subclones §  Most data is of this type

§  Single-cell sequencing data §  No deconvolution necessary §  Higher error-rates §  Subsampling §  Few data sets available

3

Why single-cell data?

§  Infinite sites assumption: no recurrent mutations, no

backmutations 4

Single-cell phylogenies

cell lineage tree mutation tree mutation tree with samples attached

§  Mutation matrix: binary character state matrix

§  True matrix E forms perfect phylogeny

§  Observed matrix D contains noise

§  FN rate between 10% and 45%

5

Single-cell mutation matrices

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 0 0 0 0

0 0 0 0 1 1 1

0 0 0 0 1 1 0

cells

mutations

1 1 - 1 1 - 1

1 1 1 1 0 1 1

1 1 1 0 0 0 0

1 0 0 0 1 1 1

0 0 - 0 0 1 0

cellsmutations

D

E

6

Error model

§  Model (T, σ , θ) §  Mutation tree T §  Attachment of samples σ §  Error rates θ=(α,β)

§  Basic assumptions §  Infinite sites §  Independence of observational errors

7

Model for Learning Mutation Histories

s1s2

s3

s4

s5 s6

s7

R

M1 M2

M3

§  Given mutation matrix D for n mutations and m samples

§  Likelihood

§  Posterior

8

Model for Learning Mutation Histories

9

Marginalization of sample attachment

10

Marginalization of sample attachment

O(mn)

§  For n mutations and m samples

§  After marginalization

§  Independent of number of samples

11

Search Space Size

§  Moves in joint (T,θ) space:

§  Transition probability:

§  Acceptance probability:

§  Ergodic mixture of moves

§  Markov chain converges to posterior distribution 12

MCMC Scheme

§  Change (T,θ) component wise §  Tree-moves: e. g. prune & reattach

§  θ-moves: Gaussian random walk 13

MCMC moves

§  Maximum a posteriori

§  Maximum likelihood

14

Point estimates

§  Current datasets: often few samples, high error rates

§  Flat posterior, global optimum hard to find

§  #mutations > # samples

§  Idea: Use binary cell lineage trees

§ 

15

Alternative tree representation

#binary leaf-labeled trees with m leafs

#mutation trees with n mutations

§  For ML trees only §  Tree scoring still in O(mn)

16

Alternative tree representation

M2

M11

M1

M3

M4

M5

M6

M7

M8

M9

M10

s1

s2 s3

s4

s5

M1

M2 M3 M4

M9M5 M6 M7 M8

s1 s2 s3 s4

M10 M11

s5

61 billion trees

180 trees

§  Accuracy vs. #samples §  ML trees §  40 mutations §  False positive rate α = 10-5

§  Accuracy vs. missing data §  ML trees §  20 mutations §  False negative rate β = 0.1

17

Evaluation of SCITE on simulated data

#samples

§  20 mutations, ML trees

Effect of doublet samples

§  SCITE, Jahn, Kuipers et al., 2016 §  KS, Kim & Simon, 2014 §  BP, BitPhylogeny, Yuan et al., 2015

§  PW, PhyloWGS, Deshwar et al., 2015 §  AT, AncesTree. El-Kebir et al., 2015

Comparison to previous approaches

Δd = normalized consensus node-based shortest path distance (Yuan et al. 2015)

n=20

§  Wang et al, Nature 2014 §  nuc-seq of 47 cells §  40 mutations §  1.4% missing data §  α = 1.24× 10‒6 §  β = 0.097

20

ER+ breast tumor

CASP3

dummy2

PIK3CA

PANK3 FCHSD2DNM3

PPP2RE

dummy6

FBN2

dummy7 PRDM9 s19s33 s46

dummy0

dummy1 s44

dummym1

LSG1ITGAD

DCAF8L1 BTLA

TRIM58

dummy3 s41

MARCH11 DUSP12 TCP11

dummy4

dummy5 s43

PITRM1 ROPN1B

s40 s45

SEC11A MUTHYGPR64

dummy8

PLXNA2 RABGAP1LCALD1 s15

dummy9

CXXX1 TECTA

dummy10

s34

CABP2 DKEZ H1ENT GLCE s3 s4s12 s21

TRIB2 c1orf223 C15orf23 s1 s2 s5 s7 s9 s10 s11 s13 s18s24 s37

s6 s16 s17 s35

ZEHX4 s32

s23 s26

s14 s30

KIAA1539 s36

FGFR2

CNDP1 s25

s0 s29

s42

FUBP3 ZNE318

s20

s22

WDR16 s28

s8s27

s31 s38

s39

§  ML tree

21

ER+ breast tumor

§  Sampling from posterior: false negative rate

§  Mean β more than twice the rate estimated by Wang et al. §  False negative rate > allelic drop-out rate

22

ER+ breast tumor

§  Sampling from posterior: branchiness

23

ER+ breast tumor

§  Hou et al., Cell 2012 §  WES of 58 cancer cells §  18 selected mutations

§  45% missing data §  α = 6.04 × 10‒5, β = 0.43

24

Myeloproliferative neoplasm

§  MAP tree

§  Sampling from posterior distribution: branchiness

25

MAP


DNAJC17

ABCB5

SESN2

PDE4DIP

DLEC1

NTRK1

DMXL1

TOP1MT

ST13

ANAPC1

ARHGAP5

ASNS

MLL3

FAM115C

RETSAT

USP32

FRG1

PABPC1

§  Sampling from posterior distribution: false negative rate

26

MAP


27


§  ML tree from larger set of mutations §  78 mutations, 58 samples §  Search performed in binary tree

representation §  Same overall structure §  Order changes a bit §  But determined by few samples

§  SCITE: Single-cell based inference of tumor evolution https://github.com/cbg-ethz/SCITE

§  Genome Biology 2016 17:86 (Special Issue: Single-Cell Omics) §  Robust against various types of noise §  Posterior computation scales linearly with #samples §  Search space size independent from #samples §  Many mutations, few cells? SCITE on binary phylogenies §  Observation: no branchings in upper part of tree

28

Conclusion

§  Testing infinite sites assumption (Jack’s talk) §  Connect with spatial information (Mykola’s talk) §  Modeling of doublets (Jack’s talk) §  Joint use of bulk and single-cell data §  Use of variant allele frequencies as data §  Integration of copy number changes

29

Outlook

§  Jack Kuipers §  Niko Beerenwinkel

30

Acknowledgements

Thank you for your attention!

Comparison to previous approaches

Δd = normalized consensus node-based shortest path distance (Yuan et al. 2015)

Modified from https://scientificbsides.files.wordpress.com/2015/02/comparingclonaltrees-idea2.png?w=1500&h=1163

reconstructing mutation histories from single-cell dataen)/zif/kg/2016gene... · casp3 dummy2...

Documents