reconstructing mutation histories from single-cell dataen)/zif/kg/2016gene... · casp3 dummy2...

31
Katharina Jahn, ETH Zurich, Basel With Jack Kuipers and Niko Beerenwinkel September 12, 2016 Reconstructing mutation histories from single-cell data

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Katharina Jahn, ETH Zurich, Basel With Jack Kuipers and Niko Beerenwinkel

    September 12, 2016

    Reconstructing mutation histories from single-cell data

  • 2

    Intra-tumour heterogeneity

    Heterogeneous tumour Clonal expansion Mutation tree

    time

  • §  Bulk sequencing data §  Mixture of hundred thousands of cells §  Deconvolution of admixed mutation profiles §  Limited resolution: no low-frequency subclones, limited #subclones §  Most data is of this type

    §  Single-cell sequencing data §  No deconvolution necessary §  Higher error-rates §  Subsampling §  Few data sets available

    3

    Why single-cell data?

  • §  Infinite sites assumption: no recurrent mutations, no

    backmutations 4

    Single-cell phylogenies

    cell lineage tree mutation tree mutation tree with samples attached

  • §  Mutation matrix: binary character state matrix

    §  True matrix E forms perfect phylogeny

    §  Observed matrix D contains noise

    §  FN rate between 10% and 45%

    5

    Single-cell mutation matrices

    1 1 1 1 1 1 1

    1 1 1 1 1 1 1

    1 1 1 0 0 0 0

    0 0 0 0 1 1 1

    0 0 0 0 1 1 0

    cells

    mutations

    1 1 - 1 1 - 1

    1 1 1 1 0 1 1

    1 1 1 0 0 0 0

    1 0 0 0 1 1 1

    0 0 - 0 0 1 0

    cellsmutations

    D

    E

  • 6

    Error model

  • §  Model (T, σ , θ) §  Mutation tree T §  Attachment of samples σ §  Error rates θ=(α,β)

    §  Basic assumptions §  Infinite sites §  Independence of observational errors

    7

    Model for Learning Mutation Histories

    s1s2

    s3

    s4

    s5 s6

    s7

    R

    M1 M2

    M3

  • §  Given mutation matrix D for n mutations and m samples

    §  Likelihood

    §  Posterior

    8

    Model for Learning Mutation Histories

  • 9

    Marginalization of sample attachment

  • 10

    Marginalization of sample attachment

    O(mn)

  • §  For n mutations and m samples

    §  After marginalization

    §  Independent of number of samples

    11

    Search Space Size

  • §  Moves in joint (T,θ) space:

    §  Transition probability:

    §  Acceptance probability:

    §  Ergodic mixture of moves

    §  Markov chain converges to posterior distribution 12

    MCMC Scheme

  • §  Change (T,θ) component wise §  Tree-moves: e. g. prune & reattach

    §  θ-moves: Gaussian random walk 13

    MCMC moves

  • §  Maximum a posteriori

    §  Maximum likelihood

    14

    Point estimates

  • §  Current datasets: often few samples, high error rates

    §  Flat posterior, global optimum hard to find

    §  #mutations > # samples

    §  Idea: Use binary cell lineage trees

    § 

    15

    Alternative tree representation

    #binary leaf-labeled trees with m leafs

    #mutation trees with n mutations

  • §  For ML trees only §  Tree scoring still in O(mn)

    16

    Alternative tree representation

    M2

    M11

    M1

    M3

    M4

    M5

    M6

    M7

    M8

    M9

    M10

    s1

    s2 s3

    s4

    s5

    M1

    M2 M3 M4

    M9M5 M6 M7 M8

    s1 s2 s3 s4

    M10 M11

    s5

    61 billion trees

    180 trees

  • §  Accuracy vs. #samples §  ML trees §  40 mutations §  False positive rate α = 10-5

    §  Accuracy vs. missing data §  ML trees §  20 mutations §  False negative rate β = 0.1

    17

    Evaluation of SCITE on simulated data

    #samples

  • §  20 mutations, ML trees

    Effect of doublet samples

  • §  SCITE, Jahn, Kuipers et al., 2016 §  KS, Kim & Simon, 2014 §  BP, BitPhylogeny, Yuan et al., 2015

    §  PW, PhyloWGS, Deshwar et al., 2015 §  AT, AncesTree. El-Kebir et al., 2015

    Comparison to previous approaches

    Δd = normalized consensus node-based shortest path distance (Yuan et al. 2015)

    n=20

  • §  Wang et al, Nature 2014 §  nuc-seq of 47 cells §  40 mutations §  1.4% missing data §  α = 1.24× 10‒6 §  β = 0.097

    20

    ER+ breast tumor

  • CASP3

    dummy2

    PIK3CA

    PANK3 FCHSD2DNM3

    PPP2RE

    dummy6

    FBN2

    dummy7 PRDM9 s19s33 s46

    dummy0

    dummy1 s44

    dummym1

    LSG1ITGAD

    DCAF8L1 BTLA

    TRIM58

    dummy3 s41

    MARCH11 DUSP12 TCP11

    dummy4

    dummy5 s43

    PITRM1 ROPN1B

    s40 s45

    SEC11A MUTHYGPR64

    dummy8

    PLXNA2 RABGAP1LCALD1 s15

    dummy9

    CXXX1 TECTA

    dummy10

    s34

    CABP2 DKEZ H1ENT GLCE s3 s4s12 s21

    TRIB2 c1orf223 C15orf23 s1 s2 s5 s7 s9 s10 s11 s13 s18s24 s37

    s6 s16 s17 s35

    ZEHX4 s32

    s23 s26

    s14 s30

    KIAA1539 s36

    FGFR2

    CNDP1 s25

    s0 s29

    s42

    FUBP3 ZNE318

    s20

    s22

    WDR16 s28

    s8s27

    s31 s38

    s39

    §  ML tree

    21

    ER+ breast tumor

  • §  Sampling from posterior: false negative rate

    §  Mean β more than twice the rate estimated by Wang et al. §  False negative rate > allelic drop-out rate

    22

    ER+ breast tumor

  • §  Sampling from posterior: branchiness

    23

    ER+ breast tumor

  • §  Hou et al., Cell 2012 §  WES of 58 cancer cells §  18 selected mutations

    §  45% missing data §  α = 6.04 × 10‒5, β = 0.43

    24

    Myeloproliferative neoplasm

  • §  MAP tree

    §  Sampling from posterior distribution: branchiness

    25

    MAP

    Myeloproliferative neoplasm

    DNAJC17

    ABCB5

    SESN2

    PDE4DIP

    DLEC1

    NTRK1

    DMXL1

    TOP1MT

    ST13

    ANAPC1

    ARHGAP5

    ASNS

    MLL3

    FAM115C

    RETSAT

    USP32

    FRG1

    PABPC1

  • §  Sampling from posterior distribution: false negative rate

    26

    MAP

    Myeloproliferative neoplasm

  • 27

    Myeloproliferative neoplasm

    §  ML tree from larger set of mutations §  78 mutations, 58 samples §  Search performed in binary tree

    representation §  Same overall structure §  Order changes a bit §  But determined by few samples

  • §  SCITE: Single-cell based inference of tumor evolution https://github.com/cbg-ethz/SCITE

    §  Genome Biology 2016 17:86 (Special Issue: Single-Cell Omics) §  Robust against various types of noise §  Posterior computation scales linearly with #samples §  Search space size independent from #samples §  Many mutations, few cells? SCITE on binary phylogenies §  Observation: no branchings in upper part of tree

    28

    Conclusion

  • §  Testing infinite sites assumption (Jack’s talk) §  Connect with spatial information (Mykola’s talk) §  Modeling of doublets (Jack’s talk) §  Joint use of bulk and single-cell data §  Use of variant allele frequencies as data §  Integration of copy number changes

    29

    Outlook

  • §  Jack Kuipers §  Niko Beerenwinkel

    30

    Acknowledgements

    Thank you for your attention!

  • Comparison to previous approaches

    Δd = normalized consensus node-based shortest path distance (Yuan et al. 2015)

    Modified from https://scientificbsides.files.wordpress.com/2015/02/comparingclonaltrees-idea2.png?w=1500&h=1163