imputation-based local ancestry inference in admixed populations

27
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with J. Kennedy and B. Pasaniuc

Upload: urian

Post on 21-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Imputation-based local ancestry inference in admixed populations. Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with J. Kennedy and B. Pasaniuc. Outline. Motivation and problem definition Factorial HMM model of genotype data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Imputation-based local ancestry inference in admixed populations

Imputation-based local

ancestry inference in admixed

populations

Ion Mandoiu

Computer Science and Engineering Department

University of Connecticut

Joint work with J. Kennedy and B. Pasaniuc

Page 2: Imputation-based local ancestry inference in admixed populations

Outline

Motivation and problem definition

Factorial HMM model of genotype data

Algorithms for genotype imputation and ancestry inference

Preliminary experimental results

Summary and ongoing work

Page 3: Imputation-based local ancestry inference in admixed populations

Population admixture

http://www.garlandscience.co.uk/textbooks/0815341857.asp?type=resources

Page 4: Imputation-based local ancestry inference in admixed populations

Admixture mapping

Patterson et al, AJHG 74:979-1000, 2004

Page 5: Imputation-based local ancestry inference in admixed populations

Local ancestry inference problem

rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...

Given: Reference haplotypes for ancestral populations P1,…,Pn Whole-genome SNP genotype data for extant individual

Find: Allele ancestries at each locus

Reference haplotypes

SNP genotypes

rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Inferred local ancestry

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000

Page 6: Imputation-based local ancestry inference in admixed populations

Previous work

MANY methods Ancestry inference at different granularities, assuming

different amounts of info about genetic makeup of ancestral populations

Two main classes HMM-based: SABER [Tang et al 06], SWITCH

[Sankararaman et al 08a], HAPAA [Sundquist et al. 08], … Window-based: LAMP [Sankararaman et al 08b], WINPOP

[Pasaniuc et al. 09] Poor accuracy when ancestral populations are

closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform methods

that model LD!

Page 7: Imputation-based local ancestry inference in admixed populations

Haplotype structure in panmictic populations

Page 8: Imputation-based local ancestry inference in admixed populations

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

HMM model of haplotype frequencies

Page 9: Imputation-based local ancestry inference in admixed populations

Random variables Fi = founder haplotype at locus i, between 1 and K Hi = observed allele at locus I

Model training Based on haplotypes using Baum-Welch algo, or Based on genotypes using EM [Rastas et al. 05]

Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

Graphical model representation

F1 F2 Fn…

H1 H2 Hn

Page 10: Imputation-based local ancestry inference in admixed populations

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Factorial HMM for genotype data in a window with known local ancestry

Page 11: Imputation-based local ancestry inference in admixed populations

HMM Based Genotype Imputation

Probability of missing genotype given the typed genotype data:

gi is imputed as )|][(argmax }2,1,0{ MxggP ix

)|][(),|( MxggPMgxgP iii

Page 12: Imputation-based local ancestry inference in admixed populations

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

i

ff

K

f

i

ff

i

ff

K

fgMgP

iii iiiii

Page 13: Imputation-based local ancestry inference in admixed populations

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

i

ff

K

f

i

ff

i

ff

K

fgMgP

iii iiiii

Page 14: Imputation-based local ancestry inference in admixed populations

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

i

ff

K

f

i

ff

i

ff

K

fgMgP

iii iiiii

Page 15: Imputation-based local ancestry inference in admixed populations

fi …

hi

gi

f’i …

h’i

Forward-backward computation

)()|( '' ''1 ,1 ,, i

i

ff

K

f

i

ff

i

ff

K

fgMgP

iii iiiii

Page 16: Imputation-based local ancestry inference in admixed populations

)()( '11

1

, ' fPfPii ff

K

fi

i

ffii

K

fii

i

ff

i

ff

i

ii

i

iiiigffPffP

11

1

,

'1

'

11

1

,,

1

'11'

1

'11

' )()|()|(

Runtime Direct recurrences for computing forward

probabilities:

Runtime reduced to O(nK3) by reusing common terms:

where

)()|( 11

1

,

'1

'1

,,'1

'11

'11

'1

i

K

f

i

ffiii

ff

i

ffgffP

i

iiiiii

K

f

i

ffiii

ffi

iiiiffP

1,1,

'1

'1

' )|(

Page 17: Imputation-based local ancestry inference in admixed populations

Imputation-based ancestry inference

View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial

HMM Pick model that re-imputes SNPs most

accurately around the locus of interest Fixed-window version: pick ancestry that

maximizes the average posterior probability of true SNP genotypes within a fixed-size window centered at the locus

Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities

Page 18: Imputation-based local ancestry inference in admixed populations

HMM imputation accuracy

Missing data rate and accuracy for imputed genotypes at different thresholds (WTCCC 58BC/Hapmap CEU)

Page 19: Imputation-based local ancestry inference in admixed populations

N=2,000g=7

=0.2n=38,864

r=10-8

Window size effect

Page 20: Imputation-based local ancestry inference in admixed populations

Number of founders effect

CEU-JPTN=2,000

g=7=0.2

n=38,864 r=10-8

Page 21: Imputation-based local ancestry inference in admixed populations

N=2,000g=7

=0.2n=38,864

r=10-8

Comparison with other methods

Page 22: Imputation-based local ancestry inference in admixed populations

Summary and ongoing work

Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations

Code at http://dna.engr.uconn.edu/software/ Ongoing work

Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations)

Extension to pedigree data Exploiting inferred local ancestry for more accurate

untyped SNP imputation and phasing of admixed individuals

Extensions to sequencing data Inference of ancestral haplotypes from extant admixed

populations

Page 23: Imputation-based local ancestry inference in admixed populations

N=2,000g=7

=0.5n=38,864

r=10-8

Untyped SNP imputation accuracy in admixed individuals

Page 24: Imputation-based local ancestry inference in admixed populations

HMM-based phasing

Maximum likelihood genotype phasing: given g, find (h1,h2) = argmax h1+h2=g P(h1|M)P(h2|M)

F1 F2 Fn…

H1 H2 Hn

F'1 F'2 F'n…

H'1 H'2 H'n

G1 G2 Gn

Page 25: Imputation-based local ancestry inference in admixed populations

• Bad news: Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP [KMP08]

• Good news: Viterbi-like heuristics yields phasing accuracy comparable to PHASE in practice [Rastas et al. 05]

HMM-based phasing

Page 26: Imputation-based local ancestry inference in admixed populations

F1 F2 Fn…

H1 H2 Hn

G1 G2 Gn

…R1,1 R2,1

F'1 F'2 F'n…

H'1 H'2 H'n

R1,c … R2,c …Rn,1 Rn,c1 2 n

Factorial HMM model for sequencing data

Page 27: Imputation-based local ancestry inference in admixed populations

Acknowledgments

J. Kennedy and B. Pasaniuc Work supported in part by NSF awards IIS-0546457

and DBI-0543365.