summer institute in statistical genetics...

30
Lecture 5: Population Structure Inference Lecture 5: Population Structure Inference Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2019 1/1

Upload: others

Post on 21-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Lecture 5: Population Structure Inference

Instructors: Timothy Thornton and Michael Wu

Summer Institute in Statistical Genetics 2019

1 / 1

Page 2: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Background: Population StructureI Humans originally spread across the world many thousand

years ago.I Migration and genetic drift led to genetic diversity between

isolated groups.

Image from https://science.education.nih.gov

2 / 1

Page 3: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Population Structure Inference

I Inference on genetic ancestry differences among individualsfrom different populations, or population structure, has beenmotivated by a variety of applications:

I population geneticsI genetic association studiesI personalized medicineI forensics

I Advancements in array-based genotyping technologies havelargely facilitated the investigation of genetic diversity atremarkably high levels of detail

I A variety of methods have been proposed for the identificationof genetic ancestry differences among individuals in a sampleusing high-density genome-screen data.

3 / 1

Page 4: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Inferring Population Structure with PCA

I Principal Components Analysis (PCA) is the most widely usedapproach for identifying and adjusting for ancestry differenceamong sample individuals

I PCA applied to genotype data can be used to calculateprincipal components (PCs) that explain differences amongthe sample individuals in the genetic data

I The top PCs are viewed as continuous axes of variation thatreflect genetic variation due to ancestry in the sample.

I Individuals with ”similar” values for a particular top principalcomponent will have similar ancestry for that axes.

4 / 1

Page 5: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Standard Principal Components Analysis (sPCA)

I sPCA is an unsupervised learning tool for dimension reductionin multivariate analysis.

I Widely used in genetics community to infer populationstructure from genetic data.

I Belief that top principal components (PCs) will reflectpopulation structure in the sample.

I Orthogonal linear transformation to a new coordinate systemI sequentially identifies linear combinations of genetic markers

that explain the greatest proportion of variability in the dataI these define the axes (PCs) of the new coordinate systemI each individual has a value along each PC

I EIGENSOFT (Price et al. 2006) is a popular implementationof PCA.

5 / 1

Page 6: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Data Structure

I Sample of n individuals, indexed by i = 1, 2, . . . , n.

I Genome screen data on m genetic autosomal markers, indexedby l = 1, 2, . . . ,m.

I At each marker, for each individual, we have a genotype value,xil .

I Here we consider SNP data, so xil takes values 0, 1, or 2,corresponding to the number of reference alleles.

I We center and standardize these genotype values:

zil =xil − 2p̂l√2p̂l(1 − p̂l)

where p̂l is an estimate of the reference allele frequency formarker l .

6 / 1

Page 7: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Genetic Correlation Estimation

I Create an n x m matrix, ZZZ, of centered and standardizedgenotype values, and from this, a genetic correlation matrix(GRM):

Ψ̂ =1

mZZZZZZT

I Ψ̂ij is an estimate of the genome wide average geneticcorrelation between individuals i and j .

I sPCA relies on individuals from the same ancestral populationbeing more genetically correlated than individuals fromdifferent ancestral populations.

7 / 1

Page 8: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Standard Principal Components Analysis (sPCA)

I PCA is performed by obtaining the eigendecomposition Ψ̂

I Orthogonal axes of variation, i.e. linear combinations ofSNPs, that best explain the genotypic variability amongst then sample individuals are identified.

I The result is:I a set of n length n eigenvectors, (V1,V2, . . .Vn), where Vd is

a column vector of coordinates of each individual along axis dI and a corresponding set of n eigenvalues,

(λ1 > λ2 > . . . > λn), in decerasing order.I The d th principal component (eigenvector) corresponds to

eigenvalue λd , where λd is proportional to the percentage ofvariability in the genome-screen data that is explained by Vd .

I These eigenvectors (PCs) are used as surrogates forpopulation structure

8 / 1

Page 9: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA of Europeans

I The top principal components are viewed as continuous axesof variation that reflect genetic variation due to ancestry inthe sample.

I Individuals with ”similar” values for a particular top principalcomponent will have ”similar” ancestry for that axes.

I A application of principal components to genetic data fromEuropean samples (Novembre et al., Nature 2008) showedthat among Europeans for whom all four grandparentsoriginated in the same country, the first two principalcomponents computed using 200,000 SNPs could map theircountry of origin quite accurately in the plane

9 / 1

Page 10: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA of Europeans

10 / 1

Page 11: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA in Finland

I There can be population structure in all populations, eventhose that appear to be relatively ”homogenous”

I An application of principal components to genetic data fromFinland samples (Sabatti et al., 2009) identified populationstructure that corresponded very well to geographic regions inthis country.

11 / 1

Page 12: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA in Finland

12 / 1

Page 13: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Relatedness Confounds sPCAI Recall that the GRM used by sPCA, Ψ̂ij , and is an estimate of

the genome wide average genetic correlation betweenindividuals i and j .

I It can be shown:

Ψij = 2 [φij + (1 − φij)Aij ]

I φij : kinship coefficient - a measure of familial relatednessI Aij : a measure of ancestral similarity

I PCA is an unsupervised method; in related samples we don’tknow the correlation structure each eigenvector is reflecting

I If the only genetic correlation structure among individuals isdue to ancestry, Ψ and the top PCs will capture this.

I If there is relatedness in the sample, the top PCs may reflectthis or some combination of ancestry and relatedness.

I Association studies have known or cryptic relatedness!13 / 1

Page 14: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA for Related Samples

The PC-AiR method was developed for performing a PrincipalComponents Analysis in Related samples. The algorithm has thefollowing steps:

14 / 1

Page 15: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PC-AiR: PCA for Related Samples

I The PC-AiR Algorithm

1. Estimate the relatedness of all pairs of individuals in thesample.

2. Partition the sample into an unrelated set and a related set ofindividuals.

3. Perform standard PCA on the set of unrelated individuals.4. Predict PC values for the set of related individuals based on

genetic similarities with the unrelated set

15 / 1

Page 16: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Admixed Populations

I Several recent and ongoing genetic studies have focused onadmixed populations: populations characterized by ancestryderived from two or more ancestral populations that werereproductively isolated.

I Admixed populations have arisen in the past several hundredyears as a consequence of historical events such as thetransatlantic slave trade, the colonization of the Americas andother long-distance migrations.

I Examples of admixed populations includeI African Americans and Hispanic Americans in the U.SI Latinos from throughout Latin AmericaI Uyghur population of Central AsiaI Cape VerdeansI South African ”Coloured” population

16 / 1

Page 17: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Ancestry  Admixture  

Ancestral�Pop. B  

Ancestral�Pop. A  

Today  

I The chromosomes of an admixed individual represent amosaic of chromosomal blocks from the ancestral populations.

17 / 1

Page 18: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Admixed Populations

I Can be substantial genetic heterogeneity among individuals inadmixed populations

I Admixed populations are ancestrally admixed and thus havepopulation structure.

I Statistical method for estimating admixture proportions usinggenetic data are available

18 / 1

Page 19: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Supervised Learning for Ancestry AdmixtureI Methods, such as ADMXITURE and FRAPPE, have recently

been developed for supervised learning of ancestry proportionsfor an admixed individuals using high-density SNP data.

I Most use either a hidden Markov model (HMM) or anExpectation-Maximization (EM) algorithm to infer ancestry

I Example: Suppose we are interested in identifying theancestry proportions for an admixed individual

I Observed sequence on a chromosome for an admixedindividual:...TATACGTGCACCTGGATTACAGATTACAGATTACAGATTACATTGCATCGATCGAA...

I Observed sequence on a chromosome for samples selectedfrom a ”homogenous” reference population:...TGATCCTGAACCTAGATTACAGATTACAGATTACAGATTACAATGCTTCGATGGAC......AGATCCTGAACCTAGATTACAGATTACAGATTACAGATACCAATGCTTCGATGGAC......CGATCCTGAACCTAGATTACAGATTACAGATTTGCGTATACAATGCTTCGATGGAC...

19 / 1

Page 20: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

HapMap ASW and MXL Ancestry

I Genome-screen data on 150,872 autosomal SNPs was used toestimate ancestry

I Estimated genome-wide ancestry proportions of everyindividual using the ADMIXTURE (Alexander et al., 2009)software

I A supervised analysis was conducted using genotype datafrom the following reference population samples for three”ancestral” populations

I HapMap YRI for West African ancestryI HapMap CEU samples for northern and western European

ancestryI HGDP Native American samples for Native American ancestry.

20 / 1

Page 21: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Supervised ADMIXTURE Estimated Ancestry for HapMap MXL + ASW

Individual

Anc

estr

y

0.0

0.2

0.4

0.6

0.8

1.0

ASW MXL

EuropeanAfricanNative American

A

Unsupervised ADMIXTURE Estimated Ancestry for HapMap MXL + ASW

Individual

Anc

estr

y

0.0

0.2

0.4

0.6

0.8

1.0

ASW MXL

B

21 / 1

Page 22: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Table: Average Estimated Ancestry Proportions for HapMap AfricanAmericans and Mexican Americans

Estimated Ancestry Proportions (SD)Population European African Native American

MXL 49.9% (14.8%) 6%(1.8%) 44.1% (14.8%)ASW 20.5% (7.9%) 77.5% (8.4%) 1.9% (3.5%)

22 / 1

Page 23: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

Figure: HapMap MXL + ASW Sample

●●

●●

●●

●●

●●

●●

−0.15 −0.10 −0.05 0.00 0.05 0.10

−0.

2−

0.1

0.0

0.1

0.2

0.3

0.4

Top 2 PCs with PC−AIR

PC−AIR PC1

PC

−A

IR P

C2

● Unrelated SetRelated Set

African

European

Native American

A

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●● ●●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

−0.10 −0.05 0.00 0.05 0.10

−0.

10.

00.

10.

20.

30.

4

Top 2 PCs with EIGENSOFT

EIGENSOFT PC1

EIG

EN

SO

FT

PC

2

MXL Extended Family 1Other

B

●●

●●●

●●

●●●●●●●● ●●●●

●●

●● ●

●● ●● ●

●●● ●●●

●●

●●●●

●●

●●●●

● ●●

●●

● ●●●● ●●●

●●

● ●

●●

● ●●●●

●●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●● ●●●

● ●

●●

●●●● ●●●

●●● ●

● ●

●●●

●●

●●●

●●

●●

●●

−0.1 0.0 0.1 0.2 0.3 0.4

−0.

4−

0.3

−0.

2−

0.1

0.0

EIGENSOFT PCs 2 and 3

EIGENSOFT PC2

EIG

EN

SO

FT

PC

3

MXL Extended Family 1MXL Extended Family 2Other

C

●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●● ●

●●

●●●●

●●

●●

●●●

●●●

●●●●●●

●●●

●●●●

●●

● ●●

●●

●●●

●●●●●

●●●

●●

●●●

●●

● ●

●●

●●

●●

−0.5 −0.4 −0.3 −0.2 −0.1 0.0

−0.

5−

0.4

−0.

3−

0.2

−0.

10.

00.

1

EIGENSOFT PCs 4 and 5

EIGENSOFT PC4

EIG

EN

SO

FT

PC

5

ASW Extended Family 1ASW Extended Family 2Other

D

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

−0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1

−0.

10.

00.

10.

20.

30.

4

EIGENSOFT PCs 6 and 7

EIGENSOFT PC6

EIG

EN

SO

FT

PC

7

ASW Extended Family 3ASW Extended Family 4Other

E

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

EIGENSOFT PCs 8 and 9

EIGENSOFT PC8

EIG

EN

SO

FT

PC

9

ASW Extended Family 4ASW Extended Family 5ASW Extended Family 6Other

F

23 / 1

Page 24: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

HapMap MXL Cryptic Relatedness

NA19685

NA19664

NA19665

NA19661

NA19662

NA19660

NA19684

NA19686

NA19672

NA19663? ?

2n

d−Deg

ree 2nd−Degree

2382 M008

M011

M012

24 / 1

Page 25: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

I “Genetic diversity and association studies in USHispanic/Latino populations: Applications in the HispanicCommunity Health Study/Study of Latinos.” (2016)American Journal of Human Genetics 98(1), 165-184.

25 / 1

Page 26: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PCA-AiR: Hispanic Community Health Study

26 / 1

Page 27: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PC-AiR: Hispanic Community Health Study

27 / 1

Page 28: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

PC-AiR: Hispanic Community Health Study

I Genetic differentiation among individuals is associated with thegeography of their countries of grandparental origin.

I Individuals for whom all four grandparents were born in a specificcountry in Central or South America were used 28 / 1

Page 29: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

References

I Alexander, D.H., Novembre, J., Lange, K. (2009). Fastmodel-based estimation of ancestry in unrelated individuals.Genome Res. 19,1655-1664.

I Conomos MP, Miller M, Thornton T (2015). Robust Inferenceof Population Structure for Ancestry Prediction andCorrection of Stratification in the Presence of Relatedness.Genetic Epidemiology 39, 276-93

I Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko,A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson,M.R. (2008). Genes mirror geography within Europe. Nature456, 98-101.

I Patterson,N., Price, A.L., Reich, D. (2006) Populationstructure and eigenanalysis. PLoS Genet. 2, e190.

29 / 1

Page 30: Summer Institute in Statistical Genetics 2019faculty.washington.edu/tathornt/SISG2019/lectures/SISG... · 2019. 7. 22. · Lecture 5: Population Structure Inference Population Structure

Lecture 5: Population Structure Inference

References

I Sabatti, C., Service, S. K., Hartikainen, A. L., Pouta, A.,Ripatti, S., Brodsky, J., et al. (2009). Genome-wideassociation analysis of metabolic traits in a birth cohort froma founder population. Nature Genetics., 41, 35-46.

30 / 1