computational human genetics - searching for relations between genes, diseases, and populations

COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS

Eran HalperinNovember 10, 2009

1

Environmental Factors

Genetic Factors

Complexdisease

Multiple genes may affect the disease.

Therefore, the effect of every single gene may be negligible.

April 05’

The Human ChromosomesThe Human Chromosomes

………ACCAGGACGA……

………ACCAGGACGA……

Each chromosome ‘is’ a sequence over the alphabet {A,G,C,T} (base pairs)

Copy from mother

Copy from father

Facts about our genome

23 pairs of chromosomes. X and Y are the sex chromosomes (XX

for women, XY for men). 3,300,000,000 base pairs in the human

genome

The Human Genome Project

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“But our work previously has shown… that having one genetic code is important, but it's not all that useful.”

“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DCJune, 26, 2000

The Vision of Personalized Medicine

Genetic and epigenetic variants + measurable environmental/behavioral factorsGenetic and epigenetic variants + measurable environmental/behavioral factors would be used for a personalized treatment and diagnosis would be used for a personalized treatment and diagnosis

Paradigm shifts in medicine

Example: WarfarinAn anticoagulant drug, useful in the prevention of thrombosis.

Warfarin was originallyused as rat poison.

Optimal dose variesacross the population

Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.

Example: WarfarinExample: Warfarin

Association Studies

12

Where should we look first?

person 1: ….AAGCTAAATTTG….person 2: ….AAGCTAAGTTTG….person 3: ….AAGCTAAGTTTG….person 4: ….AAGCTAAATTTG….person 5: ….AAGCTAAGTTTG….

SNP = Single Nucleotide Polymorphism

Each common SNP has only two possible letters (alleles).

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases:

Controls: Associated SNP (high Relative Risk)

Disease Association Studies

SNP = Single Nucleotide Polymorphism13

Associated SNP (lower Relative Risk)

Preliminary Definitions14

SNP – single nucleotide polymorphism. A genetic variant which may carry different ‘value’ for different individuals.

Allele – the variant’s value: A,G,C, or T. Most SNPs are bi-allelic. There are only two

observed alleles in the populations. Risk allele – the allele which is more

common in cases than in controls (denoted R)

Nonrisk allele – the allele which is more common in the controls (denoted N)

Relative Risk

Chances of developing type II

diabetes: 30%

Chances of developing type II

diabetes: 20%

Relative Risk: Pr(D|R)/Pr(D|N) = 1.5

Risk=G

Nonrisk=A

Other Structural Variants

InversionDeletionCopy number variant

Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8

NHGRI GWA Catalogwww.genome.gov/GWAStudies

19

Public Genotype Data Growth

2001

Daly et al.Nature Genetics103 SNPs40,000genotypes

Gabriel et al.Science3000 SNPs400,000 genotypes

2002

TSC DataNucleic AcidsResearch35,000 SNPs4,500,000genotypes

2003

Perlegen DataScience1,570,000 SNPs100,000,000 genotypes

2004

NCBI dbSNPGenomeResearch3,000,000 SNPs286,000,000 genotypes

2005

HapMap Phase 25,000,000+ SNPs600,000,000+genotypes

2006

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases:

Controls:

Chance or Real Association?20

Associated SNP (lower Relative Risk)

How does it work?

For every SNP we can construct a contingency table:

R N Total

Cases a b N

Controls

c d N

Hypothesis testing22

Null hypothesis: Pr(R|case) = Pr(R|control) Alternative hypothesis: Pr(R|case) ≠ Pr(R|

control) The model assumes that all individuals are

independent (unrelated), and therefore our sample is a random sample from a Binomial distribution Cases sampled from distribution X~B(n,Pr(R|

cases)) Controls sampled from distribution Y~B(n,Pr(R|

controls))

Hypothesis testing, cont.23

When n is large, B(n,p) ~ N(np, np(1-p)). Under the null hypothesis:

€

X −Y ~ N(0,2np(1− p))

⇓

Z =X −Y

2np(1− p)~ N(0,1)

Set p =X +Y

2n. Then we get :

Z =2n (X −Y )

(X +Y )(2n − X −Y )~ N(0,1+

1

2n −1)

P-value24

Z is called a test-statistic (z-score in this case).

We can calculate Z* for our data, and then calculate (using the normal approximation):p-value = Pr(|Z| > |Z*|)

Often we take , which is €

T = Z 2

€

T = χ12

Results: Manhattan Plots

The curse of dimensionality – corrections of multiple testing

In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs.

If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease.

This needs to be corrected.

Bonferroni Correction

If the number of tests is n, we set the threshold to be 0.05/n.

A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: Example: If all SNPs are identical, then we lose

a lot of power; the false positive rate reduces, but so does the power.

Population Substructure

Challenge 128


Imagine that all the cases are collected from Africa, and all the controls are from Europe. Many association signals are going to be

found The vast majority of them are false;

Why ???

Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.

Evolution Theory

Mutations add to genetic variation Natural Selection controls the frequency

of certain traits and alleles Genetic drift

Mutations

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA

AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA

Estimated probability of a mutation in a single generation is 10^-8

Other ‘mutations’ - recombination

Copy 1

Copy 2

child chromosome

Probability ri (~10^-8) for recombination in position i.

Natural Selection

Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene

different allele frequencies in LCT

Genetic Drift

Even without selection, the allele frequencies in the population are not fixed across time.

Consider the case where we assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population.

If at the first generation the allele frequencies are p0 (of a) and q0=1-p0 (of A).

Under HWE, E[pk+1]=pk, but V[pk+1] > 0, so the next generation will have pk+1≠p0.

The rate of the drift

N – effective population size (if all individuals are entirely unrelated than N is the total population size).

Under an assumption of constant population size, if Xk counts the number of occurrences of a at generation k, then Xk+1 ~ B(N,pk).

E[pk+1] = E[Xk+1]/N = pk. Var[pk+1] = pk(1-pk)/N. The effect of genetic drift depends on the time

and the effective populations size. Small population increases the effect.

Bottleneck effect

Effective population size

Tim

e Genetic drift’s rate is higher.

Generation 1Allele frequency 1/9

The Wright-Fisher Model

Ancestral population


migration


Genetic drift

different allele frequencies


Imagine that all the cases are collected from Africa, and all the controls are from Europe. Many association signals are going to be

found The vast majority of them are false;

What can we do about it?

Jakobsson et al, Nature 421: 998-103

Principal Component Analysis Dimensionality reduction Based on linear algebra (Singular Value

Decomposition) Intuition: find the ‘most important’

features of the data – project the data on the axis with the largest variance.

Principal Component Analysis

Plotting the data on a onedimensional line for which the spread is maximized.

Principal Component Analysis In our case, we want to look at two

dimensions at a time. The original data has many dimensions –

each SNP corresponds to one dimension.

To what extent can population structure be detected from SNP data?

What can we learn from these inferences? Can we build the tree of life? How do we analyze complex

populations (mixed)?

Novembre et al., Nature, 2008

Ancestry Inference

Modeling Correlation

Challenge 252

A typical associated region53

Linkage Disequilibrium54

Haplotype Data in a Block

(Daly et al., 2001) Block 6 from Chromosome 5q31

Phasing - haplotype inference

Cost effective genotyping technology gives genotypes and not haplotypes.

Haplotypes Genotype

A

CCG

A

C

G

TA

ATCCGAAGACGC

ATACGAAGCCGC

Possiblephases:

AGACGAATCCGC ….

mother chromosomefather chromosome

57

1??11?1??11?

?100???100??

1?0???1?0???

10?11?11?11?

1100??0100??

100???110???

1??11?1??11?

1100??0100??

1?0???1?0???

10011?11111?

11000?01001?

10011?11000?

Inferring Haplotypes From Trios

Parent 1

Parent 2

Child

122112

210022

120222

Assumption: No recombination

Maximum Likelihood

Until now we discussed the case of two hypotheses (null, and alternative).

In some cases we are interested in many hypotheses and we search for the best one.

Normally a hypothesis will be defined by a set of parameters θ.

The likelihood of θ is .We are interested in the hypothesis that maximizes the likelihood.

€

L(θ;D) = Pr(D |θ )

Soft assignment59

Compute probabilities P={ph} for all possible haplotypes.

For each genotype g, we do not assign one pair of haplotypes, but a distribution of possible pairs.

The set of pairs of haplotypes compatible with g is denoted as C(g).

In soft assignment, a pair is explaining g with probability

€

(h,h')∈C(g)

€

ph ph '

ph1ph2(h1,h2)∈C (g )

∑

Phasing via Maximum Likelihood

60

Soft decision:

Hard decision:

€

log(L(P = {ph};D)) = log(g∈G

∑ ph ph '(h,h' )∈C (g )

∑ )

€

log(L(Z,P = {ph};D)) = nh log(ph )h

∑

An iterative algorithm

0 0 0 1 01/120 0 0 1 11/121 0 0 0 11/121 0 0 1 01/121 0 0 1 13/121 0 1 0 11/121 0 1 1 12/121 1 0 1 11/121 1 1 1 11/12

Data:

1 0 h h 1

h 0 0 1 h

1 h h 1 1

1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1

0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0

1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼


0 0 0 1 0.1250 0 0 1 1.0421 0 0 0 1.0671 0 0 1 0.0421 0 0 1 1.3251 0 1 0 1 .11 0 1 1 1.0671 1 0 1 1.0671 1 1 1 1 .1

Data:

1 0 h h 1

h 0 0 1 h

1 h h 1 1

1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1

0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0

1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

0.40.6

0.750.25

0.60.4

0 0 0 1 01/120 0 0 1 11/121 0 0 0 11/121 0 0 1 01/121 0 0 1 13/121 0 1 0 11/121 0 1 1 12/121 1 0 1 11/121 1 1 1 11/12


0 0 0 1 0 1/60 0 0 1 1 01 0 0 0 1 01 0 0 1 0 01 0 0 1 1 1/21 0 1 0 1 1/61 0 1 1 1 01 1 0 1 1 01 1 1 1 1 1/6

Data:

1 0 h h 1

h 0 0 1 h

1 h h 1 1

1 0 0 0 11 0 1 1 11 0 0 1 11 0 1 0 1

0 0 0 1 01 0 0 1 10 0 0 1 11 0 0 1 0

1 0 0 1 11 1 1 1 11 0 1 1 11 1 0 1 1

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

¼

01

10

10

Expectation Maximization (EM)

64

D – given data Θ– parameters that need to be

estimated Z – Latent missing variables

€

1. E - step : Compute Q(θ |θ n ) = EZ |D,θ n[log(Pr(D,Z |θ ))]

2. M - step : Find a θ n+1 which maximizes Q(θ |θ n )

log P(x| )

Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”)

MLE from Incomplete DataFinding MLE parameters: nonlinear optimization

problem

E ’[log P(x,y| )]

log P(x| )

MLE from Incomplete Data

E ’[log P(x,y| )]

€

log(L(θ;D))

€

Q(θ |θ 0) = EZ |θ 0[log(L(θ ;D,Z))]

€

0

EM for phasing70

€

log(L(θ = {ph};D)) = log(g∈G

∑ ph ph '(h,h' )∈C (g )

∑ )

€

Q(θ |θ n ) = EZ |θ n[log(L(Z,θ = {ph};D)) =

EZ |θ n[ nh(Z)log(ph )h

∑ ] = EZ |θ n[ IZ (g )=(h1,h2)(log(ph1) + log(ph2)

(h1,h2)∈C (g )

∑ ]g∈G

∑

=ph1n ph2

n

ph ph'(h,h ' )∈C (g )

∑(

(h1,h2)

∑g

∑ log(ph1) + log(ph2))

71

This is maximized for:€

Q(θ |θ n ) =ph1n ph2

n

phn ph'

n

(h,h ' )∈C (g )

∑(

(h1,h2)

∑g

∑ log(ph1) + log(ph2))

€

ph =phn ph '

n

ph1n ph2

n

(h1,h2)∈C (g )

∑(h,h' )∈C (g )

∑g

∑

Phasing summary72

Expectation maximization is easy to implement, works reasonably well in practice.

We can use other models (tree models) to improve the accuracy of the phasing prediction.

Human Genetics – where to?

We can typically explain 5%-15%of the heritability of commondiseases.

Where is the missing heritability? Rare variants Gene-gene interactions Gene-environment interactions

Creative computational methods are key to the discovery of the missing heritability.

73

Course: Computational Human Genetics

74

Semester bet More background in human genetics,

statistics, and machine learning. Studying genetics of human disease Privacy and forensics Analysis of new technologies (sequencing) Population genetics – detecting selection,

mutation rate, recombination rates, etc. Reconstructing human history

computational human genetics - searching for relations between genes, diseases, and populations

Documents