cse280vineet bafna cse280a: algorithmic topics in bioinformatics vineet bafna

51
CSE280 Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

Upload: archibald-newman

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

CSE280a: Algorithmic topics in bioinformatics

Vineet Bafna

Page 2: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

The scope/syllabus

• We will cover topics from the following areas:– Population genetics– Computational Mass Spectrometry– Biological networks (emphasis on comparative

analysis)– ncRNA

• The focus will be on the use of algorithms for analyzing data in these areas– Some background in algorithms (mathematical

maturity) is helpful.– Relevant biology will be discussed on a ‘need to know’ basis

Page 3: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Logisitics

• This class has little required homework.• All reading is optional, but recommended.• 1 Final Exam (20%), and 1 research project (70%)• Students help edit class notes in latex (10%)

– At most one topic per student– Groups of <= 2 per topic

• Project Goal: – Address research problems with minimum preparation

• Lectures will be given by instructors and by students.• Most communication is electronic.

– Check http://www.cse.ucsd.edu/classes/wi08/cse280a

Page 4: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

From an individual to a population

• Individual genomes vary by about 1 in 1000bp.• These small variations account for significant

phenotype differences.– Disease susceptibility.– Response to drugs

• How can we understand genetic variation in a population, and its consequences?

• It took a long time (10-15 yrs) to produce the draft sequence of the human genome.

• Soon (within 10-15 years), entire populations can have their DNA sequenced. Why do we care?

Page 5: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Population Genetics

• Individuals in a species (population) are phenotypically different.

• Often these differences are inherited (genetic). Understanding the genetic basis of these differences is a key challenge of biology!

• The analysis of these differences involves many interesting algorithmic questions.

• We will use these questions to illustrate algorithmic principles, and use algorithms to interpret genetic data.

Page 6: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Population genetics

• We are all similar, yet we are different. How substantial are the differences?– What are the sources of variation?– As mutations arise, they are either neutral and subject to

evolutionary drift, or they are (dis-)advantageous and under selective pressure. Can we tell?

– If you had DNA from many sub-populations, Asian, European, African, can you separate them?

– How can we detect recombination?– Why are some people more likely to get a disease then

others? How is disease gene mapping done?– Phasing of chromosomes

Page 7: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Computational mass spectrometry

• Mass Spectrometry is a key technology for measuring active proteins, their interactions, and their post-translational modifications

• Computation plays a key role in interpreting mass spectrometry data.

Page 8: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Back to the genome

• Recall that the genome contains protein-coding genes (1% of the genome)

• Another 5% might encode RNA, and regulatory sites• How do we find regions of functional interest?

Page 9: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Population genetics basics

Page 10: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Scope of genetics lectures

• Basic terminology• Key principles

– Sources of variation– HW equilibrium– Linkage– Coalescent theory– Recombination/Ancestral Recombination Graph– Haplotypes/Haplotype phasing– Population sub-structure– Structural polymorphisms– Medical genetics basis: Association mapping/pedigree

analysis

Page 11: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Terminology: allele

• Allele: A specific variant at a location– The notion of alleles predates the concept of

gene, and DNA.– Initially, alleles referred to variants that

described a measurable trait (round/wrinkled seed)

– Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype.

– As we discuss source of variation, we will have different kinds of alleles.

Page 12: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Terminology

• Locus: The location of the allele– A nucleotide position.– A genetic marker– A gene– A chromosomal segment

Page 13: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Terminology

• Genotype: genetic makeup of (part of) an individual

• Phenotype: A measurable trait in an organism, often the consequence of a genetic variation

• Humans are diploid, they have 2 copies of each chromosome.– They may have heterozygosity/homozygosity

at a location– Other organisms (plants) have higher forms

of ploidy.– Additionally, some sites might have 2 allelic

forms, or even many allelic forms.• Haplotype: genetic makeup of (part of) a single

chromosome

Page 14: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

What causes variation in a population?

• Mutations (may lead to SNPs)• Recombinations• Other crossover events (gene

conversion)• Structural Polymorphisms

Page 15: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Single Nucleotide Polymorphisms

• Small mutations that are sustained in a population are called SNPs

• SNPs are the most common source of variation studied

• The data is a matrix (rows are individuals, columns are loci). Only the variant positions are kept.

A->G

Page 16: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Single Nucleotide Polymorphisms

000001010111000110100101000101010010000000110001111000000101100110

Infinite Sites Assumption:Each site mutates at most once

Page 17: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Short Tandem Repeats

GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC

435335

Page 18: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

STR can be used as a DNA fingerprint

• Consider a collection of regions with variable length repeats.

• Variable length repeats will lead to variable length DNA

• Vector of lengths is a finger-print

4 23 35 13 23 15 3

loci

indiv

idual

s

Page 19: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Structural polymorphisms

• Large scale structural changes (deletions/insertions/inversions) may occur in a population.

• Copy Number variation• Certain diseases (cancers) are marked by an

abundance of these events

Page 20: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Personalized genome sequencing

• These variants (of which 1,288,319 were novel) included

• 3,213,401 single nucleotide polymorphisms (SNPs),

• 53,823 block substitutions (2–206 bp), • 292,102 heterozygous insertion/deletion

events (indels)(1–571 bp), • 559,473 homozygous indels (1–82,711 bp), • 90 inversions, as well as numerous

segmental duplications and copy number variation regions.

• Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure.

• Moreover, 44% of genes were heterozygous for one or more variants.

PLoS Biology, 2007

Page 21: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Recombination

0000000011111111

00011111

• Not all DNA recombines!

Page 22: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Human DNA

http://upload.wikimedia.org/wikipedia/commons/b/b2/Karyotype.png

• Not all DNA recombines. • mtDNA is inherited from the

mother, and • y-chromosome from the father

Page 23: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Gene Conversion

• Gene Conversion versus single crossover– Hard to distinguish

in a population

Page 24: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Topic 1: Basic Principles

• In a ‘stable’ population, the distribution of alleles obeys certain laws– Not really, and the deviations are

interesting• HW Equilibrium

– (due to mixing in a population)• Linkage (dis)-equilibrium

– Due to recombination

Page 25: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Hardy Weinberg equilibrium

• Consider a locus with 2 alleles, A, a• p (respectively, q) is the frequency of A

(resp. a) in the population• 3 Genotypes: AA, Aa, aa• Q: What is the frequency of each genotype

If various assumptions are satisfied, (such as random mating, no natural selection), Then• PAA=p2

• PAa=2pq• Paa=q2

Page 26: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Hardy Weinberg: why?

• Assumptions:– Diploid– Sexual reproduction– Random mating– Bi-allelic sites– Large population size, …

• Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

Page 27: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Hardy Weinberg: Generalizations

• Multiple alleles with frequencies– By HW,

• Multiple loci?

θ1,θ2,L ,θH

Pr[homozygous genotype i] =θ i2

Pr[heterozygous genotype i, j] = 2θ iθ j

Page 28: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Hardy Weinberg: Implications

• The allele frequency does not change from generation to generation. Why?

• It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation?

• Males are 100 times more likely to have the “red’ type of color blindness than females. Why?

• Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Page 29: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Recombination

0000000011111111

00011111

Page 30: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

What if there were no recombinations?

• Life would be simpler• Each individual sequence would have a

single parent (even for higher ploidy)• The genealogical relationship is

expressed as a tree. This principle is used to track ancestry of an individual

Page 31: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

The Infinite Sites Assumption

0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0

3

8 5

• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.

• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination

Page 32: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Infinite sites assumption and Perfect Phylogeny

• Each site is mutated at most once in the history.

• All descendants must carry the mutated value, and all others must carry the ancestral value

i

1 in position i0 in position i

Page 33: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Perfect Phylogeny

• Assume an evolutionary model in which no recombination takes place, only mutation.

• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

Page 34: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Handling recombination

• A tree is not sufficient as a sequence may have 2 parents

• Recombination leads to loss of correlation between columns

Page 35: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Quiz 1

• Allele, locus, genotype, haplotype• Hardy Weinberg equilibrium?• Today: Linkage (dis)-equilibrium

Page 36: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Quiz 2

• Recall that a SNP data-set is a ‘binary’ matrix.– Rows are individual (chromosomes)– Columns are alleles at a specific locus

• Suppose you have 2 SNP datasets of a contiguous genomic region. – One from an African population, and one from

a European Population.– Can you tell which is which?– How long does the genomic region have to

be?

Page 37: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Recombination, and populations

• Think of a population of N individual chromosomes.

• The population remains stable from generation to generation.

• Without recombination, each individual has exactly one parent chromosome from the previous generation.

• With recombinations, each individual is derived from one or two parents.

• We will formalize this notion later in the context of coalescent theory.

Page 38: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No

recombination• Each new individual

chromosome chooses a parent from the existing ‘haplotype’

A B0 10 10 00 01 01 01 01 0

1 0

Page 39: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 2: diploidy and

recombination• Each new individual

chooses a parent from the existing alleles

A B0 10 10 00 01 01 01 01 0

1 1

Page 40: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No recombination• Each new individual chooses a

parent from the existing ‘haplotype’

– Pr[A,B=0,1] = 0.25• Linkage disequilibrium

• Case 2: Extensive recombination• Each new individual simply

chooses and allele from either site

– Pr[A,B=(0,1)]=0.125• Linkage equilibrium

A B0 10 10 00 01 01 01 01 0

Page 41: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD

• In the absence of recombination, – Correlation between columns– The joint probability Pr[A=a,B=b] is

different from P(a)P(b)• With extensive recombination

– Pr(a,b)=P(a)P(b)

Page 42: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Measures of LD

• Consider two bi-allelic sites with alleles marked with 0 and 1

• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]

– P0* = Pr[Allele 0 in locus 1]

• Linkage equilibrium if P00 = P0* P*0

• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …

Page 43: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD over time

• With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear– Let D(t) = LD at time t– P(t)

00 = (1-r) P(t-1)00 + r P(t-1)

0* P(t-1)*0

– D(t) = P(t)00 - P(t)

0* P(t)*0 = P(t)

00 - P(t-1)0* P(t-1)

*0

(Why?)– D(t) =(1-r) D(t-1) =(1-r)t D(0)

Page 44: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Other measures of LD

• D’ is obtained by dividing D by the largest possible value– Dmax = max {P1*P*1, P0*P*1, P1*P*0, P0*P*0 }

– Ex: D’ = abs(P11- P1* P*1)/ Dmax

= D/(P1* P0* P*1 P*0)1/2

• Let N be the number of individuals• Show that 2N is the 2 statistic between the two sites

Site 10

Site 2

1

10

P00N P0*N

Page 45: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD over distance

• Assumption– Recombination rate increases linearly with

distance– LD decays exponentially with distance.

• The assumption is reasonable, but recombination rates vary from region to region, adding to complexity

• This simple fact is the basis of disease association mapping.

Page 46: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD and disease mapping

• Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover

which gene (locus) carries the mutation.• Consider every polymorphism, and check:

– There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to

the same disease

• Instead, consider a dense sample of polymorphisms that span the genome

Page 47: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD can be used to map disease genes

• LD decays with distance from the disease allele.

• By plotting LD, one can short list the region containing the disease gene.

011001

DNNDDN

LD

Page 48: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Page 49: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

• 269 individuals– 90 Yorubans– 90 Europeans (CEPH)– 44 Japanese– 45 Chinese

• ~1M SNPs

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 50: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

LD and disease gene mapping problems

• Marker density?• Complex diseases• Population sub-structure

Page 51: CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna

CSE280 Vineet Bafna

Topic 2: Simulating population data

• We described various population genetic concepts (HW, LD), and their applicability

• The values of these parameters depend critically upon the population assumptions.– What if we do not have infinite populations– No random mating (Ex: geographic isolation)– Sudden growth– Bottlenecks– Ad-mixture

• It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?