mw  12:50-2:05pm in beckman b302 profs: serafim batzoglou & gill bejerano

42
http://cs273a.stanford.edu [BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejeran TAs: Harendra Guturu & Panos Achlioptas CS273A Lecture 10: Comparative Genomics I

Upload: tassos

Post on 05-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

CS273A. Lecture 10: Comparative Genomics I. MW  12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas. Announcements. HW2 is out Half way feedback end of this class. Please take 5 minutes to share your thoughts with us!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 1

MW  12:50-2:05pm in Beckman B302

Profs: Serafim Batzoglou & Gill Bejerano

TAs: Harendra Guturu & Panos Achlioptas

CS273A

Lecture 10: Comparative Genomics I

Page 2: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 2

Announcements

• HW2 is out• Half way feedback end of this class. Please take 5 minutes to share your thoughts with us!

Page 3: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG

3

Page 4: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 4

human

mouse

rat

chimp

chicken

fugu

zfish

dog

tetra

human

mouse

rat

chimp

chicken

fugu

zfish

dog

tetra

opossum

cow

macaque

platypus

opossum

cow

macaque

platypus

Comparative Genomics

“Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky

t

Page 5: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 5

human

mouse

rat

chimp

chicken

fugu

zfish

dog

tetra

human

mouse

rat

chimp

chicken

fugu

zfish

dog

tetra

opossum

cow

macaque

platypus

opossum

cow

macaque

platypus

Comparative Genomics

“Nothing in Evolution Makes Sense Except in the Light of Computation” Yours Truly

t

Page 6: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 6

Evolution = Mutation + Selection

Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in.

...ACGTACGACTGACTAGCATCGACTACGA...

chicken

egg...ACGTACGACTGACTAGCATCGACTACGA...

functionaljunk

TT CAT

“anything

goes”

many changes

are not tolerated

chicken

This has bad implications – disease, and good implications – adaptation.

Page 7: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 7

Mutation

Page 8: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Chromosomal (ie big) Mutations

• Five types exist:–Deletion–Inversion–Duplication–Translocation–Nondisjunction

Page 9: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Deletion

• Due to breakage• A piece of a

chromosome is lost

Page 10: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Inversion• Chromosome segment

breaks off• Segment flips around

backwards• Segment reattaches

Page 11: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Duplication

• Occurs when a genomic region is repeated

Page 12: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Whole Genome Duplication at the Base of the Vertebrate Tree

http://cs273a.stanford.edu [BejeranoFall13/14] 12

Xen.Laevis WGD

Page 13: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Translocation

• Involves two chromosomes that aren’t homologous

• Part of one chromosome is transferred to another chromosomes

Page 14: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Nondisjunction• Failure of chromosomes to

separate during meiosis• Causes gamete to have too many

or too few chromosomes• Disorders:

– Down Syndrome – three 21st chromosomes

– Turner Syndrome – single X chromosome– Klinefelter’s Syndrome – XXY

chromosomes

Page 15: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Genomic (ie small) Mutations

• Six types exist:–Substitution (eg GT)

–Deletion–Insertion–Inversion–Duplication–Translocation

Page 16: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Example: Human-Chimp Genomic Differences

http://cs273a.stanford.edu [BejeranoFall13/14] 16

Mutations kill functional elements.

Mutations give rise to new functional elements(by duplicating existing ones, or creating new ones)

Selection whittles this constant flow of genomic innovations.

Page 17: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 17

Evolution = Mutation + Selection

Neutral Drift Positive SelectionNegative Selection

Time

Page 18: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

The Species Tree

http://cs273a.stanford.edu [BejeranoFall13/14] 18

Sampled Genomes

S

S

S SpeciationTime

When we compare one individual from two species, most, but not all mutations we see are fixed differences between the two species.

Page 19: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 19

Inferring Genomic Histories

From Alignments of Genomes

Page 20: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

20

A Gene tree evolves with respect to a Species tree

Species tree

Gene tree

SpeciationDuplicationLoss

By “Gene” we meanany piece of DNA.

Page 21: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 21

Terminology

Orthologs : Genes related via speciation (e.g. C,M,H3)

Paralogs: Genes related through duplication (e.g. H1,H2,H3)

Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3)

Species tree

Gene tree

SpeciationDuplicationLoss

single

ancestral

gene

Page 22: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 22

Typical Molecular Distances

If they were evolving at a constant rate:• To which is H1 closer in sequence, H2 or H3?• To which H is M closest?• And C?

(Selection may skew distances)

Species tree

Gene tree

SpeciationDuplicationLoss

single

ancestral

gene

Page 23: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 23

Gene trees and even species trees are figments of our (scientific) imagination

Species trees and gene trees can be wrong.

All we really have are extant observations, and fossils.

Species tree

Gene tree

SpeciationDuplicationLoss

single

ancestral

gene

ObservedInferred

Page 24: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Gene Families

24

Page 25: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

• What?• Compare whole genomes

• Compare two genomes• Within (intra) species• Between (inter) species

• Compare genome to itself

• Compare functional element to a genome• Why?

• To learn about genome evolution (and phenotype evolution!)• Homologous functional regions often have similar functions• Modification of functional regions can reveal

• Neutral and functional regions• Disease susceptibility• Adaptation

• And more..• How?

http://cs273a.stanford.edu [BejeranoFall13/14] 25

Page 26: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 27: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Scoring Function

• Sequence edits:AGGCCTC

Mutations AGGACTC

Insertions AGGGCCTC

Deletions AGG . CTC

Scoring Function:Match: +mMismatch: -sGap: -d

Score F = (# matches) m - (# mismatches) s – (#gaps) d

Alternative definition:

minimal edit distance

“Given two strings x, y,find minimum # of edits (insertions, deletions,

mutations) to transform one string to the other”

Cost of edit operationsneeds to be biologicallyinspired (eg DEL length).

Solve via Dynamic Programming

Page 28: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Are two sequences homologous?

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Given an (optimal) alignment between two genome regions,you can ask what is the probability that they are (not) related by homology?

Note that (when known) the answer is a function of the molecular distance between the two (eg, between two species)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

DP matrix:

Page 29: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Similarity is often measured using “%id”, or percent identity

%id = number of matching bases / number of alignment columns

Where

Every alignment column is a match / mismatch / indel base

Where indel = insertion or deletion (requires an outgroup to resolve)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 30: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Note the pattern of sequence conservation / divergence

http://cs273a.stanford.edu [BejeranoFall13/14] 30

human

lizar

d

Objective: find local alignment blocks, that are likely homologous (share common origin)

O(mn) examine the full matrix using DP

O(m+n) heuristics based on seeding + extension trades sensitivity for speed

Page 31: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

31

“Raw” (B)lastz track (no longer displayed)

Protease Regulatory Subunit 3

Alignment = homologous regions

Page 32: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Chaining co-linear alignment blocks

http://cs273a.stanford.edu [BejeranoFall13/14] 32

human

lizar

d

Objective: find local alignment blocks, that are likely homologous (share common origin)

Chaining strings together co-linear blocks in the target genome to which we are comparing.

Double lines when there is unalignable sequence in the other species. Single lines when there isn’t.

Page 33: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Gap Types: Single vs Double sided

D E

Human Sequence

D E

Mouse Sequence

B’

In Human BrowserHumansequence

Mousehomology

D E

D E

In Mouse BrowserMousesequence

Humanhomology

D E

33

Page 34: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Did Mouse insert or Human delete?The Need for an Outgroup

D E

Human Sequence

D E

Mouse Sequence

B’

In Human BrowserHumansequence

Mousehomology

D E

D E

In Mouse BrowserMousesequence

Humanhomology

D E

34

Outgroup Sequence

D E

Page 35: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 35

Conservation Track Documentation

Page 36: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Dotplots

• Dotplots are a simple way of seeing alignments– We really like to see good

visual demonstrations, not just tables of numbers

• It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match.

• You see the alignment as a diagonal

• Note that DNA dotplots are messier because the alphabet has only 4 letters…

Page 37: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 37

Chaining Alignments

Chaining highlights homologous regions between genomes, bridging the gulf between syntenic blocks and base-by-base alignments.

Local alignments tend to break at transposon insertions, inversions, duplications, etc.

Global alignments tend to force non-homologous bases to align.

Chaining is a rigorous way of joining together local alignments into larger structures.

Page 38: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

38

“Raw” (B)lastz track (no longer displayed)

Protease Regulatory Subunit 3

Alignment = homologous regions

Page 39: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Chains & Nets: How they’re built

• 1: Blastz one genome to another– Local alignment algorithm– Finds short blocks of similarity

Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGG

Hg18.1-6 + AAAAAAMm8.1-6 + AAAAAA

Hg18.7-11 + CCCCCMm8.1-5 - CCCCC

Hg18.12-16 + AAAAAMm8.1-5 + AAAAA

39

Page 40: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Chains & Nets: How they’re built• 2: “Chain” alignment blocks together

– Links blocks that preserve order and orientation– Not single coverage in either species

Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGGAAAAA

Hg18: AAAAAACCCCCAAAAA Mm8 chains

Mm8.1-6 +

Mm8.7-11 -

Mm8.12-16 +

Mm8.12-15 + Mm8.1-5 + 40

Page 41: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

Another Chain Example

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

41

Page 42: MW   12:50-2:05pm  in Beckman B302 Profs: Serafim  Batzoglou  & Gill  Bejerano

http://cs273a.stanford.edu [BejeranoFall13/14] 42

Chains join together related local alignments

Protease Regulatory Subunit 3

likely ortholog

likely paralogsshared domain?