mw 12:50-2:05pm in beckman b302 profs: serafim batzoglou & gill bejerano
DESCRIPTION
CS273A. Lecture 10: Comparative Genomics I. MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas. Announcements. HW2 is out Half way feedback end of this class. Please take 5 minutes to share your thoughts with us!. - PowerPoint PPT PresentationTRANSCRIPT
http://cs273a.stanford.edu [BejeranoFall13/14] 1
MW 12:50-2:05pm in Beckman B302
Profs: Serafim Batzoglou & Gill Bejerano
TAs: Harendra Guturu & Panos Achlioptas
CS273A
Lecture 10: Comparative Genomics I
http://cs273a.stanford.edu [BejeranoFall13/14] 2
Announcements
• HW2 is out• Half way feedback end of this class. Please take 5 minutes to share your thoughts with us!
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG
3
http://cs273a.stanford.edu [BejeranoFall13/14] 4
human
mouse
rat
chimp
chicken
fugu
zfish
dog
tetra
human
mouse
rat
chimp
chicken
fugu
zfish
dog
tetra
opossum
cow
macaque
platypus
opossum
cow
macaque
platypus
Comparative Genomics
“Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky
t
http://cs273a.stanford.edu [BejeranoFall13/14] 5
human
mouse
rat
chimp
chicken
fugu
zfish
dog
tetra
human
mouse
rat
chimp
chicken
fugu
zfish
dog
tetra
opossum
cow
macaque
platypus
opossum
cow
macaque
platypus
Comparative Genomics
“Nothing in Evolution Makes Sense Except in the Light of Computation” Yours Truly
t
http://cs273a.stanford.edu [BejeranoFall13/14] 6
Evolution = Mutation + Selection
Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in.
...ACGTACGACTGACTAGCATCGACTACGA...
chicken
egg...ACGTACGACTGACTAGCATCGACTACGA...
functionaljunk
TT CAT
“anything
goes”
many changes
are not tolerated
chicken
This has bad implications – disease, and good implications – adaptation.
http://cs273a.stanford.edu [BejeranoFall13/14] 7
Mutation
Chromosomal (ie big) Mutations
• Five types exist:–Deletion–Inversion–Duplication–Translocation–Nondisjunction
Deletion
• Due to breakage• A piece of a
chromosome is lost
Inversion• Chromosome segment
breaks off• Segment flips around
backwards• Segment reattaches
Duplication
• Occurs when a genomic region is repeated
Whole Genome Duplication at the Base of the Vertebrate Tree
http://cs273a.stanford.edu [BejeranoFall13/14] 12
Xen.Laevis WGD
Translocation
• Involves two chromosomes that aren’t homologous
• Part of one chromosome is transferred to another chromosomes
Nondisjunction• Failure of chromosomes to
separate during meiosis• Causes gamete to have too many
or too few chromosomes• Disorders:
– Down Syndrome – three 21st chromosomes
– Turner Syndrome – single X chromosome– Klinefelter’s Syndrome – XXY
chromosomes
Genomic (ie small) Mutations
• Six types exist:–Substitution (eg GT)
–Deletion–Insertion–Inversion–Duplication–Translocation
Example: Human-Chimp Genomic Differences
http://cs273a.stanford.edu [BejeranoFall13/14] 16
Mutations kill functional elements.
Mutations give rise to new functional elements(by duplicating existing ones, or creating new ones)
Selection whittles this constant flow of genomic innovations.
http://cs273a.stanford.edu [BejeranoFall13/14] 17
Evolution = Mutation + Selection
Neutral Drift Positive SelectionNegative Selection
Time
The Species Tree
http://cs273a.stanford.edu [BejeranoFall13/14] 18
Sampled Genomes
S
S
S SpeciationTime
When we compare one individual from two species, most, but not all mutations we see are fixed differences between the two species.
http://cs273a.stanford.edu [BejeranoFall13/14] 19
Inferring Genomic Histories
From Alignments of Genomes
20
A Gene tree evolves with respect to a Species tree
Species tree
Gene tree
SpeciationDuplicationLoss
By “Gene” we meanany piece of DNA.
http://cs273a.stanford.edu [BejeranoFall13/14] 21
Terminology
Orthologs : Genes related via speciation (e.g. C,M,H3)
Paralogs: Genes related through duplication (e.g. H1,H2,H3)
Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3)
Species tree
Gene tree
SpeciationDuplicationLoss
single
ancestral
gene
http://cs273a.stanford.edu [BejeranoFall13/14] 22
Typical Molecular Distances
If they were evolving at a constant rate:• To which is H1 closer in sequence, H2 or H3?• To which H is M closest?• And C?
(Selection may skew distances)
Species tree
Gene tree
SpeciationDuplicationLoss
single
ancestral
gene
http://cs273a.stanford.edu [BejeranoFall13/14] 23
Gene trees and even species trees are figments of our (scientific) imagination
Species trees and gene trees can be wrong.
All we really have are extant observations, and fossils.
Species tree
Gene tree
SpeciationDuplicationLoss
single
ancestral
gene
ObservedInferred
Gene Families
24
• What?• Compare whole genomes
• Compare two genomes• Within (intra) species• Between (inter) species
• Compare genome to itself
• Compare functional element to a genome• Why?
• To learn about genome evolution (and phenotype evolution!)• Homologous functional regions often have similar functions• Modification of functional regions can reveal
• Neutral and functional regions• Disease susceptibility• Adaptation
• And more..• How?
http://cs273a.stanford.edu [BejeranoFall13/14] 25
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
Scoring Function
• Sequence edits:AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG . CTC
Scoring Function:Match: +mMismatch: -sGap: -d
Score F = (# matches) m - (# mismatches) s – (#gaps) d
Alternative definition:
minimal edit distance
“Given two strings x, y,find minimum # of edits (insertions, deletions,
mutations) to transform one string to the other”
Cost of edit operationsneeds to be biologicallyinspired (eg DEL length).
Solve via Dynamic Programming
Are two sequences homologous?
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Given an (optimal) alignment between two genome regions,you can ask what is the probability that they are (not) related by homology?
Note that (when known) the answer is a function of the molecular distance between the two (eg, between two species)
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
DP matrix:
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Similarity is often measured using “%id”, or percent identity
%id = number of matching bases / number of alignment columns
Where
Every alignment column is a match / mismatch / indel base
Where indel = insertion or deletion (requires an outgroup to resolve)
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
Note the pattern of sequence conservation / divergence
http://cs273a.stanford.edu [BejeranoFall13/14] 30
human
lizar
d
Objective: find local alignment blocks, that are likely homologous (share common origin)
O(mn) examine the full matrix using DP
O(m+n) heuristics based on seeding + extension trades sensitivity for speed
31
“Raw” (B)lastz track (no longer displayed)
Protease Regulatory Subunit 3
Alignment = homologous regions
Chaining co-linear alignment blocks
http://cs273a.stanford.edu [BejeranoFall13/14] 32
human
lizar
d
Objective: find local alignment blocks, that are likely homologous (share common origin)
Chaining strings together co-linear blocks in the target genome to which we are comparing.
Double lines when there is unalignable sequence in the other species. Single lines when there isn’t.
Gap Types: Single vs Double sided
D E
Human Sequence
D E
Mouse Sequence
B’
In Human BrowserHumansequence
Mousehomology
D E
D E
In Mouse BrowserMousesequence
Humanhomology
D E
33
Did Mouse insert or Human delete?The Need for an Outgroup
D E
Human Sequence
D E
Mouse Sequence
B’
In Human BrowserHumansequence
Mousehomology
D E
D E
In Mouse BrowserMousesequence
Humanhomology
D E
34
Outgroup Sequence
D E
http://cs273a.stanford.edu [BejeranoFall13/14] 35
Conservation Track Documentation
Dotplots
• Dotplots are a simple way of seeing alignments– We really like to see good
visual demonstrations, not just tables of numbers
• It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match.
• You see the alignment as a diagonal
• Note that DNA dotplots are messier because the alphabet has only 4 letters…
http://cs273a.stanford.edu [BejeranoFall13/14] 37
Chaining Alignments
Chaining highlights homologous regions between genomes, bridging the gulf between syntenic blocks and base-by-base alignments.
Local alignments tend to break at transposon insertions, inversions, duplications, etc.
Global alignments tend to force non-homologous bases to align.
Chaining is a rigorous way of joining together local alignments into larger structures.
38
“Raw” (B)lastz track (no longer displayed)
Protease Regulatory Subunit 3
Alignment = homologous regions
Chains & Nets: How they’re built
• 1: Blastz one genome to another– Local alignment algorithm– Finds short blocks of similarity
Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGG
Hg18.1-6 + AAAAAAMm8.1-6 + AAAAAA
Hg18.7-11 + CCCCCMm8.1-5 - CCCCC
Hg18.12-16 + AAAAAMm8.1-5 + AAAAA
39
Chains & Nets: How they’re built• 2: “Chain” alignment blocks together
– Links blocks that preserve order and orientation– Not single coverage in either species
Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGGAAAAA
Hg18: AAAAAACCCCCAAAAA Mm8 chains
Mm8.1-6 +
Mm8.7-11 -
Mm8.12-16 +
Mm8.12-15 + Mm8.1-5 + 40
Another Chain Example
A B CD E
Human SequenceA B CD E
Mouse Sequence
B’
In Human BrowserImplicitHumansequence
Mousechains B’
…
…
D E
D E
In Mouse BrowserImplicitMousesequence
Humanchains
…
… D E
41
http://cs273a.stanford.edu [BejeranoFall13/14] 42
Chains join together related local alignments
Protease Regulatory Subunit 3
likely ortholog
likely paralogsshared domain?