xgram and phylo-grammars a brief intro. what is a phylo-grammar? combination of: –phylogenetic...

22
xgram and phylo- grammars A brief intro

Upload: lorraine-daniel

Post on 20-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

xgram and phylo-grammars

A brief intro

Page 2: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

What is a phylo-grammar?

• Combination of:– Phylogenetic likelihood model

• Tree with branch lengths, t• Rate matrix, R (continuous-time Markov chain)• Edge probabilities: exp(R*t)

– Stochastic grammar• Grammar symbols (nonterminals and terminals)• Production rules (with probabilities)

Page 3: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Grammars and dependencies

Nested dependencies

(context-free; Chomsky)

Cross-serial dependencies

(“mildly” context-sensitive; Joshi)

TIR

LTR

Adjacent dependencies

(regular/HMMs)

ATATATATATATATATATTAT

MicrosatelliteFuzzy duck, fuzzy duck, duckie fuzz, duckie fuzz, duckie fuzz…

Page 4: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

PFOLD: Knudsen & Hein, 1999

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

S -> LS | L;F -> dFd | LS;L -> s | dFd;

dd:

s:

Page 5: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Codon evolution and exon models

Goldman & Kosiol, MS in prep

Page 6: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

EvoGene (Pedersen & Hein)

c.f. Exoniphy (Siepel & Haussler) whose “null model” is considerably more sophisticated (context-dependent substitutions are explicitly modeled, yielding higher-

order dependence between noncoding bases)

Page 7: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

PHASTCONS phylo-HMM

Page 8: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

xgram command line usage

Page 9: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

xgram examples

• Load alignment from file “align.stk”; load grammar from file “grammar.eg”; estimate tree by neighbor-joining (if there isn’t a tree annotated to the alignment already); do CYK algorithm (or Viterbi, as appropriate); annotate alignment; print to standard output

xgram align.stk -g grammar.eg

• Load alignment from file “align2.stk”; load grammar from file “grammar2.eg”; optimize branch lengths of tree; do EM by iterating Inside-Outside algorithm (or Forward-Backward); save grammar to file “trained.eg”; print log messages to level 5

xgram align2.stk -g grammar2.eg -b -t trained.eg --noannotate -log 5

Page 10: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

xgram grammar elements• Alphabet

– Tokens– Complementarity– Degeneracies

• Grammar– Chains

• Pseudoterminals• Initial probabilities• Mutation rates• Update policy

– Production rules (& nonterminals)• Emissions (paired emit-null states)

– Annotation labels (optional)– Gap model (optional)

• Transitions (aka “null” states)• Bifurcations

– Parameters (optional)• Rate parameters• Probability parameters

(alphabet (name RNA) (token (a c g u)) (complement (u g c a)) (extend (to n) (from a) (from c) (from g) (from u)) (extend (to x) (from a) (from c) (from g) (from u)) (extend (to t) (from u)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from u)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from u)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from u)) (extend (to h) (from a) (from c) (from u)) (extend (to b) (from c) (from g) (from u)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from u)) (wildcard *)) ;; end alphabet RNA

(grammar (name pfold) (update-rates 1) (update-rules 1)

(chain (update-policy rev) (terminal (LNUC RNUC))

;; initial probability distribution (initial (state (a a)) (prob 0.001167)) (initial (state (c a)) (prob 0.001806)) (initial (state (g a)) (prob 0.001058)) (initial (state (u a)) (prob 0.177977)) (initial (state (a c)) (prob 0.001806)) (initial (state (c c)) (prob 0.000391)) (initial (state (g c)) (prob 0.266974)) (initial (state (u c)) (prob 0.000763)) (initial (state (a g)) (prob 0.001058)) …….

Page 11: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Types of production rule ;; state pfoldS (transform (from (pfoldS)) (to (pfoldL)) (prob 0.131488)) (transform (from (pfoldS)) (to (pfoldB)) (prob 0.868742))

;; state pfoldF (transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >))) (transform (from (pfoldF')) (to (pfoldF)) (prob 0.787854)) (transform (from (pfoldF')) (to (pfoldB)) (prob 0.212421))

;; state pfoldL (transform (from (pfoldL)) (to (pfoldF)) (prob 0.105404)) (transform (from (pfoldL)) (to (pfoldU)) (prob 0.895025))

;; state pfoldB (transform (from (pfoldB)) (to (pfoldL pfoldS)))

;; state pfoldU (transform (from (pfoldU)) (to (NUC pfoldU')) (gaps-ok)) (transform (from (pfoldU')) (to ()) (prob 1))

Emit

Emit

Null

Null

Null

Null (end)

Bifurcate

Page 12: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Types of rate matrix

• update-policy for a chain can be– rind

• i.e. R(i,j) = R*pi(j)• Could also implement this with parametric (see below)

– rev• i.e. reversible: pi(i) * R(i,j) = pi(j) * R(j,i)

– irrev• i.e. irreversible (more general)

– parametric• i.e. R(i,j) = f(a,b,c,d,e….)• (a,b,c,d,e….) are independent parameters

Page 13: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Annotation & supervised learning

• Use the “annotate” element to add annotation lines to the alignment

• Add these annotation lines yourself prior to training, to force a particular parse (supervised learning)

• Period characters are treated as wildcards (partially supervised learning)

(transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >)))

#=GC PFOLD ...................<<<<<<<<...<<<<<........>>>>>..<...>>>.<<<<<<<.......>>>>>>>..>>>>>>.................

Page 14: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Context-dependent substitutions

• Let P(n) be the probability of column n of an alignment

• Context-independent model says thatP(1..N) = P(1)P(2)P(3)…P(N)

• More generally (context-dependence):P(1..N) = P(1)P(2|1)P(3|1,2)P(4|1,2,3)…

• Siepel & Haussler approximated this byP(1..N) ≈ P(1)P(2|1)P(3|2)…P(n|n-1)…

• Here P(n|n-1) is obtained from a dinucleotide model for {n-1,n}, using Bayes’ theorem:P(n|n-1) = P(n-1,n) / P(n-1)

Page 15: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Context-dependent emit rules

(transform(from (PREVNUC S))(to (PREVNUC EMITNUC S')))

Here {PREVNUC,EMITNUC} are the pseudoterminals for a dinucleotide chain

NB context-dependent substitution models generally irreversible (CpG -> TpG)

Page 16: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Parametric models

• Any rate or probability in a grammar can be replaced by a parametric function

• This is useful to constrain models

• e.g. PHASTCONS phylo-HMM

Page 17: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

How the “-length” argument works

• DP is an iteration over subsequences• In long (e.g. genomic) alignments, you can

save time by not considering all subseqs– e.g. you probably don’t expect bases over 1MB

apart to be paired

• The -length command-line argument allows you to limit the maximum length of subseqs that will be iterated over– All suffix subseqs are always included, however,

so there is always a valid global parse tree

• Care must be given to design of grammars, particularly if you want to find “local” features

Page 18: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Speed tips

• Use “-length”– Also “minlen” and “maxlen” in grammar

• Replace emit loops with bifurcations– E.g. instead of “S -> x S”

use “S -> X S; X -> x”and limit “maxlen” for X to 1

– NB this goes against the “standard” SCFG dogma of minimizing bifurcations; reason is that with big trees, emissions become more expensive

• Turn off logging once model debugged

Page 19: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

DART logging and dartlog.pl

Page 20: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Model debugging tips

• Minimal test cases to reproduce errors• Use logging during model development

– Examine source code for log messages– E.g. “-log CYK_MATRIX”

• Use Makefiles for reproducibility• To protect against EM getting stuck in local

minima, try training a low-dimensional model first (e.g. using “rind”) then move to models with more degrees of freedom (rev -> irrev)

Page 21: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Perl xgram modules

• In dart/perl– Stockholm.pm

• Stockholm alignment class. Pretty basic

– DartSexpr.pm• Class for working with S-expressions

– PhyloGram.pm• Subclass of DartSexpr for phylo-grammars• Has accessors/helpers for various common tasks e.g.

creating & populating new chains, emit rules, etc.• Subclasses: DNA.pm and Protein.pm• Related class: Chain.pm

– Not much documentation (but see first few lines of each file for examples (or bug me))

Page 22: Xgram and phylo-grammars A brief intro. What is a phylo-grammar? Combination of: –Phylogenetic likelihood model Tree with branch lengths, t Rate matrix,

Rudimentary indel models

• Highly experimental

• Attempt to deal with gaps more intelligently than just ignoring them

• Falls somewhat short of a full “statistical alignment” treatment

(transform (from (S)) (to (X S')) (gap-model (extend-prob 0.5) (insert-rate 0.01) (delete-rate 0.01)))