xgram and phylo-grammars a brief intro. what is a phylo-grammar? combination of: –phylogenetic...

xgram and phylo-grammars

A brief intro

What is a phylo-grammar?

• Combination of:– Phylogenetic likelihood model

• Tree with branch lengths, t• Rate matrix, R (continuous-time Markov chain)• Edge probabilities: exp(R*t)

– Stochastic grammar• Grammar symbols (nonterminals and terminals)• Production rules (with probabilities)

Grammars and dependencies

Nested dependencies

(context-free; Chomsky)

Cross-serial dependencies

(“mildly” context-sensitive; Joshi)

TIR

LTR

Adjacent dependencies

(regular/HMMs)

ATATATATATATATATATTAT

MicrosatelliteFuzzy duck, fuzzy duck, duckie fuzz, duckie fuzz, duckie fuzz…

PFOLD: Knudsen & Hein, 1999

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

S -> LS | L;F -> dFd | LS;L -> s | dFd;

dd:

s:

Codon evolution and exon models

Goldman & Kosiol, MS in prep

EvoGene (Pedersen & Hein)

c.f. Exoniphy (Siepel & Haussler) whose “null model” is considerably more sophisticated (context-dependent substitutions are explicitly modeled, yielding higher-

order dependence between noncoding bases)

PHASTCONS phylo-HMM

xgram command line usage

xgram examples

• Load alignment from file “align.stk”; load grammar from file “grammar.eg”; estimate tree by neighbor-joining (if there isn’t a tree annotated to the alignment already); do CYK algorithm (or Viterbi, as appropriate); annotate alignment; print to standard output

xgram align.stk -g grammar.eg

• Load alignment from file “align2.stk”; load grammar from file “grammar2.eg”; optimize branch lengths of tree; do EM by iterating Inside-Outside algorithm (or Forward-Backward); save grammar to file “trained.eg”; print log messages to level 5

xgram align2.stk -g grammar2.eg -b -t trained.eg --noannotate -log 5

xgram grammar elements• Alphabet

– Tokens– Complementarity– Degeneracies

• Grammar– Chains

• Pseudoterminals• Initial probabilities• Mutation rates• Update policy

– Production rules (& nonterminals)• Emissions (paired emit-null states)

– Annotation labels (optional)– Gap model (optional)

• Transitions (aka “null” states)• Bifurcations

– Parameters (optional)• Rate parameters• Probability parameters

(alphabet (name RNA) (token (a c g u)) (complement (u g c a)) (extend (to n) (from a) (from c) (from g) (from u)) (extend (to x) (from a) (from c) (from g) (from u)) (extend (to t) (from u)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from u)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from u)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from u)) (extend (to h) (from a) (from c) (from u)) (extend (to b) (from c) (from g) (from u)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from u)) (wildcard *)) ;; end alphabet RNA

(grammar (name pfold) (update-rates 1) (update-rules 1)

(chain (update-policy rev) (terminal (LNUC RNUC))

;; initial probability distribution (initial (state (a a)) (prob 0.001167)) (initial (state (c a)) (prob 0.001806)) (initial (state (g a)) (prob 0.001058)) (initial (state (u a)) (prob 0.177977)) (initial (state (a c)) (prob 0.001806)) (initial (state (c c)) (prob 0.000391)) (initial (state (g c)) (prob 0.266974)) (initial (state (u c)) (prob 0.000763)) (initial (state (a g)) (prob 0.001058)) …….

Types of production rule ;; state pfoldS (transform (from (pfoldS)) (to (pfoldL)) (prob 0.131488)) (transform (from (pfoldS)) (to (pfoldB)) (prob 0.868742))

;; state pfoldF (transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >))) (transform (from (pfoldF')) (to (pfoldF)) (prob 0.787854)) (transform (from (pfoldF')) (to (pfoldB)) (prob 0.212421))

;; state pfoldL (transform (from (pfoldL)) (to (pfoldF)) (prob 0.105404)) (transform (from (pfoldL)) (to (pfoldU)) (prob 0.895025))

;; state pfoldB (transform (from (pfoldB)) (to (pfoldL pfoldS)))

;; state pfoldU (transform (from (pfoldU)) (to (NUC pfoldU')) (gaps-ok)) (transform (from (pfoldU')) (to ()) (prob 1))

Emit

Emit

Null

Null

Null

Null (end)

Bifurcate

Types of rate matrix

• update-policy for a chain can be– rind

• i.e. R(i,j) = R*pi(j)• Could also implement this with parametric (see below)

– rev• i.e. reversible: pi(i) * R(i,j) = pi(j) * R(j,i)

– irrev• i.e. irreversible (more general)

– parametric• i.e. R(i,j) = f(a,b,c,d,e….)• (a,b,c,d,e….) are independent parameters

Annotation & supervised learning

• Use the “annotate” element to add annotation lines to the alignment

• Add these annotation lines yourself prior to training, to force a particular parse (supervised learning)

• Period characters are treated as wildcards (partially supervised learning)

(transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >)))

#=GC PFOLD ...................<<<<<<<<...<<<<<........>>>>>..<...>>>.<<<<<<<.......>>>>>>>..>>>>>>.................

Context-dependent substitutions

• Let P(n) be the probability of column n of an alignment

• Context-independent model says thatP(1..N) = P(1)P(2)P(3)…P(N)

• More generally (context-dependence):P(1..N) = P(1)P(2|1)P(3|1,2)P(4|1,2,3)…

• Siepel & Haussler approximated this byP(1..N) ≈ P(1)P(2|1)P(3|2)…P(n|n-1)…

• Here P(n|n-1) is obtained from a dinucleotide model for {n-1,n}, using Bayes’ theorem:P(n|n-1) = P(n-1,n) / P(n-1)

Context-dependent emit rules

(transform(from (PREVNUC S))(to (PREVNUC EMITNUC S')))

Here {PREVNUC,EMITNUC} are the pseudoterminals for a dinucleotide chain

NB context-dependent substitution models generally irreversible (CpG -> TpG)

Parametric models

• Any rate or probability in a grammar can be replaced by a parametric function

• This is useful to constrain models

• e.g. PHASTCONS phylo-HMM

How the “-length” argument works

• DP is an iteration over subsequences• In long (e.g. genomic) alignments, you can

save time by not considering all subseqs– e.g. you probably don’t expect bases over 1MB

apart to be paired

• The -length command-line argument allows you to limit the maximum length of subseqs that will be iterated over– All suffix subseqs are always included, however,

so there is always a valid global parse tree

• Care must be given to design of grammars, particularly if you want to find “local” features

Speed tips

• Use “-length”– Also “minlen” and “maxlen” in grammar

• Replace emit loops with bifurcations– E.g. instead of “S -> x S”

use “S -> X S; X -> x”and limit “maxlen” for X to 1

– NB this goes against the “standard” SCFG dogma of minimizing bifurcations; reason is that with big trees, emissions become more expensive

• Turn off logging once model debugged

DART logging and dartlog.pl

Model debugging tips

• Minimal test cases to reproduce errors• Use logging during model development

– Examine source code for log messages– E.g. “-log CYK_MATRIX”

• Use Makefiles for reproducibility• To protect against EM getting stuck in local

minima, try training a low-dimensional model first (e.g. using “rind”) then move to models with more degrees of freedom (rev -> irrev)

Perl xgram modules

• In dart/perl– Stockholm.pm

• Stockholm alignment class. Pretty basic

– DartSexpr.pm• Class for working with S-expressions

– PhyloGram.pm• Subclass of DartSexpr for phylo-grammars• Has accessors/helpers for various common tasks e.g.

creating & populating new chains, emit rules, etc.• Subclasses: DNA.pm and Protein.pm• Related class: Chain.pm

– Not much documentation (but see first few lines of each file for examples (or bug me))

Rudimentary indel models

• Highly experimental

• Attempt to deal with gaps more intelligently than just ignoring them

• Falls somewhat short of a full “statistical alignment” treatment

(transform (from (S)) (to (X S')) (gap-model (extend-prob 0.5) (insert-rate 0.01) (delete-rate 0.01)))

xgram and phylo-grammars a brief intro. what is a phylo-grammar? combination of: –phylogenetic...

Documents

g prob

initial state c c prob

initial state g c prob

initial state u c prob

c g u complement u g

pfoldf prob

g grammar

state pfoldf