xgram and phylo-grammars a brief intro. what is a phylo-grammar? combination of: –phylogenetic...
TRANSCRIPT
xgram and phylo-grammars
A brief intro
What is a phylo-grammar?
• Combination of:– Phylogenetic likelihood model
• Tree with branch lengths, t• Rate matrix, R (continuous-time Markov chain)• Edge probabilities: exp(R*t)
– Stochastic grammar• Grammar symbols (nonterminals and terminals)• Production rules (with probabilities)
Grammars and dependencies
Nested dependencies
(context-free; Chomsky)
Cross-serial dependencies
(“mildly” context-sensitive; Joshi)
TIR
LTR
Adjacent dependencies
(regular/HMMs)
ATATATATATATATATATTAT
MicrosatelliteFuzzy duck, fuzzy duck, duckie fuzz, duckie fuzz, duckie fuzz…
PFOLD: Knudsen & Hein, 1999
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
S -> LS | L;F -> dFd | LS;L -> s | dFd;
dd:
s:
Codon evolution and exon models
Goldman & Kosiol, MS in prep
EvoGene (Pedersen & Hein)
c.f. Exoniphy (Siepel & Haussler) whose “null model” is considerably more sophisticated (context-dependent substitutions are explicitly modeled, yielding higher-
order dependence between noncoding bases)
PHASTCONS phylo-HMM
xgram command line usage
xgram examples
• Load alignment from file “align.stk”; load grammar from file “grammar.eg”; estimate tree by neighbor-joining (if there isn’t a tree annotated to the alignment already); do CYK algorithm (or Viterbi, as appropriate); annotate alignment; print to standard output
xgram align.stk -g grammar.eg
• Load alignment from file “align2.stk”; load grammar from file “grammar2.eg”; optimize branch lengths of tree; do EM by iterating Inside-Outside algorithm (or Forward-Backward); save grammar to file “trained.eg”; print log messages to level 5
xgram align2.stk -g grammar2.eg -b -t trained.eg --noannotate -log 5
xgram grammar elements• Alphabet
– Tokens– Complementarity– Degeneracies
• Grammar– Chains
• Pseudoterminals• Initial probabilities• Mutation rates• Update policy
– Production rules (& nonterminals)• Emissions (paired emit-null states)
– Annotation labels (optional)– Gap model (optional)
• Transitions (aka “null” states)• Bifurcations
– Parameters (optional)• Rate parameters• Probability parameters
(alphabet (name RNA) (token (a c g u)) (complement (u g c a)) (extend (to n) (from a) (from c) (from g) (from u)) (extend (to x) (from a) (from c) (from g) (from u)) (extend (to t) (from u)) (extend (to r) (from a) (from g)) (extend (to y) (from c) (from u)) (extend (to m) (from a) (from c)) (extend (to k) (from g) (from u)) (extend (to s) (from c) (from g)) (extend (to w) (from a) (from u)) (extend (to h) (from a) (from c) (from u)) (extend (to b) (from c) (from g) (from u)) (extend (to v) (from a) (from c) (from g)) (extend (to d) (from a) (from g) (from u)) (wildcard *)) ;; end alphabet RNA
(grammar (name pfold) (update-rates 1) (update-rules 1)
(chain (update-policy rev) (terminal (LNUC RNUC))
;; initial probability distribution (initial (state (a a)) (prob 0.001167)) (initial (state (c a)) (prob 0.001806)) (initial (state (g a)) (prob 0.001058)) (initial (state (u a)) (prob 0.177977)) (initial (state (a c)) (prob 0.001806)) (initial (state (c c)) (prob 0.000391)) (initial (state (g c)) (prob 0.266974)) (initial (state (u c)) (prob 0.000763)) (initial (state (a g)) (prob 0.001058)) …….
Types of production rule ;; state pfoldS (transform (from (pfoldS)) (to (pfoldL)) (prob 0.131488)) (transform (from (pfoldS)) (to (pfoldB)) (prob 0.868742))
;; state pfoldF (transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >))) (transform (from (pfoldF')) (to (pfoldF)) (prob 0.787854)) (transform (from (pfoldF')) (to (pfoldB)) (prob 0.212421))
;; state pfoldL (transform (from (pfoldL)) (to (pfoldF)) (prob 0.105404)) (transform (from (pfoldL)) (to (pfoldU)) (prob 0.895025))
;; state pfoldB (transform (from (pfoldB)) (to (pfoldL pfoldS)))
;; state pfoldU (transform (from (pfoldU)) (to (NUC pfoldU')) (gaps-ok)) (transform (from (pfoldU')) (to ()) (prob 1))
Emit
Emit
Null
Null
Null
Null (end)
Bifurcate
Types of rate matrix
• update-policy for a chain can be– rind
• i.e. R(i,j) = R*pi(j)• Could also implement this with parametric (see below)
– rev• i.e. reversible: pi(i) * R(i,j) = pi(j) * R(j,i)
– irrev• i.e. irreversible (more general)
– parametric• i.e. R(i,j) = f(a,b,c,d,e….)• (a,b,c,d,e….) are independent parameters
Annotation & supervised learning
• Use the “annotate” element to add annotation lines to the alignment
• Add these annotation lines yourself prior to training, to force a particular parse (supervised learning)
• Period characters are treated as wildcards (partially supervised learning)
(transform (from (pfoldF)) (to (LNUC pfoldF' RNUC)) (gaps-ok) (annotate (row PFOLD) (column LNUC) (label <)) (annotate (row PFOLD) (column RNUC) (label >)))
#=GC PFOLD ...................<<<<<<<<...<<<<<........>>>>>..<...>>>.<<<<<<<.......>>>>>>>..>>>>>>.................
Context-dependent substitutions
• Let P(n) be the probability of column n of an alignment
• Context-independent model says thatP(1..N) = P(1)P(2)P(3)…P(N)
• More generally (context-dependence):P(1..N) = P(1)P(2|1)P(3|1,2)P(4|1,2,3)…
• Siepel & Haussler approximated this byP(1..N) ≈ P(1)P(2|1)P(3|2)…P(n|n-1)…
• Here P(n|n-1) is obtained from a dinucleotide model for {n-1,n}, using Bayes’ theorem:P(n|n-1) = P(n-1,n) / P(n-1)
Context-dependent emit rules
(transform(from (PREVNUC S))(to (PREVNUC EMITNUC S')))
Here {PREVNUC,EMITNUC} are the pseudoterminals for a dinucleotide chain
NB context-dependent substitution models generally irreversible (CpG -> TpG)
Parametric models
• Any rate or probability in a grammar can be replaced by a parametric function
• This is useful to constrain models
• e.g. PHASTCONS phylo-HMM
How the “-length” argument works
• DP is an iteration over subsequences• In long (e.g. genomic) alignments, you can
save time by not considering all subseqs– e.g. you probably don’t expect bases over 1MB
apart to be paired
• The -length command-line argument allows you to limit the maximum length of subseqs that will be iterated over– All suffix subseqs are always included, however,
so there is always a valid global parse tree
• Care must be given to design of grammars, particularly if you want to find “local” features
Speed tips
• Use “-length”– Also “minlen” and “maxlen” in grammar
• Replace emit loops with bifurcations– E.g. instead of “S -> x S”
use “S -> X S; X -> x”and limit “maxlen” for X to 1
– NB this goes against the “standard” SCFG dogma of minimizing bifurcations; reason is that with big trees, emissions become more expensive
• Turn off logging once model debugged
DART logging and dartlog.pl
Model debugging tips
• Minimal test cases to reproduce errors• Use logging during model development
– Examine source code for log messages– E.g. “-log CYK_MATRIX”
• Use Makefiles for reproducibility• To protect against EM getting stuck in local
minima, try training a low-dimensional model first (e.g. using “rind”) then move to models with more degrees of freedom (rev -> irrev)
Perl xgram modules
• In dart/perl– Stockholm.pm
• Stockholm alignment class. Pretty basic
– DartSexpr.pm• Class for working with S-expressions
– PhyloGram.pm• Subclass of DartSexpr for phylo-grammars• Has accessors/helpers for various common tasks e.g.
creating & populating new chains, emit rules, etc.• Subclasses: DNA.pm and Protein.pm• Related class: Chain.pm
– Not much documentation (but see first few lines of each file for examples (or bug me))
Rudimentary indel models
• Highly experimental
• Attempt to deal with gaps more intelligently than just ignoring them
• Falls somewhat short of a full “statistical alignment” treatment
(transform (from (S)) (to (X S')) (gap-model (extend-prob 0.5) (insert-rate 0.01) (delete-rate 0.01)))