graphical modeling of multiple sequence alignment jinbo xu toyota technological institute at chicago...

67
Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Upload: claribel-patrick

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Graphical Modeling of Multiple Sequence Alignment

Jinbo XuToyota Technological Institute at Chicago

Computational Institute, The University of Chicago

Page 2: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Two applications of MSA• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one

Page 3: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Modeling MSAby Markov Random Fields

The generating probability of a sequence :

Infer , by maximum-likelihood encodes residue correlation relationshipA special case is Gaussian Graphical Model

Page 4: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Numeric Representation of MSA

… 0 0 … 1 0 …

21 elements for each column in MSA

Represent a sequence in MSA as a L×21 binary vector

Page 5: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Gaussian Graphical Model (GGM)

• a multiple sequence alignment (MSA)• Assume has Gaussian distribution where is the

covariance matrix• (inverse of ): the precision matrix, implying the

residue interaction pattern among all MSA columns

Page 6: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Covariance and Precision Matrix

L

L

The precision matrix has dimension 21L×21L

21×21

one residue pair

Larger values indicate stronger interaction

Page 7: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Today’s talk• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one

Page 8: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Protein Contact Map(residue interaction network)

1

2

3

4

6.0

8.1

5.9

1 2 3 4

1 0 1 1 0

2 1 0 1 1

3 1 1 0 1

4 0 1 1 0

Two residues in contact if their Cα or Cβ distance < 8Å

3.8

Shorter distance Stronger interaction

Page 9: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Contact Matrix is Sparse

short range: 6-12 AAs apart along primary sequencemedium range: 12-24 AAs apartlong range: >24 AAs apart

#contacts is linear w.r.t. sequence length

Page 10: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN

Protein Contact Prediction

Output:

With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]

Page 11: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Contact Prediction Methods

• Evolutionary coupling analysis (unsupervised learning) Identity co-evolved residues from multiple sequence alignment No solved protein structures used at all High-throughput sequencing makes this method promising e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN

• Supervised machine learning Input features: sequence (profile) similarity, chemical properties

similarity, mutual information (implicitly) learn information from solved structures examples: NNcon, SVMcon, CMAPpro, PhyCMAP

Page 12: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Evolutionary Coupling (EC) AnalysisObservation: two residues in contact tend to co-evolve,i.e., two co-evolved residues likely to form a contact

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028766

Page 13: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Evolutionary Coupling (EC) Analysis (Cont’d)• Local statistical methods: examine the correlation

between two residues independent of the others Mutual information (MI): two residues in contact likely to have

large MI Not all residue pairs with large MI are in contact due to indirect

evolutionary coupling. If A~B and B~C, then likely A~C

• Global statistical methods: examine the correlation between two residues condition on the others

Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN

Page 14: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Single MSA-based Contact Prediction

• Given a protein sequence under prediction, run PSI-BLAST to detect its homologs and build an MSA

• Calculate the sample covariance matrix from the MSA

• is singular, so cannot calculate the precision matrix by

• Calculate by maximum-likelihood, i.e., maximize the occurring probability of observed seqs

Enforce sparse precision matrixWhy ?

Page 15: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Issues with Existing Methods Evolutionary coupling (EC) analysis works for proteins

with a large number of sequence homologs Focus on how to improve the statistical methods instead of

use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods,

Use information mostly in a single protein family Physical constraints other than sparsity not used

Page 16: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Our Work: contact prediction using multiple MSAs

Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs. These MSAs share inter-residue interaction network to some degree

Integrate evolutionary coupling (EC) analysis with supervised learning EC analysis makes use of residue co-evolution information Supervised learning makes use of sequence (profile) similarity

Goal: focus on proteins without many sequence homologsStrategy: increase statistical power by information aggregation

Page 17: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Red: shared; Blue: unique to PF00116; Green: unique to PF13473

Observation: different protein families share similar contact maps

Page 18: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Joint evolutionary coupling (EC) analysis

Jointly predict contacts for a set of related protein families Predict contacts for a protein family using information in

other related families Enforce contact map consistency among related families Do not lose family-specific information

Page 19: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Joint graphical lasso for joint evolutionary coupling analysis

1. Given a protein family and its MSA, find related families and corresponding MSAs

Let be precision matrices

2. Estimate by joint log-likelihood as follows

Where the last term enforces sparse precision matrices

How to enforce contact map consistency?

Page 20: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Residue Pair/Submatrix Grouping

In total ≤L(L-1)/2 groups where L is the seq length

Page 21: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Enforce Contact Map Consistencyby Group Penalty

: the number of groups: the number of families Using group lasso to model family consistency:

Group conservation level

is defined as

Page 22: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Supervised Machine Learning

• Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential

• Mutual information power series: – Local info: mutual information matrix (MI)– Partially global info: MI2, MI3, …, MI11

– Can be calculated much faster than PSICOV• Random Forests trained by 800-900 proteins

Page 23: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Joint EC Analysis with Supervised Prediction as Prior

max∑𝑘=1

𝐾

( log|𝛺𝑘|−tr (𝛺𝑘 �̂�𝑘 ) )

− 𝜆1∑𝑘=1

𝐾

¿|𝛺𝑘|∨¿1¿−∑𝑔=1

𝐺

𝜆𝑔∨¿𝛺𝑔∨¿2

− 𝜆2∑𝑘=1

𝐾

∑𝑖𝑗

‖𝛺𝑖𝑗𝑘‖1

𝑚𝑎𝑥 (𝑃 𝑖𝑗𝑘 ,0.3)

sparsity

contact map consistency among families

similarity with supervised prediction

Log-likelihood of K families

This optimization problem can be solved by ADMM to suboptimal

Page 24: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Accuracy on 98 Pfam families Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2

CoinDCA 0.496 0.435 0.312 0.561 0.502 0.391

PSICOV 0.375 0.312 0.213 0.446 0.400 0.311PSICOV_b 0.388 0.306 0.199 0.462 0.400 0.294plmDCA 0.433 0.354 0.233 0.484 0.443 0.343

plmDCA_h 0.433 0.339 0.211 0.480 0.413 0.292GREMLIN 0.401 0.332 0.225 0.447 0.423 0.329

GREMLIN_h 0.391 0.316 0.204 0.428 0.400 0.301

Merge_p 0.303 0.246 0.178 0.370 0.328 0.253Merge_m 0.276 0.223 0.169 0.355 0.309 0.232

Voting 0.405 0.280 0.168 0.337 0.353 0.275

Page 25: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Accuracy vs. # Sequence Homologs(A) Medium-range (B) Long-range

X-axis: ln of the number of non-redundant sequence homologsY-axis: L/10 accuracy

Page 26: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Accuracy on 123 CASP10 targets

Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2CoinDCA 0.500 0.440 0.340 0.412 0.351 0.279

Evfold 0.294 0.249 0.188 0.257 0.225 0.171PSICOV 0.310 0.259 0.192 0.276 0.225 0.168plmDCA 0.344 0.289 0.214 0.326 0.280 0.213

GREMLIN 0.343 0.280 0.229 0.320 0.278 0.159

NNcon 0.393 0.334 0.226 0.239 0.188 0.001CMAPpro 0.414 0.363 0.276 0.336 0.297 0.227

Page 27: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Accuracy vs. # sequence homologs (CASP10)

X-axis: ln of # non-redundant sequence homologsY-axis: L/10 long-range prediction accuracy

Page 28: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Accuracy vs. Contact Conservation Level

(A)Medium-range; (B) long-rangeX-axis: conservation level, the larger, the more conserved

Page 29: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Today’s Talk• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Remote homology detection and fold recognition– Merge two MSAs into a larger one

Page 30: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Homology Detection & Fold Recognition

• Primary sequence comparison– Similar sequences -> very likely homologous– Sequence alignment method, e.g., BLAST, FASTA– works only for close homologs

• Profile-based method– Compare two protein families instead of primary sequences, using

evolutionary information in a family– Sequence-profile alignment & profile-profile alignment– Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g.,

HHpred, HMMER)– Sometimes works for remote homologs, but not sensitive enough

Page 31: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

MSA to Sequence Profile

Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)

Page 32: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Position-Specific Scoring Matrix (PSSM)

Taken from http://carrot.mcb.uconn.edu/~olgazh/bioinf2010/class10.html

Page 33: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Hidden Markov Model (HMM)

http://www.biopred.net/eddy.html

Page 34: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Our Work: Markov Random Fields (MRF) Representation

1) MRF encodes long-range residue interaction pattern while HMM does not;

2) Long-range interaction pattern encodes global information of a protein,So can deal with proteins of similar folds but divergent sequences

Page 35: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Protein alignment by aligning two MRFs

G R K - Y S A

G R K - Y S A

F L V - L Y I

K L V - L Y I

P T A K F R E

P T A K F R S

P T V P G Y E

P T V P G R S

MRF1

MRF2

Family 1

Family 2

Page 36: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Scoring function for MRF alignment

 

local alignment potential pairwise alignment potential

   

   

MRF1

MRF2

𝜃𝑖 , 𝑗𝑀

𝜃𝑘 ,𝑙𝑀

𝑍 𝑖 , 𝑗𝑀 =1 𝑍 𝑘 ,𝑙

𝑀 =1

NP-hard due to1) Gaps allowed2) Pairwise potential

Page 37: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Alternating Direction of Method Multiplier (ADMM)

Make a copy of z to y

Add a penalty term to obtain an augmented problem

Page 38: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

ADMM (Cont’d)

Solve the above problem iteratively as follows:Step 1: Solve the optimization problem for a fixed Step 2: Update by subgradient and repeat 1) until convergence

Use a Lagrangian multiplier to relax the original problemand obtain a upper bound

Page 39: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

ADMM(Cont’d)

(SP1) Where

(SP2)

Where

For a fixed plit the relaxation problem into two subproblems and solve them alternatively

Both subproblems can solved efficiently by dynamic programming!

Page 40: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Superfamily & Fold Recognition Rate

Superfamily level detection Fold level detection

Page 41: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Conclusion

• Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs

• Long-range residue interaction encoded in an MSA helpful for remote homolog detection

Page 42: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Acknowledgements• RaptorX servers at http://raptorx.uchicago.edu• Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang• Funding

– NIH R01GM0897532– NSF CAREER award and NSF ABI– Alfred P. Sloan Research Fellowship

• Computational resources– University of Chicago Beagle team– TeraGrid and Open Science Grid

Page 43: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN

Protein Structure Prediction

Output:1. One of the most challenging problems in computational biology!

2. Improved due to better algorithms and large databases

3. Knowledge-based methods outperformsphysics-based methods

4. Big demand: our server processes> 800 jobs/week, >12k users in 3yrs

Page 44: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance in CASP9 (2010)A blind test for protein structure prediction

Server ranking tested on the 50 hardest TBM targets

Adapted from http://predictioncenter.org/casp9/doc/presentations/CASP9_TBM.pdf

Page 45: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance in CASP10 (2012)A blind test for protein structure prediction

The only server group among top 10

Adapted from http://predictioncenter.org/casp10/doc/presentations/CASP10_TBM_GM.pdf

The top 10 performing human/server groups on the hardest TBM targets

Page 46: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

My WorkAnalyze large-scale biological data and build predictive models• Protein sequence and structure alignment• Homology detection and fold recognition• Protein structure prediction• Protein function prediction (e.g., interaction and binding site

prediction)• Biological network construction and analysis

Study computational methods that have applications beyond bioinformatics• Machine learning (e.g. probabilistic graphical model)• Optimization (discrete, combinatorial and continuous)

Page 47: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Homology Detection & Fold Recognition

• Homology detection & fold recognition– Determine the relationship between two proteins– Given a query, search for all homologs in a database

• Homology search/fold recognition useful for– Study protein evolutionary relationship– Functional transfer– Homology modeling (i.e., template-based modeling)

Two proteins are homologous if they have shared ancestry.Two proteins have the same fold if their 3D structures are similar.

Page 48: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Structure Prediction (Cont’d)• Template-based modeling (TBM)

– Using solved protein structures as template, e.g., homology modeling and protein threading

– Most reliable, but fails when no good templates• Template-free modeling (FM) or ab initio folding

– Not using solved protein structures as template– Mostly works only on some small proteins

• Subproblems– Loop modeling– Inter-residue contact prediction

Page 49: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Residue Pair Grouping

Page 50: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Precision Submatrix Grouping

for Family 1 for Family 2

Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2

In total ≤L(L-1)/2 groups where L is the seq length

Page 51: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance on the 31 Pfam families with only distantly-related auxiliary families

Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2

CoinFold 0.457 0.400 0.267 0.558 0.524 0.416PSICOV 0.413 0.360 0.252 0.494 0.465 0.377

PSICOV_p 0.320 0.295 0.212 0.396 0.355 0.290PSICOV_v 0.400 0.320 0.179 0.396 0.375 0.261

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Page 52: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance on the 13 Pfam families with closely-related auxiliary families

Medium-range Long-rangeL/10 L/5 L/2 L/10 L/5 L/2

CoinFold 0.501 0.395 0.251 0.462 0.413 0.293PSICOV 0.433 0.351 0.231 0.398 0.331 0.234

PSICOV_p 0.335 0.220 0.175 0.322 0.276 0.194PSICOV_v 0.423 0.320 0.188 0.386 0.384 0.301

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Page 53: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Our method vs. PSICOV

Page 54: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Our method vs. GREMLIN

L/10 top predicted long-range contacts are evaluated

Page 55: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance vs. family size

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Page 56: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago
Page 57: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago
Page 58: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Multiple Sequence Alignment (MSA) of One Protein Family

Page 59: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Top L/10 long-range prediction accuracy on 15 large Pfam families

PFAM ID MEFF CoinFold PSICOV PSICOV_p

PF00041 2981 0.767 0.767 0.667PF00595 3026 0.556 0.444 0.233PF03061 3334 0.375 0.250 0.500PF01522 3519 0.615 0.462 0.077PF00578 3733 0.308 0.308 0.231PF00059 3744 0.455 0.455 0.182PF07686 3801 0.917 0.583 0.667PF00034 4060 0.600 0.500 0.200PF00989 4596 0.583 0.250 0.500PF00144 4684 0.272 0.212 0.242PF00085 5075 0.636 0.545 0.182PF00168 6735 0.667 0.667 0.556PF00515 7230 0.500 0.500 0.250PF00089 9045 0.783 0.783 0.739PF00550 11476 0.857 0.857 0.714

Page 60: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Running Time

Average protein sequence length

Time (in seconds)

Page 61: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance: Alignment Accuracy

Page 62: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance: Homology Detection

Page 63: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Performance: Alignment Accuracy

Tmalign, Matt and DeepAlign represent three different ground truth

Page 64: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Joint Graphical Lasso Formulation

Log-likelihood: Regularization:

Rewrite the original problem as

Unconstrained optimization problem Both and are convex, so the objective is the difference of two convex functions Can be solved by the Convex-Concave Procedure [A. Yuille]

Page 65: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Alternating Direction of Method Multiplier (ADMM)

s.t. ZAdd a penalty term to obtain an augmented problem, which has the same solution but converges faster.

Make a copy of to , without changing the solution space

s.t. Z

Page 66: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Lagrangian Relaxation

min𝑈maxΩ ,𝑍

𝑓 (Ω)−𝑃 (𝑍 )−∑𝑘=1

𝐾

¿¿¿¿

Use a Lagrange multiplier for each constraint Obtain a dual problem to upper bound the augmented problem

Solve the dual problem iteratively by subgradient as follows.Step 1) fix and solve

Step 2) Update by and repeat 1) until convergence

Page 67: Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

ADMM (Cont’d)

For a fixed U, split the relaxation problem into two subproblems and solve them alternatively

(SP1)

(SP2)