probabilistic ensembles for improved inference in protein -structure determination

Probabilistic Ensembles for Improved Inference in

Protein-Structure Determination

Ameet Soni* and Jude ShavlikDept. of Computer SciencesDept. of Biostatistics and Medical Informatics

Presented at the ACM International Conference on Bioinformatics and Computational Biology 2011

Protein Structure Determination

2

Proteins essential to mostcellular function Structural support Catalysis/enzymatic activity Cell signaling

Protein structures determine function

X-ray crystallography is main technique for determining structures

Task Overview3

Given A protein sequence Electron-density map

(EDM) of protein

Do Automatically produce a

protein structure that Contains all atoms Is physically feasible

SAVRVGLAIM...

Challenges & Related Work4

1 Å 2 Å 3 Å 4 Å

Our Method: ACMI

ARP/wARPTEXTAL & RESOLVE

Resolution is a

property of the protein

Higher Resolution : Better Quality

Outline5

Protein Structures Prior Work on ACMI Probabilistic Ensembles in ACMI (PEA) Experiments and Results

Outline6


Our Technique: ACMI7

Perform Local Match Apply Global Constraints Sample Structure

Phase 1 Phase 2 Phase 3

prior probability of

each AA’s location

posterior probabilityof each AA’s location

all-atom protein structures

bk

bk-1

bk+1*1…M

Results[DiMaio, Kondrashov, Bitto, Soni, Bingman, Phillips, and Shavlik, Bioinformatics 2007]

8

ACMI Outline9


Phase 1 Phase 2 Phase 3





bk

bk-1

bk+1*1…M

Phase 2 – Probabilistic Model

10

ACMI models the probability of all possible traces using a pairwise Markov Random Field (MRF)

LEU4 SER5GLY2 LYS3ALA1

Probabilistic Model11

# nodes: ~1,000# edges:

~1,000,000

Approximate Inference12

Best structure intractable to calculatei.e., we cannot infer the underlying structure analytically

Phase 2 uses Loopy Belief Propagation (BP) to approximate solution Local, message-passing scheme Distributes evidence between nodes

Loopy Belief Propagation13

LYS31 LEU32

mLYS31→LEU32

pLEU32pLYS31

Loopy Belief Propagation14

LYS31 LEU32

mLEU32→LEU31

pLEU32pLYS31

Shortcomings of Phase 215

Inference is very difficult ~1,000,000 possible outputs for one amino

acid ~250-1250 amino acids in one protein Evidence is noisy O(N2) constraints

Approximate solutions, room for improvement

Outline16


Ensembles: the use of multiple models to improve predictive performance

Tend to outperform best single model [Dietterich ‘00] Eg, Netflix prize

Ensemble Methods17

Phase 2: Standard ACMI18

Protocol

MRF

P(bk)

Phase 2: Ensemble ACMI19

Protocol 1

MRF

Protocol 2

Protocol C

P1(bk)

P2(bk)

PC(bk)

…

…

Probabilistic Ensembles in ACMI (PEA)20

New ensemble framework (PEA) Run inference multiple times, under

different conditions Output: multiple, diverse, estimates of each

amino acid’s location

Phase 2 now has several probability distributions for each amino acid, so what?

ACMI Outline21


Phase 1 Phase 2 Phase 3bk

bk-1

bk+1*1…M





Place next backbone atom

Backbone Step (Prior work)22

(1) Sample bk from empirical Ca- Ca- Ca pseudoangle distribution

bk-1b'k

bk-2

????

?



0.25…

bk-1

bk-2

(2) Weight each sample by its Phase 2 computed marginal

b'k0.20

0.15



0.25…

bk-1

bk-2

(3) Select bk with probability proportional to sample weight

b'k0.20

0.15

Backbone Step for PEA25

bk-1

bk-2

b'k0.23 0.15 0.04

PC(b'k)P2(b'k)P1(b'k)

? Aggregator

w(b'k)

Backbone Step for PEA: Average

26

bk-1

bk-2

b'k0.23 0.15 0.04


? AVG

0.14

Backbone Step for PEA: Maximum

27

bk-1

bk-2

b'k0.23 0.15 0.04


? MAX

0.23

Backbone Step for PEA: Sample

28

bk-1

bk-2

b'k0.23 0.15 0.04


? SAMP

0.15

Review: Previous work on ACMI

29

Prot

ocol

P(bk)

0.25

…

bk-1

bk-2

0.20

0.15

Phase 2 Phase 3

Prot

ocol

Prot

ocol

Review: PEA30

Prot

ocol

bk-1

bk-2

0.14

…

0.26

0.05

Phase 2 Phase 3AG

G

Outline31


Experimental Methodology32

PEA (Probabilistic Ensembles in ACMI) 4 ensemble components Aggregators: AVG, MAX, SAMP

ACMI ORIG – standard ACMI (prior work) EXT – run inference 4 times as long BEST – test best of 4 PEA components

Phase 2 Results33

*p-value < 0.01

Protein Structure Results34

*p-value < 0.05

Correctness Completeness

Protein Structure Results35

Impact of Ensemble Size36

Conclusions37

ACMI is the state-of-the-art method for determining protein structures in poor-resolution images

Probabilistic Ensembles in ACMI (PEA) improves approximate inference, produces better protein structures

Future Work General solution for inference Larger ensemble size

Acknowledgements38

Phillips Laboratory at UW - Madison UW Center for Eukaryotic Structural Genomics

(CESG)

NLM R01-LM008796 NLM Training Grant T15-LM007359 NIH Protein Structure Initiative Grant

GM074901

Thank you!

probabilistic ensembles for improved inference in protein -structure determination

Documents

underlying structure

sample structure biologists

amino acids location

acmi peaexperiments

acmiprobabilistic ensembles

info inconsistenciesphase

info loopy belief prop

probabilistic model10acmi