distributed approach for peptide identification

Distributed Approach for Peptide

IdentificationBy

Naga Venkata Krishna Abhinav Vedanbhatla

Outline

• Background• About C-Ranker

• Problem Statement• Proposed Solution

• Architecture• Implementation• Execution Environment

• Results• Conclusions & Future Scope

Background

Protein

• Bio-Molecules consists of one or multiple chains of Amino Acids• Proteins differ from one another primarily in their sequence of

amino acids• A protein is characterized by the sequence of amino acids as they

occur in the protein

Proteins (Cont.…)

• Proteins perform a vast array of functions within living organisms.• A protein contains at least one long polypeptide• Proteins are involved in almost every biological process

happening in an organism’s body. • Important part of drug development to target specific metabolic

pathways.

Peptide

• Short chain of amino acids

• Peptides are distinguished from proteins on the basis of size

• A protein is first digested into peptides and then each peptide is identified individually to infer the protein identity.

Finding Peptide and Protein Relationship?

• Peptide Identification using mass spectrometry

Determine the sequence of Peptides

• Peptide mass fingerprinting (PMF) in MS spectra

Peptide Identification Using MS/MS Spectra

• Sequence database searching (for the large-scale dataset)

• de novo sequencing (new protein discovery)

• Post sequence database searching (extension of sequence database search)

Peptide Identification (cont…)

• Mass Spectrometry (MS) strategy

• Sequence database searching

• Combination of both: dominant method for peptide identification

• Which results in more spectra from MS

Sequence Database searching algorithms.

• SEQUEST

• Mascot

Post Database Algorithms

• Machine learning algorithms are proposed to identify the peptide spectrum matches (PSM)

Post Database Search Algorithms

• PeptideProphet• Learns distribution of scores and properties

• Percolator• Search scores considered reliable for high values and low values

• CRanker• Fuzzy SVM and silhouette index.

CRanker

C-Ranker

• Is to identify correct PSMs output from the database(Peptide).

• Developed in Matlab and C

Why CRanker?

• Based on research by Dr. Zhonghang Xia, it is the best among the other.

• Easy to parallelize to make it work on a network of computers rather than on a single computer to work on a larger scale

Why CRanker? (cont…)

Data PeptideProphet CRanker Overlap

UPS1 582 576 509

pbmc 34035 34273 32243

Overlapping of aggregate PSMs distinguished by PeptideProphet and CRanker are 88.4 % and 94.8% on UPS1, and PBMC, respectively

CRanker Execution: Step 1

InputFileName.txt

InputFileName.mat

C-Ranker Read Stage

Loads raw PSM data into main

memoryreads load

generates


InputFileName.txt

InputFileName_score.mat

C-Ranker Solve Stage

Loads PSM records into

main memoryreads load

creates


InputFileName.txt

InputFileName_score.mat

C-Ranker Write Stage

Loads PSM scores into main

memory

OutputFile.txt

readsreads

reads

reads

load

creates

Problem Statement

Problem Statement

• C-Ranker need a computer with high computation power

• Dataset having about 400,000 PSM records, it may cost about 5 to 8 on normal PC

• Poor Resource Management

• Need to address future big data sets

Can’t we change C-Ranker?

• Research going on to optimize C-Ranker.

• Distributed approach of C-Ranker.

• Needs to re-write the complete code!

Brainstorming

• Can we divide(who will divide) the 400,000 PSM records across 4 machines and do the job??

• Increase computational power?

Constraints!

• Restrictions on changing the C-Ranker design and code (I am not well experienced to do so..)

• Should not change the execution flow of C-Ranker.

Shortlisted Approach

• Fundamental Distributed Approach

Why Distributed Framework?

• It can handle bigger datasets than it would be able to in a centralized setting.

• Requires less memory per computer and each computer can have commodity hardware.

• Cheaper to have multiple commodity hardware computers than having a single high-performance high-end system capable of achieving similar goals.

Job Execution in Distributed Approach

Proposed Solution

Proposed Solution

• A framework to execute C-Ranker on distributed node.• Design such that it may work with other post database searching

algorithms like C-Ranker with minimal changes• Compare the time-taken of generate distributed output of C-Ranker

with actual output• Make sure C-Ranker algorithm is well executed on the set of

predefined nodes

Architecture

Implementation

Data Flow Details in the Original Single-Threaded C-Ranker

Data Flow Details in Distributed C-Ranker (Dividing)

Data Flow For a Single Worker Host

Data Flow Details in Distributed C-Ranker (Merging)

Execution Environment

Execution Environment

• JAVA

• MATLAB MCR environment

• Apache Tomcat web server

Input Data used

Hardware Used to Observe Results

Servers Server_1 Server_2 Server_3 Server_4

Memory 8GB 4GB 4GB 4GB

Processor i5 i5 i5 i5

Operating System Windows 7 Windows Vista Windows 7 Windows 7

Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster1

PBMC data (KB) C-Ranker Execution time in hrs (Cluster1 Hadoop)

Distributed approach for C-Ranker executiion time in hrs

11221 6.5 3.56

12816 9.9 8.1

31422 10.2 8.25

48486 15. 2 9.2

Results

Results for testData.xls (409 KB)

Results for Pbmc_orbit_mips.xls (11221 KB)

Results for Pbmc_orbit_nomips.xls (12816 KB)

Results for Pbmc_velos_mips.xls (31422 KB)

Results for Pbmc_velos_nomips.xls (48486 KB)

Memory usage for testData.xls (409 KB)

Memory Usage for Pbmc_orbit_mips.xls (11221 KB)

Memory Usage for Pbmc_orbit_nomips.xls (12816 KB)

Memory Usage for Pbmc_velos_mips.xls (31422 KB)

Memory usage for Pbmc_velos_nomips.xls (48486 KB)

Memory Usage

Difference in Memory Usage

Upgraded hardware to compare with cluster2 HadoopServers Server_1 Server_2 Server_3 Server_4

Memory 12GB 8GB 12GB 8GB

Processor i7 i5 i7 i7

Operating System

Windows 8 Windows 7 Windows 7 Windows 7

Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster 2

PBMC data (KB) Distributed approach for C-Ranker execution time in hrs (new results)

CRanker Execution time in hrs(Cluster2 Hadoop)

11221 1.7 1.312816 3.82 1.5831422 4.1 3.448486 5.93 4.5

Cost Calculation of Apache Hadoop Cluster 1 and Cluster 2

PBMC Data Size(KB) Cost of Hadoop Cluster 1($)

Cost of Hadoop cluster 2($)

11221 3.4581 0.6916

12816 5.2668 0.8410

31422 5.4264 1.8088

48486 8.8064 2.3941

Conclusion and Future Scope

Conclusion

• Reduces the execution time

• Absolutely cost free (no need of high computing machines)

• No need to change the current structure of C-Ranker

Conclusion (Cont.…)

• Better Resource Management. For example: Memory

• No need to change the implementation of CRanker

Future Scope

• The same distributed approach can be used with Percolator and PeptideProphet to see how well they perform

• Additionally, once can use an ensemble method to combine the results of the three tools.

Questions??

Thank you

distributed approach for peptide identification

Documents