distributed approach for peptide identification
TRANSCRIPT
Distributed Approach for Peptide
IdentificationBy
Naga Venkata Krishna Abhinav Vedanbhatla
Outline
• Background• About C-Ranker
• Problem Statement• Proposed Solution
• Architecture• Implementation• Execution Environment
• Results• Conclusions & Future Scope
Background
Protein
• Bio-Molecules consists of one or multiple chains of Amino Acids• Proteins differ from one another primarily in their sequence of
amino acids• A protein is characterized by the sequence of amino acids as they
occur in the protein
Proteins (Cont.…)
• Proteins perform a vast array of functions within living organisms.• A protein contains at least one long polypeptide• Proteins are involved in almost every biological process
happening in an organism’s body. • Important part of drug development to target specific metabolic
pathways.
Peptide
• Short chain of amino acids
• Peptides are distinguished from proteins on the basis of size
• A protein is first digested into peptides and then each peptide is identified individually to infer the protein identity.
Finding Peptide and Protein Relationship?
• Peptide Identification using mass spectrometry
Determine the sequence of Peptides
• Peptide mass fingerprinting (PMF) in MS spectra
Peptide Identification Using MS/MS Spectra
• Sequence database searching (for the large-scale dataset)
• de novo sequencing (new protein discovery)
• Post sequence database searching (extension of sequence database search)
Peptide Identification (cont…)
• Mass Spectrometry (MS) strategy
• Sequence database searching
• Combination of both: dominant method for peptide identification
• Which results in more spectra from MS
Sequence Database searching algorithms.
• SEQUEST
• Mascot
Post Database Algorithms
• Machine learning algorithms are proposed to identify the peptide spectrum matches (PSM)
Post Database Search Algorithms
• PeptideProphet• Learns distribution of scores and properties
• Percolator• Search scores considered reliable for high values and low values
• CRanker• Fuzzy SVM and silhouette index.
CRanker
C-Ranker
• Is to identify correct PSMs output from the database(Peptide).
• Developed in Matlab and C
Why CRanker?
• Based on research by Dr. Zhonghang Xia, it is the best among the other.
• Easy to parallelize to make it work on a network of computers rather than on a single computer to work on a larger scale
Why CRanker? (cont…)
Data PeptideProphet CRanker Overlap
UPS1 582 576 509
pbmc 34035 34273 32243
Overlapping of aggregate PSMs distinguished by PeptideProphet and CRanker are 88.4 % and 94.8% on UPS1, and PBMC, respectively
CRanker Execution: Step 1
InputFileName.txt
InputFileName.mat
C-Ranker Read Stage
Loads raw PSM data into main
memoryreads load
generates
CRanker Execution: Step 2
InputFileName.txt
InputFileName_score.mat
C-Ranker Solve Stage
Loads PSM records into
main memoryreads load
creates
CRanker Execution: Step 3
InputFileName.txt
InputFileName_score.mat
C-Ranker Write Stage
Loads PSM scores into main
memory
OutputFile.txt
readsreads
reads
reads
load
creates
Problem Statement
Problem Statement
• C-Ranker need a computer with high computation power
• Dataset having about 400,000 PSM records, it may cost about 5 to 8 on normal PC
• Poor Resource Management
• Need to address future big data sets
Can’t we change C-Ranker?
• Research going on to optimize C-Ranker.
• Distributed approach of C-Ranker.
• Needs to re-write the complete code!
Brainstorming
• Can we divide(who will divide) the 400,000 PSM records across 4 machines and do the job??
• Increase computational power?
Constraints!
• Restrictions on changing the C-Ranker design and code (I am not well experienced to do so..)
• Should not change the execution flow of C-Ranker.
Shortlisted Approach
• Fundamental Distributed Approach
Why Distributed Framework?
• It can handle bigger datasets than it would be able to in a centralized setting.
• Requires less memory per computer and each computer can have commodity hardware.
• Cheaper to have multiple commodity hardware computers than having a single high-performance high-end system capable of achieving similar goals.
Job Execution in Distributed Approach
Proposed Solution
Proposed Solution
• A framework to execute C-Ranker on distributed node.• Design such that it may work with other post database searching
algorithms like C-Ranker with minimal changes• Compare the time-taken of generate distributed output of C-Ranker
with actual output• Make sure C-Ranker algorithm is well executed on the set of
predefined nodes
Architecture
Implementation
Data Flow Details in the Original Single-Threaded C-Ranker
Data Flow Details in Distributed C-Ranker (Dividing)
Data Flow For a Single Worker Host
Data Flow Details in Distributed C-Ranker (Merging)
Execution Environment
Execution Environment
• JAVA
• MATLAB MCR environment
• Apache Tomcat web server
Input Data used
Hardware Used to Observe Results
Servers Server_1 Server_2 Server_3 Server_4
Memory 8GB 4GB 4GB 4GB
Processor i5 i5 i5 i5
Operating System Windows 7 Windows Vista Windows 7 Windows 7
Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster1
PBMC data (KB) C-Ranker Execution time in hrs (Cluster1 Hadoop)
Distributed approach for C-Ranker executiion time in hrs
11221 6.5 3.56
12816 9.9 8.1
31422 10.2 8.25
48486 15. 2 9.2
Results
Results for testData.xls (409 KB)
Results for Pbmc_orbit_mips.xls (11221 KB)
Results for Pbmc_orbit_nomips.xls (12816 KB)
Results for Pbmc_velos_mips.xls (31422 KB)
Results for Pbmc_velos_nomips.xls (48486 KB)
Memory usage for testData.xls (409 KB)
Memory Usage for Pbmc_orbit_mips.xls (11221 KB)
Memory Usage for Pbmc_orbit_nomips.xls (12816 KB)
Memory Usage for Pbmc_velos_mips.xls (31422 KB)
Memory usage for Pbmc_velos_nomips.xls (48486 KB)
Memory Usage
Difference in Memory Usage
Upgraded hardware to compare with cluster2 HadoopServers Server_1 Server_2 Server_3 Server_4
Memory 12GB 8GB 12GB 8GB
Processor i7 i5 i7 i7
Operating System
Windows 8 Windows 7 Windows 7 Windows 7
Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster 2
PBMC data (KB) Distributed approach for C-Ranker execution time in hrs (new results)
CRanker Execution time in hrs(Cluster2 Hadoop)
11221 1.7 1.312816 3.82 1.5831422 4.1 3.448486 5.93 4.5
Cost Calculation of Apache Hadoop Cluster 1 and Cluster 2
PBMC Data Size(KB) Cost of Hadoop Cluster 1($)
Cost of Hadoop cluster 2($)
11221 3.4581 0.6916
12816 5.2668 0.8410
31422 5.4264 1.8088
48486 8.8064 2.3941
Conclusion and Future Scope
Conclusion
• Reduces the execution time
• Absolutely cost free (no need of high computing machines)
• No need to change the current structure of C-Ranker
Conclusion (Cont.…)
• Better Resource Management. For example: Memory
• No need to change the implementation of CRanker
Future Scope
• The same distributed approach can be used with Percolator and PeptideProphet to see how well they perform
• Additionally, once can use an ensemble method to combine the results of the three tools.
Questions??
Thank you