distributed approach for peptide identification

63
Distributed Approach for Peptide Identification By Naga Venkata Krishna Abhinav Vedanbhatla

Upload: abhinav-vedanbhatla

Post on 22-Jan-2017

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed approach for Peptide Identification

Distributed Approach for Peptide

IdentificationBy

Naga Venkata Krishna Abhinav Vedanbhatla

Page 2: Distributed approach for Peptide Identification

Outline

• Background• About C-Ranker

• Problem Statement• Proposed Solution

• Architecture• Implementation• Execution Environment

• Results• Conclusions & Future Scope

Page 3: Distributed approach for Peptide Identification

Background

Page 4: Distributed approach for Peptide Identification

Protein

• Bio-Molecules consists of one or multiple chains of Amino Acids• Proteins differ from one another primarily in their sequence of

amino acids• A protein is characterized by the sequence of amino acids as they

occur in the protein

Page 5: Distributed approach for Peptide Identification

Proteins (Cont.…)

• Proteins perform a vast array of functions within living organisms.• A protein contains at least one long polypeptide• Proteins are involved in almost every biological process

happening in an organism’s body. • Important part of drug development to target specific metabolic

pathways.

Page 6: Distributed approach for Peptide Identification

Peptide

• Short chain of amino acids

• Peptides are distinguished from proteins on the basis of size

• A protein is first digested into peptides and then each peptide is identified individually to infer the protein identity.

Page 7: Distributed approach for Peptide Identification

Finding Peptide and Protein Relationship?

• Peptide Identification using mass spectrometry

Page 8: Distributed approach for Peptide Identification

Determine the sequence of Peptides

• Peptide mass fingerprinting (PMF) in MS spectra

Page 9: Distributed approach for Peptide Identification

Peptide Identification Using MS/MS Spectra

• Sequence database searching (for the large-scale dataset)

• de novo sequencing (new protein discovery)

• Post sequence database searching (extension of sequence database search)

Page 10: Distributed approach for Peptide Identification

Peptide Identification (cont…)

• Mass Spectrometry (MS) strategy

• Sequence database searching

• Combination of both: dominant method for peptide identification

• Which results in more spectra from MS

Page 11: Distributed approach for Peptide Identification

Sequence Database searching algorithms.

• SEQUEST

• Mascot

Page 12: Distributed approach for Peptide Identification

Post Database Algorithms

• Machine learning algorithms are proposed to identify the peptide spectrum matches (PSM)

Page 13: Distributed approach for Peptide Identification

Post Database Search Algorithms

• PeptideProphet• Learns distribution of scores and properties

• Percolator• Search scores considered reliable for high values and low values

• CRanker• Fuzzy SVM and silhouette index.

Page 14: Distributed approach for Peptide Identification

CRanker

Page 15: Distributed approach for Peptide Identification

C-Ranker

• Is to identify correct PSMs output from the database(Peptide).

• Developed in Matlab and C

Page 16: Distributed approach for Peptide Identification

Why CRanker?

• Based on research by Dr. Zhonghang Xia, it is the best among the other.

• Easy to parallelize to make it work on a network of computers rather than on a single computer to work on a larger scale

Page 17: Distributed approach for Peptide Identification

Why CRanker? (cont…)

Data PeptideProphet CRanker Overlap

UPS1 582 576 509

pbmc 34035 34273 32243

Overlapping of aggregate PSMs distinguished by PeptideProphet and CRanker are 88.4 % and 94.8% on UPS1, and PBMC, respectively

Page 18: Distributed approach for Peptide Identification

CRanker Execution: Step 1

InputFileName.txt

InputFileName.mat

C-Ranker Read Stage

Loads raw PSM data into main

memoryreads load

generates

Page 19: Distributed approach for Peptide Identification

CRanker Execution: Step 2

InputFileName.txt

InputFileName_score.mat

C-Ranker Solve Stage

Loads PSM records into

main memoryreads load

creates

Page 20: Distributed approach for Peptide Identification

CRanker Execution: Step 3

InputFileName.txt

InputFileName_score.mat

C-Ranker Write Stage

Loads PSM scores into main

memory

OutputFile.txt

readsreads

reads

reads

load

creates

Page 21: Distributed approach for Peptide Identification

Problem Statement

Page 22: Distributed approach for Peptide Identification

Problem Statement

• C-Ranker need a computer with high computation power

• Dataset having about 400,000 PSM records, it may cost about 5 to 8 on normal PC

• Poor Resource Management

• Need to address future big data sets

Page 23: Distributed approach for Peptide Identification

Can’t we change C-Ranker?

• Research going on to optimize C-Ranker.

• Distributed approach of C-Ranker.

• Needs to re-write the complete code!

Page 24: Distributed approach for Peptide Identification

Brainstorming

• Can we divide(who will divide) the 400,000 PSM records across 4 machines and do the job??

• Increase computational power?

Page 25: Distributed approach for Peptide Identification

Constraints!

• Restrictions on changing the C-Ranker design and code (I am not well experienced to do so..)

• Should not change the execution flow of C-Ranker.

Page 26: Distributed approach for Peptide Identification

Shortlisted Approach

• Fundamental Distributed Approach

Page 27: Distributed approach for Peptide Identification

Why Distributed Framework?

• It can handle bigger datasets than it would be able to in a centralized setting.

• Requires less memory per computer and each computer can have commodity hardware.

• Cheaper to have multiple commodity hardware computers than having a single high-performance high-end system capable of achieving similar goals.

Page 28: Distributed approach for Peptide Identification

Job Execution in Distributed Approach

Page 29: Distributed approach for Peptide Identification

Proposed Solution

Page 30: Distributed approach for Peptide Identification

Proposed Solution

• A framework to execute C-Ranker on distributed node.• Design such that it may work with other post database searching

algorithms like C-Ranker with minimal changes• Compare the time-taken of generate distributed output of C-Ranker

with actual output• Make sure C-Ranker algorithm is well executed on the set of

predefined nodes

Page 31: Distributed approach for Peptide Identification

Architecture

Page 32: Distributed approach for Peptide Identification

Implementation

Page 33: Distributed approach for Peptide Identification

Data Flow Details in the Original Single-Threaded C-Ranker

Page 34: Distributed approach for Peptide Identification

Data Flow Details in Distributed C-Ranker (Dividing)

Page 35: Distributed approach for Peptide Identification

Data Flow For a Single Worker Host

Page 36: Distributed approach for Peptide Identification

Data Flow Details in Distributed C-Ranker (Merging)

Page 37: Distributed approach for Peptide Identification

Execution Environment

Page 38: Distributed approach for Peptide Identification

Execution Environment

• JAVA

• MATLAB MCR environment

• Apache Tomcat web server

Page 39: Distributed approach for Peptide Identification

Input Data used

Page 40: Distributed approach for Peptide Identification

Hardware Used to Observe Results

Servers Server_1 Server_2 Server_3 Server_4

Memory 8GB 4GB 4GB 4GB

Processor i5 i5 i5 i5

Operating System Windows 7 Windows Vista Windows 7 Windows 7

Page 41: Distributed approach for Peptide Identification

Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster1

PBMC data (KB) C-Ranker Execution time in hrs (Cluster1 Hadoop)

Distributed approach for C-Ranker executiion time in hrs

11221 6.5 3.56

12816 9.9 8.1

31422 10.2 8.25

48486 15. 2 9.2

Page 42: Distributed approach for Peptide Identification

Results

Page 43: Distributed approach for Peptide Identification

Results for testData.xls (409 KB)

Page 44: Distributed approach for Peptide Identification

Results for Pbmc_orbit_mips.xls (11221 KB)

Page 45: Distributed approach for Peptide Identification

Results for Pbmc_orbit_nomips.xls (12816 KB)

Page 46: Distributed approach for Peptide Identification

Results for Pbmc_velos_mips.xls (31422 KB)

Page 47: Distributed approach for Peptide Identification

Results for Pbmc_velos_nomips.xls (48486 KB)

Page 48: Distributed approach for Peptide Identification

Memory usage for testData.xls (409 KB)

Page 49: Distributed approach for Peptide Identification

Memory Usage for Pbmc_orbit_mips.xls (11221 KB)

Page 50: Distributed approach for Peptide Identification

Memory Usage for Pbmc_orbit_nomips.xls (12816 KB)

Page 51: Distributed approach for Peptide Identification

Memory Usage for Pbmc_velos_mips.xls (31422 KB)

Page 52: Distributed approach for Peptide Identification

Memory usage for Pbmc_velos_nomips.xls (48486 KB)

Page 53: Distributed approach for Peptide Identification

Memory Usage

Page 54: Distributed approach for Peptide Identification

Difference in Memory Usage

Page 55: Distributed approach for Peptide Identification

Upgraded hardware to compare with cluster2 HadoopServers Server_1 Server_2 Server_3 Server_4

Memory 12GB 8GB 12GB 8GB

Processor i7 i5 i7 i7

Operating System

Windows 8 Windows 7 Windows 7 Windows 7

Page 56: Distributed approach for Peptide Identification

Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster 2

PBMC data (KB) Distributed approach for C-Ranker execution time in hrs (new results)

CRanker Execution time in hrs(Cluster2 Hadoop)

11221 1.7 1.312816 3.82 1.5831422 4.1 3.448486 5.93 4.5

Page 57: Distributed approach for Peptide Identification

Cost Calculation of Apache Hadoop Cluster 1 and Cluster 2

PBMC Data Size(KB) Cost of Hadoop Cluster 1($)

Cost of Hadoop cluster 2($)

11221 3.4581 0.6916

12816 5.2668 0.8410

31422 5.4264 1.8088

48486 8.8064 2.3941

Page 58: Distributed approach for Peptide Identification

Conclusion and Future Scope

Page 59: Distributed approach for Peptide Identification

Conclusion

• Reduces the execution time

• Absolutely cost free (no need of high computing machines)

• No need to change the current structure of C-Ranker

Page 60: Distributed approach for Peptide Identification

Conclusion (Cont.…)

• Better Resource Management. For example: Memory

• No need to change the implementation of CRanker

Page 61: Distributed approach for Peptide Identification

Future Scope

• The same distributed approach can be used with Percolator and PeptideProphet to see how well they perform

• Additionally, once can use an ensemble method to combine the results of the three tools.

Page 62: Distributed approach for Peptide Identification

Questions??

Page 63: Distributed approach for Peptide Identification

Thank you