valid data from large-scale proteomics studies

2
NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 647 NEWS AND VIEWS Valid data from large-scale proteomics studies Daniel Chamrad & Helmut E Meyer How sure can we be to have identified the right proteins in a large scale proteomics study with our mass spectrometric instrumentation? Can we expect valid data from the employed search algorithm(s)? Can we believe what our computer is telling us? Right questionswhat are the answers? Mass spectrometry (MS) has become the method of choice for identification of pro- teins in proteome studies. Exponentially growing protein sequence databases as well as the improvements in accuracy and speed of liquid chromatography (LC)-coupled MS analysis have allowed protein identification by peptide fragmentation analysis (MS/ MS) to become a high-throughput technol- ogy. Coupling capillary multidimensional LC with MS analysis allows even complex protein mixtures (for example, whole-cell lysates) to be analyzed directly in so-called shotgun approaches. In this method a spe- cific digestion of an intact protein mixture is followed by LC-MS/MS analysis without the need for a prior fractionation step. One must, however, mention that pep- tide masses determined by MS are generally not unique and peptide identification (and in turn protein identification) is therefore inherently probability-based. The task of assigning peptides to MS/MS spectra is done with protein sequence database search algorithms. Particularly in the case of shot- gun approaches, it is a challenge to combine all the MS/MS spectra into a single result summarizing protein candidates instead of peptides. Because different laboratories use various platforms to obtain and interpret MS/MS results, evaluating and comparing data between groups is exceedingly difficult. In this view, the manuscript by Gygi and coworkers 1 is of great value, as it contains an up-to-date comparison and validation of actual LC-MS/MS protein identification platforms featuring two of the most popu- lar mass spectrometers used for shotgun sequencing, large MS/MS datasets, two popu- lar database search algorithms called Mascot and SEQUEST and finally, replicate analysis. This is the most comprehensive evaluation of these methods ever published, and it helps to answer some long-standing questions. Insights obtained from this work are important because recent developments in proteomics have revealed high-quality interpretation of the acquired MS/MS data as a bottleneck. Because manual methods are inadequate for high-throughput result vali- dation, evaluation of the results is left to MS data interpretation algorithms. Although these automated methods for MS-based complex protein mixture analysis have had a tremendous impact on proteomics, there is currently no generally accepted standard for publication and validation of MS-based protein identification results from complex mixtures. Part of the reason for this is the range of choices researchers have for pro- teomics experiments. Proteome scientists have various options in experimental design of proteome studies, such as choice of protein or peptide separa- tion based on two-dimensional gel electro- phoresis or liquid chromatography followed by analysis with different kinds of MS instru- ments and MS/MS data interpretation algo- rithms. This, together with the mentioned absence of a standard for protein identifica- tion, leads to the momentary uncomfortable situation in which it is very hard to validate, compare or even to replicate experimental results within the proteomics community. Daniel Chamrad is at Protagen AG, Dortmund, Germany, and Helmut E. Meyer is at Ruhr University of Bochum, Bochum, Germany. e-mail: [email protected] or [email protected] 1 time, 63% 2 times, 14% 3 times, 10% 4 times, 6% 5 times, 4% 6 times, 3% 1,109 52 64 110 171 251 Total 1,757 nonhomologous proteins Figure 1 | Illustration of the difficulty in identifying the same proteins using different MS platforms. The graph summarizes the results of the 2005 contest of the 12 th Martinsried meeting to identify proteins in a complex mixture. Six companies analyzed the protein extract from 10,000 human cells by LC-MS/MS after tryptic digestion. Only 52 proteins (3% of the data) were identified by all six participants (6 times), as represented by the smallest wedge. Successive wedges represent the number of proteins identified by five, four, three, two and only a single participant. About two-thirds of the proteins (1,109) were identified by only one participant. Application of the results in the manuscript by Gygi and colleagues may help researchers to overcome this problem. Katie Ris © 2005 Nature Publishing Group http://www.nature.com/naturemethods

Upload: helmut-e

Post on 29-Jul-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Valid data from large-scale proteomics studies

NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 647

NEWS AND VIEWS

Valid data from large-scale proteomics studiesDaniel Chamrad & Helmut E Meyer

How sure can we be to have identified the right proteins in a large scale proteomics study with our mass spectrometric instrumentation? Can we expect valid data from the employed search algorithm(s)? Can we believe what our computer is telling us? Right questions⎯what are the answers?

Mass spectrometry (MS) has become the method of choice for identification of pro-teins in proteome studies. Exponentially growing protein sequence databases as well as the improvements in accuracy and speed of liquid chromatography (LC)-coupled MS analysis have allowed protein identification by peptide fragmentation analysis (MS/MS) to become a high-throughput technol-ogy. Coupling capillary multidimensional LC with MS analysis allows even complex

protein mixtures (for example, whole-cell lysates) to be analyzed directly in so-called shotgun approaches. In this method a spe-cific digestion of an intact protein mixture is followed by LC-MS/MS analysis without the need for a prior fractionation step.

One must, however, mention that pep-tide masses determined by MS are generally not unique and peptide identification (and in turn protein identification) is therefore inherently probability-based. The task of

assigning peptides to MS/MS spectra is done with protein sequence database search algorithms. Particularly in the case of shot-gun approaches, it is a challenge to combine all the MS/MS spectra into a single result summarizing protein candidates instead of peptides. Because different laboratories use various platforms to obtain and interpret MS/MS results, evaluating and comparing data between groups is exceedingly difficult.

In this view, the manuscript by Gygi and coworkers1 is of great value, as it contains an up-to-date comparison and validation of actual LC-MS/MS protein identification platforms featuring two of the most popu-lar mass spectrometers used for shotgun sequencing, large MS/MS datasets, two popu-lar database search algorithms called Mascot and SEQUEST and finally, replicate analysis. This is the most comprehensive evaluation of these methods ever published, and it helps to answer some long-standing questions.

Insights obtained from this work are important because recent developments in proteomics have revealed high-quality interpretation of the acquired MS/MS data as a bottleneck. Because manual methods are inadequate for high-throughput result vali-dation, evaluation of the results is left to MS data interpretation algorithms. Although these automated methods for MS-based complex protein mixture analysis have had a tremendous impact on proteomics, there is currently no generally accepted standard for publication and validation of MS-based protein identification results from complex mixtures. Part of the reason for this is the range of choices researchers have for pro-teomics experiments.

Proteome scientists have various options in experimental design of proteome studies, such as choice of protein or peptide separa-tion based on two-dimensional gel electro-phoresis or liquid chromatography followed by analysis with different kinds of MS instru-ments and MS/MS data interpretation algo-rithms. This, together with the mentioned absence of a standard for protein identifica-tion, leads to the momentary uncomfortable situation in which it is very hard to validate, compare or even to replicate experimental results within the proteomics community.

Daniel Chamrad is at Protagen AG, Dortmund, Germany, and Helmut E. Meyer is at Ruhr University of Bochum, Bochum, Germany.e-mail: [email protected] or [email protected]

1 time, 63%

2 times,14%

3 times,10%

4 times,6%

5 times,4%

6 times,3%

1,109

52 64

110

171

251

Total 1,757nonhomologous

proteins

Figure 1 | Illustration of the difficulty in identifying the same proteins using different MS platforms. The graph summarizes the results of the 2005 contest of the 12th Martinsried meeting to identify proteins in a complex mixture. Six companies analyzed the protein extract from 10,000 human cells by LC-MS/MS after tryptic digestion. Only 52 proteins (3% of the data) were identified by all six participants (6 times), as represented by the smallest wedge. Successive wedges represent the number of proteins identified by five, four, three, two and only a single participant. About two-thirds of the proteins (1,109) were identified by only one participant. Application of the results in the manuscript by Gygi and colleagues may help researchers to overcome this problem.

Kat

ie R

is

©20

05 N

atur

e P

ub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

em

eth

od

s

Page 2: Valid data from large-scale proteomics studies

NEWS AND VIEWS

648 | VOL.2 NO.9 | SEPTEMBER 2005 | NATURE METHODS

One example is the outcome of this year’s Martinsried contest presented at the 12th “Micromethods in Protein Chemistry” meeting (http://www.arbeitstagung.de). Five mass spectrometry vendors and one proteome service company tried to iden-tify as many proteins as possible from a cell extract of 10,000 human cells. Altogether, 1,757 nonhomologous proteins could be found, but only 52 proteins were identified by all participants. Almost two thirds of the identified proteins (1,109) were identified by only a single company.

Thus, basic questions such as ‘What is the actual false positive rate of the given protein identification procedure?’, ‘How often must a LC-MS/MS experiment be repeated to obtain valid results?’ or ‘Are one or two iden-tified peptides required for an unambigu-ous protein identification?’ are now under discussion, and these are exactly the kinds questions addressed by Gygi and colleagues in this work. They demonstrate that using a composite target-decoy database strategy, their selected scoring criteria yielded 1%

estimated false positive identifications at maximum sensitivity for all acquired data sets, permitting reasonable comparisons between them. These comparisons indi-cate that Mascot and SEQUEST yield simi-lar results for both MS platforms, and the observed low reproducibility between rep-licate data acquisitions made on one or both instrument platforms can be exploited to increase sensitivity and confidence in large-scale protein identifications.

Now there are no basic rules on how to perform a proteomic study and manu-scripts can frequently be found that publish results from single LC-MS/MS experiments without any repetition, which can become problematic for further independent vali-dation steps. Thus, search strategies and data evaluation methods in LC-MS/MS–based proteome studies must be improved, and the manuscript by Gygi and colleagues gives some very useful directions on how to do so.

1. Elias, J.E., Haas, W., Faherty, B.K. & Gygi, S.P. Nat.Methods 2, 667–675 (2005).

Szilard’s dreamNathalie Q Balaban

With the advent of microfluidics technology and the development of a user-friendly device, studying high-density colonies of microorganisms in controlled chemostatic conditions now becomes a reality.

Many microorganisms can grow at impres-sive rates. Escherichia coli, the most studied bacterium, divides every 20 minutes, and will grow to a mass greater than the Earth in 48 hours, if supplied with indefinite nutrients. In conventional growth conditions, when bacteria are grown in a flask, the nutrients are rapidly exhausted, and the bacteria adapt by arresting growth. It is essential, however, to keep the bacteria in steady-state growth con-ditions to perform reproducible and quan-titative experiments such as measuring the mutation rates or developing new drugs. The miniaturized chambers described in the work of Groisman et al.1 in this issue represent the ultimate way to do controlled experiments, even at high densities of cells. Groisman’s chemostat is a packaged monolithic chip that does everything for you: traps the cells under

the microscope, feeds them and keeps them at the required temperature.

Already more than 50 years ago, physicist Leo Szilard invented the chemostat2, on his quest to deal with some of the deepest bio-logical questions: mutations, evolution and aging. In a chemostat the growth of micro-organisms is kept at steady state by control-ling the input rate of a limiting nutrient and the dilution rate of the culture. In a paper using this invention, Szilard indicates that “by choosing suitable values,… we may vary over a wide range, independently of each other, the bacterial concentration and the [nutrient] concentration”3. The invention of the chemostat, together with the math-ematical formulation of its dynamics by Monod4, allowed extraction of quantitative parameters on the growth and physiology of

Nathalie Q. Balaban is at The Racah Institute of Physics, The Hebrew University, Jerusalem 91904, Israel.e-mail: [email protected]

©20

05 N

atur

e P

ub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

em

eth

od

s