automation of web applications and iterative searching for post-translational modifications
TRANSCRIPT
-
8/14/2019 Automation of Web Applications and Iterative Searching for Post-Translational Modifications
1/1
Automation of Web Applications and
Iterative Searching for Post-Translational ModicationsSimon Chiang & Kirk C. Hansen
Biomolecular Structure Program, University of Colorado Denver
Introduction
The past several years have seen a proliferation of analytical tools for proteomics
data. Several major search engines exist and proteomics toolkits are now available
in many programming languages. The next challenge of the proteomics
community will be nding ways to utilize these tools more eectively.
Studies have show, for instance, that searching MS/MS proteomics data using
multiple search engines increases the number and quality of peptide/protein
identications[1]. Variations of this approach, including iterative searching[2], also
hold promise for improving search results but, as a practical matter, thesetechniques require automation. A signicant barrier to automation is simply
interfacing with the various analytical tools, the vast majority of which are online.
Web applications provide a fairly standard interface to humans, the web form, but
typically they do not provide a programatic interface. Moreover, analytical
applications can be quite complex; most require a large number of congurations
where the allowable values are hard to predict. As a result, web applications can be
dicult to automate. Even with a program that emulates a web form, the work
required to generate congurations is prohibitive.
Tap-Mechanize facilities the automation of web applications by providing a simple
and robust way to capture congurations directly from web forms. Once captured,the congurations may be resubmitted programatically, or used as a template to
run the application in a batched fashion. Using Tap-Mechanize most web
applications are easy to incorporate into automated workows.
We have used Tap-Mechanize to automate several workows related to data
preparation and processing, and to experiment with iterative searching. Iterative
searching can take many forms. The most basic type of iterative searching simplypartitions spectra by the strength of their identications, then re-searches the weak
or unidentied spectra using additional techniques. Thi s type of searching is
thought to be useful when analyzing proteins with numerous post-translational
modications (PTMs).
One such protein is collagen. Collagen consists primarily of a GXY repeat where X
and Y can be any amino acid. Normally X is proline and Y is hydroxyproline,
meaning collagen is modied at approximately every third residue. As a
consequence, hydroxylation of proline must be specied as a PTM to identify the
majority of collagen peptides. These peptides are physiologically very relevant;hydroxyproline allows collagen molecules to wrap into tight alpha-helix spirals and
ultimately stabilizes collagen brils. In the absence of hydroxylation, collagen
degrades easily and the disease scurvy results.
Using rat tail collagen as a sample, we explored the consequences of using iterative
searching to identify PTMs, in particular the eects of partitioning spectra on the
false discovery rate (FDR).
AbstractData intensive elds like proteomics require researchers to interact with a wide
variety of software that is increasingly web-based. Web applications pose special
challenges to programmers seeking to automate their execution. Although web
applications provide a relatively standard interface to users, the programmatic
interfaces span numerous protocols and frequently do not exist at all.
We present Tap-Mechanize, a system to easily capture the output of web forms for
resubmission using a standard programatic interface. By capturing web forms into
a standard format, Tap-Mechanize enables many web applications to be used in
automated workows. Such workows drastically reduce the time required toanalyze large datasets, facilitate reproducibility, and enable more complicated
techniques to be used during analysis.
We have used Tap-Mechanize to implement iterative searching of MS/MS
proteomics data. Iterative searching uses a quick, general search to lter spectra of
unmodied peptides, and then performs more time-consuming searches on the
remaining spectra. Using iterative searching we are able identify peptides with
post-translational modications (PTMs) that normally are missed. These peptides
are of particular interest because PTMs frequently regulate the function of proteins,and are implicated in many disease states.
References
1. Searle, B.C., Turner, M. & Nesvizhskii, A.I. Improving sensitivity by probabilisti-cally combining results from multiple MS/MS search methodologies. J Proteome
Res 7, 245-53(2008).
2. Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative com-
putational analysis of shotgun proteomic data: toward more ecient identication
of post-translational modications, sequence polymorphisms, and novel peptides.
Mol Cell Proteomics 5, 652-70(2006).
3. Tap Website
4. Mechanize
5. Ubiquity
Implementation Tap-Mechanize is written in the programming language Ruby and utilizes two
distinct libraries, Tap and Mechanize. Tap (Task Application[3]) is a software
framework that we developed to standardize our interaction with diverse software
tools, and to facilitate the construction of automated workows. Mechanize[4] is a
library that emulates human interactions with web forms and, although we use the
Ruby version, originates from the Perl open-source community.
Tap-Mechanize captures congurations by redirecting the HTTP output of a webform to a local server that parses the request into a conguration le. The
redirection occurs via javascript that re-writes the action of the form upon
submission. Multiple page requests, requests across https, and page requests using
links may all be captured using this method.
The redirection script is injected into the DOM using the Firefox plugin Ubiquity[5].
Redirection from other browsers is currently not supported.
Discussion
Our experiments using Tap-Mechanize to iteratively search a collagen sample for
hydroxylation of proline illustrates that the partitioning of search results can inatefalse discovery rates (FDRs). The eect is purely mathematical in nature. During the
non-PTM search, the modied peptides are unidentied and therefore absent from
the denominator of the FDR equation; as a result the decoy hits have adisproportionately high eect and the FDR increases. During the subsequent PTM
search, the unmodied peptides are now absent from the denominator and again,
FDR increases.
In this example, the lowest FDRs were observed when searching for the modied
and unmodied peptides together, without iterative searching. However, the total
number of identications between the non-PTM and PTM searches was greater
than the total without iterative searching.
Collagen has an unusually high rate of modication and the observed eect should
be less severe for most proteins. Moreover, this experiment does not prove or
disprove the utility of iterative searching. M ostly it illustrates that the partitioning
step used to select spectra for secondary searching must be executed carefully, and
that there is great utility in exploring search results under many conditions.
Without automation it is dicult to pursue studies such as this. Tap-Mechanize
helps researchers to automate web applications by capturing congurations fromweb forms and resubmitting them within workows. This technique preserves the
functionality built into the web interface. Moreover, it allows web applications to
be utilized as-is, without requiring developers to provide a separate programatic
interface.
At the most basic level, automation allows researchers to be more productive.
More signicantly, automation gives researchers an opportunity to examine how
their tools work. Analytical software is complex; each conguration is meaningful,even though the exact consequences of a conguration are often unclear. The
same can be said of the many numeric results produced during analysis. It is, as
always, through trial and error that we enrich our appreciation of what our results
mean.
Iterative Search Workow
Tap-Mechanize
+
Redirect
0. Inputdata1,2. Search using Mascot,exportresults.3,4. Search using GPM,exportresults.5. Generate intersection of results6. Mapresultsaccession numbersusing PIR7. Generate graphicusing GoGetter
Automated Analytical Workow
0
3
1
4
2 5 6 7
GeneOntology,GOSlims:BiologicalProcess-Weighted(Dataset1name)
Biologicalprocess(go:0008150) (16.35%)
Cellularprocess(go:0009987) (16.18%)
Macromoleculemetabolicprocess (go:0043170)(15.98%Metabolicprocess(go:0008152) (15.70%)
Nucleobase,nucleoside,nucleotide andnucleicaci..Cellcommunication(go:0007154) (8.73%)
Regulationofbiological process(go:0050789)(6.46%
Transport(go:0006810)(4.14%)Responsetostimulus (go:0050896)(2.48%)
Multicellularorganismaldevelopment (go:0007275)(2Biosyntheticprocess(go:0009058) (0.67%)
Celldifferentiation(go:0030154) (0.56%)
Celldeath(go:0008219) (0.48%)Electrontransport(go:0006118) (0.48%)
Secretion(go:0046903)(0.33%)
Membranefusion(go:0006944) (0.33%)
0. InputData1,2. Search withoutPTMs,export results
3. Partition spectraby identication4,5. Search weak/unidentifedspectrafor PTMs6. Collate results
strong
weak
2% seq cov after primary
search without PTMs
52% seq cov after secondary
search for PTMs
0 1 2 3
4 5
6
CO1A2_RAT CO1A2_RAT
Web Applications are in GreenPartition Threshold: exp > 0.05
Primary (non-PTM)
Secondary (PTM)
Non-Iterative Search
N Spectra
1293
1293
1244
Peptide Hits
49
326
373
Decoy Hits
1
2
2
FDR (%)
2.04
0.61
0.54
FDR=Decoy Hits/
Peptide Hits
1/49
2/326
2/373