data validation and annotation: prideviewer and pike bioinformatics analysis from proteomics data
DESCRIPTION
Data Validation and Annotation: PRIDEViewer and PIKE Bioinformatics analysis from proteomics data. ProteoRed Bioinformatics Workshop Salamanca Alberto Medina-Aunon March, 15th 2010. Main Topics. Mass spectrometry and protein and peptide validation PRIDEViewer: Description. - PowerPoint PPT PresentationTRANSCRIPT
Data Validation and Annotation: PRIDEViewer and PIKE
Bioinformatics analysis from proteomics data
ProteoRed Bioinformatics Workshop Salamanca
Alberto Medina-AunonMarch, 15th 2010
Main Topics
• Mass spectrometry and protein and peptide validation– PRIDEViewer: Description.– Examples: Uses-cases.
• Experiment context: Linking functional information to our proteins.– PIKE: Description.– Examples: Uses-cases.
• Starting from:– Mass spectrum/spectra– Tentative identification/Sequence– Search Engine
MS Validation. The easiest Way
Candidate: AFLLAMAARTGFRTR
How to do it
• By hand:– Just for a few sequences/spectra – We cannot read every format files (for instance
binaries).
• Semi-automatically: – Using PRIDE files as input: PRIDEViewer
PRIDEViewer Experiment info
PRIDEViewer Sample and Instrument info
PRIDEViewerSpectra and identifications
PRIDEViewerGel Separation
PRIDEViewerMascot interface
One Example: Identification using 5 peptides
Example Mascot output
Another example:350 input spectra
Validation study
• Starting from one public proteomics repository – EBI PRIDE-:
– Retrieve a set of available experiments.– Check the level of fulfillment of the experiments.– Repeat the protein and peptide identification.
VALIDATE THE EXPERIMENT……..
Validation using PRIDEhttp://www.ebi.ac.uk/pride/
PRIDE: Searching experiments: Biomart
Validation. First Round. Biomart
Validation- First Round: PRIDE Accession 1642
First View: Mascot Results
Validation – First Round:PRIDE Accession 1642
Protein Id Database Peptide Count
Identified
IPI00295598 IPI 2 No
Q15843 SwissProt 6 Yes
P62491 SwissProt 1 NoWhy? If we explore the data, we’ll find …..
Protein Id First Peptide PRIDE mass
Calculated mass
IPI00295598 VISEPGEAEVFMTPEDFVR
2184.0375 2152.0267
Q15843 EIGPPQQQR 1052.5697 1052.5483
P62491 DHADSNIVIMLVGNK 1657.8186 1625.8316
Delta mass around 32Da
Validation – First Round:Pride Accession 1642
• Hypothesis…. – First and third sequences present a mass
variation around 32 Da. • Is there a modification in C or N termini? In that way,
second sequence will present as well.• Is any residue -or more than one- modified?• We’ll extract the common aminoacids: D, A, S, I, C, M
and G• Compare they with the described modifications with a
mass variation of 32 Da.
Validation – First Round:PRIDE Accession 1642.
Only this modification could explain a common property between both sequences.
So, we’ll select it in the next round
Validation – First Round:PRIDE Accession 1642
Validation – Second Round: Latest Experiments. Retrieved by hand
• PRIDE accession id: 10470 to 11257 (787 experiments).– No one is suitable to check.– No information regarding the identification is available.
• PRIDE accession id: 10000 to 10074 (74 experiments).– One dataset could be checked: 10042 to 10060. (Dataset title: Low abundance proteome of
human red blood cells captured by combinatorial peptide libraries)
Validation – Second Round:Latest experiments
Pride Accession 10053
Mascot output
Pride Accession 10060
Mascot output: No identification
Validation – Third Round: Recent Experiments. Retrieved by
hand• Experiment id: 9900 to 9999• Two dataset are suitable to check:
– 9900 to 9942: LC-MALDI experiments (Tannerella forsythia).
– 9944 to 9949: Rattus norvegicus.– 9984: Zebrafish. No spectra.– 9985 to 9992: Homo sapiens. (No identifications).– 44 not available.
Validation – Third Round:Experiment 9900
Validation – Third Round.Experiment 9900
Validation – Third Round: Experiment 9900. Summary
Protein Id Peptide Count
Identified 1st Peptide Mass
Theoretical Mass
TF2239 1 No 1228.5463 1228.6433
TF26612 13 Yes -- --
TF1259 1 No 1271.6478 1271.6783
TF2116 4 No 1139.5835 1139.6208
TF1741 16 No 1044.5144 1044.5473
TF0447 2 No 1092.4619 1092.5432
TF2663 7 Yes -- --
TF2592 2 No 1022.5306 1022.5782
Study summary
• Around 1000 PRIDE experiments were downloaded from PRIDE central repository.
• Around 100 of them were suitable to test.
• Less than of 50% were successfully validated.
In summary
• There a lot of data within the repositories. (PRIDE).
• There a lot of missing information.• It is not possible to check the data
automatically.
• PRIDEViewer could help us saving a lot of time.
Protein Set
• Other times, if there is a mistake in the identification, it will not so significant if finally we can reach to the goal of the experiment.
• For instance, proteins involved in a particular function or biological process.
DB id Protein Name
gi|12857455 Heat shock protein
gi|14017768 FKB9_HUMAN
gi|12836587 Tubulin alpha homo sapiens
gi|15010550 Ubiquitin specific protease
gi|15489190 vinculin isoform VCL Homo sapiens
gi|9963904 selenium binding protein 1 Homo sapiens
… …
PIKE http://proteo.cnb.csic.es/
PIKE: Protein Information and Knowledge extractor
PIKE http://proteo.cnb.csic.es/
PIKE http://proteo.cnb.csic.es/
PIKE http://proteo.cnb.csic.es/Information asked by user
PIKE http://proteo.cnb.csic.es/
PIKE output. CSV
PIKE output
First example medium-complexity protein list (containing 57 proteins)
J Proteome Res. 2005 Nov-Dec;4(6):2435-41.
First example medium-complexity protein list (containing 57 proteins)
# entry namea Entry ID (UniProt ID)Manual searching PIKE output -Only Keywords-
6 Integrin alpha-5 precursor P08648 1 TM KeyWord: Transmembrane
7Sodium/potassium-transporting ATPase alpha-1 chain precursor P05023 10 TM KeyWord: Transmembrane
8 Short transient receptor potential channel 4 Q9UBN4 8 TM KeyWord: Transmembrane
10 Band 3 anion transport protein P02730 11 TM KeyWord: Transmembrane
11 Transferrin receptor protein 1 P02786 1 TM KeyWord: Transmembrane17 calnexin precursor P27824 1 TM KeyWord: Transmembrane
19 5'-nucleotidase precursor P21589 1 TM; GPI Keyword: GPI-anchor
21 Alkaline phosphatase, placental type precursor P05187 GPIKeyWords: Transmembrane; GPI-anchor
22 4F2 cell-surface antigen heavy chain P08195 1 TM KeyWord: Transmembrane
24Solute carrier family 2, facilitated glucose transporter, member 1 P11166 12 TM KeyWord: Transmembrane
29 chloride intracellular channel protein 5 Q9NZA1 KeyWord: Transmembrane
303beta-hydroxy-Delta5-steroid dehydrogenase multifunctional protein I P14060 1 TM KeyWord: Transmembrane
41 myristoylated alanine-rich C-kinase substrate P29966 Myristoylation Keyword: Myristate
42 Basigin precursor P35613 1 TM KeyWord: Transmembrane
47 Brain acid soluble protein 1 P80723 MyristoylationKeyWords: Transmembrane; Myrsitate
51 ADP-ribosylation factor 1 P84077 KeyWords: Transmembrane; Myristate
Second example Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65
25 MOST FREQUENT PROTEINS
Serum albumin [Precursor] - Serum albumin - ALB 356Complement C3 [Precursor] 273IGHA1 protein 225Calcium/calmodulin-dependent protein kinase kinase 2 100Inter-alpha-trypsin inhibitor heavy chain H1-H4 [Precursor] 99Putative uncharacterized protein 97IGL@ protein 96ARF GTPase-activating protein GIT2 90Complement factor B [Precursor] 90PRO2275 90IGHM protein 78IGKC protein 64Alpha-1B-glycoprotein [Precursor] 62cDNA FLJ14473 fis, clone MAMMA1001080. 58CDNA FLJ25298 fis, clone STM07683. 58Fibronectin [Precursor] 58IGHD protein 56Trypsin 55Apolipoprotein-L1 [Precursor] 54HP protein 53Alpha-2-macroglobulin [Precursor] 52SNC66 protein 52Ig kappa chain V-III region HAH [Precursor] 50
PROTEIN COUNT 2226REDUNDANCY RATIO (Protein count/non redundant entries) 89.04%
Third example The Human Plasma Proteome: A non redundant list:
Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.
>> We have merged four different views of the human plasma proteome, based on different methodologies, into a single nonredundant list of 1175 distinct gene products ….
Third example The Human Plasma Proteome: A non redundant list:
Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.
Conclussion• PIKE represents a suitable and useful bioinformatics
tool for small-or large-scale proteomics projects.
• PIKE main characteristic is its ability to systematically access and automatically retrieve comprehensive biological information contained in common databases.
• The resulting information is output in a wide range of standard formats that can be directly viewed, exported, or downloaded for additional analysis.
Questions?