data validation and annotation: prideviewer and pike bioinformatics analysis from proteomics data

Data Validation and Annotation: PRIDEViewer and PIKE

Bioinformatics analysis from proteomics data

ProteoRed Bioinformatics Workshop Salamanca

Alberto Medina-AunonMarch, 15th 2010

Main Topics

• Mass spectrometry and protein and peptide validation– PRIDEViewer: Description.– Examples: Uses-cases.

• Experiment context: Linking functional information to our proteins.– PIKE: Description.– Examples: Uses-cases.

• Starting from:– Mass spectrum/spectra– Tentative identification/Sequence– Search Engine

MS Validation. The easiest Way

Candidate: AFLLAMAARTGFRTR

How to do it

• By hand:– Just for a few sequences/spectra – We cannot read every format files (for instance

binaries).

• Semi-automatically: – Using PRIDE files as input: PRIDEViewer

PRIDEViewer Experiment info

PRIDEViewer Sample and Instrument info

PRIDEViewerSpectra and identifications

PRIDEViewerGel Separation

PRIDEViewerMascot interface

One Example: Identification using 5 peptides

Example Mascot output

Another example:350 input spectra

Validation study

• Starting from one public proteomics repository – EBI PRIDE-:

– Retrieve a set of available experiments.– Check the level of fulfillment of the experiments.– Repeat the protein and peptide identification.

VALIDATE THE EXPERIMENT……..

Validation using PRIDEhttp://www.ebi.ac.uk/pride/

http://www.ebi.ac.uk/pride/

PRIDE: Searching experiments: Biomart

Validation. First Round. Biomart

Validation- First Round: PRIDE Accession 1642

First View: Mascot Results

Validation – First Round:PRIDE Accession 1642

Protein Id Database Peptide Count

Identified

IPI00295598 IPI 2 No

Q15843 SwissProt 6 Yes

P62491 SwissProt 1 NoWhy? If we explore the data, we’ll find …..

Protein Id First Peptide PRIDE mass

Calculated mass

IPI00295598 VISEPGEAEVFMTPEDFVR

2184.0375 2152.0267

Q15843 EIGPPQQQR 1052.5697 1052.5483

P62491 DHADSNIVIMLVGNK 1657.8186 1625.8316

Delta mass around 32Da

Validation – First Round:Pride Accession 1642

• Hypothesis…. – First and third sequences present a mass

variation around 32 Da. • Is there a modification in C or N termini? In that way,

second sequence will present as well.• Is any residue -or more than one- modified?• We’ll extract the common aminoacids: D, A, S, I, C, M

and G• Compare they with the described modifications with a

mass variation of 32 Da.

Validation – First Round:PRIDE Accession 1642.

Only this modification could explain a common property between both sequences.

So, we’ll select it in the next round

Validation – First Round:PRIDE Accession 1642

Validation – Second Round: Latest Experiments. Retrieved by hand

• PRIDE accession id: 10470 to 11257 (787 experiments).– No one is suitable to check.– No information regarding the identification is available.

• PRIDE accession id: 10000 to 10074 (74 experiments).– One dataset could be checked: 10042 to 10060. (Dataset title: Low abundance proteome of

human red blood cells captured by combinatorial peptide libraries)

Validation – Second Round:Latest experiments

Pride Accession 10053

Mascot output

Pride Accession 10060

Mascot output: No identification

Validation – Third Round: Recent Experiments. Retrieved by

hand• Experiment id: 9900 to 9999• Two dataset are suitable to check:

– 9900 to 9942: LC-MALDI experiments (Tannerella forsythia).

– 9944 to 9949: Rattus norvegicus.– 9984: Zebrafish. No spectra.– 9985 to 9992: Homo sapiens. (No identifications).– 44 not available.

Validation – Third Round:Experiment 9900

Validation – Third Round.Experiment 9900

Validation – Third Round: Experiment 9900. Summary

Protein Id Peptide Count

Identified 1st Peptide Mass

Theoretical Mass

TF2239 1 No 1228.5463 1228.6433

TF26612 13 Yes -- --

TF1259 1 No 1271.6478 1271.6783

TF2116 4 No 1139.5835 1139.6208

TF1741 16 No 1044.5144 1044.5473

TF0447 2 No 1092.4619 1092.5432

TF2663 7 Yes -- --

TF2592 2 No 1022.5306 1022.5782

Study summary

• Around 1000 PRIDE experiments were downloaded from PRIDE central repository.

• Around 100 of them were suitable to test.

• Less than of 50% were successfully validated.

In summary

• There a lot of data within the repositories. (PRIDE).

• There a lot of missing information.• It is not possible to check the data

automatically.

• PRIDEViewer could help us saving a lot of time.

Protein Set

• Other times, if there is a mistake in the identification, it will not so significant if finally we can reach to the goal of the experiment.

• For instance, proteins involved in a particular function or biological process.

DB id Protein Name

gi|12857455 Heat shock protein

gi|14017768 FKB9_HUMAN

gi|12836587 Tubulin alpha homo sapiens

gi|15010550 Ubiquitin specific protease

gi|15489190 vinculin isoform VCL Homo sapiens

gi|9963904 selenium binding protein 1 Homo sapiens

… …

PIKE http://proteo.cnb.csic.es/

PIKE: Protein Information and Knowledge extractor

http://proteo.cnb.uam.es/

PIKE http://proteo.cnb.csic.es/Information asked by user


PIKE output. CSV

PIKE output

First example medium-complexity protein list (containing 57 proteins)

J Proteome Res. 2005 Nov-Dec;4(6):2435-41.

First example medium-complexity protein list (containing 57 proteins)

# entry namea Entry ID (UniProt ID)Manual searching PIKE output -Only Keywords-

6 Integrin alpha-5 precursor P08648 1 TM KeyWord: Transmembrane

7Sodium/potassium-transporting ATPase alpha-1 chain precursor P05023 10 TM KeyWord: Transmembrane

8 Short transient receptor potential channel 4 Q9UBN4 8 TM KeyWord: Transmembrane

10 Band 3 anion transport protein P02730 11 TM KeyWord: Transmembrane

11 Transferrin receptor protein 1 P02786 1 TM KeyWord: Transmembrane17 calnexin precursor P27824 1 TM KeyWord: Transmembrane

19 5'-nucleotidase precursor P21589 1 TM; GPI Keyword: GPI-anchor

21 Alkaline phosphatase, placental type precursor P05187 GPIKeyWords: Transmembrane; GPI-anchor

22 4F2 cell-surface antigen heavy chain P08195 1 TM KeyWord: Transmembrane

24Solute carrier family 2, facilitated glucose transporter, member 1 P11166 12 TM KeyWord: Transmembrane

29 chloride intracellular channel protein 5 Q9NZA1 KeyWord: Transmembrane

303beta-hydroxy-Delta5-steroid dehydrogenase multifunctional protein I P14060 1 TM KeyWord: Transmembrane

41 myristoylated alanine-rich C-kinase substrate P29966 Myristoylation Keyword: Myristate

42 Basigin precursor P35613 1 TM KeyWord: Transmembrane

47 Brain acid soluble protein 1 P80723 MyristoylationKeyWords: Transmembrane; Myrsitate

51 ADP-ribosylation factor 1 P84077 KeyWords: Transmembrane; Myristate

Second example Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65

25 MOST FREQUENT PROTEINS

Serum albumin [Precursor] - Serum albumin - ALB 356Complement C3 [Precursor] 273IGHA1 protein 225Calcium/calmodulin-dependent protein kinase kinase 2 100Inter-alpha-trypsin inhibitor heavy chain H1-H4 [Precursor] 99Putative uncharacterized protein 97IGL@ protein 96ARF GTPase-activating protein GIT2 90Complement factor B [Precursor] 90PRO2275 90IGHM protein 78IGKC protein 64Alpha-1B-glycoprotein [Precursor] 62cDNA FLJ14473 fis, clone MAMMA1001080. 58CDNA FLJ25298 fis, clone STM07683. 58Fibronectin [Precursor] 58IGHD protein 56Trypsin 55Apolipoprotein-L1 [Precursor] 54HP protein 53Alpha-2-macroglobulin [Precursor] 52SNC66 protein 52Ig kappa chain V-III region HAH [Precursor] 50

PROTEIN COUNT 2226REDUNDANCY RATIO (Protein count/non redundant entries) 89.04%

Third example The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.

>> We have merged four different views of the human plasma proteome, based on different methodologies, into a single nonredundant list of 1175 distinct gene products ….

Third example The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.

Conclussion• PIKE represents a suitable and useful bioinformatics

tool for small-or large-scale proteomics projects.

• PIKE main characteristic is its ability to systematically access and automatically retrieve comprehensive biological information contained in common databases.

• The resulting information is output in a wide range of standard formats that can be directly viewed, exported, or downloaded for additional analysis.

Questions?

data validation and annotation: prideviewer and pike bioinformatics analysis from proteomics data

Documents

pride files

mascot outputpride accession

data validation

recent experiments

handpride accession

set of available experiments

peptide identification

mass variation