data validation and annotation: prideviewer and pike bioinformatics analysis from proteomics data

49
Data Validation and Annotation: PRIDEViewer and PIKE Bioinformatics analysis from proteomics data ProteoRed Bioinformatics Workshop Salamanca Alberto Medina-Aunon March, 15th 2010

Upload: fadey

Post on 21-Mar-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Data Validation and Annotation: PRIDEViewer and PIKE Bioinformatics analysis from proteomics data. ProteoRed Bioinformatics Workshop Salamanca Alberto Medina-Aunon March, 15th 2010. Main Topics. Mass spectrometry and protein and peptide validation PRIDEViewer: Description. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Data Validation and Annotation: PRIDEViewer and PIKE

Bioinformatics analysis from proteomics data

ProteoRed Bioinformatics Workshop Salamanca

Alberto Medina-AunonMarch, 15th 2010

Page 2: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Main Topics

• Mass spectrometry and protein and peptide validation– PRIDEViewer: Description.– Examples: Uses-cases.

• Experiment context: Linking functional information to our proteins.– PIKE: Description.– Examples: Uses-cases.

Page 3: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

• Starting from:– Mass spectrum/spectra– Tentative identification/Sequence– Search Engine

MS Validation. The easiest Way

Candidate: AFLLAMAARTGFRTR

Page 4: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

How to do it

• By hand:– Just for a few sequences/spectra – We cannot read every format files (for instance

binaries).

• Semi-automatically: – Using PRIDE files as input: PRIDEViewer

Page 5: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDEViewer Experiment info

Page 6: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDEViewer Sample and Instrument info

Page 7: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDEViewerSpectra and identifications

Page 8: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDEViewerGel Separation

Page 9: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDEViewerMascot interface

Page 10: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

One Example: Identification using 5 peptides

Page 11: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Example Mascot output

Page 12: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Another example:350 input spectra

Page 13: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation study

• Starting from one public proteomics repository – EBI PRIDE-:

– Retrieve a set of available experiments.– Check the level of fulfillment of the experiments.– Repeat the protein and peptide identification.

VALIDATE THE EXPERIMENT……..

Page 14: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation using PRIDEhttp://www.ebi.ac.uk/pride/

Page 15: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PRIDE: Searching experiments: Biomart

Page 16: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation. First Round. Biomart

Page 17: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation- First Round: PRIDE Accession 1642

Page 18: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

First View: Mascot Results

Page 19: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – First Round:PRIDE Accession 1642

Protein Id Database Peptide Count

Identified

IPI00295598 IPI 2 No

Q15843 SwissProt 6 Yes

P62491 SwissProt 1 NoWhy? If we explore the data, we’ll find …..

Protein Id First Peptide PRIDE mass

Calculated mass

IPI00295598 VISEPGEAEVFMTPEDFVR

2184.0375 2152.0267

Q15843 EIGPPQQQR 1052.5697 1052.5483

P62491 DHADSNIVIMLVGNK 1657.8186 1625.8316

Delta mass around 32Da

Page 20: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – First Round:Pride Accession 1642

• Hypothesis…. – First and third sequences present a mass

variation around 32 Da. • Is there a modification in C or N termini? In that way,

second sequence will present as well.• Is any residue -or more than one- modified?• We’ll extract the common aminoacids: D, A, S, I, C, M

and G• Compare they with the described modifications with a

mass variation of 32 Da.

Page 21: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – First Round:PRIDE Accession 1642.

Only this modification could explain a common property between both sequences.

So, we’ll select it in the next round

Page 22: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – First Round:PRIDE Accession 1642

Page 23: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – Second Round: Latest Experiments. Retrieved by hand

Page 24: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

• PRIDE accession id: 10470 to 11257 (787 experiments).– No one is suitable to check.– No information regarding the identification is available.

• PRIDE accession id: 10000 to 10074 (74 experiments).– One dataset could be checked: 10042 to 10060. (Dataset title: Low abundance proteome of

human red blood cells captured by combinatorial peptide libraries)

Validation – Second Round:Latest experiments

Page 25: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Pride Accession 10053

Page 26: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Mascot output

Page 27: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Pride Accession 10060

Page 28: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Mascot output: No identification

Page 29: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – Third Round: Recent Experiments. Retrieved by

hand• Experiment id: 9900 to 9999• Two dataset are suitable to check:

– 9900 to 9942: LC-MALDI experiments (Tannerella forsythia).

– 9944 to 9949: Rattus norvegicus.– 9984: Zebrafish. No spectra.– 9985 to 9992: Homo sapiens. (No identifications).– 44 not available.

Page 30: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – Third Round:Experiment 9900

Page 31: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – Third Round.Experiment 9900

Page 32: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Validation – Third Round: Experiment 9900. Summary

Protein Id Peptide Count

Identified 1st Peptide Mass

Theoretical Mass

TF2239 1 No 1228.5463 1228.6433

TF26612 13 Yes -- --

TF1259 1 No 1271.6478 1271.6783

TF2116 4 No 1139.5835 1139.6208

TF1741 16 No 1044.5144 1044.5473

TF0447 2 No 1092.4619 1092.5432

TF2663 7 Yes -- --

TF2592 2 No 1022.5306 1022.5782

Page 33: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Study summary

• Around 1000 PRIDE experiments were downloaded from PRIDE central repository.

• Around 100 of them were suitable to test.

• Less than of 50% were successfully validated.

Page 34: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

In summary

• There a lot of data within the repositories. (PRIDE).

• There a lot of missing information.• It is not possible to check the data

automatically.

• PRIDEViewer could help us saving a lot of time.

Page 35: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Protein Set

• Other times, if there is a mistake in the identification, it will not so significant if finally we can reach to the goal of the experiment.

• For instance, proteins involved in a particular function or biological process.

Page 36: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

DB id Protein Name

gi|12857455 Heat shock protein

gi|14017768 FKB9_HUMAN

gi|12836587 Tubulin alpha homo sapiens

gi|15010550 Ubiquitin specific protease

gi|15489190 vinculin isoform VCL Homo sapiens

gi|9963904 selenium binding protein 1 Homo sapiens

… …

PIKE http://proteo.cnb.csic.es/

PIKE: Protein Information and Knowledge extractor

Page 37: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE http://proteo.cnb.csic.es/

Page 38: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE http://proteo.cnb.csic.es/

Page 39: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE http://proteo.cnb.csic.es/Information asked by user

Page 40: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE http://proteo.cnb.csic.es/

Page 41: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE output. CSV

Page 42: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

PIKE output

Page 43: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

First example medium-complexity protein list (containing 57 proteins)

J Proteome Res. 2005 Nov-Dec;4(6):2435-41.

Page 44: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

First example medium-complexity protein list (containing 57 proteins)

# entry namea Entry ID (UniProt ID)Manual searching PIKE output -Only Keywords-

6 Integrin alpha-5 precursor P08648 1 TM KeyWord: Transmembrane

7Sodium/potassium-transporting ATPase alpha-1 chain precursor P05023 10 TM KeyWord: Transmembrane

8 Short transient receptor potential channel 4 Q9UBN4 8 TM KeyWord: Transmembrane

10 Band 3 anion transport protein P02730 11 TM KeyWord: Transmembrane

11 Transferrin receptor protein 1 P02786 1 TM KeyWord: Transmembrane17 calnexin precursor P27824 1 TM KeyWord: Transmembrane

19 5'-nucleotidase precursor P21589 1 TM; GPI Keyword: GPI-anchor

21 Alkaline phosphatase, placental type precursor P05187 GPIKeyWords: Transmembrane; GPI-anchor

22 4F2 cell-surface antigen heavy chain P08195 1 TM KeyWord: Transmembrane

24Solute carrier family 2, facilitated glucose transporter, member 1 P11166 12 TM KeyWord: Transmembrane

29 chloride intracellular channel protein 5 Q9NZA1   KeyWord: Transmembrane

303beta-hydroxy-Delta5-steroid dehydrogenase multifunctional protein I P14060 1 TM KeyWord: Transmembrane

41 myristoylated alanine-rich C-kinase substrate P29966 Myristoylation Keyword: Myristate

42 Basigin precursor P35613 1 TM KeyWord: Transmembrane

47 Brain acid soluble protein 1 P80723 MyristoylationKeyWords: Transmembrane; Myrsitate

51 ADP-ribosylation factor 1 P84077  KeyWords: Transmembrane; Myristate

Page 45: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Second example Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65

25 MOST FREQUENT PROTEINS  

Serum albumin [Precursor] - Serum albumin - ALB 356Complement C3 [Precursor] 273IGHA1 protein 225Calcium/calmodulin-dependent protein kinase kinase 2 100Inter-alpha-trypsin inhibitor heavy chain H1-H4 [Precursor] 99Putative uncharacterized protein 97IGL@ protein 96ARF GTPase-activating protein GIT2 90Complement factor B [Precursor] 90PRO2275 90IGHM protein 78IGKC protein 64Alpha-1B-glycoprotein [Precursor] 62cDNA FLJ14473 fis, clone MAMMA1001080. 58CDNA FLJ25298 fis, clone STM07683. 58Fibronectin [Precursor] 58IGHD protein 56Trypsin 55Apolipoprotein-L1 [Precursor] 54HP protein 53Alpha-2-macroglobulin [Precursor] 52SNC66 protein 52Ig kappa chain V-III region HAH [Precursor] 50

PROTEIN COUNT   2226REDUNDANCY RATIO (Protein count/non redundant entries) 89.04%

Page 46: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Third example The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.

>> We have merged four different views of the human plasma proteome, based on different methodologies, into a single nonredundant list of 1175 distinct gene products ….

Page 47: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Third example The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics. 2004 Apr;3(4):311-26. Epub 2004 Jan 12.

Page 48: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Conclussion• PIKE represents a suitable and useful bioinformatics

tool for small-or large-scale proteomics projects.

• PIKE main characteristic is its ability to systematically access and automatically retrieve comprehensive biological information contained in common databases.

• The resulting information is output in a wide range of standard formats that can be directly viewed, exported, or downloaded for additional analysis.

Page 49: Data Validation and Annotation:  PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Questions?