biomedical informatics for proteomics

Download Biomedical informatics                        for proteomics

Post on 06-Jan-2016




1 download

Embed Size (px)


Biomedical informatics for proteomics. Boguski , M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237. : Kun-Mao Chao : . Outline. Introduction Study design and sample quality - PowerPoint PPT Presentation


Biomedical informatics for proteomics

Biomedical informatics for proteomicsBoguski, M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237. : Kun-Mao Chao

: OutlineIntroductionStudy design and sample qualityProtein databasesProtein identification by database searchingPattern matching without protein identificationConclusions and future challenges

Introductionreporter:IntroductionThe subtitle: Genes Were Easy.We have transitioned rapidly from a large but finite and complete human genome to a seemingly infinite biological universe.Proteomics is often referred to as a post-genome science, but its antecedents actually predate the Human Genome Project by two to three decades.Although medical informatics has until recently been largely detached from bioinformatics, the emergence of clinical genomics and proteomics increasingly requires the integrated analysis of genetic, cellular, molecular and clinical information and the expertise of pathologists, epidemiologists and biostatisticians.IntroductionProteomics is the latest functional genomics technology to capture our imagination and it is instructive to review some lessons learned during the earlier adoption of another functional genomics technology, namely gene expression analysis using microarrays and similar technologies. There are many implications of biomedical informatics for proteomics, including multiple platform technologies, laboratory information-management systems, medical records systems, and documentation of clinical trial results for regulatory agencies.In the present work, we confine our discussions to mass spectrometry-based proteomics, and to study design and data resources, tools and analysis in a research setting.IntroductionProteomics depends upon careful study design and high-quality biological samples, advanced information technologies.Proteome analysis is at a much earlier stage of development than genomics and gene expression (microarray) studies.Fundamental issues involving biological variability, pre-analytic factors and analytical reproducibility remain to be resolved.Study design and sample qualityreporter:GlossaryCase-control and cohort studyObservational studies:Case O/X of the phenotype(case/control) Cohort Participants based on O/X of risk factor of interest and over time for development of an outcomeConfounder/ConfoundingDistort an apparent relationship between an exposure and a phenotype of interest Plasma: fluid, non-cellular Serum: protein solution remaining after blood coagulatedPre-analytical variablesVariables that present before laboratory test and data analysisRandomized clinical trialTreatments are randomly assigned in order to prevent confoundingStudy design and sample qualityPotter describes 4 study design

However, the distinction between observational and experimental design isnt made as well as proteomics studies.

Observational studies of gene expression and proteomic analysis involving humanbias & confounding factor

Human plasma and serum proteomics are susceptible to observational biasesconfused with a specific characteristic of the disease processmislead

Each may induce a change in total protein concentrations by 10%.

Highlighting human serum proteomenature but confounding variables may complicate findingStudy design and sample qualityNo adjust for confounding even, only to have careful design and specimen ascertainment

quality number

Margolin has admonished that Scientists...need to avoid the tendency, often driven by the high price of some of the newer techniques, of running under-controlled experiments or experiments with fewer repeated conditions than would have been accepted with standard techniques.

Proteomics discovery has no priori enumeration of targets and lacks described procedural structure.Study design and sample qualityProtein databasereporter:Proteome

DNAmRNAProteinsGenomeProteome databasesCollections protein sequences date back to the1960s.

Utilitarian goal of protein databases (1990s~today)Minimal redundancyMaximal annotationIntegration with other databases

Protein databasesCurrent molecular sequence databases are classified according to their evolutionary history inferred from sequence homology.excellent tools for gene discovery, comparative genomics and molecular evolutionmuch work to be done to even minimally serve the needs of proteomics and integrative biological scienceProtein databasesToday's principal protein databases emphasize molecular cellular featuresannotation are not well suited to represent physiology.

A more ideal database for plasma proteome studies would classify proteins from a functional, rather than an evolutionary, viewpointData standardsMultiple or specialized file formats has hindered accessibility, information exchange and integration

eXtensible Markup Language (XML)an Internet standard for describing structured and semistructured datamost of the main databases make their data available in XML and make it easy to publish and exchange XML dataProtein databasesPDB(Protein Data Bank )GenBankSWISS-PROTEMBLHPRD(Human Protein Reference DatabaseProtein identification by database searchingreporter: Purpose of Protein identification by database searchingNOT the species or remoteness of the relationshipinfer similarity of function from similarity of sequencestudy the evolution of protein families or domainsDifferent aims and therefore require different strategies and tools

Analysis of human seruminterested in identifying proteins they are not normally presentmatch between subsequencesweak similarities

Statistical significancestatistical significance is important, but not in the sense of the probability that two sequences are related by chancedeviates significantly from a normal range of values.If it is met, one is then interested in attempting to demonstrate a significant correlationdatabaseDNA 1 mRNA 2 protein

1.TranscriptionPost translational modification2.translation(proteolytic processing glycosylation, methylation, phosphorylation, Met, , acetylation, hydroxylation )Post translational modificationproteolytic processing glycosylationAsn SerThr

Post translational modificationmethylationLyscphosphorylation-OH

Post translational modificationMetNMet(AUG)mRNA has no codingacetylationhydroxylation

Peptide analysisError ToleranceScoring methodsPeptide analysis- Experimental processCut to mixture of short peptidesSpecific: restriction enzymeMass Spectrometry Detect the m/z of the compoundsTandem mass spectrometry (MS/MS)Fragments of specific m/zChromatographySeparation before MSTandem mass spectrometry

DionexPeptide analysis- Mass Spectrometry Several ApproachAnalytic peptide-mass fingerprintused as profileCompare with the predicted spectrummatch to databaseDe novo sequence interpretationManual interpretation by expertTime consumption high

Consideration of Error ToleranceRestriction enzyme non-specificityPrecursor charge errorsGet more than one charge in ionizationIsotopeMass measurement errorsRelated to accuracy of instrumentUnsuspected modifications Ex: post-translational modificationPrimary sequence variationsdeletions, insertions, substitutions[2002] Error tolerant searching of uninterpreted tandem mass spectrometry data Scoring methods descriptionIn general, each scoring algorithm designates a quantity related to the probability that the candidate peptide could have produced the observed spectrum by chanceRanking is required for high-throughput automated analysis

Example of peptide identification

PB cannot be identified due to high variationSolutions: reduce the number of target peptidesAnother challengeAnother automating proteomics challenge : the best match of a scoring algorithm is simply not good enough.Establishing a criteria for acceptance overall therefore becomes the main focus of automated proteomics.

Scoring Threshld ,P valueIt is generally assumed that higher-scoring assignments are more likely to be correct than lower-scoring assignments.Threshold: i : Sensitivity ii : Specificity iii:Mixture , sequence data baseP values : If p values0.05 , 5% of all false tests will be misidentified as true. Scoring Threshld ,P value

ProbabilityP value-like quantitiesKeller et al. estimate the reference distributions of the correct and incorrect assignments within any experiment.Keller et al. describe an approach that may allow a scoring algorithm to be converted into P value-like quantities that can then be used to control error rates.*Pattern matching without protein identification

*Conclusions and future challengesreporter: Time-of-flight mass spectrometry(TOF)mass spectrometryionsare accelerated by anelectric fieldvelocity of the ion depends on themass-to-charge ratioTime is measuredCompared with known experimental parameter, we can get the ion of mass-to-charge ratio.Time-of-flight mass spectrometryTime-of-flight mass spectrometry

Time-of-flight mass spectrometry (TOFMS) is a method of mass spectrometryin whichionsare accelerated by anelectric fieldof known strength. This acceleration results in an ion having the samekinetic energyas any other ion that has the same charge. The velocity of the ion depends on themass-to-charge ratio. The time that it subsequently takes for the particle to reach a detector at a known distance is measured. This time will depend on the mass-to-charge ratioof the particle (heavie


View more >