proteomics myriad bioinformatics industrial applications in high throughput proteomics...

40
proteomic s myriad Bioinformatics Industrial Applications in Bioinformatics Industrial Applications in High Throughput Proteomics High Throughput Proteomics Alan F. James Director of Software Development Myriad Proteomics, Inc. , Salt Lake City

Upload: christian-mccormick

Post on 25-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

proteomicsmyriad

Bioinformatics Industrial Applications in Bioinformatics Industrial Applications in High Throughput ProteomicsHigh Throughput Proteomics

Alan F. James

Director of Software DevelopmentMyriad Proteomics, Inc., Salt Lake City

proteomicsmyriad

What is Proteomics?

• Proteomics refers to the study of the protein Proteomics refers to the study of the protein

constituents and protein activities of a cell, a constituents and protein activities of a cell, a

tissue or an organism.tissue or an organism.

• Proteomics may be seen from several viewpoints:Proteomics may be seen from several viewpoints:

– Protein ExpressionProtein Expression

– Protein Interaction (Interactome)Protein Interaction (Interactome)

– ……

proteomicsmyriad

Challenges in Proteomics

• Proteins are all different- some degrade easily, some are sticky, many require accessory factors

• Proteins are more complex than DNA- there are several protein forms per gene- proteins are post-translationaly modified

• There isn’t really ONE proteome in humans• Proteins change:

• with cell type• during differentiation• during development• in response to stimuli• with cell cycles

• So which Proteome do you study?

proteomicsmyriad

– Expression,Expression,

Abundance, Abundance,

Distribution Distribution

– Structural GenomicsStructural Genomics

– Protein-Protein Interaction AnalysisProtein-Protein Interaction Analysis

• Yeast two-hybrid systemYeast two-hybrid system

• Mass spectrometry ofMass spectrometry of

protein complexesprotein complexes

Normalcell

Cancercell

PDIRP5

OS-9

MPO-XYZ

novel

novel

NCALD

CASP3

Methods of Analyzing Proteomes

proteomicsmyriad

Methods of Analyzing Proteomes by Comprehensive Surveys of Protein-Protein

Interactions

Mass SpectrometryAllows identification of the proteins in a complex of many proteins (2-100) that carry out some cellular function.

Yeast two-hybrid (Y2H)• Measures association between two proteins.• Allows very high throughput.

proteomicsmyriad

Y2H Background Information:Gene Activity in Yeast

Yeast transcription factor are composed of a DNA Binding Domain and aTranscriptional Activation Domain.

TranscriptionFactor

ActivationDomain

DNABindingDomain

Yeast Gene

ActivationActivationActivationActivation

ActivationActivation

The DNA Binding Domain recruits the Activation Domain to the yeast gene,which allows the yeast gene to be active.

proteomicsmyriad

HumanProteins

1. The DNA Binding Domain is separated from the Transcriptional Activation Domain of a transcription factor.

Yeast Gene

ActivationDomain

DNABindingDomain

ActivationDomain

DNABindingDomain

HumanProteinsHuman

Proteins

ActivationDomain

HumanProtein

X

HumanProtein

Y

DNABindingDomain

HumanProtein

X DNABindingDomain

ActivationDomain

HumanProtein

Y

1.

1.

2.

2.

2.

2.

3.

3.

Principles of the Yeast Two-Hybrid System

2. Libraries of human proteins are fused to both domains to create “hybrid” proteins.

3. The recruitment of the Activation Domain to the yeast gene is now mediated by interactions of the human proteins.

ActivationActivationActivationActivation

ActivationActivation

HumanProteins

HumanProteins

proteomicsmyriad

Yeast Two-Hybrid Screens: Assay for Interactions

Reporter Gene

DNABindingDomain

HumanProtein

X

Bait

HumanProtein

Z

ActivationDomain

Prey

Reporter Gene

DNABindingDomain

Bait

ActivationDomain

Prey

( No Reporter Gene Activity )

Scenario A: Human Proteins X and Y do not Interact

Scenario B: Human Proteins X and Z do Interact

Readout:No growth of yeast colonies

Readout:Yeast colonies grow

HumanProtein

Y

HumanProtein

X

proteomicsmyriad

Directed vs. Random Approach

Directed:Directed:selecting specific proteins as baits for specific proteins as baits for

Y2H analysis.Y2H analysis.

The random approach can be used to rapidly generate large amounts of interaction data.

Random: using individual baits picked at random

from libraries of baits.

proteomicsmyriad

Random Two-Hybrid (R2H) Process Overview

Amplify Human DNAAmplify Human DNA

Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries from cDNA synthesized from mRNA libraries using random primers.from cDNA synthesized from mRNA libraries using random primers.

Library ConstructionLibrary Construction

Pick BD-ColoniesPick BD-Colonies

Mating w/AD-LibraryMating w/AD-Library

Selection PlatingSelection Plating

IncubationIncubation

Pick Growing YeastPick Growing Yeast

DNA SequencingDNA Sequencing

Put yeast colonies containing BD-hybrid proteins into 96-well culture plates

Add yeast containing the AD-hybrid proteins to the 96-well plates with the yeast colonies picked in (2.); allow yeast mating to occur.

Plate yeast matings onto dishes containing selective medium that allows yeast to grow only if the human hybrid proteins interact.

Allow several days for yeast that contain interacting human proteins to grow.

Pick yeast colonies containing interacting human proteins (“Positives”) and put them into 96-well culture plates.

Amplify the human DNA that encodes the interacting proteins by PCR.

Sequence the amplified DNA and identify the interacting proteins.

1.

2.

3.

4.

5.

6.

7.

8.

proteomicsmyriad

Vital tasks that in cells are often performedby Multi-Protein Complexes (MPC)

Mass Spectrometry

proteomicsmyriad

..

..

.

.

.

.

.

Gene Cloning Cell Biology

Protein “Preys”

Mass Spectrometry

Protein Purification

pENTR

pDEST1

pDEST3

pDEST4

pDEST5

pDEST2

Protein “Baits”

Handles(Affinity Tags)

Mass Spectrometry

proteomicsmyriad

..

..

.

.

.

.

.

cDNA Cloning Cell Biology

Protein “Preys”

Mass Spectrometry

Protein Purification

pENTR

pDEST1

pDEST3

pDEST4

pDEST5

pDEST2

Protein “Baits”

Handles

Mass Spectrometry

proteomicsmyriad

Pulldown Assay

Bait Protein

PurificationTag

Complex formation

Associated Proteins

Affinity Beads

Non-binding Proteins

Separate proteins

Identify by Mass

SpectrometryElute

Incubate with cell extract

proteomicsmyriad

..

..

.

.

.

.

.

cDNA Cloning Cell Biology

Protein “Preys”

Mass Spectrometry

Protein Purification

MPC

Mass Spectrometry

pENTR

pDEST1

pDEST3

pDEST4

pDEST5

pDEST2

Protein “Baits”

Handles

proteomicsmyriad

Purified protein complex

Protein separation Protein digestion Mass Spec. analysis

Mass spectrum Database Searching (Peptide Mass Fingerprint Search)

Protein ID

Mass Spectrometry Procedure

proteomicsmyriad

Summary of Protein-Protein Interaction Summary of Protein-Protein Interaction Analysis MethodsAnalysis Methods

A

B

C D

E

F GH I

J

K

L

A

B

G H

IK

CK

Mass Spectrometry:Yields sets of n-ary associations among proteins (that may represent protein complexes).

Random Yeast Two-Hybrid:Yields sets of binary associations between protein fragments (that may represent protein-protein interactions).

proteomicsmyriad

The Goal: Biological RelevanceThe Goal: Biological Relevance

Underlying Pathway Adopted from http://www.kegg.com

fibril formation,deposition

Amyloid Plaque,Neurofibrillary

Tangle Formation

APOPTOSIS

New Protein-Protein InteractionKnown Protein-Protein InteractionTransduction PathwayKnown Pathway MemberIdentified InteractorNovel TranscriptTraditional “Drugable” EnzymeOther Enzymes

proteomicsmyriad

Knowledge

Information

Data

Data Collection, Analysis, and Interpretation

LIMS

Base Calling

Blast/PMF Searches

Identification of Loci/Domains/Proteins

Identification of binary and n-ary interactions

Identification of participation in protein complexes

Identification of protein interaction networks

Identification of participation in diseasepathway

Identification as potential drug target

Data Collection

Automated DataReduction

Automated DataAnalysis

Manual/Experimental Data

Analysis

Biology

Computational Biology

Software Development

Data Warehousing

Mass Peak List Determination

Role of Bioinformatics in ProteomicsRole of Bioinformatics in Proteomics

proteomicsmyriad

• Robot programmingRobot programming• Software engineeringSoftware engineering• Database modeling and designDatabase modeling and design• Data warehouses and Data MartsData warehouses and Data Marts• Database federationDatabase federation• Grid ComputingGrid Computing• Information VisualizationInformation Visualization• Graph analysis, graph layout and displayGraph analysis, graph layout and display• Hidden Markhov ModelsHidden Markhov Models• Bayesian networksBayesian networks• Statistical modelsStatistical models• Signal ProcessingSignal Processing• Algorithm developmentAlgorithm development

• ……

Bioinformatics Techniques Used in ProteomicsBioinformatics Techniques Used in Proteomics

proteomicsmyriad

Objectives of Bioinformatics in ProteomicsObjectives of Bioinformatics in Proteomics

1.1. Automate and manage high-Automate and manage high-throughput laboratory processes.throughput laboratory processes.

2.2. Retrieve, collect, and store Retrieve, collect, and store experimental interaction data.experimental interaction data.

3.3. Analyze, reduce, and extend Analyze, reduce, and extend experimental interaction data.experimental interaction data.

4.4. Mine and visualize interaction analysis Mine and visualize interaction analysis results.results.

proteomicsmyriad

Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes

Laboratory AutomationLaboratory Automation• High-throughputHigh-throughput proteomics is not possible proteomics is not possible

without a high degree of laboratory without a high degree of laboratory automation.automation.

• Instruments and robotics Instruments and robotics

must interact directly andmust interact directly and

reliably with LIMS reliably with LIMS

(Laboratory Information (Laboratory Information

Management System).Management System).

proteomicsmyriad

Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes

Laboratory Management Information System (LIMS)Laboratory Management Information System (LIMS)• High-throughputHigh-throughput proteomics is not possible without a sophisticated proteomics is not possible without a sophisticated

LIMS.LIMS.• The LIMS provides the foundation for all automated data collection, The LIMS provides the foundation for all automated data collection,

reduction, and analysis.reduction, and analysis.• Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene

Cloning, Protein Pull-down, Mass Spec., etc.Cloning, Protein Pull-down, Mass Spec., etc.• May collect very large amounts of data.May collect very large amounts of data.• Fast runtime performance of the LIMS is essential to deal with the Fast runtime performance of the LIMS is essential to deal with the

high volume of transactions and possible near real-time interactions high volume of transactions and possible near real-time interactions between the LIMS and robotics and instruments.between the LIMS and robotics and instruments.

• High availability of the LIMS and supporting computer systems is High availability of the LIMS and supporting computer systems is required to support production laboratories and time-critical required to support production laboratories and time-critical operations.operations.

• May be one of the most (if not the most) labor intensive May be one of the most (if not the most) labor intensive (programming, database management, and system management) (programming, database management, and system management) and expensive software systems in the enterprise.and expensive software systems in the enterprise.

proteomicsmyriad

Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes

Functions of the Laboratory Management Information System (LIMS)Functions of the Laboratory Management Information System (LIMS)• Track samples consistently through a protocol so that each sample:Track samples consistently through a protocol so that each sample:

– Is identified.Is identified.– Is linked to the appropriate results.Is linked to the appropriate results.– Is linked to the protocol used to process the sample.Is linked to the protocol used to process the sample.– Is linked to any related samples, reagents, etc.Is linked to any related samples, reagents, etc.– Can be located physically.Can be located physically.

• Manage and enforce the protocol used to process a sample. Manage and enforce the protocol used to process a sample. • Capture laboratory quality control information and provide displays, reports, Capture laboratory quality control information and provide displays, reports,

statistical analyses, etc. to allow management and quality control of the statistical analyses, etc. to allow management and quality control of the laboratory.laboratory.

• Provide interfaces for laboratory personnel, robotics, and instruments to Provide interfaces for laboratory personnel, robotics, and instruments to support high-throughput operations.support high-throughput operations.

• Capture results directly from laboratory instruments.Capture results directly from laboratory instruments.• Provide experimental results in a format suitable for analytical programs.Provide experimental results in a format suitable for analytical programs.• Provide the interface between analytical systems and instruments (such as Provide the interface between analytical systems and instruments (such as

Mass Spectrometers) that require real-time (or near real-time) analysis during Mass Spectrometers) that require real-time (or near real-time) analysis during operation.operation.

• Manage laboratory personnel work lists, incident alerting, reporting and Manage laboratory personnel work lists, incident alerting, reporting and correction, etc.correction, etc.

proteomicsmyriad

Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes

LIMS ArchitectureLIMS Architecture

LIMS SERVER(Java Socket Application)

SQL Net

Lab Workstation(Java Application)

Web-basedManagement Client(Servlets, JSP, CGI Script)

LIMSDatabase(s)

Lab Workstation(Java Application)...

Lab Workstation(Java Application)

Robot or Instrument Robot or Instrument Robot or Instrument...

...

LIMS DataWarehouse(s)

(ODS)

AnalysisDatabases

Web Application Server

XM

L

Web-basedManagement Client(Servlets, JSP, CGI Script)

Web-basedManagement Client(Servlets, JSP, CGI Script)

proteomicsmyriad

Collect, Store, and Retrieve Experimental Data

Yeast two-hybrid Data• Electropherograms for sequence forward and reverse reads• Sequences and sequence quality scores from base-calling• Robot/Instrument Operational Parameters• Quality control data

– Distributions of positive colonies within a search– Distributions of sequencing reaction success/failure within a

plate.

Yeast two-hybrid Data Collection Challenges• Transmission of electropherograms from remote sequencing

facility and associated error handling.• Relating/correlating data received from remote sequencing

facility with LIMS data.

• Archival of electropherograms.

• Retrieval of archived electropherograms.

proteomicsmyriad

Collect, Store, and Retrieve Experimental Data

Mass Spectrometry Data• Spectrograms

– Multiple Instruments (MALDI-TOF, Electrospray/Ion Trap, etc.)– Multiple spectrogram types (MS, MS/MS)– Individual samples may be analyzed with multiple instruments, mass

spectrogram types.– False Positive/Contamination Control Sample Spectrograms

• Mass Peak Lists derived from spectrograms• Mass Spectrometry Instrument Operational Parameters

Mass Spectrometry Data Collection Challenges• Individual experiments will generate many spectrograms.• Interfacing with instrument to retrieve spectrograms and mass

peak lists.• Archival of spectrograms and mass peak lists• Retrieval of archived spectrograms and mass peak lists

proteomicsmyriad

Collect, Store, and Retrieve Experimental Data

External Data Sources• NCBI LocusLink, RefSeq, GenBank, …• SwissProt, PFAM, …SwissProt, PFAM, …• Gene Ontology, …Gene Ontology, …• KEGG, …KEGG, …• PubMed, Manually curated papers, …PubMed, Manually curated papers, …

External Data Sources Challenges• Wide variety of data formats.Wide variety of data formats.• Integrating or federating disparate data sources with internal Integrating or federating disparate data sources with internal

data bases.data bases.• Sometimes questionable quality of data.Sometimes questionable quality of data.• Data sources frequently change/evolveData sources frequently change/evolve

– Changes may invalidate previous analysis results.Changes may invalidate previous analysis results.

– May require analysis databases to support May require analysis databases to support versioningversioning of results. of results.

proteomicsmyriad

Analyze, Reduce, and Extend Experimental Data

• The goal of data analysis is to extract or discover biological The goal of data analysis is to extract or discover biological relevance from the raw data.relevance from the raw data.

• Raw data must be “cleaned”, filtered, and transformedRaw data must be “cleaned”, filtered, and transformed– Vector/adaptor identification & clippingVector/adaptor identification & clipping– Sequence assemblySequence assembly– Consensus sequence identificationConsensus sequence identification– Peptide mass fingerprint (PMF) searchingPeptide mass fingerprint (PMF) searching– False positive detection/filtering.False positive detection/filtering.

• Data representations must be modeled and developed.Data representations must be modeled and developed.– How to represent interaction data?How to represent interaction data?

• Sequences? Electropherograms? Mass Peak Lists?Sequences? Electropherograms? Mass Peak Lists?• Interactions? Pathways? Sequence Annotations?Interactions? Pathways? Sequence Annotations?• ManyMany other biological concepts / processes / functions? other biological concepts / processes / functions?

– How to organize data structures to enable querying (analysis) How to organize data structures to enable querying (analysis) involvinginvolving

• Many Many tables tables • >1 million rows in some tables>1 million rows in some tables• filtering, aggregation, and computation of datafiltering, aggregation, and computation of data

• Analysis algorithms must be developed/adapted.Analysis algorithms must be developed/adapted.• Statistical models must be developed/validated.Statistical models must be developed/validated.

proteomicsmyriad

Example: consequences of naïve data modeling

proteomicsmyriad

Send/ReceiveLab Sequence

Perform Basecalling

Perform QC andClean Lab Sequence

Annotate/Identify LabSequences

Construct Interaction Pair

Construct Interaction Map

Integrate ExternalEvidence

Y2H Laboratory

Track SequenceSubmittedVersioning

Sequence StringQuality ScoreQuality Matrix

Failed RequeueVector ClippingRepeat MaskingLow Quality Filter

BLAST, Parameters, VersionHomologous Seqs, Splice VariantsDomain Search

Frequency of InteractionConfidence LevelCollect False Positive, Self Activators

VisualizationQueryCompare Difference

Gene ExpressionPathwayDisease

Perform Downsteam Analysis

Example: Y2H Data Analysis Process Flow

proteomicsmyriad

Dealing with False Positives

• False positives will always be generated.False positives will always be generated.– Y2HY2H

• ““Self-activating” baits.Self-activating” baits.• ““Promiscuous” preys.Promiscuous” preys.

– Mass SpectrometryMass Spectrometry• Proteins that interact directly with affinity beads.Proteins that interact directly with affinity beads.• Proteins that interact directly with affinity tags.Proteins that interact directly with affinity tags.• Contaminants.Contaminants.

• False positives are False positives are veryvery hard to detect and distinguish from hard to detect and distinguish from real positives.real positives.

• False positives must be addressed both biologically and False positives must be addressed both biologically and informatically:informatically:

– Known false positives can be “subtracted” from Y2H AD/BD Known false positives can be “subtracted” from Y2H AD/BD libraries before experiments.libraries before experiments.

– Mass spectrometry control experiments with affinity beads, Mass spectrometry control experiments with affinity beads, affinity tags, and background contaminants can be “subtracted” affinity tags, and background contaminants can be “subtracted” from results.from results.

– Known false positives can be “subtracted” during analysis.Known false positives can be “subtracted” during analysis.– Statistical tests can be developed to help identify possible false Statistical tests can be developed to help identify possible false

positives during analysis.positives during analysis.

proteomicsmyriad

Mine and Visualize the Results of AnalysisMine and Visualize the Results of Analysis

• Proteomics-specific Proteomics-specific data mining tools are required to extract are required to extract meaningful knowledge from massive amounts of data.meaningful knowledge from massive amounts of data.

– Flexible Flexible searching capabilities. capabilities.– Flexible Flexible filters to reduce the amount of data. to reduce the amount of data.– Multiple Multiple views of the data. of the data.– Ad-hoc query tools for unanticipated data mining needs. tools for unanticipated data mining needs.– Data warehouses and/or data martsData warehouses and/or data marts are required to support data are required to support data

mining without impacting performance sensitive LIMS and mining without impacting performance sensitive LIMS and analytic systems.analytic systems.

• Visualization Visualization tools are required to visually organize the data tools are required to visually organize the data and reveal meaningful patterns.and reveal meaningful patterns.

– Quality control visualizations.visualizations.– Interaction network visualizations.visualizations.– Interaction network visualizations with experimental data visualizations with experimental data

overlays.overlays.– Disease and metabolic pathway visualizations with interaction visualizations with interaction

network overlays.network overlays.

proteomicsmyriad

Scatter Plot

SEARCHID2414824774 26485 27617 28532 36250 37511 39413

0

20

40

60

80

100

120

140

160

Quality Control Visualization (1)Quality Control Visualization (1)

proteomicsmyriad

Plate-by-plate Sequencing Purity Monitor

well

RA0000055

RA0000109

RA0000128

RA0000151

RA0000166

RB0000047

RB0000059

RB0000072

RB0000106

RA0000058

RA0000110

RA0000129

RA0000152

RA0000169

RB0000048

RB0000060

RB0000073

RB0000109

RA0000059

RA0000113

RA0000130

RA0000154

RA0000170

RB0000050

RB0000061

RB0000074

RB0000110

RA0000060

RA0000119

RA0000131

RA0000155

RA0000171

RB0000051

RB0000063

RB0000075

RB0000113

RA0000061

RA0000122

RA0000132

RA0000156

RA0000172

RB0000052

RB0000064

RB0000076

RB0000119

RA0000101

RA0000123

RA0000133

RA0000158

RB0000041

RB0000053

RB0000066

RB0000077

RB0000122

RA0000103

RA0000124

RA0000134

RA0000161

RB0000043

RB0000054

RB0000067

RB0000078

RB0000123

RA0000104

RA0000125

RA0000135

RA0000162

RB0000044

RB0000055

RB0000068

RB0000101

RB0000124

RA0000106

RA0000126

RA0000149

RA0000164

RB0000045

RB0000056

RB0000069

RB0000103

RB0000127

RA0000108

RA0000127

RA0000150

RA0000165

RB0000046

RB0000058

RB0000070

RB0000104

RB0000136

1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24

AFKP

AFKP

AFKP

AFKP

AFKP

AFKP

AFKP

AFKP

AFKP

Quality Control Visualization (2)Quality Control Visualization (2)

proteomicsmyriad

Interacting preys highlighted with their pronet annotation

prey

interacting baits highlighted with their pronet annotation

prey38 577 3691 6421 9090 10814 23469 55216 84619

20

1198

4343

6670

9114

11244

26289

58528

Prey Annotated Bait Annotated

Y2H Interaction Map with Curated Promiscuous Protein AnnotationY2H Interaction Map with Curated Promiscuous Protein Annotation

Quality Control Visualization (3)Quality Control Visualization (3)

proteomicsmyriad

Interaction Network Sub-Graph VisualizationInteraction Network Sub-Graph Visualization

proteomicsmyriad

loc2

2

loc2

5lo

c23

loc2

4

loc2

1

Y2H Interaction Network Sub-Graph Y2H Interaction Network Sub-Graph Visualization with Protein Pull-down OverlayVisualization with Protein Pull-down Overlay

proteomicsmyriad

Pathway with Interaction Network AnnotationPathway with Interaction Network Annotation

fibril formation,deposition

Amyloid Plaque,Neurofibrillary

Tangle Formation

APOPTOSIS

Underlying Pathway Adopted from http://www.kegg.com

New Protein-Protein InteractionKnown Protein-Protein InteractionTransduction PathwayKnown Pathway MemberIdentified InteractorNovel TranscriptTraditional “Drugable” EnzymeOther Enzymes

proteomicsmyriad

AcknowledgementsAcknowledgements