bioinformatics @ wu lab
TRANSCRIPT
Bioinformatics @ WU Lab
SIG NewGradDepartment of Computer & Information Sciencesp p
September 17, 2012
Cathy H. Wu, Ph.D.Ed d G J ff Ch i d Di C f Bi i f i & C i l Bi l (CBCB)Edward G. Jefferson Chair and Director, Center for Bioinformatics & Computational Biology (CBCB)
Director, Protein Information Resource (PIR)Departments of Computer & Information Sciences and of Biological Sciences
[email protected] Biotechnology Institute, 15 Innovation Way, Suite 205
University of Delaware
What is Bioinformatics?
NIH W ki D fi iti (2002)
What is Bioinformatics?
NIH Working Definition (2002):Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including th t i t i hi l i li h d tthose to acquire, store, organize, archive, analyze, or visualize such data.
computer + mousecomputer + mouse (information) (biology)
=> Bioinformatics
An Emerging (and Expanding) Field
2
An Emerging (and Expanding) Field Where Biological and Computational Disciplines Converge
Human Genome ProjectThe Human Genome Project has revolutionized how
biologists view and practice biology.
• Discovery science introduced the possibility of global informational analyses. • A genetics parts list of human genes and control region sequences emerged.A genetics parts list of human genes and control region sequences emerged. • The idea that biology is an informational science with three major types of
information emerged: DNA/genes, proteins, and biological systems. • Tools for high throughput quantitative measurements of biological• Tools for high‐throughput quantitative measurements of biological
information were developed.• Computer science, mathematics, and statistics were employed to store,
l i d l d di i bi l i l i f ianalyze, integrate, model and disseminate biological information. • Model organisms were used as Rosetta Stones for deciphering complex
biological systems in humans.
3
Genomics/Proteomics & Systems Biology/ y gy
• The Birth of OmicsThe Birth of Omics– From Genes to Genomes– From Proteins to Proteomes– From Interactions to Interactomes– From Metabolites to Metabolomes– …
• Systems BiologySystems Biology– From Molecules to Pathways and
Networks– From Cells to Tissues and
O iOrganisms – From Individuals to Systems and
Communities
4
The driving force in 20th century biology has been reductionism:
The driving force for 21st century biology is integration:
From population to individual From individual to cell
From cell to biomoleculeFrom biomolecule to genome
Integrating activity of genes and regulators into regulatory networks
Interactions of amino acids into protein folding predictionsFrom biomolecule to genome
From genome to genome sequence With the publication of genome
sequences, reductionist biology has
folding predictions Interactions of metabolites into metabolic
networks Interactions of cells into organisms
5
reached its endpoint Interactions of individuals into ecosystems
Sequencing-Driven Biology
Illumina HiSeq PacBio RS
• Next‐Generation Sequencing (NGS)– Sequencing no longer specialty tool of Molecular Biologists– Permeates all fields of biologyPermeates all fields of biology– Affordability encouraging investigators to utilize
• Bioinformatics Bottleneck– Researchers often ill prepared for flood of data– Researchers often ill prepared for flood of data– Bioinformatics often more costly than sequencing– Need pool of trained bioinformatics‐savvy researchers
6
CBCBCBCBPromote, coordinate and support
interdisciplinary activities in Bioinformatics &Bioinformatics &
Computational Biologyhttp://bioinformatics.udel.edu/
• Research: Foster interdisciplinary, cross‐campus and inter‐institutional research collaborations synergistic to UD strategic areasy g g
• Education: Establish graduate degree programs– Fall 2010: Master’s Program in Bioinformatics & Computational Biology– Fall 2012: PhD program in Bioinformatics & Systems BiologyFall 2012: PhD program in Bioinformatics & Systems Biology
• Core: Provide scientific expertise and infrastructure support in Bioinformatics & Computational Biology for the Delaware research and education community
• > 60 affiliated faculty from five Colleges
7
> 60 affiliated faculty from five Colleges– CoE (Engineering), CAS (Arts & Sciences), Agriculture & Natural Resources (CANR),
Earth, Ocean & Environment (CEOE), Health Sciences (CHS)
NGS (Next-Gen Sequencing) Data Analysis( q g) yShort Reads Analysis Pipelines for
Organize
• RNA‐Seq
• miRNA• De novo Genome AssemblyOrganize • De novo Genome Assembly • Reference Mapping
• Genomic Structural Variation:/ /
AnalyzeSNP/Indel/CNV
• Reduced Representation Library
Visualize• Amplicon Library (16S rRNA)
• Metagenome
8
• Metatranscriptome
Bioinformatics Core Infrastructure• High Performance Computing: BioHen Cluster
– Hardware: 280 cores, several high RAM nodes– Open resource with common bioinformatics tools– User‐supported model for BioHen compute nodes– Storage: 20 TB shared storage
• Core Analysis Servers– Service dedicated (e.g., NGS); high RAM configurations
• Software Support• Software Support– Commercial and open source software tools
• Data Center– Database design and hosting; web portal hosting
• CAVE 3D Visualization Studio– Interactive 3D Projections (7’x15’ projection)j ( p j )– Molecular modeling/visualization, bioimaging, virtual dissection/surgery
9
Protein Information Resource (PIR)Resource (PIR)
Integrated Bioinformatics Resource for Genomic Proteomic & Systems
http://ProteinInformationResource.org
Genomic, Proteomic & Systems Biology Research
• Research on protein structure‐function, omics data integration/visualization, biomedical text mining, biomedical ontology, computational systems biology
• Bioinformatics framework for data management and analysis• Broad public data dissemination and community standards development
– >10 million web hits per month from >100,000 unique sites worldwide
• National and international collaborative networks– UniProt Consortium: Central resource of protein sequence and function– Protein Ontology Consortium BioCreative Consortium– Protein Ontology Consortium, BioCreative Consortium
UniProtCentral Resource of Protein Sequence and Function
• International Consortiumhttp://www.uniprot.org
• International Consortium– Protein Information Resource (PIR)– European Bioinformatics Institute (EBI)– Swiss Institute of Bioinformatics (SIB)– Swiss Institute of Bioinformatics (SIB)
• Unifies PIR‐PSD, Swiss‐Prot, TrEMBL Protein Sequence Databases
11
iProClass Integrated Protein DatabaseiProClass Integrated Protein Database
Protein Sequence Gene/GenomeStructure Family Protein Sequence Gene/GenomeStructure Family
D t i t ti f >160UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
PDBSCOPCATH
PDBSumMMDB
…
PIRSFInterPro
PfamPrositeCOG
…
UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
PDBSCOPCATH
PDBSumMMDB
…
PIRSFInterPro
PfamPrositeCOG
…
• Data integration from >160 databases
• Underlying data warehouse …
Gene Expression
GEOGXD
ArrayExpress
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
iProClassiProClass
…
Gene Expression
GEOGXD
ArrayExpress
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
iProClassiProClass
for protein ID, gene/protein name & bibliography mapping
Disease/Variation
OMIM
CleanExSOURCE
…
Protein Expression
Swiss-2DPAGEPMG
EcoCycWIT
… Integrated Protein Knowledgebase
Integrated Protein Knowledgebase
Disease/Variation
OMIM
CleanExSOURCE
…
Protein Expression
Swiss-2DPAGEPMG
EcoCycWIT
… Integrated Protein Knowledgebase
Integrated Protein Knowledgebase • Integration of protein
family, function, structure for functional annotationHapMap
…Ontology
GOInteraction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
PMG…
Literature
PubMed
Modification
RESIDPhosphoBase
…
HapMap…Ontology
GOInteraction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
PMG…
Literature
PubMed
Modification
RESIDPhosphoBase
…
for functional annotation• Rich link (link + summary)
for value‐added reports of UniProt proteins
12
UniProt proteins
Annotation Extraction Literature survey and manual tagging for evidence attribution Training sets for different types of functional sitesg yp Natural language processing and annotation extraction
RLIMS‐PRLIMS P
13
RLIMS-P: Rule-based Literature Mining System for Phosphrylation (http://www.proteininformationresource.org/pirwww/iprolink/rlimsp.shtml)
Report: Full annotation with evidence tagging2
• Extract phosphorylation info (substrate, kinase, site) from PubMed abstracts
2
Summary table: top‐ranking annotation1
• Extensions: (i) RLIMS‐P 2.0 full‐text version
14An online literature mining tool for protein phosphorylationYuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Shanker VK, Wu CH. (2006) Bioinformatics 22, 1668‐1669
(ii) RLIMS‐PTM for acetylation, methylation, glycosylation
FIPeFIP: an integrated system for mining
Functional Impact of Functional Impact of Phosphorylation
from literature
1 G / i i i & d i l1. Gene/protein name recognition & document retrieval2. Extraction of phosphorylation Information (kinase, substrate, p‐site)3. Extraction of protein‐protein interaction 4. Extraction of functional impact of phosphorylated protein and interaction5. Document ranking and evidence tagging on abstracts6. Interactive system with user interface for curator/user validation
The eFIP system for text mining of protein interaction networks of phosphorylated proteinsTudor CO, Arighi CN, Wang Q, Wu CH, Shanker VK. (2012) Database (in press)
y /
15
Di ti t h h l t d f f t i h diff t i t ti t i
Discovery from Literature Mining• Distinct phosphorylated forms of a protein may have different interacting proteins,
leading to different subcellular locations, functions and pathways• Literature mining connects the impact (cytosolic vs mitochondria; apoptosis vs cell
i l) diff BAD f d h h ki li k BAD hsurvival) to different BAD forms, and, through kinases, links BAD to pathways => construction of phosphorylation networks: iPTMnet
16
Bioinformatics Framework Data Mining: iProClass database for molecular and omics data integration Text Mining: RLIMS‐P/eFIP system for knowledge extraction from literature
O t l PRO f k l d t ti f PTM f Ontology: PRO for knowledge representation of PTM forms Web portal linking data and analysis/visualization tools for scientific queries
(http://proteininformationresource.org/iPTMnet)
17
Linking Text Mining and Data Mining for Biomedical Knowledge Discoveryg y
• NIH/NLM: National digital resource linking text mining with data mining in the systems biology context to decipher knowledge from a plethora of information i h i ifi li d d bin the scientific literature and databases UD Team (CBCB, CIS‐Shanker, Carterette , Decker, ANFS‐Schmidt )
• NSF/DBI: BioCreative Workshop Series• NSF/DBI: BioCreative Workshop Series– Challenge Evaluations of text mining tools– Batch tasks: Precision and recall– Interactive Annotation task: Utility and usabilityInteractive Annotation task: Utility and usability– Standard discussion: tool integration and data exchange International BioCreative Consortium
/• NSF/DBI: Integrative Bioinformatics for Knowledge Discovery of PTM Networks– PTM network of enzyme‐substrate relationships and
protein‐protein interactions from literature mining UD Team (CBCB CIS Shanker PLSC Lee)
18
UD Team (CBCB, CIS‐Shanker, PLSC‐Lee)
Protein Ontology (PRO)Ontology for semantic integration of
heterogeneous biological data
• OBO Foundry – Establishes rules and best practices to
create a suite of orthogonal interoperable reference ontologies• Protein Ontology gy
– Represents protein classes, protein forms (alternative splicing, cleavage, post‐translational modifications, genetic variations) and protein complexes
– Provides precise annotation of specific protein forms/classes for accurate p p p /and consistent data mapping, integration and analysis
– Facilitate reasoning by grouping equivalent forms from different organisms to generate hypotheses about human biology
19
Why PRO Provides formalization to support precise annotation of specific protein Provides formalization to support precise annotation of specific protein
classes/forms/complexes, allowing accurate data mapping, integration, analysis Allows specification of relationships between PRO and other ontologies, such as
GO SO (Sequence Ontology) PSI MOD ChEBI CL (Cell Ontology)GO, SO (Sequence Ontology), PSI‐MOD, ChEBI, CL (Cell Ontology) Provides stable unique identifiers to distinct protein types Provides a formal structure to support computer‐based reasoning based on
h l d h d b l d “ h f ” “ h ”homology and shared protein attributes, including “ortho‐isoform,” “ortho‐PTM”
Representation of protein forms & complexes in biological/network context
TGF‐ Signaling Pathway
20
Hierarchical View
smad2 smad2 protein forms & complexes
21
Network View: smad2 protein forms & complexesp p
Connecting proteinConnecting protein forms and complexes with annotation => Modeling biology
22
iPTM Enzyme‐Substrate Database
iPTMnet (http://proteininformationresource.org/iPTMnet)
iPTM Enzyme‐Substrate Database• Literature‐curated kinase‐substrate data from Phosphositeplus , Phospho.ELM,
HPRD, UniProtKB, Protein Ontology (PRO) • # of substrate: 14,000; site: 80,000; kinase: 850# of substrate: 14,000; site: 80,000; kinase: 850• # of substrate/site‐kinase pairs: 16,000• # of curated phosphorylation papers: 15,000RLIMS‐P/eFIP Text MiningR IMS P/eFIP Text Mining• Full‐scale processing of PubMed abstracts: 22 million• Phosphorylation papers identified by RLIMS‐P: 143,000 (0.65% of PubMed)• Phosphorylation‐PPI related papers identified by eFIP: 10,000 (7% of P‐papers)p y p p y , ( p p )• Interactive system for curator/user to verify pre‐computed text mining resultsiPTMnet Website• Searching iPTM database: entry report mapped to UniProt and PRO with
enzyme‐substrate data and text mining result from associated PMIDs• Text Mining using RLIMS‐P/eFIP: PMID or protein‐based literature search
Ki b t t l ti hi & h h l t d t i t i i t tiKinase‐substrate relationship & phosphorylated protein‐protein interaction=> Phosphorylation Network connecting different phosphorylated forms of
given proteins with their kinases and interaction partners 23
iPTMNetwork
• Substrate‐centric: What PTM forms of a protein and their modifying enzymes are known?• Enzyme‐centric: What substrates are known for a given PTM enzyme? • Interaction: What interacting partners are known for each PTM form of a given protein?
24
• Pathway: What protein modifications and enzymes are known in a given signaling pathway?Coupled with functional annotation and biological context (homology, disease, tissue/cell..)
=> Hypothesis generation and discovery
Omics Systems Biology DataGenome Transcriptome Proteome Metabolome
DNA microarrayMass spectrometry
Antibody array (Western blot)
Making senses::
25
Making senses::Functional Interpretation
Omics Data Analysis
Omics‐Based Molecular Target and Biomarker IdentificationHu, Huang, Wu, Jung, Dritschilo, et alMethods Mol Biol (2011) 719, 547‐571
Protein‐Centric Data Integration for Functional Analysis of Comparative Proteomics DataMcGarvey, Zhang, Natale, Wu, HuangMethods Mol Biol (2011) 694, 323‐339
26
iProXpressintegrated Protein eXpression
IP/2D/MS Proteomic DataGene Expression
iProXpressintegrated Protein eXpression
IP/2D/MS Proteomic DataGene Expression IP/2D/MS Proteomic DataGene Expression
Experiment DataGene IDAnalysis System
Gene/Protein ID list Peptide Sequence
UniProt
Analysis SystemGene/Protein ID list Peptide SequenceGene/Protein ID list Peptide Sequence
UniProtUniProt
Gene ID Protein ID Peptide sequence
Protein MappingUniProt
iProClass
Protein MappingUniProtUniProt
iProClassiProClass Information
UniProtKB ID1
Protein
Functional Annotation
Protein
Functional Annotation FunctionPathwayFamily
2
oteInformation
Matrix
Expression ProfilingProtein Information Matrix
oteInformation
Matrix
Expression ProfilingProtein Information Matrix
y……
Categorization,Cross dataset
3
Function Categorization ChartPathway MapInteraction Map
GO tree visualization
Function Categorization ChartPathway MapInteraction Map
GO tree visualization
Knowledge
Cross-datasetComparison
Two-Way Comparison MatrixTwo-Way Comparison Matrix
Knowledge
27
Breast Cancer Signaling PathwayProposed pathway map based on functional profiling,
network/pathway analysis and literature mining
Early signaling pathways underlying estrogen‐induced breast cancer cell apoptosis
IntegrinVEGF
CX3CR1
ApoptosisIntegrin signaling Cytoskeleton Remodeling
NCoA3NCoA3
E2 GPR30
GasGas FakpY
FakpY
Cell Mobilityplasm
?Paxillin
?
CX3CR1
GNAO2pY
GNAO2pY
RGS3
CI70
CI50
CIL50
C
CIL50
Integrin signaling, Cytoskeleton RemodelingG-protein coupled receptor signaling
Rap1GAP Rap1a
ERK
MEK
ERaE2 ERaE2
mTOR
Cytopla
Cell growth
Transcription
Sirt3
ApoptosisCell cycle
(inhibition)
ZNF23
ELVAL2 CCND1
Cell cycle
(inhibition)
ZNF23
ELVAL2 CCND1ELVAL2 CCND1
CDK1pY
CDK1pY BAD
CI90
B
CI70
Histone modification
ERaE2 ERaE2 NCoA3NCoA3
ucleus
TranscriptionIASPP
TLE3
RUNX3
TLE3
RUNX3NFkB
Sirt3
Apoptosis
PRD15
Sirt3
ZNFZNF
Stat4BRPF3
28PRPF6
N-CoR-2
SWI/SNF
BRG1 NCoA3BAF57
PRPF6
N-CoR-2
SWI/SNF
BRG1 NCoA3BAF57
N-CoR-2
SWI/SNF
BRG1
N-CoR-2
SWI/SNF
BRG1 NCoA3BAF57
Nucle
Transcription repression
RUNX3RUNX3 5
Chr. Remodeling
Stat4BRPF3
Skate Genome ProjectGenome Sequencing & Annotation of Little Skate (Leucoraja erinacea)
• A model organism for vertebrate phylogeny, sharing characteristics with the h l dhuman immune, circulatory and nervous systems
• One of 11 non‐mammalian organisms strategically selected for sequencing by an NIH advisory panel Estimated 3.42 billion base pairs, 49 chromosomes
• Comparative genomics among jawed vertebrates
29
Skate Genome ProjectjNorth East Bioinformatics
Collaborative
Collaborative use of specialized resources & expertise in an integrated process p g p
• Little Skate (Leucoraja erinacea) Clones: MDIBL‐Mount Desert Island Biological Lab (ME)
• Next‐Generation Sequencing: UD DNA Sequencing & Genotyping Center (DE)
• Sequence Assembly: Vermont Genetics Network (VT) with ME, RIq y• Sequence Analysis & Annotation: Bioinformatics pipeline at UD CBCB (DE), ME, RI, NH, VT• Storage & Access of Sequence/Annotation data: Shared data center (DE, VT, ME)• Public Dissemination: NCBI, SkateBasePublic Dissemination: NCBI, SkateBase• Scientific Discovery: NECC and broad research community
30
Systems Medicine• Reductionism focuses on components, and can lose information about time,
space, and context; systems medicine focuses on the interactions and dynamics • Understanding the molecular basis of gene–environment interactions is key toUnderstanding the molecular basis of gene environment interactions is key to
dissecting complex disease processes, such as diabetes and cancer
31
Systems Biology and Energy & Environment
S t Bi l• Systems Biology– Plant, microbial
communitiesM i– Metagenomics
– Genome scale modeling
E d• Energy and Environment– Bioengergy– Carbon cycle– Environmental
Remediation
32
PIR TeamPIR Team
– Protein Science Team: Cecilia Arighi, Darren Natale, Lai‐Su Yeh, CR Vinayaka, Kati Laiho, Qinghua Wang, Thanemozhi Natarajan, Abhishek Kukreja
– Informatics Team: Hongzhan Huang Peter McGarvey Baris Suzek Leslie Arminski– Informatics Team: Hongzhan Huang, Peter McGarvey, Baris Suzek, Leslie Arminski, Natalia Roberts, Chuming Chen, Yongxing Chen, Jing Zhang, , Yuqi Wang
– Students and Interns
C ti C ll b t• Consortium Collaborators– UniProt: Apweiler, Xenarios and EBI/SIB Teams– PRO: Smith (SUNY), Blake, Bult (MGI), D'Eustachio (Reactome)
Bi C ti Hi h (MITRE) V l i K lli (CNIO)
33
– BioCreative: Hirschman (MITRE), Valencia, Krallinger (CNIO), Lu, Wilbur (NCBI), Cohen (U Colorado)
CBCB TeamCBCB Team– PIR Team at UD
C S i i /S ff Sh P l (C C di ) K i L k f k (Ed i– Center Scientists/Staff: Shawn Polson (Core Coordinator), Katie Lakofsky (Education Coordinator), Susan Phipps (Administrative), Manabu Torii, Oana Tudor
– Graduate Students: Alvaro Gonzalez, Yifan Peng, Luis Lopez, TianChuan Du, Ruoyao Du, Gang Li (CIS), Katie Bi, Sari Khallel, Modupe Adetunji (BINF)Gang Li (CIS), Katie Bi, Sari Khallel, Modupe Adetunji (BINF)
Collaborators– UD: Vijay Shanker, Keith Decker, Ben Carterette, Li Liao, Jingyi Yu, Hagit Shatkay (CIS), Ulhas
k ( ) l h b ( ) l ( ) kNaik (BIO), Carl Schmidt, Larry Cogburn (ANFS), Karl Steiner (ECE), Terry Papoutsakis, Maciek Antoniewicz, Kelvin Lee (ChemE)
– DHSA (Delaware Health Sciences Alliance) [UD, Nemours, CCHC, TJU]– NECC (North East Cyberinfrastructure Consortium)
34
– NECC (North East Cyberinfrastructure Consortium)– BiND (Bioinformatics Network of Delaware )
F di S tFunding Support• NIH/NHGRI &NIGMS: UniProt: A Centralized Protein Sequence and Function Resource• NIH/NIGMS: PRO: Protein Ontology in OBO Foundry for Integration of Biomedical KnowledgeNIH/NIGMS: PRO: Protein Ontology in OBO Foundry for Integration of Biomedical Knowledge• NIH/NLM : Linking Text Mining and Data Mining for Biomedical Knowledge Discovery • NIH/NCRR: Delaware INBRE: North East Cyberinfrastructure Consortium (NECC)• NSF/DBI : Integrative Bioinformatics for Knowledge Discovery of PTM Networks• NSF/DBI : BioCreative Workshops: Linking Text Mining with Ontology and Systems Biology• DOE: Experimental Systems‐Biology Approaches for Clostridia‐Based Bioenergy Production• Delaware Health Sciences Alliance (DHSA): Linking Genotype to Phenotype
D l Bi i C t f Ad d T h l (CAT) Bi i f ti O ti i ti f• Delaware Bioscience Center for Advanced Technology (CAT): Bioinformatics Optimization for Recombinant Protein Expression for Vaccines and Therapeutics
• UniDel Foundation: UD Center for Bioinformatics and Computational Biology• NIH/NCRR & NIGMS: Delaware INBRE – Bioinformatics Core• NSF: Delaware EPSCoR – Bioinformatics Core
NIH: National Institutes of Health NSF: National Science of FoundationDOE: Department of Energy
35