bioinformatics @ wu lab

Bioinformatics @ WU Lab

SIG NewGradDepartment of Computer & Information Sciencesp p

September 17, 2012

Cathy H. Wu, Ph.D.Ed d G J ff Ch i d Di C f Bi i f i & C i l Bi l (CBCB)Edward G. Jefferson Chair and Director, Center for Bioinformatics & Computational Biology (CBCB)

Director, Protein Information Resource (PIR)Departments of Computer & Information Sciences and of Biological Sciences

[email protected] Biotechnology Institute, 15 Innovation Way, Suite 205

University of Delaware

What is Bioinformatics?

NIH W ki D fi iti (2002)

What is Bioinformatics?

NIH Working Definition (2002):Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including th t i t i hi l i li h d tthose to acquire, store, organize, archive, analyze, or visualize such data.

computer + mousecomputer + mouse (information) (biology)

=> Bioinformatics

An Emerging (and Expanding) Field

2

An Emerging (and Expanding) Field Where Biological and Computational Disciplines Converge

Human Genome ProjectThe Human Genome Project has revolutionized how

biologists view and practice biology.

• Discovery science introduced the possibility of global informational analyses. • A genetics parts list of human genes and control region sequences emerged.A genetics parts list of human genes and control region sequences emerged. • The idea that biology is an informational science with three major types of

information emerged: DNA/genes, proteins, and biological systems. • Tools for high throughput quantitative measurements of biological• Tools for high‐throughput quantitative measurements of biological

information were developed.• Computer science, mathematics, and statistics were employed to store,

l i d l d di i bi l i l i f ianalyze, integrate, model and disseminate biological information. • Model organisms were used as Rosetta Stones for deciphering complex

biological systems in humans.

3

Genomics/Proteomics & Systems Biology/ y gy

• The Birth of OmicsThe Birth of Omics– From Genes to Genomes– From Proteins to Proteomes– From Interactions to Interactomes– From Metabolites to Metabolomes– …

• Systems BiologySystems Biology– From Molecules to Pathways and

Networks– From Cells to Tissues and

O iOrganisms – From Individuals to Systems and

Communities

4

The driving force in 20th century biology has been reductionism:

The driving force for 21st century biology is integration:

From population to individual From individual to cell

From cell to biomoleculeFrom biomolecule to genome

Integrating activity of genes and regulators into regulatory networks

Interactions of amino acids into protein folding predictionsFrom biomolecule to genome

From genome to genome sequence With the publication of genome

sequences, reductionist biology has

folding predictions Interactions of metabolites into metabolic

networks Interactions of cells into organisms

5

reached its endpoint Interactions of individuals into ecosystems

Sequencing-Driven Biology

Illumina HiSeq PacBio RS

• Next‐Generation Sequencing (NGS)– Sequencing no longer specialty tool of Molecular Biologists– Permeates all fields of biologyPermeates all fields of biology– Affordability encouraging investigators to utilize

• Bioinformatics Bottleneck– Researchers often ill prepared for flood of data– Researchers often ill prepared for flood of data– Bioinformatics often more costly than sequencing– Need pool of trained bioinformatics‐savvy researchers

6

CBCBCBCBPromote, coordinate and support

interdisciplinary activities in Bioinformatics &Bioinformatics &

Computational Biologyhttp://bioinformatics.udel.edu/

• Research: Foster interdisciplinary, cross‐campus and inter‐institutional research collaborations synergistic to UD strategic areasy g g

• Education: Establish graduate degree programs– Fall 2010: Master’s Program in Bioinformatics & Computational Biology– Fall 2012: PhD program in Bioinformatics & Systems BiologyFall 2012: PhD program in Bioinformatics & Systems Biology

• Core: Provide scientific expertise and infrastructure support in Bioinformatics & Computational Biology for the Delaware research and education community

• > 60 affiliated faculty from five Colleges

7

> 60 affiliated faculty from five Colleges– CoE (Engineering), CAS (Arts & Sciences), Agriculture & Natural Resources (CANR),

Earth, Ocean & Environment (CEOE), Health Sciences (CHS)

NGS (Next-Gen Sequencing) Data Analysis( q g) yShort Reads Analysis Pipelines for

Organize

• RNA‐Seq

• miRNA• De novo Genome AssemblyOrganize • De novo Genome Assembly • Reference Mapping

• Genomic Structural Variation:/ /

AnalyzeSNP/Indel/CNV

• Reduced Representation Library

Visualize• Amplicon Library (16S rRNA)

• Metagenome

8

• Metatranscriptome

Bioinformatics Core Infrastructure• High Performance Computing: BioHen Cluster

– Hardware: 280 cores, several high RAM nodes– Open resource with common bioinformatics tools– User‐supported model for BioHen compute nodes– Storage: 20 TB shared storage

• Core Analysis Servers– Service dedicated (e.g., NGS); high RAM configurations

• Software Support• Software Support– Commercial and open source software tools

• Data Center– Database design and hosting; web portal hosting

• CAVE 3D Visualization Studio– Interactive 3D Projections (7’x15’ projection)j ( p j )– Molecular modeling/visualization, bioimaging, virtual dissection/surgery

9

Protein Information Resource (PIR)Resource (PIR)

Integrated Bioinformatics Resource for Genomic Proteomic & Systems

http://ProteinInformationResource.org

Genomic, Proteomic & Systems Biology Research

• Research on protein structure‐function, omics data integration/visualization, biomedical text mining, biomedical ontology, computational systems biology

• Bioinformatics framework for data management and analysis• Broad public data dissemination and community standards development

– >10 million web hits per month from >100,000 unique sites worldwide

• National and international collaborative networks– UniProt Consortium: Central resource of protein sequence and function– Protein Ontology Consortium BioCreative Consortium– Protein Ontology Consortium, BioCreative Consortium

UniProtCentral Resource of Protein Sequence and Function

• International Consortiumhttp://www.uniprot.org

• International Consortium– Protein Information Resource (PIR)– European Bioinformatics Institute (EBI)– Swiss Institute of Bioinformatics (SIB)– Swiss Institute of Bioinformatics (SIB)

• Unifies PIR‐PSD, Swiss‐Prot, TrEMBL Protein Sequence Databases

11

iProClass Integrated Protein DatabaseiProClass Integrated Protein Database

Protein Sequence Gene/GenomeStructure Family Protein Sequence Gene/GenomeStructure Family

D t i t ti f >160UniProtUniRefUniParcRefSeq

GenPept…

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

MGITIGR

PDBSCOPCATH

PDBSumMMDB

…

PIRSFInterPro

PfamPrositeCOG

…

UniProtUniRefUniParcRefSeq

GenPept…

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

MGITIGR

PDBSCOPCATH

PDBSumMMDB

…

PIRSFInterPro

PfamPrositeCOG

…

• Data integration from >160 databases

• Underlying data warehouse …

Gene Expression

GEOGXD

ArrayExpress

Function/Pathway

EC-IUBMBKEGG

BioCartaEcoCyc

iProClassiProClass

…

Gene Expression

GEOGXD

ArrayExpress

Function/Pathway

EC-IUBMBKEGG

BioCartaEcoCyc

iProClassiProClass

for protein ID, gene/protein name & bibliography mapping

Disease/Variation

OMIM

CleanExSOURCE

…

Protein Expression

Swiss-2DPAGEPMG

EcoCycWIT

… Integrated Protein Knowledgebase

Integrated Protein Knowledgebase

Disease/Variation

OMIM

CleanExSOURCE

…

Protein Expression

Swiss-2DPAGEPMG

EcoCycWIT

… Integrated Protein Knowledgebase

Integrated Protein Knowledgebase • Integration of protein

family, function, structure for functional annotationHapMap

…Ontology

GOInteraction

DIPBIND

…

Taxonomy

NCBI TaxonNEWT

PMG…

Literature

PubMed

Modification

RESIDPhosphoBase

…

HapMap…Ontology

GOInteraction

DIPBIND

…

Taxonomy

NCBI TaxonNEWT

PMG…

Literature

PubMed

Modification

RESIDPhosphoBase

…

for functional annotation• Rich link (link + summary)

for value‐added reports of UniProt proteins

12

UniProt proteins

Annotation Extraction Literature survey and manual tagging for evidence attribution Training sets for different types of functional sitesg yp Natural language processing and annotation extraction

RLIMS‐PRLIMS P

13

RLIMS-P: Rule-based Literature Mining System for Phosphrylation (http://www.proteininformationresource.org/pirwww/iprolink/rlimsp.shtml)

Report: Full annotation with evidence tagging2

• Extract phosphorylation info (substrate, kinase, site) from PubMed abstracts

2

Summary table: top‐ranking annotation1

• Extensions: (i) RLIMS‐P 2.0 full‐text version

14An online literature mining tool for protein phosphorylationYuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Shanker VK, Wu CH. (2006) Bioinformatics 22, 1668‐1669

(ii) RLIMS‐PTM for acetylation, methylation, glycosylation

FIPeFIP: an integrated system for mining

Functional Impact of Functional Impact of Phosphorylation

from literature

1 G / i i i & d i l1. Gene/protein name recognition & document retrieval2. Extraction of phosphorylation Information (kinase, substrate, p‐site)3. Extraction of protein‐protein interaction 4. Extraction of functional impact of phosphorylated protein and interaction5. Document ranking and evidence tagging on abstracts6. Interactive system with user interface for curator/user validation

The eFIP system for text mining of protein interaction networks of phosphorylated proteinsTudor CO, Arighi CN, Wang Q, Wu CH, Shanker VK. (2012) Database (in press)

y /

15

Di ti t h h l t d f f t i h diff t i t ti t i

Discovery from Literature Mining• Distinct phosphorylated forms of a protein may have different interacting proteins,

leading to different subcellular locations, functions and pathways• Literature mining connects the impact (cytosolic vs mitochondria; apoptosis vs cell

i l) diff BAD f d h h ki li k BAD hsurvival) to different BAD forms, and, through kinases, links BAD to pathways => construction of phosphorylation networks: iPTMnet

16

Bioinformatics Framework Data Mining: iProClass database for molecular and omics data integration Text Mining: RLIMS‐P/eFIP system for knowledge extraction from literature

O t l PRO f k l d t ti f PTM f Ontology: PRO for knowledge representation of PTM forms Web portal linking data and analysis/visualization tools for scientific queries

(http://proteininformationresource.org/iPTMnet)

17

Linking Text Mining and Data Mining for Biomedical Knowledge Discoveryg y

• NIH/NLM: National digital resource linking text mining with data mining in the systems biology context to decipher knowledge from a plethora of information i h i ifi li d d bin the scientific literature and databases UD Team (CBCB, CIS‐Shanker, Carterette , Decker, ANFS‐Schmidt )

• NSF/DBI: BioCreative Workshop Series• NSF/DBI: BioCreative Workshop Series– Challenge Evaluations of text mining tools– Batch tasks: Precision and recall– Interactive Annotation task: Utility and usabilityInteractive Annotation task: Utility and usability– Standard discussion: tool integration and data exchange International BioCreative Consortium

/• NSF/DBI: Integrative Bioinformatics for Knowledge Discovery of PTM Networks– PTM network of enzyme‐substrate relationships and

protein‐protein interactions from literature mining UD Team (CBCB CIS Shanker PLSC Lee)

18

UD Team (CBCB, CIS‐Shanker, PLSC‐Lee)

Protein Ontology (PRO)Ontology for semantic integration of

heterogeneous biological data

• OBO Foundry – Establishes rules and best practices to

create a suite of orthogonal interoperable reference ontologies• Protein Ontology gy

– Represents protein classes, protein forms (alternative splicing, cleavage, post‐translational modifications, genetic variations) and protein complexes

– Provides precise annotation of specific protein forms/classes for accurate p p p /and consistent data mapping, integration and analysis

– Facilitate reasoning by grouping equivalent forms from different organisms to generate hypotheses about human biology

19

Why PRO Provides formalization to support precise annotation of specific protein Provides formalization to support precise annotation of specific protein

classes/forms/complexes, allowing accurate data mapping, integration, analysis Allows specification of relationships between PRO and other ontologies, such as

GO SO (Sequence Ontology) PSI MOD ChEBI CL (Cell Ontology)GO, SO (Sequence Ontology), PSI‐MOD, ChEBI, CL (Cell Ontology) Provides stable unique identifiers to distinct protein types Provides a formal structure to support computer‐based reasoning based on

h l d h d b l d “ h f ” “ h ”homology and shared protein attributes, including “ortho‐isoform,” “ortho‐PTM”

Representation of protein forms & complexes in biological/network context

TGF‐ Signaling Pathway

20

Hierarchical View

smad2 smad2 protein forms & complexes

21

Network View: smad2 protein forms & complexesp p

Connecting proteinConnecting protein forms and complexes with annotation => Modeling biology

22

iPTM Enzyme‐Substrate Database

iPTMnet (http://proteininformationresource.org/iPTMnet)

iPTM Enzyme‐Substrate Database• Literature‐curated kinase‐substrate data from Phosphositeplus , Phospho.ELM,

HPRD, UniProtKB, Protein Ontology (PRO) • # of substrate: 14,000; site: 80,000; kinase: 850# of substrate: 14,000; site: 80,000; kinase: 850• # of substrate/site‐kinase pairs: 16,000• # of curated phosphorylation papers: 15,000RLIMS‐P/eFIP Text MiningR IMS P/eFIP Text Mining• Full‐scale processing of PubMed abstracts: 22 million• Phosphorylation papers identified by RLIMS‐P: 143,000 (0.65% of PubMed)• Phosphorylation‐PPI related papers identified by eFIP: 10,000 (7% of P‐papers)p y p p y , ( p p )• Interactive system for curator/user to verify pre‐computed text mining resultsiPTMnet Website• Searching iPTM database: entry report mapped to UniProt and PRO with

enzyme‐substrate data and text mining result from associated PMIDs• Text Mining using RLIMS‐P/eFIP: PMID or protein‐based literature search

Ki b t t l ti hi & h h l t d t i t i i t tiKinase‐substrate relationship & phosphorylated protein‐protein interaction=> Phosphorylation Network connecting different phosphorylated forms of

given proteins with their kinases and interaction partners 23

iPTMNetwork

• Substrate‐centric: What PTM forms of a protein and their modifying enzymes are known?• Enzyme‐centric: What substrates are known for a given PTM enzyme? • Interaction: What interacting partners are known for each PTM form of a given protein?

24

• Pathway: What protein modifications and enzymes are known in a given signaling pathway?Coupled with functional annotation and biological context (homology, disease, tissue/cell..)

=> Hypothesis generation and discovery

Omics Systems Biology DataGenome Transcriptome Proteome Metabolome

DNA microarrayMass spectrometry

Antibody array (Western blot)

Making senses::

25

Making senses::Functional Interpretation

Omics Data Analysis

Omics‐Based Molecular Target and Biomarker IdentificationHu, Huang, Wu, Jung, Dritschilo, et alMethods Mol Biol (2011) 719, 547‐571

Protein‐Centric Data Integration for Functional Analysis of Comparative Proteomics DataMcGarvey, Zhang, Natale, Wu, HuangMethods Mol Biol (2011) 694, 323‐339

26

iProXpressintegrated Protein eXpression

IP/2D/MS Proteomic DataGene Expression

iProXpressintegrated Protein eXpression

IP/2D/MS Proteomic DataGene Expression IP/2D/MS Proteomic DataGene Expression

Experiment DataGene IDAnalysis System

Gene/Protein ID list Peptide Sequence

UniProt

Analysis SystemGene/Protein ID list Peptide SequenceGene/Protein ID list Peptide Sequence

UniProtUniProt

Gene ID Protein ID Peptide sequence

Protein MappingUniProt

iProClass

Protein MappingUniProtUniProt

iProClassiProClass Information

UniProtKB ID1

Protein

Functional Annotation

Protein

Functional Annotation FunctionPathwayFamily

2

oteInformation

Matrix

Expression ProfilingProtein Information Matrix

oteInformation

Matrix

Expression ProfilingProtein Information Matrix

y……

Categorization,Cross dataset

3

Function Categorization ChartPathway MapInteraction Map

GO tree visualization

Function Categorization ChartPathway MapInteraction Map

GO tree visualization

Knowledge

Cross-datasetComparison

Two-Way Comparison MatrixTwo-Way Comparison Matrix

Knowledge

27

Breast Cancer Signaling PathwayProposed pathway map based on functional profiling,

network/pathway analysis and literature mining

Early signaling pathways underlying estrogen‐induced breast cancer cell apoptosis

IntegrinVEGF

CX3CR1

ApoptosisIntegrin signaling Cytoskeleton Remodeling

NCoA3NCoA3

E2 GPR30

GasGas FakpY

FakpY

Cell Mobilityplasm

?Paxillin

?

CX3CR1

GNAO2pY

GNAO2pY

RGS3

CI70

CI50

CIL50

C

CIL50

Integrin signaling, Cytoskeleton RemodelingG-protein coupled receptor signaling

Rap1GAP Rap1a

ERK

MEK

ERaE2 ERaE2

mTOR

Cytopla

Cell growth

Transcription

Sirt3

ApoptosisCell cycle

(inhibition)

ZNF23

ELVAL2 CCND1

Cell cycle

(inhibition)

ZNF23

ELVAL2 CCND1ELVAL2 CCND1

CDK1pY

CDK1pY BAD

CI90

B

CI70

Histone modification

ERaE2 ERaE2 NCoA3NCoA3

ucleus

TranscriptionIASPP

TLE3

RUNX3

TLE3

RUNX3NFkB

Sirt3

Apoptosis

PRD15

Sirt3

ZNFZNF

Stat4BRPF3

28PRPF6

N-CoR-2

SWI/SNF

BRG1 NCoA3BAF57

PRPF6

N-CoR-2

SWI/SNF

BRG1 NCoA3BAF57

N-CoR-2

SWI/SNF

BRG1

N-CoR-2

SWI/SNF

BRG1 NCoA3BAF57

Nucle

Transcription repression

RUNX3RUNX3 5

Chr. Remodeling

Stat4BRPF3

Skate Genome ProjectGenome Sequencing & Annotation of Little Skate (Leucoraja erinacea)

• A model organism for vertebrate phylogeny, sharing characteristics with the h l dhuman immune, circulatory and nervous systems

• One of 11 non‐mammalian organisms strategically selected for sequencing by an NIH advisory panel Estimated 3.42 billion base pairs, 49 chromosomes

• Comparative genomics among jawed vertebrates

29

Skate Genome ProjectjNorth East Bioinformatics

Collaborative

Collaborative use of specialized resources & expertise in an integrated process p g p

• Little Skate (Leucoraja erinacea) Clones: MDIBL‐Mount Desert Island Biological Lab (ME)

• Next‐Generation Sequencing: UD DNA Sequencing & Genotyping Center (DE)

• Sequence Assembly: Vermont Genetics Network (VT) with ME, RIq y• Sequence Analysis & Annotation: Bioinformatics pipeline at UD CBCB (DE), ME, RI, NH, VT• Storage & Access of Sequence/Annotation data: Shared data center (DE, VT, ME)• Public Dissemination: NCBI, SkateBasePublic Dissemination: NCBI, SkateBase• Scientific Discovery: NECC and broad research community

30

Systems Medicine• Reductionism focuses on components, and can lose information about time,

space, and context; systems medicine focuses on the interactions and dynamics • Understanding the molecular basis of gene–environment interactions is key toUnderstanding the molecular basis of gene environment interactions is key to

dissecting complex disease processes, such as diabetes and cancer

31

Systems Biology and Energy & Environment

S t Bi l• Systems Biology– Plant, microbial

communitiesM i– Metagenomics

– Genome scale modeling

E d• Energy and Environment– Bioengergy– Carbon cycle– Environmental

Remediation

32

PIR TeamPIR Team

– Protein Science Team: Cecilia Arighi, Darren Natale, Lai‐Su Yeh, CR Vinayaka, Kati Laiho, Qinghua Wang, Thanemozhi Natarajan, Abhishek Kukreja

– Informatics Team: Hongzhan Huang Peter McGarvey Baris Suzek Leslie Arminski– Informatics Team: Hongzhan Huang, Peter McGarvey, Baris Suzek, Leslie Arminski, Natalia Roberts, Chuming Chen, Yongxing Chen, Jing Zhang, , Yuqi Wang

– Students and Interns

C ti C ll b t• Consortium Collaborators– UniProt: Apweiler, Xenarios and EBI/SIB Teams– PRO: Smith (SUNY), Blake, Bult (MGI), D'Eustachio (Reactome)

Bi C ti Hi h (MITRE) V l i K lli (CNIO)

33

– BioCreative: Hirschman (MITRE), Valencia, Krallinger (CNIO), Lu, Wilbur (NCBI), Cohen (U Colorado)

CBCB TeamCBCB Team– PIR Team at UD

C S i i /S ff Sh P l (C C di ) K i L k f k (Ed i– Center Scientists/Staff: Shawn Polson (Core Coordinator), Katie Lakofsky (Education Coordinator), Susan Phipps (Administrative), Manabu Torii, Oana Tudor

– Graduate Students: Alvaro Gonzalez, Yifan Peng, Luis Lopez, TianChuan Du, Ruoyao Du, Gang Li (CIS), Katie Bi, Sari Khallel, Modupe Adetunji (BINF)Gang Li (CIS), Katie Bi, Sari Khallel, Modupe Adetunji (BINF)

Collaborators– UD: Vijay Shanker, Keith Decker, Ben Carterette, Li Liao, Jingyi Yu, Hagit Shatkay (CIS), Ulhas

k ( ) l h b ( ) l ( ) kNaik (BIO), Carl Schmidt, Larry Cogburn (ANFS), Karl Steiner (ECE), Terry Papoutsakis, Maciek Antoniewicz, Kelvin Lee (ChemE)

– DHSA (Delaware Health Sciences Alliance) [UD, Nemours, CCHC, TJU]– NECC (North East Cyberinfrastructure Consortium)

34

– NECC (North East Cyberinfrastructure Consortium)– BiND (Bioinformatics Network of Delaware )

F di S tFunding Support• NIH/NHGRI &NIGMS: UniProt: A Centralized Protein Sequence and Function Resource• NIH/NIGMS: PRO: Protein Ontology in OBO Foundry for Integration of Biomedical KnowledgeNIH/NIGMS: PRO: Protein Ontology in OBO Foundry for Integration of Biomedical Knowledge• NIH/NLM : Linking Text Mining and Data Mining for Biomedical Knowledge Discovery • NIH/NCRR: Delaware INBRE: North East Cyberinfrastructure Consortium (NECC)• NSF/DBI : Integrative Bioinformatics for Knowledge Discovery of PTM Networks• NSF/DBI : BioCreative Workshops: Linking Text Mining with Ontology and Systems Biology• DOE: Experimental Systems‐Biology Approaches for Clostridia‐Based Bioenergy Production• Delaware Health Sciences Alliance (DHSA): Linking Genotype to Phenotype

D l Bi i C t f Ad d T h l (CAT) Bi i f ti O ti i ti f• Delaware Bioscience Center for Advanced Technology (CAT): Bioinformatics Optimization for Recombinant Protein Expression for Vaccines and Therapeutics

• UniDel Foundation: UD Center for Bioinformatics and Computational Biology• NIH/NCRR & NIGMS: Delaware INBRE – Bioinformatics Core• NSF: Delaware EPSCoR – Bioinformatics Core

NIH: National Institutes of Health NSF: National Science of FoundationDOE: Department of Energy

35

bioinformatics @ wu lab

Documents