mopping up the flood of data with web services

Mopping up the Flood of Mopping up the Flood of Data with Web ServicesData with Web Services

Gary WigginsGary WigginsIndiana UniversityIndiana University

School of InformaticsSchool of [email protected]@indiana.edu

Overview of the TalkOverview of the Talk

Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned at NIH-funded Projects Underway or Planned at

Indiana UniversityIndiana University Educational Opportunities at IUEducational Opportunities at IU

Data Mining and Knowledge Data Mining and Knowledge Discovery (DMKD)Discovery (DMKD)

Techniques began to be used around Techniques began to be used around 19891989

Rapid growth in the mid 1990s, with Rapid growth in the mid 1990s, with DMKD field emerging around 1995DMKD field emerging around 1995

Built on DM tools such as Machine Built on DM tools such as Machine Learning Learning

Data MiningData Mining

One of the steps in Knowledge DiscoveryOne of the steps in Knowledge Discovery Concerned with the actual extraction of Concerned with the actual extraction of

knowledge from dataknowledge from data Efficient and scalable methods for mining Efficient and scalable methods for mining

interesting patterns and knowledge and interesting patterns and knowledge and discovering hidden facts contained in large discovering hidden facts contained in large databasesdatabases

Data Mining TechniquesData Mining Techniques

Efficient classification methodsEfficient classification methods ClusteringClustering Outlier analysisOutlier analysis Frequent, sequential, and structured Frequent, sequential, and structured

pattern analysispattern analysis Visualization and spatial/temporal analysis Visualization and spatial/temporal analysis

toolstools

Knowledge Discovery (KD)Knowledge Discovery (KD)

““KD is a nontrivial process of identifying KD is a nontrivial process of identifying valid, novel, potentially useful, and valid, novel, potentially useful, and ultimately understandable patterns from ultimately understandable patterns from large collections of data.”large collections of data.”--Fayyad et al., as quoted by Cios and Kurgan--Fayyad et al., as quoted by Cios and Kurgan

The KD process involves:The KD process involves: Understanding and preparation of the dataUnderstanding and preparation of the data Data Mining (DM)Data Mining (DM) Verification and application of the discovered Verification and application of the discovered

knowledgeknowledge

Framework for KD ProcessFramework for KD Process

Steps range from very few, e.g.,Steps range from very few, e.g., Data collection and understandingData collection and understanding Data miningData mining ImplementationImplementation

To multi-step models, e.g., Cios and To multi-step models, e.g., Cios and Kurgan’s six-step DMKD process modelKurgan’s six-step DMKD process model

Cios and Kurgan’s Six-Step DMKD Cios and Kurgan’s Six-Step DMKD Process ModelProcess Model

Understanding the problem domainUnderstanding the problem domain Understanding the dataUnderstanding the data Preparation of the data Preparation of the data

~50% or more of effort spent on this step~50% or more of effort spent on this step Data miningData mining Evaluation of the discovered knowledgeEvaluation of the discovered knowledge Using the discovered knowledgeUsing the discovered knowledge

General Data Mining/General Data Mining/Data Analysis SystemsData Analysis Systems

SAS Enterprise MinerSAS Enterprise Miner SPSSSPSS Insightful S-PlusInsightful S-Plus IBM DB2 Intelligent MinerIBM DB2 Intelligent Miner Microsoft SQLServer 2005Microsoft SQLServer 2005 SGI MLC++ and MineSet Tree VisualizerSGI MLC++ and MineSet Tree Visualizer Inxight VizServerInxight VizServer

Trends: Major ConferencesTrends: Major Conferences

Knowledge Discovery and Data Mining (KDD) 2005Knowledge Discovery and Data Mining (KDD) 2005 http://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.htmlhttp://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.html

International Conference on Machine Learning (ICML) International Conference on Machine Learning (ICML) 20062006

http://www.icml2006.org/icml2006/technical/accepted.htmlhttp://www.icml2006.org/icml2006/technical/accepted.html

SIAM Conference on Data Mining 2006SIAM Conference on Data Mining 2006 http://www.siam.org/meetings/sdm06/proceedings.htmhttp://www.siam.org/meetings/sdm06/proceedings.htm

1212thth Annual SIGKDD International Conference on Annual SIGKDD International Conference onKnowledge Discovery and Data Mining, Knowledge Discovery and Data Mining,

Philadelphia, August 20-23, 2006Philadelphia, August 20-23, 2006 Areas of Interest on the Research Track:Areas of Interest on the Research Track:

Applications of data mining (biomedicine, business, e-commerce, defense) Applications of data mining (biomedicine, business, e-commerce, defense) Data and result visualization Data and result visualization Data warehousing Data warehousing Data mining for community generation, social network analysis and graph-structured data Data mining for community generation, social network analysis and graph-structured data Foundations of data mining Foundations of data mining Interactive and online data mining Interactive and online data mining KDD framework and process KDD framework and process Mining data streams Mining data streams Mining high-dimensional data Mining high-dimensional data Mining sensor data Mining sensor data Mining text and semi-structured data Mining text and semi-structured data Mining multi-media data Mining multi-media data Novel data mining algorithms Novel data mining algorithms Privacy and data mining Privacy and data mining Robust and scalable statistical methods Robust and scalable statistical methods Pre-processing and post-processing for data mining Pre-processing and post-processing for data mining Security issues Security issues Spatial and temporal data miningSpatial and temporal data mining

Trends in DMKDTrends in DMKD

OLAP (On-Line Analytical Processing)OLAP (On-Line Analytical Processing) Data warehousingData warehousing Association rulesAssociation rules High Performance DMKD systemsHigh Performance DMKD systems Visualization techniquesVisualization techniques Applications of DMApplications of DM More recently:More recently:

Database products that incorporate DM toolsDatabase products that incorporate DM tools New developments in design and implementation of the DMKD New developments in design and implementation of the DMKD

processprocess Information visualization products as end-user queriesInformation visualization products as end-user queries XMLXML

XML: the Key to DM and KD?XML: the Key to DM and KD?

Or simply a data exchange protocol?Or simply a data exchange protocol? Allows for the description and storage of Allows for the description and storage of

structured or semi-structured data and structured or semi-structured data and their relationshipstheir relationships

Can be used to exchange data in a Can be used to exchange data in a platform-independent wayplatform-independent way

BUT—only one paper at the major BUT—only one paper at the major conferences listed earlier that dealt with conferences listed earlier that dealt with XMLXML

XML helps:XML helps:

Standardize communication between diverse Standardize communication between diverse DM tools and databases (I/O procedures)DM tools and databases (I/O procedures)

Build standard data repositories sharing data Build standard data repositories sharing data between different DM tools that work on different between different DM tools that work on different software platformssoftware platforms

Implement communication protocols between Implement communication protocols between DM toolsDM tools

Provide a framework for integration of and Provide a framework for integration of and communication between different DMKD stepscommunication between different DMKD steps

Predictive Model Markup Language Predictive Model Markup Language (PMML) and Other Tools(PMML) and Other Tools

In conjunction with XML, PMML enables In conjunction with XML, PMML enables the automation of sharing of discovered the automation of sharing of discovered knowledge between different domains and knowledge between different domains and toolstools

XML-RPCXML-RPC SOAP (Simple Object Access Protocol)SOAP (Simple Object Access Protocol) UDDIUDDI OLAPOLAP OLE DB-DMOLE DB-DM

Discovery Informatics: DefinitionDiscovery Informatics: Definition

"Discovery Informatics is the study and "Discovery Informatics is the study and practice of employing the full spectrum of practice of employing the full spectrum of computing and analytical science and computing and analytical science and technology to the singular pursuit of technology to the singular pursuit of discovering new information by identifying discovering new information by identifying and validating patterns in data." and validating patterns in data." --William W. Agresti in 2003 --William W. Agresti in 2003

Discovery InformaticsDiscovery Informatics

Discovery and Application of InformationDiscovery and Application of Information Data Mining and Machine Learning are Data Mining and Machine Learning are

two aspects of Discovery Informatics.two aspects of Discovery Informatics.


Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned NIH-funded Projects Underway or Planned

at Indiana Universityat Indiana University Educational Opportunities at IUEducational Opportunities at IU

Trends: Bioinformatics Trends: Bioinformatics ConferencesConferences

International Conference on Instelligent Systems International Conference on Instelligent Systems for Molecular Biology (ISMB) 2006for Molecular Biology (ISMB) 2006 http://ismb2006.cbi.cnptia.embrapa.br/papers.htmlhttp://ismb2006.cbi.cnptia.embrapa.br/papers.html

Research in Computational Molecular Biology Research in Computational Molecular Biology (RECOMB) 2006(RECOMB) 2006 http://www.informatik.uni-trier.de/~ley/db/conf/recomb/http://www.informatik.uni-trier.de/~ley/db/conf/recomb/

recomb2006.htmlrecomb2006.html

Pacific Symposium on Biocomputing (PSB) 2006Pacific Symposium on Biocomputing (PSB) 2006 http://helix-web.stanford.edu/psb06/http://helix-web.stanford.edu/psb06/

Main Areas of Research in Main Areas of Research in BioinformaticsBioinformatics

Sequence alignmentSequence alignment Alternative splicingAlternative splicing Microarray analysisMicroarray analysis Functional analysisFunctional analysis Analysis of single nucleotide Analysis of single nucleotide

polymorphisms (SNPs)polymorphisms (SNPs) Natural language text analysisNatural language text analysis

DMKD Sessions at Major DMKD Sessions at Major Bioinformatics ConferencesBioinformatics Conferences

Databases and Data IntegrationDatabases and Data Integration Text Mining and Information ExtractionText Mining and Information Extraction Semantic WebsSemantic Webs

Data Mining in Bioinformatics Data Mining in Bioinformatics (Bajcsy)(Bajcsy)

Data cleaning, data preprocessing, and Data cleaning, data preprocessing, and semantic integration of heterogeneous, semantic integration of heterogeneous, distributed biomedical databasesdistributed biomedical databases

Existing data mining tools for biodata Existing data mining tools for biodata analysisanalysis

Development of advanced, effective, and Development of advanced, effective, and scalable data mining methods in biodata scalable data mining methods in biodata analysisanalysis

Preprocessing of BiodataPreprocessing of Biodata

Integration of multiple microarray gene Integration of multiple microarray gene experiments must resolve inconsistent experiments must resolve inconsistent labels of genes to form a coherent data labels of genes to form a coherent data store.store.

Focus on quantitative quality metrics Focus on quantitative quality metrics based on analytical and statistical data based on analytical and statistical data descriptors and on relationships among descriptors and on relationships among variables.variables.

Semantic Integration of Semantic Integration of Heterogeneous Biomedical Heterogeneous Biomedical

DatabasesDatabases

Combine multiple sources into a coherent Combine multiple sources into a coherent data storedata store

Find sematically equivalent real-world Find sematically equivalent real-world entities from several biomedical sourcesentities from several biomedical sources

ProblemsProblems Different labels for the same concept: gene_id Different labels for the same concept: gene_id

vs. g_idvs. g_id Time asynchronization: same gene analyzed Time asynchronization: same gene analyzed

at multiple development stagesat multiple development stages

Approaches for Semantic Approaches for Semantic Integration of BiodataIntegration of Biodata

Construction of integrated biodata Construction of integrated biodata warehouses or biodatabaseswarehouses or biodatabases

Construction of a federation of Construction of a federation of heterogeneous distributed biodatabasesheterogeneous distributed biodatabases Must build up mapping rules or semantic Must build up mapping rules or semantic

ambiguity resolution rules across multiple ambiguity resolution rules across multiple databasesdatabases

Existing Data Mining Tools for Existing Data Mining Tools for Biodata Analysis-IBiodata Analysis-I

Sequence Analysis, e.g., Sequence Analysis, e.g., NCBI/BLAST, ClustalW, HMMER, PHYLIP, NCBI/BLAST, ClustalW, HMMER, PHYLIP,

MEME, TRANSFAC, MDScan, Vector NTI, MEME, TRANSFAC, MDScan, Vector NTI, Sequencher, MacVectorSequencher, MacVector

Structure Prediction and Visualization, Structure Prediction and Visualization, e.g.,e.g., RasMol, Raster3D, Swiss-Model, Scope, RasMol, Raster3D, Swiss-Model, Scope,

MolScript, Cn3DMolScript, Cn3D

Existing Data Mining Tools for Existing Data Mining Tools for Biodata Analysis-IIBiodata Analysis-II

Genome Analysis, e.g.,Genome Analysis, e.g., CAP3, Paracel GenomeAssembler, CAP3, Paracel GenomeAssembler,

GenomeScan, GeneMark, GenScan, X-Grail, GenomeScan, GeneMark, GenScan, X-Grail, ORF Finder, GeneBuilderORF Finder, GeneBuilder

Pathway Analysis and Visualization, e.g.,Pathway Analysis and Visualization, e.g., KEGG, EcoCyc/MetaCyc, GenMappKEGG, EcoCyc/MetaCyc, GenMapp

Microarray Analysis, e.g.,Microarray Analysis, e.g., ScanAlyze/Cluster/TreeView, Scanalytics ScanAlyze/Cluster/TreeView, Scanalytics

MicroArray Suite, Profiler, Silicon GeneticsMicroArray Suite, Profiler, Silicon Genetics

Biospecific Data Analysis Software Biospecific Data Analysis Software SystemsSystems

Agilent GeneSpringAgilent GeneSpring SpotfireSpotfire Invitrogen VectorNTIInvitrogen VectorNTI

Text Mining in BioinformaticsText Mining in Bioinformatics

Techniques have progressed from simple Techniques have progressed from simple recognition of terms to extraction of recognition of terms to extraction of interaction relationships in complex interaction relationships in complex sentences.sentences.

Search objectives have broadened to a Search objectives have broadened to a range of problems, e.g.,range of problems, e.g., Improving homology searchImproving homology search Identifying cellular locationIdentifying cellular location Deriving genetic network technologiesDeriving genetic network technologies

Current Work in Biomedical Text Current Work in Biomedical Text Mining (Cohen and Hersh)Mining (Cohen and Hersh)

Text mining operates at a finer level of granularity than Text mining operates at a finer level of granularity than information retrieval and text summarization.information retrieval and text summarization.

TM examines relationships between specific kinds of TM examines relationships between specific kinds of information contained within and between documents.information contained within and between documents.

Areas of active research:Areas of active research: Named entity recognition (genes, proteins, etc.)Named entity recognition (genes, proteins, etc.) Text classificationText classification Synonym and abbreviation extractionSynonym and abbreviation extraction Relationship extractionRelationship extraction Hypothesis generationHypothesis generation Integrated frameworksIntegrated frameworks

Systems BiologySystems Biology

Requires a shift in focus from genes and Requires a shift in focus from genes and proteins to the system’s structure and dynamicsproteins to the system’s structure and dynamics

Four key properties:Four key properties: System structuresSystem structures System dynamicsSystem dynamics Control methodControl method Design methodDesign method

Systems Biology Markup Language (SBML) and Systems Biology Markup Language (SBML) and CellMLCellML

iSpecies.orgiSpecies.org

Data Mining in ChemistryData Mining in Chemistry

“ “Modern experimentation (whether Modern experimentation (whether “classical” or high-throughput) should be “classical” or high-throughput) should be based on the productive interplay of based on the productive interplay of statistical techniques (design-of-statistical techniques (design-of-experiments), molecular modeling as well experiments), molecular modeling as well as cheminformatics.”as cheminformatics.”--Ulrich S. Schubert--Ulrich S. Schubert

Session on “Integration of Informatics Session on “Integration of Informatics and Knowledge Management and Knowledge Management

Informatics”*Informatics”* Integration of Informatics at the Systems Level and at the Data LevelIntegration of Informatics at the Systems Level and at the Data Level

Chris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global Chris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global Research & Development Research & Development

Integrated Knowledge Management at Bayer HealthCare: Pharmacophore Integrated Knowledge Management at Bayer HealthCare: Pharmacophore Informatics Informatics William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer Pharmaceuticals CorporationPharmaceuticals Corporation

Building a Knowledge Enabled OrganizationBuilding a Knowledge Enabled OrganizationCory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics, Cory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics, Pfizer Global Research & DevelopmentPfizer Global Research & Development

Knowledge Management: Building a Knowledge Enabled OrganizationKnowledge Management: Building a Knowledge Enabled OrganizationVictor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical Victor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical R&DR&D

*10*10thth Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia

Impact of HTS and Combinatorial Impact of HTS and Combinatorial Chemistry ResearchChemistry Research

Most impact in:Most impact in: the pharmaceutical industrythe pharmaceutical industry medical researchmedical research catalyst researchcatalyst research

More recently:More recently: polymer and materials research.polymer and materials research.

Diversity of Data Mining in Diversity of Data Mining in ChemistryChemistry

On 5/7/2006 there were 4072 On 5/7/2006 there were 4072 references to either references to either “datamining” or “data mining” “datamining” or “data mining” in Chemical Abstracts.in Chemical Abstracts.

3416 different index terms 3416 different index terms were assigned to those were assigned to those records.records.

2772 used 1-5 times (81%)2772 used 1-5 times (81%) 298 used 6-10 times (9%)298 used 6-10 times (9%) 103 used 11-15 times (3%)103 used 11-15 times (3%) 71 used 16-20 times (2%)71 used 16-20 times (2%) 38 used 21-25 times (1%)38 used 21-25 times (1%) 24 used 26-30 times (1%)24 used 26-30 times (1%) 110 for 31-480 times (3%)110 for 31-480 times (3%) Most frequent co-term: Most frequent co-term:

“bioinformatics” with 480 hits “bioinformatics” with 480 hits or 12% of the occurrencesor 12% of the occurrences

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1-5 6-10 11-15 16-20 21-25 25-30 31-480

Series1

SFS graphSFS graph

Components of the Semantic Web Components of the Semantic Web for Chemistryfor Chemistry

XML – eXtensible Markup LanguageXML – eXtensible Markup Language RDF – Resource Description FrameworkRDF – Resource Description Framework RSS – Rich Site SummaryRSS – Rich Site Summary Dublin Core – allows metadata-based Dublin Core – allows metadata-based

newsfeedsnewsfeeds OWL – for ontologiesOWL – for ontologies BPEL4WS – for workflow and web servicesBPEL4WS – for workflow and web services

Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203. 3203.

Chemical Markup Language (CML)Chemical Markup Language (CML)

Much of the semantics in a chemical article can Much of the semantics in a chemical article can be supported by CMLbe supported by CML MoleculesMolecules StructuresStructures Reactions and reaction schemesReactions and reaction schemes Spectra (including annotations)Spectra (including annotations) Physicochemical dataPhysicochemical data

XML dictionaries and lexicons provide linguistic XML dictionaries and lexicons provide linguistic and semantic support for markupand semantic support for markup

Will lead to quicker authoring and higher quality Will lead to quicker authoring and higher quality of embedded structures and data through of embedded structures and data through machine validationmachine validation

Key Factors in the Success of the Key Factors in the Success of the Chemical Semantic WebChemical Semantic Web

Institutional Repositories: services Institutional Repositories: services deployed and supported at an institutional deployed and supported at an institutional level to offer dissemination management, level to offer dissemination management, stewardship, and where appropriate, long-stewardship, and where appropriate, long-term preservation of both the intellectual term preservation of both the intellectual work created by an institutional community work created by an institutional community and the records of the intellectual and and the records of the intellectual and cultural life of the institutional community cultural life of the institutional community

Open Access MovementOpen Access Movement

Knowledge-Driven Bioinformatics Knowledge-Driven Bioinformatics Enhanced with ChemistryEnhanced with Chemistry

Text Mining (Banville)Text Mining (Banville)

““In the pharmaceutical field, it is ideally the In the pharmaceutical field, it is ideally the marriage of biological and chemical information marriage of biological and chemical information that needs to be the ultimate focus of text data that needs to be the ultimate focus of text data mining applications.”mining applications.”

Problems:Problems: Lack of universal publication standards for identifying Lack of universal publication standards for identifying

each unique chemical entityeach unique chemical entity Selective indexing policies of A&I servicesSelective indexing policies of A&I services Need to understand how chemical structures link to Need to understand how chemical structures link to

biological processesbiological processes

OSCAR3 ServiceOSCAR3 Service Open Java source application under Open Java source application under

development by Peter Murray-Rust group at development by Peter Murray-Rust group at Cambridge (Not published yet)Cambridge (Not published yet)

Extracts chemical information from either a Extracts chemical information from either a paragraph of experimental data or a full paper paragraph of experimental data or a full paper (e.g. melting points, infra-red and NMR data, (e.g. melting points, infra-red and NMR data, and mass spectral information)and mass spectral information)

Produces an XML instance highlighting the Produces an XML instance highlighting the chemical information with an Extensible chemical information with an Extensible Stylesheet Language (XSL) fileStylesheet Language (XSL) file

At IU, we are attaching SOAP input/output At IU, we are attaching SOAP input/output engine for a web service based on OSCAR3.engine for a web service based on OSCAR3.

OSCAR at Work in the FutureOSCAR at Work in the Future

Semantic Scholars’ Grid ISemantic Scholars’ Grid I

Local MDStore

Local HarvestStore

Gatherer

AnalyzerIndexer

Query andGet list

Fetch MD and Documents

Run filter such asOSCAR2 on

harvested MDand documentsStore new MD

Index allLocal MD

Science.gov

PubMed

Google Scholar

etc.

Dspace

e-Prints

Semantic Scholars’ Grid IISemantic Scholars’ Grid II

Local MDStore

Updater

CiteULike

Connotea

Del.icio.us

etc.

ForeignUser Interface

Update and viewforeign MD

SSGViewer

Update local MDControl foreign interactions

View all MD’Access Community Tools

SynchronizeSSG and

foreign MD

ACM

IEEE

Google Scholar

etc.

Wiley

CommunityTools

Instant CitationIndex etc.

Plug-in

Chemical Datamining SoftwareChemical Datamining Software SureChemSureChem

http://surechem.reeltwo.com/http://surechem.reeltwo.com/ CLiDECLiDE

Recognizes structures, reactions, and textRecognizes structures, reactions, and text http://www.simbiosys.ca/clide/http://www.simbiosys.ca/clide/

OSCAR OSCAR ““OSCAR1” to check experimental dataOSCAR1” to check experimental data

• http://www.ch.cam.ac.uk/magnus/checker.htmlhttp://www.ch.cam.ac.uk/magnus/checker.html• http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/Ehttp://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/E

xperimentalDataChecker/xperimentalDataChecker/

CSR (Chemical Structure Reconstruction)CSR (Chemical Structure Reconstruction) http://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdfhttp://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdf

MDL DocSearch—combines MDL’s Isentris platform and EMC’s MDL DocSearch—combines MDL’s Isentris platform and EMC’s DocumentumDocumentum


Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned at NIH-funded Projects Underway or Planned at

Indiana UniversityIndiana University Educational Opportunities at IUEducational Opportunities at IU

ChemDB ChemDB http://cdb.ics.uci.edu/CHEM/Web/http://cdb.ics.uci.edu/CHEM/Web/

ChEBI, Chemical Entities of ChEBI, Chemical Entities of Biological InterestBiological Interest

Dictionary of molecular entities focused on Dictionary of molecular entities focused on small chemical compoundssmall chemical compounds

Features an ontological classification, Features an ontological classification, showing the relationships between showing the relationships between molecular entities or classes of entities molecular entities or classes of entities and their parents and/or children and their parents and/or children

Vioxx Entry in ChEBIVioxx Entry in ChEBI

The IUPAC International Chemical The IUPAC International Chemical Identifier (InChI)Identifier (InChI)

Open source, non-proprietary, public-domain identifier Open source, non-proprietary, public-domain identifier for chemicalsfor chemicals

String of characters that uniquely represent a molecular String of characters that uniquely represent a molecular substancesubstance

Independent of the way the chemical structure is drawnIndependent of the way the chemical structure is drawn Enables reliable structure recognition and easy linking of Enables reliable structure recognition and easy linking of

diverse data compilationsdiverse data compilations Accepts as input MOLfiles (or SDfiles) and CML filesAccepts as input MOLfiles (or SDfiles) and CML files Download the program to your computer at: Download the program to your computer at:

http://http://www.iupac.org/inchi/license.htmlwww.iupac.org/inchi/license.html

Generation of InChI for Vioxx with Generation of InChI for Vioxx with wInChIwInChI

Vioxx Entry in PubChem Vioxx Entry in PubChem Compounds Found with InChICompounds Found with InChI

Vioxx Bioassay Data in PubChemVioxx Bioassay Data in PubChem

Vioxx PubChem Link to External Vioxx PubChem Link to External Sources of InformationSources of Information

PubChem Link to Elsevier MDLPubChem Link to Elsevier MDL

DiscoveryGate DiscoveryGate www.discoverygate.comwww.discoverygate.com provides access to integrated scientific content from provides access to integrated scientific content from

databases, journal articles, patent publications and databases, journal articles, patent publications and reference worksreference works

information providers include Elsevier, Thomson-information providers include Elsevier, Thomson-Derwent, FIZ CHEMIE, the U.S. FDA, Prous Science Derwent, FIZ CHEMIE, the U.S. FDA, Prous Science and Thieme and Thieme

MDL Compound Index (the master list of substances MDL Compound Index (the master list of substances included in DiscoveryGate data sources) now included in DiscoveryGate data sources) now exceeds 14 million unique chemical structures with exceeds 14 million unique chemical structures with the addition of 5 million chemical structures from the the addition of 5 million chemical structures from the PubChem database.PubChem database.

The Elsevier MDL/NIH Link via The Elsevier MDL/NIH Link via PubChem and DiscoveryGatePubChem and DiscoveryGate

Cross-indexes PubChem to the Compound Cross-indexes PubChem to the Compound Index hosted on Elsevier MDL’s DiscoveryGate Index hosted on Elsevier MDL’s DiscoveryGate platformplatform

MDL added 5 million structures from PubChem MDL added 5 million structures from PubChem to their index, resulting in over 14 million unique to their index, resulting in over 14 million unique chemical structureschemical structures

Links go both waysLinks go both ways Can move from biological data in PubChem to Can move from biological data in PubChem to

bioactivity, chemical sourcing, synthetic methodology, bioactivity, chemical sourcing, synthetic methodology, and EHS data in DiscoveryGate sources and EHS data in DiscoveryGate sources

Elsevier MDL’s xPharmElsevier MDL’s xPharm

Comprehensive set of records linking:Comprehensive set of records linking: Agents (compounds) (2300)Agents (compounds) (2300) Targets (600)Targets (600) Disorders (450)Disorders (450) Principles that govern their interactions (180)Principles that govern their interactions (180)

Answers questions such as:Answers questions such as:• What targets are associated with control of blood What targets are associated with control of blood

pressure?pressure?• What adverse effects are associated with What adverse effects are associated with

monoamine oxidase inhibitors?monoamine oxidase inhibitors?

Web Guide for Essential Web Guide for Essential Cheminformatics ResourcesCheminformatics Resources

http://www.chembiogrid.orghttp://www.chembiogrid.org http://www.indiana.edu/~cheminfo/cicc/http://www.indiana.edu/~cheminfo/cicc/

ChemBioGrid Chemical DatabasesChemBioGrid Chemical Databases

Web Services OverviewWeb Services Overview

What are “Web Services”?What are “Web Services”? A distributed invocation system built on Grid A distributed invocation system built on Grid

computingcomputing• Independent of platform and programming Independent of platform and programming

languagelanguage• Built on existing Web standardsBuilt on existing Web standards

A service oriented architecture withA service oriented architecture with• Interfaces based on Internet protocolsInterfaces based on Internet protocols• Messages in XML (except for binary data Messages in XML (except for binary data

attachments)attachments)

Web Services for Chemistry: Web Services for Chemistry: ProblemsProblems

Performance and scalabilityPerformance and scalability Proprietary dataProprietary data Competition from high-performance desktop Competition from high-performance desktop

applicationsapplications-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05

ALSO: ALSO: Lack of a substantial body of trustworthy Open Lack of a substantial body of trustworthy Open

Access databasesAccess databases Non-standard chemical data formats (over 40 in Non-standard chemical data formats (over 40 in

regular use and requiring normalization to one regular use and requiring normalization to one another)another)

DM Internet Toolbox ArchitectureDM Internet Toolbox Architecture


Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or NIH-funded Projects Underway or

Planned at Indiana UniversityPlanned at Indiana University Educational Opportunities at IUEducational Opportunities at IU

Indiana University Planned Indiana University Planned Projects:Projects:

http://www.chembiogrid.orghttp://www.chembiogrid.org Design of a Grid-based distributed data Design of a Grid-based distributed data

architecturearchitecture Development of tools for HTS data analysis and Development of tools for HTS data analysis and

virtual screeningvirtual screening Database for quantum mechanical simulation Database for quantum mechanical simulation

datadata Chemical prototype projectsChemical prototype projects

Novel routes to enzymatic reaction mechanismsNovel routes to enzymatic reaction mechanisms Mechanism-based drug designMechanism-based drug design Data-inquiry-based development of new methods in Data-inquiry-based development of new methods in

natural product synthesisnatural product synthesis

Web Services for Chemistry at IUWeb Services for Chemistry at IUPurpose Purpose Technologies Technologies

Interaction LayerInteraction Layer Interactive software for Interactive software for creative access and creative access and exploitation of information exploitation of information by humans by humans

Microsoft .NET Smart Microsoft .NET Smart Clients, portlets, Java Clients, portlets, Java applets, email and browser applets, email and browser clients, visualization clients, visualization technologies technologies

Aggregation LayerAggregation Layer Workflows and data Workflows and data schemas customized for schemas customized for particular domains, particular domains, applications and users applications and users

BPEL, Taverna and other BPEL, Taverna and other workflow modeling tools, workflow modeling tools, aggregate web servicesaggregate web services

Web service layerWeb service layer Comprehensive data and Comprehensive data and computation provision computation provision including storage, including storage, calculation, semantics and calculation, semantics and meta-data exposed as web meta-data exposed as web services services

Apache web services, Apache web services, SOAP wrappers, WSDL, SOAP wrappers, WSDL, UDDI, XML, UDDI, XML,

Microsoft .NET Microsoft .NET

NCI Developmental Therapeutics NCI Developmental Therapeutics Program (DTP)Program (DTP)

Downloadable data:Downloadable data: In vitroIn vitro 60 cell line results 60 cell line results in vitroin vitro anti-HIV results anti-HIV results Yeast assayYeast assay 200,000+ chemical structures200,000+ chemical structures molecular targetsmolecular targets microarray data microarray data

Or search the database at:Or search the database at:• http://http://dtp.nci.nih.gov/docs/dtp_search.htmldtp.nci.nih.gov/docs/dtp_search.html

IU Database of NIH DTP DataIU Database of NIH DTP Data Contains over 200,000 chemical structures Contains over 200,000 chemical structures

tested in 60 cellular assays from different human tested in 60 cellular assays from different human tumor cell linestumor cell lines

Also includes microarray assay profiles for the Also includes microarray assay profiles for the untreated cell lines (~14,000 datapoints)untreated cell lines (~14,000 datapoints)

A local PostgreSQL database containing the A local PostgreSQL database containing the data that is exposed as a web servicedata that is exposed as a web service

Using workflows and complex SQL queries, we Using workflows and complex SQL queries, we can do advanced data mining that exploits the can do advanced data mining that exploits the chemical, biological and genomic information for chemical, biological and genomic information for particular audiences (chemists, biologists, etc)particular audiences (chemists, biologists, etc)

Mining the NIH DTP databaseMining the NIH DTP database

~20

0,00

0 ~

200,

000

com

poun

dsco

mpo

unds

60 cell lines60 cell lines

~14,000 gene expression

~14,000 gene expression valuesvalues

Cell lines can be clustered based on gene expression similarity

Compounds can be clustered based on similarity of profileacross cell lines, or by chemical structure fingerprint similarity

Use of Taverna at IUUse of Taverna at IU A protein implicated in tumor growth is supplied to the docking A protein implicated in tumor growth is supplied to the docking

program (in this case HSP90 taken from the PDB 1Y4 complex)program (in this case HSP90 taken from the PDB 1Y4 complex) The workflow employs our local NIH DTP database service to The workflow employs our local NIH DTP database service to

search 200,000 compounds tested in human tumor cellular assays search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. for similar structures to the ligand.

Client portlets are used to browse these structuresClient portlets are used to browse these structures Once docking is complete, the user visualizes the high-scoring Once docking is complete, the user visualizes the high-scoring

docked structures in a portlet using the JMOL applet.docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically Similar structures are filtered for drugability, and are automatically

passed to the OpenEye FRED docking program for docking into the passed to the OpenEye FRED docking program for docking into the target protein.target protein.

A 2D structure is supplied for input into the similarity search (in this A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex)case, the extracted bound ligand from the PDB IY4 complex)

Correlation of docking results and “biological fingerprints” across the Correlation of docking results and “biological fingerprints” across the human tumor cell lines can help identify potential mechanisms of human tumor cell lines can help identify potential mechanisms of action of DTP compoundsaction of DTP compounds

Taverna WorkflowTaverna Workflow

Visual depiction of workflow

Workflow definition

Available web services(WSDL)

Taverna in ActionTaverna in Action

CGL Contributions to CICCCGL Contributions to CICC Build Web/Grid services for connectingBuild Web/Grid services for connecting

Data sourcesData sources Applications (simulation, data mining, data assimilation, imaging, etc).Applications (simulation, data mining, data assimilation, imaging, etc). Computing resourcesComputing resources Information services.Information services.

Third party tool evaluationThird party tool evaluation Workflow (Taverna)Workflow (Taverna) Grid tools: Globus and Condor (for interacting with TeraGrid)Grid tools: Globus and Condor (for interacting with TeraGrid)

Building standards-based Web portal environments.Building standards-based Web portal environments. OGCE grid portal projectOGCE grid portal project JSR 168 Java standards.JSR 168 Java standards. This activity will begin in earnest over the summer.This activity will begin in earnest over the summer.

Digital Chemistry (BCI) Clustering Digital Chemistry (BCI) Clustering Service MethodsService Methods

Service MethodService Method DescriptionDescription InputInput OutputOutput

makebitsGeneratemakebitsGenerate Generate fingerprints Generate fingerprints from a SMILES structurefrom a SMILES structure

SMIstringSMIstring Fingerprint Fingerprint stringstring

divkmGeneratedivkmGenerate Cluster fingerprints with Cluster fingerprints with DivkmeansDivkmeans

SCNstringSCNstring Clustered Clustered HierarchyHierarchy

smile2dkmsmile2dkm Makebits + divkmMakebits + divkm SMIstringSMIstring Clustered Clustered HierarchyHierarchy

optclusGenerateoptclusGenerate Generate the best levels Generate the best levels in a hierarchyin a hierarchy

DKMstringDKMstring Best partition Best partition cluster levelcluster level

rnnclusGeneraternnclusGenerate Extract individual cluster Extract individual cluster partitionspartitions

DKMstringDKMstring Indiv. cluster Indiv. cluster partitionspartitions

smile2ClusterPartitismile2ClusterPartitionedoned

Generate a new SMILES Generate a new SMILES structure w/ extra col.structure w/ extra col.

SMIstringSMIstring New SMILES New SMILES structurestructure

Local Web Service Methods for Local Web Service Methods for WWMM of PMR’s GroupWWMM of PMR’s Group

ServicesServices DescriptionsDescriptions InputInput OutputOutput

InChIGoogleInChIGoogle Search an InChI Search an InChI structure through Googlestructure through Google

inchiBasicinchiBasic

typetype

Search result in Search result in HTML formatHTML format

InChIServerInChIServer Generate InChIGenerate InChI versionversion

formatformat

An InChI An InChI structurestructure

OBServerOBServer Transform a chemical Transform a chemical format to another using format to another using Open BabelOpen Babel

formatformat

inputDatainputData

outputDataoutputData

optionsoptions

Converted Converted chemical chemical structure stringstructure string

CMLRSSSerCMLRSSServerver

Generate CMLRSS feed Generate CMLRSS feed from CML datafrom CML data

mol, title mol, title description description link, sourcelink, source

Converted Converted CMLRSS feed CMLRSS feed of CML dataof CML data

More ServicesMore Services

VOTables VOTables and related and related services.services.

General purpose service for manipulating tabular General purpose service for manipulating tabular data. Comes with third party tools for parsing, data. Comes with third party tools for parsing, manipulating, displaying data. Includes import manipulating, displaying data. Includes import tools. Using this as an intermediary for data tools. Using this as an intermediary for data exchange between data bases.exchange between data bases.

Draw2dDraw2d Uses CDK tools to create 2d images from SDF Uses CDK tools to create 2d images from SDF formatted data.formatted data.

Common Common SubstructureSubstructure

Another CDK service that can be used to calculate Another CDK service that can be used to calculate the common substructure between two molecules.the common substructure between two molecules.

Other CDK Other CDK ServicesServices

See See http://www.chembiogrid.org/wiki/index.php/Web_Sehttp://www.chembiogrid.org/wiki/index.php/Web_Services_Infrastructurervices_Infrastructure. Based on Dr. Rajarshi Guha’s services.. Based on Dr. Rajarshi Guha’s services.

ToxTreeToxTree

An in silico toxicology prediction suiteAn in silico toxicology prediction suite Based on the CDK toolkitBased on the CDK toolkit Built on CMLBuilt on CML Released as OpenSource under the GPL Released as OpenSource under the GPL Standalone PC softwareStandalone PC software User Manual: User Manual: http://http://

ecb.jrc.it/DOCUMENTS/QSAR/TOXTREE/ecb.jrc.it/DOCUMENTS/QSAR/TOXTREE/toxTree_user_manual.pdftoxTree_user_manual.pdf

ToxTree ServiceToxTree Service An open Java source application by Nina JeliazkovaAn open Java source application by Nina Jeliazkova Estimates toxic hazard by applying a decision tree Estimates toxic hazard by applying a decision tree

approach. approach. Encodes the Cramer scheme Encodes the Cramer scheme

(Cramer G. M., R. A. Ford, R. L. Hall, Estimation of Toxic (Cramer G. M., R. A. Ford, R. L. Hall, Estimation of Toxic Hazard - A Decision Tree Approach, J. Cosmet. Toxicol., Hazard - A Decision Tree Approach, J. Cosmet. Toxicol., Vol.16, pp. 255-276, Pergamon Press, 1978) Vol.16, pp. 255-276, Pergamon Press, 1978)

Could be applied to datasets from various compatible file Could be applied to datasets from various compatible file types.types.

We are converting this GUI application to a text-based We are converting this GUI application to a text-based web serviceweb service

Chemoinformatics Education at IUChemoinformatics Education at IU

School of Informatics degree programsSchool of Informatics degree programs BS, MS, PhDBS, MS, PhD

Programs offered at both the Indianapolis Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses(IUPUI) and Bloomington (IUB) campuses

Other Educational ActivitiesOther Educational Activities Graduate Certificate Program in Chemical Graduate Certificate Program in Chemical

Informatics (4 courses by Distance Education)Informatics (4 courses by Distance Education) I571 Chemical Information Technology (3 cr.) I571 Chemical Information Technology (3 cr.) I572 Computational Chemistry and Molecular I572 Computational Chemistry and Molecular

Modeling (3 cr.)Modeling (3 cr.) I573 Programming Techniques for Chemical and Life I573 Programming Techniques for Chemical and Life

Science Informatics (3 cr.)Science Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 I553 Independent Study in Chemical Informatics (3

cr.) cr.) I571 as CIC Courseshare offering w. MichiganI571 as CIC Courseshare offering w. Michigan Experiments with teleconferencing as a distance Experiments with teleconferencing as a distance

education tooleducation tool

PhD in InformaticsPhD in Informatics

Began in August 2005Began in August 2005 Tracks: Tracks:

bioinformatics; chemical informatics; health bioinformatics; chemical informatics; health informatics; human-computer interaction design; informatics; human-computer interaction design; social and organizational informaticssocial and organizational informatics

Under development: Under development: complex systems, networks, modeling and complex systems, networks, modeling and

simulation; cybersecurity; discovery and application of simulation; cybersecurity; discovery and application of information; logical and mathematical foundations; information; logical and mathematical foundations; music informaticsmusic informatics

Graduate Enrollment: Chemo-, Graduate Enrollment: Chemo-, Laboratory, Bio-, Health InformaticsLaboratory, Bio-, Health InformaticsMSMS ChemChem LabLab BioBio HealthHealth

IUBIUB 33 00 3838 00

IUPUIIUPUI 66 1515 3434 3636

TOTALTOTAL 99 1515 7272 3636

PhDPhD ChemChem LabLab BioBio HealthHealth

IUBIUB 11 00 33 00

IUPUIIUPUI 11 00 44 33

TOTALTOTAL 22 00 77 33

Software/DBs Used in the ProgramSoftware/DBs Used in the Program

CompanyCompany Products and/or (Target Area)Products and/or (Target Area)ArrgusLabArrgusLab (Molecular modeling)(Molecular modeling)Digital ChemistryDigital Chemistry Toolkit (Clustering)Toolkit (Clustering)Cambridge Cryst Data CtrCambridge Cryst Data Ctr Cambridge Structrual DB & GOLDCambridge Structrual DB & GOLDCambridgeSoftCambridgeSoft ChemDraw UltraChemDraw UltraChemical Abstracts ServiceChemical Abstracts Service SciFinder ScholarSciFinder ScholarChemaxonChemaxon Marvin (and other software)Marvin (and other software)Daylight Chemical Info SystemDaylight Chemical Info System ToolkitToolkitFIZ KarlsruheFIZ Karlsruhe Inorganic Crystal Structure DBInorganic Crystal Structure DBIO-InformaticsIO-Informatics SentientSentientMDLCrossFire MDLCrossFire Beilstein and GmelinBeilstein and GmelinOpenEyeOpenEye Toolkit (and other software)Toolkit (and other software)Sage InformaticsSage Informatics ChemTKChemTKSerena SoftwareSerena Software PCMODELPCMODELSpotfireSpotfire DecisionSiteDecisionSiteSTN InternationalSTN International STN Express with Discover (Anal Ed)STN Express with Discover (Anal Ed)WavefunctionWavefunction SpartanSpartan

Closing quoteClosing quote

““The future of chemistry depends on the The future of chemistry depends on the automated analysis of chemical automated analysis of chemical knowledge, combining disparate data knowledge, combining disparate data sources in a single resource, . . . which sources in a single resource, . . . which can be analysed using computational can be analysed using computational techniques to assess and build on these techniques to assess and build on these data.”data.” Townsend et al. Org. Biomol. Chem. 2004, 2, Townsend et al. Org. Biomol. Chem. 2004, 2,

3299.3299.

We all need help when overloaded!We all need help when overloaded!

BibliographyBibliography Agresti, William W. “Discovery informatics.” Agresti, William W. “Discovery informatics.” Communications of the ACMCommunications of the ACM

2003, 46(8), 25-28. 2003, 46(8), 25-28. Banville, Debra L. “Mining chemical structural information from the drug Banville, Debra L. “Mining chemical structural information from the drug

literature.” Drug Discovery Today January 2006, 11(1/2), 35-42.literature.” Drug Discovery Today January 2006, 11(1/2), 35-42. Bajcsy, Peter; Han, Jiawei; Liu, Lei; Yang, Jiong. "Survey of bio-data Bajcsy, Peter; Han, Jiawei; Liu, Lei; Yang, Jiong. "Survey of bio-data

analysis from a data mining perspective." Chapter 2 in: Wang, Jason T. L.; analysis from a data mining perspective." Chapter 2 in: Wang, Jason T. L.; Zaki, Mohammed J.; Toivonen, Hannu T. T.; Shasha, Dennis (eds.), Zaki, Mohammed J.; Toivonen, Hannu T. T.; Shasha, Dennis (eds.), Data Data Mining in Bioinformatics.Mining in Bioinformatics. London, Springer Verlag, 2005, pp.9-39. London, Springer Verlag, 2005, pp.9-39.

Banville, Debra L. “Mining chemical structural information from the drug Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, 2006, 11(1/2), 35-42.literature.” Drug Discovery Today, 2006, 11(1/2), 35-42.

Cios, Krzysztof J.; Kurgan, Lukasz A. “Trends in data mining and knowledge Cios, Krzysztof J.; Kurgan, Lukasz A. “Trends in data mining and knowledge discovery.” Chapter 1 in: Pal, N.R.; Jain, L.C.; Teodoresku, N. (eds.), discovery.” Chapter 1 in: Pal, N.R.; Jain, L.C.; Teodoresku, N. (eds.), Knowledge Discovery in Advanced Information SystemsKnowledge Discovery in Advanced Information Systems. N.Y., Springer . N.Y., Springer Verlag, 2002, pp. 1-26. Verlag, 2002, pp. 1-26.

Cohen, Aaron M.; Hersh, W.illiam R. "A survey of current work in biomedical Cohen, Aaron M.; Hersh, W.illiam R. "A survey of current work in biomedical text mining." text mining." Briefings in BioinformaticsBriefings in Bioinformatics March 2005, 6(1), 57-71. March 2005, 6(1), 57-71.

BibliographyBibliography Corbett, Peter T.; Murray-Rust, Peter; Day, Nick E.; Townsend, Joe A.; Rzepa, Henry Corbett, Peter T.; Murray-Rust, Peter; Day, Nick E.; Townsend, Joe A.; Rzepa, Henry

S. S. “ “Chemistry publications in CML.” Chemistry publications in CML.” Abstracts of PapersAbstracts of Papers, 231st ACS National , 231st ACS National Meeting, Atlanta, GA, United States, March 26-30, 2006, CINF-055.Meeting, Atlanta, GA, United States, March 26-30, 2006, CINF-055.

Fayyad, U.M.; Piatesky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Fayyad, U.M.; Piatesky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Advances in Knowledge Discovery and Data Mining. Knowledge Discovery and Data Mining. AAAi/MIT Press, 1996. (quoted by Cios and AAAi/MIT Press, 1996. (quoted by Cios and Kurgan)Kurgan)

Gardner, Stephen P. “Ontologies and semantic data integration.” Gardner, Stephen P. “Ontologies and semantic data integration.” Drug Discovery Drug Discovery TodayToday 2005 10(14), 1001-1007. 2005 10(14), 1001-1007.

Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C; Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C; Wegner, J.; Willighagen, E.L. “The Blue Obelisk—Interoperability in chemical Wegner, J.; Willighagen, E.L. “The Blue Obelisk—Interoperability in chemical informatics.” Journal of Chemical Information and Modeling 2006 Web Release Date: informatics.” Journal of Chemical Information and Modeling 2006 Web Release Date: 22-Feb-2006; DOI: 10.1021/ci050400b 22-Feb-2006; DOI: 10.1021/ci050400b

Holliday, Gemma L.; Murray-Rust, Peter; Rzepa, Henry S. Holliday, Gemma L.; Murray-Rust, Peter; Rzepa, Henry S. “ “Chemical Markup, XML, Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions.” Reactions.” Journal of Chemical Information and ModelingJournal of Chemical Information and Modeling 2006, 46(1), 145-157. 2006, 46(1), 145-157.

JJóónsdnsdóóttir, S.O.; Jorgensen, F.S.; Brunak, S. “Prediction methods and databases ttir, S.O.; Jorgensen, F.S.; Brunak, S. “Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates.” Bioinformatics within chemoinformatics: emphasis on drugs and drug candidates.” Bioinformatics 2005 May 15; 21(10): 2145-60.2005 May 15; 21(10): 2145-60.

BibliographyBibliography Karthikeyan, M.; Krishnan, S.; Pankey, Anil Kumar. “Harvesting chemical information Karthikeyan, M.; Krishnan, S.; Pankey, Anil Kumar. “Harvesting chemical information

from the Internet using a distributed approach: ChemXtreme.” from the Internet using a distributed approach: ChemXtreme.” Journal of Chemical Journal of Chemical Information and ModelingInformation and Modeling.” DOI: 10.1021/ci050329..” DOI: 10.1021/ci050329.

Krallinger, Martin; Alonso-Allende Erhardt, Ramon; Valencia, Alfonso. “Text-mining Krallinger, Martin; Alonso-Allende Erhardt, Ramon; Valencia, Alfonso. “Text-mining approaches in molecular biology and biomedicine.” approaches in molecular biology and biomedicine.” Drug Discovery TodayDrug Discovery Today 2005, 2005, 10(6), 439-445.10(6), 439-445.Scherf Uwe, Ross Douglas T., Waltham Mark, Smith Lawrence H., Lee Jae K., Scherf Uwe, Ross Douglas T., Waltham Mark, Smith Lawrence H., Lee Jae K., Tanabe Lorraine, Kohn Kurt W., Reinhold William C., Myers Timothy G., Andrews Tanabe Lorraine, Kohn Kurt W., Reinhold William C., Myers Timothy G., Andrews Darren T., Scudiero Dominic A., Eisen Michael B., Sausville Edward A., Pommier Darren T., Scudiero Dominic A., Eisen Michael B., Sausville Edward A., Pommier Yves, Botstein David, Brown Patrick O., Weinstein John N. “A gene expression Yves, Botstein David, Brown Patrick O., Weinstein John N. “A gene expression database for the molecular pharmacology of cancer.” database for the molecular pharmacology of cancer.” Nature GeneticsNature Genetics 2000, 24, 236- 2000, 24, 236-244.244.

Schubert, Ulrich S. "Materials informatics: from data to knowledge towards integrated Schubert, Ulrich S. "Materials informatics: from data to knowledge towards integrated escience approaches." escience approaches." QSAR & Combinatorial ScienceQSAR & Combinatorial Science 2005, 24(1), 5. (NB: Entire 2005, 24(1), 5. (NB: Entire issue is devoted to this topic.)issue is devoted to this topic.)

SIAM International Conference on Data Mining (5SIAM International Conference on Data Mining (5 thth: 2005: Newport Beach, CA) Data : 2005: Newport Beach, CA) Data Mining; Proceedings. Kargupta, Hillol et al., eds. SIAM, 2005.Mining; Proceedings. Kargupta, Hillol et al., eds. SIAM, 2005.

Torr-Brown, Sheryl. Torr-Brown, Sheryl. “ “Advances in knowledge management for pharmaceutical Advances in knowledge management for pharmaceutical research and development.” research and development.” Current Opinion in Drug Discovery & DevelopmentCurrent Opinion in Drug Discovery & Development 2005, 8(3), 316-322.2005, 8(3), 316-322.

Web 2.0Web 2.0

Social Software: allows group interactionsSocial Software: allows group interactions Enables groups to form and organize themselvesEnables groups to form and organize themselves ExamplesExamples

• WikisWikis• BlogsBlogs• RSS (now found on chemistry.org)RSS (now found on chemistry.org)• Podcasting/CoursecastingPodcasting/Coursecasting• Webcasting/WebinarsWebcasting/Webinars• FlickrFlickr• JybeJybe• FURLFURL

FURL (Frame Uniform Resource FURL (Frame Uniform Resource Locater)Locater)

For archiving and sharing of web pagesFor archiving and sharing of web pages Furler can capture the pages for a Furler can capture the pages for a

discussion groupdiscussion group Tracks useful pages for a discussionTracks useful pages for a discussion http://www.furl.net/home.jsphttp://www.furl.net/home.jsp

Jybe (Join Your Browser with Jybe (Join Your Browser with Everyone)Everyone)

Collaboration and communication in real Collaboration and communication in real time with IE and Firefoxtime with IE and Firefox

Screen-sharing AND editingScreen-sharing AND editing Privacy protected: must be invitedPrivacy protected: must be invited Upload documents to convert to htmlUpload documents to convert to html http://www.jybe.comhttp://www.jybe.com

mopping up the flood of data with web services

Documents

data mining robust

defense data

data collection

flood of data

process mining data

temporal data mining

iudata mining

talkdata mining