mopping up the flood of data with web services
DESCRIPTION
Mopping up the Flood of Data with Web Services. Gary Wiggins Indiana University School of Informatics [email protected]. Overview of the Talk. Data Mining and Knowledge Discovery DMKD in Bioinformatics DMKD in Chemistry Public Chemistry Databases for DMKD Overview of Web Services - PowerPoint PPT PresentationTRANSCRIPT
Mopping up the Flood of Mopping up the Flood of Data with Web ServicesData with Web Services
Gary WigginsGary WigginsIndiana UniversityIndiana University
School of InformaticsSchool of [email protected]@indiana.edu
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned at NIH-funded Projects Underway or Planned at
Indiana UniversityIndiana University Educational Opportunities at IUEducational Opportunities at IU
Data Mining and Knowledge Data Mining and Knowledge Discovery (DMKD)Discovery (DMKD)
Techniques began to be used around Techniques began to be used around 19891989
Rapid growth in the mid 1990s, with Rapid growth in the mid 1990s, with DMKD field emerging around 1995DMKD field emerging around 1995
Built on DM tools such as Machine Built on DM tools such as Machine Learning Learning
Data MiningData Mining
One of the steps in Knowledge DiscoveryOne of the steps in Knowledge Discovery Concerned with the actual extraction of Concerned with the actual extraction of
knowledge from dataknowledge from data Efficient and scalable methods for mining Efficient and scalable methods for mining
interesting patterns and knowledge and interesting patterns and knowledge and discovering hidden facts contained in large discovering hidden facts contained in large databasesdatabases
Data Mining TechniquesData Mining Techniques
Efficient classification methodsEfficient classification methods ClusteringClustering Outlier analysisOutlier analysis Frequent, sequential, and structured Frequent, sequential, and structured
pattern analysispattern analysis Visualization and spatial/temporal analysis Visualization and spatial/temporal analysis
toolstools
Knowledge Discovery (KD)Knowledge Discovery (KD)
““KD is a nontrivial process of identifying KD is a nontrivial process of identifying valid, novel, potentially useful, and valid, novel, potentially useful, and ultimately understandable patterns from ultimately understandable patterns from large collections of data.”large collections of data.”--Fayyad et al., as quoted by Cios and Kurgan--Fayyad et al., as quoted by Cios and Kurgan
The KD process involves:The KD process involves: Understanding and preparation of the dataUnderstanding and preparation of the data Data Mining (DM)Data Mining (DM) Verification and application of the discovered Verification and application of the discovered
knowledgeknowledge
Framework for KD ProcessFramework for KD Process
Steps range from very few, e.g.,Steps range from very few, e.g., Data collection and understandingData collection and understanding Data miningData mining ImplementationImplementation
To multi-step models, e.g., Cios and To multi-step models, e.g., Cios and Kurgan’s six-step DMKD process modelKurgan’s six-step DMKD process model
Cios and Kurgan’s Six-Step DMKD Cios and Kurgan’s Six-Step DMKD Process ModelProcess Model
Understanding the problem domainUnderstanding the problem domain Understanding the dataUnderstanding the data Preparation of the data Preparation of the data
~50% or more of effort spent on this step~50% or more of effort spent on this step Data miningData mining Evaluation of the discovered knowledgeEvaluation of the discovered knowledge Using the discovered knowledgeUsing the discovered knowledge
General Data Mining/General Data Mining/Data Analysis SystemsData Analysis Systems
SAS Enterprise MinerSAS Enterprise Miner SPSSSPSS Insightful S-PlusInsightful S-Plus IBM DB2 Intelligent MinerIBM DB2 Intelligent Miner Microsoft SQLServer 2005Microsoft SQLServer 2005 SGI MLC++ and MineSet Tree VisualizerSGI MLC++ and MineSet Tree Visualizer Inxight VizServerInxight VizServer
Trends: Major ConferencesTrends: Major Conferences
Knowledge Discovery and Data Mining (KDD) 2005Knowledge Discovery and Data Mining (KDD) 2005 http://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.htmlhttp://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.html
International Conference on Machine Learning (ICML) International Conference on Machine Learning (ICML) 20062006
http://www.icml2006.org/icml2006/technical/accepted.htmlhttp://www.icml2006.org/icml2006/technical/accepted.html
SIAM Conference on Data Mining 2006SIAM Conference on Data Mining 2006 http://www.siam.org/meetings/sdm06/proceedings.htmhttp://www.siam.org/meetings/sdm06/proceedings.htm
1212thth Annual SIGKDD International Conference on Annual SIGKDD International Conference onKnowledge Discovery and Data Mining, Knowledge Discovery and Data Mining,
Philadelphia, August 20-23, 2006Philadelphia, August 20-23, 2006 Areas of Interest on the Research Track:Areas of Interest on the Research Track:
Applications of data mining (biomedicine, business, e-commerce, defense) Applications of data mining (biomedicine, business, e-commerce, defense) Data and result visualization Data and result visualization Data warehousing Data warehousing Data mining for community generation, social network analysis and graph-structured data Data mining for community generation, social network analysis and graph-structured data Foundations of data mining Foundations of data mining Interactive and online data mining Interactive and online data mining KDD framework and process KDD framework and process Mining data streams Mining data streams Mining high-dimensional data Mining high-dimensional data Mining sensor data Mining sensor data Mining text and semi-structured data Mining text and semi-structured data Mining multi-media data Mining multi-media data Novel data mining algorithms Novel data mining algorithms Privacy and data mining Privacy and data mining Robust and scalable statistical methods Robust and scalable statistical methods Pre-processing and post-processing for data mining Pre-processing and post-processing for data mining Security issues Security issues Spatial and temporal data miningSpatial and temporal data mining
Trends in DMKDTrends in DMKD
OLAP (On-Line Analytical Processing)OLAP (On-Line Analytical Processing) Data warehousingData warehousing Association rulesAssociation rules High Performance DMKD systemsHigh Performance DMKD systems Visualization techniquesVisualization techniques Applications of DMApplications of DM More recently:More recently:
Database products that incorporate DM toolsDatabase products that incorporate DM tools New developments in design and implementation of the DMKD New developments in design and implementation of the DMKD
processprocess Information visualization products as end-user queriesInformation visualization products as end-user queries XMLXML
XML: the Key to DM and KD?XML: the Key to DM and KD?
Or simply a data exchange protocol?Or simply a data exchange protocol? Allows for the description and storage of Allows for the description and storage of
structured or semi-structured data and structured or semi-structured data and their relationshipstheir relationships
Can be used to exchange data in a Can be used to exchange data in a platform-independent wayplatform-independent way
BUT—only one paper at the major BUT—only one paper at the major conferences listed earlier that dealt with conferences listed earlier that dealt with XMLXML
XML helps:XML helps:
Standardize communication between diverse Standardize communication between diverse DM tools and databases (I/O procedures)DM tools and databases (I/O procedures)
Build standard data repositories sharing data Build standard data repositories sharing data between different DM tools that work on different between different DM tools that work on different software platformssoftware platforms
Implement communication protocols between Implement communication protocols between DM toolsDM tools
Provide a framework for integration of and Provide a framework for integration of and communication between different DMKD stepscommunication between different DMKD steps
Predictive Model Markup Language Predictive Model Markup Language (PMML) and Other Tools(PMML) and Other Tools
In conjunction with XML, PMML enables In conjunction with XML, PMML enables the automation of sharing of discovered the automation of sharing of discovered knowledge between different domains and knowledge between different domains and toolstools
XML-RPCXML-RPC SOAP (Simple Object Access Protocol)SOAP (Simple Object Access Protocol) UDDIUDDI OLAPOLAP OLE DB-DMOLE DB-DM
Discovery Informatics: DefinitionDiscovery Informatics: Definition
"Discovery Informatics is the study and "Discovery Informatics is the study and practice of employing the full spectrum of practice of employing the full spectrum of computing and analytical science and computing and analytical science and technology to the singular pursuit of technology to the singular pursuit of discovering new information by identifying discovering new information by identifying and validating patterns in data." and validating patterns in data." --William W. Agresti in 2003 --William W. Agresti in 2003
Discovery InformaticsDiscovery Informatics
Discovery and Application of InformationDiscovery and Application of Information Data Mining and Machine Learning are Data Mining and Machine Learning are
two aspects of Discovery Informatics.two aspects of Discovery Informatics.
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned NIH-funded Projects Underway or Planned
at Indiana Universityat Indiana University Educational Opportunities at IUEducational Opportunities at IU
Trends: Bioinformatics Trends: Bioinformatics ConferencesConferences
International Conference on Instelligent Systems International Conference on Instelligent Systems for Molecular Biology (ISMB) 2006for Molecular Biology (ISMB) 2006 http://ismb2006.cbi.cnptia.embrapa.br/papers.htmlhttp://ismb2006.cbi.cnptia.embrapa.br/papers.html
Research in Computational Molecular Biology Research in Computational Molecular Biology (RECOMB) 2006(RECOMB) 2006 http://www.informatik.uni-trier.de/~ley/db/conf/recomb/http://www.informatik.uni-trier.de/~ley/db/conf/recomb/
recomb2006.htmlrecomb2006.html
Pacific Symposium on Biocomputing (PSB) 2006Pacific Symposium on Biocomputing (PSB) 2006 http://helix-web.stanford.edu/psb06/http://helix-web.stanford.edu/psb06/
Main Areas of Research in Main Areas of Research in BioinformaticsBioinformatics
Sequence alignmentSequence alignment Alternative splicingAlternative splicing Microarray analysisMicroarray analysis Functional analysisFunctional analysis Analysis of single nucleotide Analysis of single nucleotide
polymorphisms (SNPs)polymorphisms (SNPs) Natural language text analysisNatural language text analysis
DMKD Sessions at Major DMKD Sessions at Major Bioinformatics ConferencesBioinformatics Conferences
Databases and Data IntegrationDatabases and Data Integration Text Mining and Information ExtractionText Mining and Information Extraction Semantic WebsSemantic Webs
Data Mining in Bioinformatics Data Mining in Bioinformatics (Bajcsy)(Bajcsy)
Data cleaning, data preprocessing, and Data cleaning, data preprocessing, and semantic integration of heterogeneous, semantic integration of heterogeneous, distributed biomedical databasesdistributed biomedical databases
Existing data mining tools for biodata Existing data mining tools for biodata analysisanalysis
Development of advanced, effective, and Development of advanced, effective, and scalable data mining methods in biodata scalable data mining methods in biodata analysisanalysis
Preprocessing of BiodataPreprocessing of Biodata
Integration of multiple microarray gene Integration of multiple microarray gene experiments must resolve inconsistent experiments must resolve inconsistent labels of genes to form a coherent data labels of genes to form a coherent data store.store.
Focus on quantitative quality metrics Focus on quantitative quality metrics based on analytical and statistical data based on analytical and statistical data descriptors and on relationships among descriptors and on relationships among variables.variables.
Semantic Integration of Semantic Integration of Heterogeneous Biomedical Heterogeneous Biomedical
DatabasesDatabases
Combine multiple sources into a coherent Combine multiple sources into a coherent data storedata store
Find sematically equivalent real-world Find sematically equivalent real-world entities from several biomedical sourcesentities from several biomedical sources
ProblemsProblems Different labels for the same concept: gene_id Different labels for the same concept: gene_id
vs. g_idvs. g_id Time asynchronization: same gene analyzed Time asynchronization: same gene analyzed
at multiple development stagesat multiple development stages
Approaches for Semantic Approaches for Semantic Integration of BiodataIntegration of Biodata
Construction of integrated biodata Construction of integrated biodata warehouses or biodatabaseswarehouses or biodatabases
Construction of a federation of Construction of a federation of heterogeneous distributed biodatabasesheterogeneous distributed biodatabases Must build up mapping rules or semantic Must build up mapping rules or semantic
ambiguity resolution rules across multiple ambiguity resolution rules across multiple databasesdatabases
Existing Data Mining Tools for Existing Data Mining Tools for Biodata Analysis-IBiodata Analysis-I
Sequence Analysis, e.g., Sequence Analysis, e.g., NCBI/BLAST, ClustalW, HMMER, PHYLIP, NCBI/BLAST, ClustalW, HMMER, PHYLIP,
MEME, TRANSFAC, MDScan, Vector NTI, MEME, TRANSFAC, MDScan, Vector NTI, Sequencher, MacVectorSequencher, MacVector
Structure Prediction and Visualization, Structure Prediction and Visualization, e.g.,e.g., RasMol, Raster3D, Swiss-Model, Scope, RasMol, Raster3D, Swiss-Model, Scope,
MolScript, Cn3DMolScript, Cn3D
Existing Data Mining Tools for Existing Data Mining Tools for Biodata Analysis-IIBiodata Analysis-II
Genome Analysis, e.g.,Genome Analysis, e.g., CAP3, Paracel GenomeAssembler, CAP3, Paracel GenomeAssembler,
GenomeScan, GeneMark, GenScan, X-Grail, GenomeScan, GeneMark, GenScan, X-Grail, ORF Finder, GeneBuilderORF Finder, GeneBuilder
Pathway Analysis and Visualization, e.g.,Pathway Analysis and Visualization, e.g., KEGG, EcoCyc/MetaCyc, GenMappKEGG, EcoCyc/MetaCyc, GenMapp
Microarray Analysis, e.g.,Microarray Analysis, e.g., ScanAlyze/Cluster/TreeView, Scanalytics ScanAlyze/Cluster/TreeView, Scanalytics
MicroArray Suite, Profiler, Silicon GeneticsMicroArray Suite, Profiler, Silicon Genetics
Biospecific Data Analysis Software Biospecific Data Analysis Software SystemsSystems
Agilent GeneSpringAgilent GeneSpring SpotfireSpotfire Invitrogen VectorNTIInvitrogen VectorNTI
Text Mining in BioinformaticsText Mining in Bioinformatics
Techniques have progressed from simple Techniques have progressed from simple recognition of terms to extraction of recognition of terms to extraction of interaction relationships in complex interaction relationships in complex sentences.sentences.
Search objectives have broadened to a Search objectives have broadened to a range of problems, e.g.,range of problems, e.g., Improving homology searchImproving homology search Identifying cellular locationIdentifying cellular location Deriving genetic network technologiesDeriving genetic network technologies
Current Work in Biomedical Text Current Work in Biomedical Text Mining (Cohen and Hersh)Mining (Cohen and Hersh)
Text mining operates at a finer level of granularity than Text mining operates at a finer level of granularity than information retrieval and text summarization.information retrieval and text summarization.
TM examines relationships between specific kinds of TM examines relationships between specific kinds of information contained within and between documents.information contained within and between documents.
Areas of active research:Areas of active research: Named entity recognition (genes, proteins, etc.)Named entity recognition (genes, proteins, etc.) Text classificationText classification Synonym and abbreviation extractionSynonym and abbreviation extraction Relationship extractionRelationship extraction Hypothesis generationHypothesis generation Integrated frameworksIntegrated frameworks
Systems BiologySystems Biology
Requires a shift in focus from genes and Requires a shift in focus from genes and proteins to the system’s structure and dynamicsproteins to the system’s structure and dynamics
Four key properties:Four key properties: System structuresSystem structures System dynamicsSystem dynamics Control methodControl method Design methodDesign method
Systems Biology Markup Language (SBML) and Systems Biology Markup Language (SBML) and CellMLCellML
iSpecies.orgiSpecies.org
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned NIH-funded Projects Underway or Planned
at Indiana Universityat Indiana University Educational Opportunities at IUEducational Opportunities at IU
Data Mining in ChemistryData Mining in Chemistry
“ “Modern experimentation (whether Modern experimentation (whether “classical” or high-throughput) should be “classical” or high-throughput) should be based on the productive interplay of based on the productive interplay of statistical techniques (design-of-statistical techniques (design-of-experiments), molecular modeling as well experiments), molecular modeling as well as cheminformatics.”as cheminformatics.”--Ulrich S. Schubert--Ulrich S. Schubert
Session on “Integration of Informatics Session on “Integration of Informatics and Knowledge Management and Knowledge Management
Informatics”*Informatics”* Integration of Informatics at the Systems Level and at the Data LevelIntegration of Informatics at the Systems Level and at the Data Level
Chris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global Chris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global Research & Development Research & Development
Integrated Knowledge Management at Bayer HealthCare: Pharmacophore Integrated Knowledge Management at Bayer HealthCare: Pharmacophore Informatics Informatics William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer Pharmaceuticals CorporationPharmaceuticals Corporation
Building a Knowledge Enabled OrganizationBuilding a Knowledge Enabled OrganizationCory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics, Cory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics, Pfizer Global Research & DevelopmentPfizer Global Research & Development
Knowledge Management: Building a Knowledge Enabled OrganizationKnowledge Management: Building a Knowledge Enabled OrganizationVictor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical Victor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical R&DR&D
*10*10thth Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia
Impact of HTS and Combinatorial Impact of HTS and Combinatorial Chemistry ResearchChemistry Research
Most impact in:Most impact in: the pharmaceutical industrythe pharmaceutical industry medical researchmedical research catalyst researchcatalyst research
More recently:More recently: polymer and materials research.polymer and materials research.
Diversity of Data Mining in Diversity of Data Mining in ChemistryChemistry
On 5/7/2006 there were 4072 On 5/7/2006 there were 4072 references to either references to either “datamining” or “data mining” “datamining” or “data mining” in Chemical Abstracts.in Chemical Abstracts.
3416 different index terms 3416 different index terms were assigned to those were assigned to those records.records.
2772 used 1-5 times (81%)2772 used 1-5 times (81%) 298 used 6-10 times (9%)298 used 6-10 times (9%) 103 used 11-15 times (3%)103 used 11-15 times (3%) 71 used 16-20 times (2%)71 used 16-20 times (2%) 38 used 21-25 times (1%)38 used 21-25 times (1%) 24 used 26-30 times (1%)24 used 26-30 times (1%) 110 for 31-480 times (3%)110 for 31-480 times (3%) Most frequent co-term: Most frequent co-term:
“bioinformatics” with 480 hits “bioinformatics” with 480 hits or 12% of the occurrencesor 12% of the occurrences
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1-5 6-10 11-15 16-20 21-25 25-30 31-480
Series1
SFS graphSFS graph
Components of the Semantic Web Components of the Semantic Web for Chemistryfor Chemistry
XML – eXtensible Markup LanguageXML – eXtensible Markup Language RDF – Resource Description FrameworkRDF – Resource Description Framework RSS – Rich Site SummaryRSS – Rich Site Summary Dublin Core – allows metadata-based Dublin Core – allows metadata-based
newsfeedsnewsfeeds OWL – for ontologiesOWL – for ontologies BPEL4WS – for workflow and web servicesBPEL4WS – for workflow and web services
Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203. 3203.
Chemical Markup Language (CML)Chemical Markup Language (CML)
Much of the semantics in a chemical article can Much of the semantics in a chemical article can be supported by CMLbe supported by CML MoleculesMolecules StructuresStructures Reactions and reaction schemesReactions and reaction schemes Spectra (including annotations)Spectra (including annotations) Physicochemical dataPhysicochemical data
XML dictionaries and lexicons provide linguistic XML dictionaries and lexicons provide linguistic and semantic support for markupand semantic support for markup
Will lead to quicker authoring and higher quality Will lead to quicker authoring and higher quality of embedded structures and data through of embedded structures and data through machine validationmachine validation
Key Factors in the Success of the Key Factors in the Success of the Chemical Semantic WebChemical Semantic Web
Institutional Repositories: services Institutional Repositories: services deployed and supported at an institutional deployed and supported at an institutional level to offer dissemination management, level to offer dissemination management, stewardship, and where appropriate, long-stewardship, and where appropriate, long-term preservation of both the intellectual term preservation of both the intellectual work created by an institutional community work created by an institutional community and the records of the intellectual and and the records of the intellectual and cultural life of the institutional community cultural life of the institutional community
Open Access MovementOpen Access Movement
Knowledge-Driven Bioinformatics Knowledge-Driven Bioinformatics Enhanced with ChemistryEnhanced with Chemistry
Text Mining (Banville)Text Mining (Banville)
““In the pharmaceutical field, it is ideally the In the pharmaceutical field, it is ideally the marriage of biological and chemical information marriage of biological and chemical information that needs to be the ultimate focus of text data that needs to be the ultimate focus of text data mining applications.”mining applications.”
Problems:Problems: Lack of universal publication standards for identifying Lack of universal publication standards for identifying
each unique chemical entityeach unique chemical entity Selective indexing policies of A&I servicesSelective indexing policies of A&I services Need to understand how chemical structures link to Need to understand how chemical structures link to
biological processesbiological processes
OSCAR3 ServiceOSCAR3 Service Open Java source application under Open Java source application under
development by Peter Murray-Rust group at development by Peter Murray-Rust group at Cambridge (Not published yet)Cambridge (Not published yet)
Extracts chemical information from either a Extracts chemical information from either a paragraph of experimental data or a full paper paragraph of experimental data or a full paper (e.g. melting points, infra-red and NMR data, (e.g. melting points, infra-red and NMR data, and mass spectral information)and mass spectral information)
Produces an XML instance highlighting the Produces an XML instance highlighting the chemical information with an Extensible chemical information with an Extensible Stylesheet Language (XSL) fileStylesheet Language (XSL) file
At IU, we are attaching SOAP input/output At IU, we are attaching SOAP input/output engine for a web service based on OSCAR3.engine for a web service based on OSCAR3.
OSCAR at Work in the FutureOSCAR at Work in the Future
Semantic Scholars’ Grid ISemantic Scholars’ Grid I
Local MDStore
Local HarvestStore
Gatherer
AnalyzerIndexer
Query andGet list
Fetch MD and Documents
Run filter such asOSCAR2 on
harvested MDand documentsStore new MD
Index allLocal MD
Science.gov
PubMed
Google Scholar
etc.
Dspace
e-Prints
Semantic Scholars’ Grid IISemantic Scholars’ Grid II
Local MDStore
Updater
CiteULike
Connotea
Del.icio.us
etc.
ForeignUser Interface
Update and viewforeign MD
SSGViewer
Update local MDControl foreign interactions
View all MD’Access Community Tools
SynchronizeSSG and
foreign MD
ACM
IEEE
Google Scholar
etc.
Wiley
CommunityTools
Instant CitationIndex etc.
Plug-in
Chemical Datamining SoftwareChemical Datamining Software SureChemSureChem
http://surechem.reeltwo.com/http://surechem.reeltwo.com/ CLiDECLiDE
Recognizes structures, reactions, and textRecognizes structures, reactions, and text http://www.simbiosys.ca/clide/http://www.simbiosys.ca/clide/
OSCAR OSCAR ““OSCAR1” to check experimental dataOSCAR1” to check experimental data
• http://www.ch.cam.ac.uk/magnus/checker.htmlhttp://www.ch.cam.ac.uk/magnus/checker.html• http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/Ehttp://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/E
xperimentalDataChecker/xperimentalDataChecker/
CSR (Chemical Structure Reconstruction)CSR (Chemical Structure Reconstruction) http://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdfhttp://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdf
MDL DocSearch—combines MDL’s Isentris platform and EMC’s MDL DocSearch—combines MDL’s Isentris platform and EMC’s DocumentumDocumentum
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned at NIH-funded Projects Underway or Planned at
Indiana UniversityIndiana University Educational Opportunities at IUEducational Opportunities at IU
ChemDB ChemDB http://cdb.ics.uci.edu/CHEM/Web/http://cdb.ics.uci.edu/CHEM/Web/
ChEBI, Chemical Entities of ChEBI, Chemical Entities of Biological InterestBiological Interest
Dictionary of molecular entities focused on Dictionary of molecular entities focused on small chemical compoundssmall chemical compounds
Features an ontological classification, Features an ontological classification, showing the relationships between showing the relationships between molecular entities or classes of entities molecular entities or classes of entities and their parents and/or children and their parents and/or children
Vioxx Entry in ChEBIVioxx Entry in ChEBI
The IUPAC International Chemical The IUPAC International Chemical Identifier (InChI)Identifier (InChI)
Open source, non-proprietary, public-domain identifier Open source, non-proprietary, public-domain identifier for chemicalsfor chemicals
String of characters that uniquely represent a molecular String of characters that uniquely represent a molecular substancesubstance
Independent of the way the chemical structure is drawnIndependent of the way the chemical structure is drawn Enables reliable structure recognition and easy linking of Enables reliable structure recognition and easy linking of
diverse data compilationsdiverse data compilations Accepts as input MOLfiles (or SDfiles) and CML filesAccepts as input MOLfiles (or SDfiles) and CML files Download the program to your computer at: Download the program to your computer at:
http://http://www.iupac.org/inchi/license.htmlwww.iupac.org/inchi/license.html
Generation of InChI for Vioxx with Generation of InChI for Vioxx with wInChIwInChI
Vioxx Entry in PubChem Vioxx Entry in PubChem Compounds Found with InChICompounds Found with InChI
Vioxx Bioassay Data in PubChemVioxx Bioassay Data in PubChem
Vioxx PubChem Link to External Vioxx PubChem Link to External Sources of InformationSources of Information
PubChem Link to Elsevier MDLPubChem Link to Elsevier MDL
DiscoveryGate DiscoveryGate www.discoverygate.comwww.discoverygate.com provides access to integrated scientific content from provides access to integrated scientific content from
databases, journal articles, patent publications and databases, journal articles, patent publications and reference worksreference works
information providers include Elsevier, Thomson-information providers include Elsevier, Thomson-Derwent, FIZ CHEMIE, the U.S. FDA, Prous Science Derwent, FIZ CHEMIE, the U.S. FDA, Prous Science and Thieme and Thieme
MDL Compound Index (the master list of substances MDL Compound Index (the master list of substances included in DiscoveryGate data sources) now included in DiscoveryGate data sources) now exceeds 14 million unique chemical structures with exceeds 14 million unique chemical structures with the addition of 5 million chemical structures from the the addition of 5 million chemical structures from the PubChem database.PubChem database.
The Elsevier MDL/NIH Link via The Elsevier MDL/NIH Link via PubChem and DiscoveryGatePubChem and DiscoveryGate
Cross-indexes PubChem to the Compound Cross-indexes PubChem to the Compound Index hosted on Elsevier MDL’s DiscoveryGate Index hosted on Elsevier MDL’s DiscoveryGate platformplatform
MDL added 5 million structures from PubChem MDL added 5 million structures from PubChem to their index, resulting in over 14 million unique to their index, resulting in over 14 million unique chemical structureschemical structures
Links go both waysLinks go both ways Can move from biological data in PubChem to Can move from biological data in PubChem to
bioactivity, chemical sourcing, synthetic methodology, bioactivity, chemical sourcing, synthetic methodology, and EHS data in DiscoveryGate sources and EHS data in DiscoveryGate sources
Elsevier MDL’s xPharmElsevier MDL’s xPharm
Comprehensive set of records linking:Comprehensive set of records linking: Agents (compounds) (2300)Agents (compounds) (2300) Targets (600)Targets (600) Disorders (450)Disorders (450) Principles that govern their interactions (180)Principles that govern their interactions (180)
Answers questions such as:Answers questions such as:• What targets are associated with control of blood What targets are associated with control of blood
pressure?pressure?• What adverse effects are associated with What adverse effects are associated with
monoamine oxidase inhibitors?monoamine oxidase inhibitors?
Web Guide for Essential Web Guide for Essential Cheminformatics ResourcesCheminformatics Resources
http://www.chembiogrid.orghttp://www.chembiogrid.org http://www.indiana.edu/~cheminfo/cicc/http://www.indiana.edu/~cheminfo/cicc/
ChemBioGrid Chemical DatabasesChemBioGrid Chemical Databases
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned NIH-funded Projects Underway or Planned
at Indiana Universityat Indiana University Educational Opportunities at IUEducational Opportunities at IU
Web Services OverviewWeb Services Overview
What are “Web Services”?What are “Web Services”? A distributed invocation system built on Grid A distributed invocation system built on Grid
computingcomputing• Independent of platform and programming Independent of platform and programming
languagelanguage• Built on existing Web standardsBuilt on existing Web standards
A service oriented architecture withA service oriented architecture with• Interfaces based on Internet protocolsInterfaces based on Internet protocols• Messages in XML (except for binary data Messages in XML (except for binary data
attachments)attachments)
Web Services for Chemistry: Web Services for Chemistry: ProblemsProblems
Performance and scalabilityPerformance and scalability Proprietary dataProprietary data Competition from high-performance desktop Competition from high-performance desktop
applicationsapplications-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05
ALSO: ALSO: Lack of a substantial body of trustworthy Open Lack of a substantial body of trustworthy Open
Access databasesAccess databases Non-standard chemical data formats (over 40 in Non-standard chemical data formats (over 40 in
regular use and requiring normalization to one regular use and requiring normalization to one another)another)
DM Internet Toolbox ArchitectureDM Internet Toolbox Architecture
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or NIH-funded Projects Underway or
Planned at Indiana UniversityPlanned at Indiana University Educational Opportunities at IUEducational Opportunities at IU
Indiana University Planned Indiana University Planned Projects:Projects:
http://www.chembiogrid.orghttp://www.chembiogrid.org Design of a Grid-based distributed data Design of a Grid-based distributed data
architecturearchitecture Development of tools for HTS data analysis and Development of tools for HTS data analysis and
virtual screeningvirtual screening Database for quantum mechanical simulation Database for quantum mechanical simulation
datadata Chemical prototype projectsChemical prototype projects
Novel routes to enzymatic reaction mechanismsNovel routes to enzymatic reaction mechanisms Mechanism-based drug designMechanism-based drug design Data-inquiry-based development of new methods in Data-inquiry-based development of new methods in
natural product synthesisnatural product synthesis
Web Services for Chemistry at IUWeb Services for Chemistry at IUPurpose Purpose Technologies Technologies
Interaction LayerInteraction Layer Interactive software for Interactive software for creative access and creative access and exploitation of information exploitation of information by humans by humans
Microsoft .NET Smart Microsoft .NET Smart Clients, portlets, Java Clients, portlets, Java applets, email and browser applets, email and browser clients, visualization clients, visualization technologies technologies
Aggregation LayerAggregation Layer Workflows and data Workflows and data schemas customized for schemas customized for particular domains, particular domains, applications and users applications and users
BPEL, Taverna and other BPEL, Taverna and other workflow modeling tools, workflow modeling tools, aggregate web servicesaggregate web services
Web service layerWeb service layer Comprehensive data and Comprehensive data and computation provision computation provision including storage, including storage, calculation, semantics and calculation, semantics and meta-data exposed as web meta-data exposed as web services services
Apache web services, Apache web services, SOAP wrappers, WSDL, SOAP wrappers, WSDL, UDDI, XML, UDDI, XML,
Microsoft .NET Microsoft .NET
NCI Developmental Therapeutics NCI Developmental Therapeutics Program (DTP)Program (DTP)
Downloadable data:Downloadable data: In vitroIn vitro 60 cell line results 60 cell line results in vitroin vitro anti-HIV results anti-HIV results Yeast assayYeast assay 200,000+ chemical structures200,000+ chemical structures molecular targetsmolecular targets microarray data microarray data
Or search the database at:Or search the database at:• http://http://dtp.nci.nih.gov/docs/dtp_search.htmldtp.nci.nih.gov/docs/dtp_search.html
IU Database of NIH DTP DataIU Database of NIH DTP Data Contains over 200,000 chemical structures Contains over 200,000 chemical structures
tested in 60 cellular assays from different human tested in 60 cellular assays from different human tumor cell linestumor cell lines
Also includes microarray assay profiles for the Also includes microarray assay profiles for the untreated cell lines (~14,000 datapoints)untreated cell lines (~14,000 datapoints)
A local PostgreSQL database containing the A local PostgreSQL database containing the data that is exposed as a web servicedata that is exposed as a web service
Using workflows and complex SQL queries, we Using workflows and complex SQL queries, we can do advanced data mining that exploits the can do advanced data mining that exploits the chemical, biological and genomic information for chemical, biological and genomic information for particular audiences (chemists, biologists, etc)particular audiences (chemists, biologists, etc)
Mining the NIH DTP databaseMining the NIH DTP database
~20
0,00
0 ~
200,
000
com
poun
dsco
mpo
unds
60 cell lines60 cell lines
~14,000 gene expression
~14,000 gene expression valuesvalues
Cell lines can be clustered based on gene expression similarity
Compounds can be clustered based on similarity of profileacross cell lines, or by chemical structure fingerprint similarity
Use of Taverna at IUUse of Taverna at IU A protein implicated in tumor growth is supplied to the docking A protein implicated in tumor growth is supplied to the docking
program (in this case HSP90 taken from the PDB 1Y4 complex)program (in this case HSP90 taken from the PDB 1Y4 complex) The workflow employs our local NIH DTP database service to The workflow employs our local NIH DTP database service to
search 200,000 compounds tested in human tumor cellular assays search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. for similar structures to the ligand.
Client portlets are used to browse these structuresClient portlets are used to browse these structures Once docking is complete, the user visualizes the high-scoring Once docking is complete, the user visualizes the high-scoring
docked structures in a portlet using the JMOL applet.docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically Similar structures are filtered for drugability, and are automatically
passed to the OpenEye FRED docking program for docking into the passed to the OpenEye FRED docking program for docking into the target protein.target protein.
A 2D structure is supplied for input into the similarity search (in this A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex)case, the extracted bound ligand from the PDB IY4 complex)
Correlation of docking results and “biological fingerprints” across the Correlation of docking results and “biological fingerprints” across the human tumor cell lines can help identify potential mechanisms of human tumor cell lines can help identify potential mechanisms of action of DTP compoundsaction of DTP compounds
Taverna WorkflowTaverna Workflow
Visual depiction of workflow
Workflow definition
Available web services(WSDL)
Taverna in ActionTaverna in Action
CGL Contributions to CICCCGL Contributions to CICC Build Web/Grid services for connectingBuild Web/Grid services for connecting
Data sourcesData sources Applications (simulation, data mining, data assimilation, imaging, etc).Applications (simulation, data mining, data assimilation, imaging, etc). Computing resourcesComputing resources Information services.Information services.
Third party tool evaluationThird party tool evaluation Workflow (Taverna)Workflow (Taverna) Grid tools: Globus and Condor (for interacting with TeraGrid)Grid tools: Globus and Condor (for interacting with TeraGrid)
Building standards-based Web portal environments.Building standards-based Web portal environments. OGCE grid portal projectOGCE grid portal project JSR 168 Java standards.JSR 168 Java standards. This activity will begin in earnest over the summer.This activity will begin in earnest over the summer.
Digital Chemistry (BCI) Clustering Digital Chemistry (BCI) Clustering Service MethodsService Methods
Service MethodService Method DescriptionDescription InputInput OutputOutput
makebitsGeneratemakebitsGenerate Generate fingerprints Generate fingerprints from a SMILES structurefrom a SMILES structure
SMIstringSMIstring Fingerprint Fingerprint stringstring
divkmGeneratedivkmGenerate Cluster fingerprints with Cluster fingerprints with DivkmeansDivkmeans
SCNstringSCNstring Clustered Clustered HierarchyHierarchy
smile2dkmsmile2dkm Makebits + divkmMakebits + divkm SMIstringSMIstring Clustered Clustered HierarchyHierarchy
optclusGenerateoptclusGenerate Generate the best levels Generate the best levels in a hierarchyin a hierarchy
DKMstringDKMstring Best partition Best partition cluster levelcluster level
rnnclusGeneraternnclusGenerate Extract individual cluster Extract individual cluster partitionspartitions
DKMstringDKMstring Indiv. cluster Indiv. cluster partitionspartitions
smile2ClusterPartitismile2ClusterPartitionedoned
Generate a new SMILES Generate a new SMILES structure w/ extra col.structure w/ extra col.
SMIstringSMIstring New SMILES New SMILES structurestructure
Local Web Service Methods for Local Web Service Methods for WWMM of PMR’s GroupWWMM of PMR’s Group
ServicesServices DescriptionsDescriptions InputInput OutputOutput
InChIGoogleInChIGoogle Search an InChI Search an InChI structure through Googlestructure through Google
inchiBasicinchiBasic
typetype
Search result in Search result in HTML formatHTML format
InChIServerInChIServer Generate InChIGenerate InChI versionversion
formatformat
An InChI An InChI structurestructure
OBServerOBServer Transform a chemical Transform a chemical format to another using format to another using Open BabelOpen Babel
formatformat
inputDatainputData
outputDataoutputData
optionsoptions
Converted Converted chemical chemical structure stringstructure string
CMLRSSSerCMLRSSServerver
Generate CMLRSS feed Generate CMLRSS feed from CML datafrom CML data
mol, title mol, title description description link, sourcelink, source
Converted Converted CMLRSS feed CMLRSS feed of CML dataof CML data
More ServicesMore Services
VOTables VOTables and related and related services.services.
General purpose service for manipulating tabular General purpose service for manipulating tabular data. Comes with third party tools for parsing, data. Comes with third party tools for parsing, manipulating, displaying data. Includes import manipulating, displaying data. Includes import tools. Using this as an intermediary for data tools. Using this as an intermediary for data exchange between data bases.exchange between data bases.
Draw2dDraw2d Uses CDK tools to create 2d images from SDF Uses CDK tools to create 2d images from SDF formatted data.formatted data.
Common Common SubstructureSubstructure
Another CDK service that can be used to calculate Another CDK service that can be used to calculate the common substructure between two molecules.the common substructure between two molecules.
Other CDK Other CDK ServicesServices
See See http://www.chembiogrid.org/wiki/index.php/Web_Sehttp://www.chembiogrid.org/wiki/index.php/Web_Services_Infrastructurervices_Infrastructure. Based on Dr. Rajarshi Guha’s services.. Based on Dr. Rajarshi Guha’s services.
ToxTreeToxTree
An in silico toxicology prediction suiteAn in silico toxicology prediction suite Based on the CDK toolkitBased on the CDK toolkit Built on CMLBuilt on CML Released as OpenSource under the GPL Released as OpenSource under the GPL Standalone PC softwareStandalone PC software User Manual: User Manual: http://http://
ecb.jrc.it/DOCUMENTS/QSAR/TOXTREE/ecb.jrc.it/DOCUMENTS/QSAR/TOXTREE/toxTree_user_manual.pdftoxTree_user_manual.pdf
ToxTree ServiceToxTree Service An open Java source application by Nina JeliazkovaAn open Java source application by Nina Jeliazkova Estimates toxic hazard by applying a decision tree Estimates toxic hazard by applying a decision tree
approach. approach. Encodes the Cramer scheme Encodes the Cramer scheme
(Cramer G. M., R. A. Ford, R. L. Hall, Estimation of Toxic (Cramer G. M., R. A. Ford, R. L. Hall, Estimation of Toxic Hazard - A Decision Tree Approach, J. Cosmet. Toxicol., Hazard - A Decision Tree Approach, J. Cosmet. Toxicol., Vol.16, pp. 255-276, Pergamon Press, 1978) Vol.16, pp. 255-276, Pergamon Press, 1978)
Could be applied to datasets from various compatible file Could be applied to datasets from various compatible file types.types.
We are converting this GUI application to a text-based We are converting this GUI application to a text-based web serviceweb service
Overview of the TalkOverview of the Talk
Data Mining and Knowledge DiscoveryData Mining and Knowledge Discovery DMKD in BioinformaticsDMKD in Bioinformatics DMKD in ChemistryDMKD in Chemistry Public Chemistry Databases for DMKDPublic Chemistry Databases for DMKD Overview of Web ServicesOverview of Web Services NIH-funded Projects Underway or Planned NIH-funded Projects Underway or Planned
at Indiana Universityat Indiana University Educational Opportunities at IUEducational Opportunities at IU
Chemoinformatics Education at IUChemoinformatics Education at IU
School of Informatics degree programsSchool of Informatics degree programs BS, MS, PhDBS, MS, PhD
Programs offered at both the Indianapolis Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses(IUPUI) and Bloomington (IUB) campuses
Other Educational ActivitiesOther Educational Activities Graduate Certificate Program in Chemical Graduate Certificate Program in Chemical
Informatics (4 courses by Distance Education)Informatics (4 courses by Distance Education) I571 Chemical Information Technology (3 cr.) I571 Chemical Information Technology (3 cr.) I572 Computational Chemistry and Molecular I572 Computational Chemistry and Molecular
Modeling (3 cr.)Modeling (3 cr.) I573 Programming Techniques for Chemical and Life I573 Programming Techniques for Chemical and Life
Science Informatics (3 cr.)Science Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 I553 Independent Study in Chemical Informatics (3
cr.) cr.) I571 as CIC Courseshare offering w. MichiganI571 as CIC Courseshare offering w. Michigan Experiments with teleconferencing as a distance Experiments with teleconferencing as a distance
education tooleducation tool
PhD in InformaticsPhD in Informatics
Began in August 2005Began in August 2005 Tracks: Tracks:
bioinformatics; chemical informatics; health bioinformatics; chemical informatics; health informatics; human-computer interaction design; informatics; human-computer interaction design; social and organizational informaticssocial and organizational informatics
Under development: Under development: complex systems, networks, modeling and complex systems, networks, modeling and
simulation; cybersecurity; discovery and application of simulation; cybersecurity; discovery and application of information; logical and mathematical foundations; information; logical and mathematical foundations; music informaticsmusic informatics
Graduate Enrollment: Chemo-, Graduate Enrollment: Chemo-, Laboratory, Bio-, Health InformaticsLaboratory, Bio-, Health InformaticsMSMS ChemChem LabLab BioBio HealthHealth
IUBIUB 33 00 3838 00
IUPUIIUPUI 66 1515 3434 3636
TOTALTOTAL 99 1515 7272 3636
PhDPhD ChemChem LabLab BioBio HealthHealth
IUBIUB 11 00 33 00
IUPUIIUPUI 11 00 44 33
TOTALTOTAL 22 00 77 33
Software/DBs Used in the ProgramSoftware/DBs Used in the Program
CompanyCompany Products and/or (Target Area)Products and/or (Target Area)ArrgusLabArrgusLab (Molecular modeling)(Molecular modeling)Digital ChemistryDigital Chemistry Toolkit (Clustering)Toolkit (Clustering)Cambridge Cryst Data CtrCambridge Cryst Data Ctr Cambridge Structrual DB & GOLDCambridge Structrual DB & GOLDCambridgeSoftCambridgeSoft ChemDraw UltraChemDraw UltraChemical Abstracts ServiceChemical Abstracts Service SciFinder ScholarSciFinder ScholarChemaxonChemaxon Marvin (and other software)Marvin (and other software)Daylight Chemical Info SystemDaylight Chemical Info System ToolkitToolkitFIZ KarlsruheFIZ Karlsruhe Inorganic Crystal Structure DBInorganic Crystal Structure DBIO-InformaticsIO-Informatics SentientSentientMDLCrossFire MDLCrossFire Beilstein and GmelinBeilstein and GmelinOpenEyeOpenEye Toolkit (and other software)Toolkit (and other software)Sage InformaticsSage Informatics ChemTKChemTKSerena SoftwareSerena Software PCMODELPCMODELSpotfireSpotfire DecisionSiteDecisionSiteSTN InternationalSTN International STN Express with Discover (Anal Ed)STN Express with Discover (Anal Ed)WavefunctionWavefunction SpartanSpartan
Closing quoteClosing quote
““The future of chemistry depends on the The future of chemistry depends on the automated analysis of chemical automated analysis of chemical knowledge, combining disparate data knowledge, combining disparate data sources in a single resource, . . . which sources in a single resource, . . . which can be analysed using computational can be analysed using computational techniques to assess and build on these techniques to assess and build on these data.”data.” Townsend et al. Org. Biomol. Chem. 2004, 2, Townsend et al. Org. Biomol. Chem. 2004, 2,
3299.3299.
We all need help when overloaded!We all need help when overloaded!
BibliographyBibliography Agresti, William W. “Discovery informatics.” Agresti, William W. “Discovery informatics.” Communications of the ACMCommunications of the ACM
2003, 46(8), 25-28. 2003, 46(8), 25-28. Banville, Debra L. “Mining chemical structural information from the drug Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today January 2006, 11(1/2), 35-42.literature.” Drug Discovery Today January 2006, 11(1/2), 35-42. Bajcsy, Peter; Han, Jiawei; Liu, Lei; Yang, Jiong. "Survey of bio-data Bajcsy, Peter; Han, Jiawei; Liu, Lei; Yang, Jiong. "Survey of bio-data
analysis from a data mining perspective." Chapter 2 in: Wang, Jason T. L.; analysis from a data mining perspective." Chapter 2 in: Wang, Jason T. L.; Zaki, Mohammed J.; Toivonen, Hannu T. T.; Shasha, Dennis (eds.), Zaki, Mohammed J.; Toivonen, Hannu T. T.; Shasha, Dennis (eds.), Data Data Mining in Bioinformatics.Mining in Bioinformatics. London, Springer Verlag, 2005, pp.9-39. London, Springer Verlag, 2005, pp.9-39.
Banville, Debra L. “Mining chemical structural information from the drug Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, 2006, 11(1/2), 35-42.literature.” Drug Discovery Today, 2006, 11(1/2), 35-42.
Cios, Krzysztof J.; Kurgan, Lukasz A. “Trends in data mining and knowledge Cios, Krzysztof J.; Kurgan, Lukasz A. “Trends in data mining and knowledge discovery.” Chapter 1 in: Pal, N.R.; Jain, L.C.; Teodoresku, N. (eds.), discovery.” Chapter 1 in: Pal, N.R.; Jain, L.C.; Teodoresku, N. (eds.), Knowledge Discovery in Advanced Information SystemsKnowledge Discovery in Advanced Information Systems. N.Y., Springer . N.Y., Springer Verlag, 2002, pp. 1-26. Verlag, 2002, pp. 1-26.
Cohen, Aaron M.; Hersh, W.illiam R. "A survey of current work in biomedical Cohen, Aaron M.; Hersh, W.illiam R. "A survey of current work in biomedical text mining." text mining." Briefings in BioinformaticsBriefings in Bioinformatics March 2005, 6(1), 57-71. March 2005, 6(1), 57-71.
BibliographyBibliography Corbett, Peter T.; Murray-Rust, Peter; Day, Nick E.; Townsend, Joe A.; Rzepa, Henry Corbett, Peter T.; Murray-Rust, Peter; Day, Nick E.; Townsend, Joe A.; Rzepa, Henry
S. S. “ “Chemistry publications in CML.” Chemistry publications in CML.” Abstracts of PapersAbstracts of Papers, 231st ACS National , 231st ACS National Meeting, Atlanta, GA, United States, March 26-30, 2006, CINF-055.Meeting, Atlanta, GA, United States, March 26-30, 2006, CINF-055.
Fayyad, U.M.; Piatesky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Fayyad, U.M.; Piatesky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Advances in Knowledge Discovery and Data Mining. Knowledge Discovery and Data Mining. AAAi/MIT Press, 1996. (quoted by Cios and AAAi/MIT Press, 1996. (quoted by Cios and Kurgan)Kurgan)
Gardner, Stephen P. “Ontologies and semantic data integration.” Gardner, Stephen P. “Ontologies and semantic data integration.” Drug Discovery Drug Discovery TodayToday 2005 10(14), 1001-1007. 2005 10(14), 1001-1007.
Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C; Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C; Wegner, J.; Willighagen, E.L. “The Blue Obelisk—Interoperability in chemical Wegner, J.; Willighagen, E.L. “The Blue Obelisk—Interoperability in chemical informatics.” Journal of Chemical Information and Modeling 2006 Web Release Date: informatics.” Journal of Chemical Information and Modeling 2006 Web Release Date: 22-Feb-2006; DOI: 10.1021/ci050400b 22-Feb-2006; DOI: 10.1021/ci050400b
Holliday, Gemma L.; Murray-Rust, Peter; Rzepa, Henry S. Holliday, Gemma L.; Murray-Rust, Peter; Rzepa, Henry S. “ “Chemical Markup, XML, Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions.” Reactions.” Journal of Chemical Information and ModelingJournal of Chemical Information and Modeling 2006, 46(1), 145-157. 2006, 46(1), 145-157.
JJóónsdnsdóóttir, S.O.; Jorgensen, F.S.; Brunak, S. “Prediction methods and databases ttir, S.O.; Jorgensen, F.S.; Brunak, S. “Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates.” Bioinformatics within chemoinformatics: emphasis on drugs and drug candidates.” Bioinformatics 2005 May 15; 21(10): 2145-60.2005 May 15; 21(10): 2145-60.
BibliographyBibliography Karthikeyan, M.; Krishnan, S.; Pankey, Anil Kumar. “Harvesting chemical information Karthikeyan, M.; Krishnan, S.; Pankey, Anil Kumar. “Harvesting chemical information
from the Internet using a distributed approach: ChemXtreme.” from the Internet using a distributed approach: ChemXtreme.” Journal of Chemical Journal of Chemical Information and ModelingInformation and Modeling.” DOI: 10.1021/ci050329..” DOI: 10.1021/ci050329.
Krallinger, Martin; Alonso-Allende Erhardt, Ramon; Valencia, Alfonso. “Text-mining Krallinger, Martin; Alonso-Allende Erhardt, Ramon; Valencia, Alfonso. “Text-mining approaches in molecular biology and biomedicine.” approaches in molecular biology and biomedicine.” Drug Discovery TodayDrug Discovery Today 2005, 2005, 10(6), 439-445.10(6), 439-445.Scherf Uwe, Ross Douglas T., Waltham Mark, Smith Lawrence H., Lee Jae K., Scherf Uwe, Ross Douglas T., Waltham Mark, Smith Lawrence H., Lee Jae K., Tanabe Lorraine, Kohn Kurt W., Reinhold William C., Myers Timothy G., Andrews Tanabe Lorraine, Kohn Kurt W., Reinhold William C., Myers Timothy G., Andrews Darren T., Scudiero Dominic A., Eisen Michael B., Sausville Edward A., Pommier Darren T., Scudiero Dominic A., Eisen Michael B., Sausville Edward A., Pommier Yves, Botstein David, Brown Patrick O., Weinstein John N. “A gene expression Yves, Botstein David, Brown Patrick O., Weinstein John N. “A gene expression database for the molecular pharmacology of cancer.” database for the molecular pharmacology of cancer.” Nature GeneticsNature Genetics 2000, 24, 236- 2000, 24, 236-244.244.
Schubert, Ulrich S. "Materials informatics: from data to knowledge towards integrated Schubert, Ulrich S. "Materials informatics: from data to knowledge towards integrated escience approaches." escience approaches." QSAR & Combinatorial ScienceQSAR & Combinatorial Science 2005, 24(1), 5. (NB: Entire 2005, 24(1), 5. (NB: Entire issue is devoted to this topic.)issue is devoted to this topic.)
SIAM International Conference on Data Mining (5SIAM International Conference on Data Mining (5 thth: 2005: Newport Beach, CA) Data : 2005: Newport Beach, CA) Data Mining; Proceedings. Kargupta, Hillol et al., eds. SIAM, 2005.Mining; Proceedings. Kargupta, Hillol et al., eds. SIAM, 2005.
Torr-Brown, Sheryl. Torr-Brown, Sheryl. “ “Advances in knowledge management for pharmaceutical Advances in knowledge management for pharmaceutical research and development.” research and development.” Current Opinion in Drug Discovery & DevelopmentCurrent Opinion in Drug Discovery & Development 2005, 8(3), 316-322.2005, 8(3), 316-322.
Web 2.0Web 2.0
Social Software: allows group interactionsSocial Software: allows group interactions Enables groups to form and organize themselvesEnables groups to form and organize themselves ExamplesExamples
• WikisWikis• BlogsBlogs• RSS (now found on chemistry.org)RSS (now found on chemistry.org)• Podcasting/CoursecastingPodcasting/Coursecasting• Webcasting/WebinarsWebcasting/Webinars• FlickrFlickr• JybeJybe• FURLFURL
FURL (Frame Uniform Resource FURL (Frame Uniform Resource Locater)Locater)
For archiving and sharing of web pagesFor archiving and sharing of web pages Furler can capture the pages for a Furler can capture the pages for a
discussion groupdiscussion group Tracks useful pages for a discussionTracks useful pages for a discussion http://www.furl.net/home.jsphttp://www.furl.net/home.jsp
Jybe (Join Your Browser with Jybe (Join Your Browser with Everyone)Everyone)
Collaboration and communication in real Collaboration and communication in real time with IE and Firefoxtime with IE and Firefox
Screen-sharing AND editingScreen-sharing AND editing Privacy protected: must be invitedPrivacy protected: must be invited Upload documents to convert to htmlUpload documents to convert to html http://www.jybe.comhttp://www.jybe.com