special library association washington, dc june 15, 2009 · 2009. 6. 15. · special library...
TRANSCRIPT
Challenges and Promises of the Semantic Webin Health Care and Life Sciences
Olivier Bodenreider
Lister Hill National Centerfor Biomedical Communications
Bethesda, Maryland - USA
Special Library AssociationWashington, DCJune 15, 2009
Lister Hill National Center for Biomedical Communications 2
Outline
Semantic WebSemantic Web for Health Care and Life SciencesPromisesChallengesSemantic Web and medical libraries
Semantic Web
Quick introduction
Lister Hill National Center for Biomedical Communications 4
Defining the Semantic Web
“The Semantic Web... is an extension of the current web in which... information is given well-defined meaning,... better enabling computers and people to work in cooperation.”
The Semantic WebTim Berners-Lee, James Hendler and Ora
LassilaScientific American, May 2001http://www.scientificamerican.com/article.cfm?id=
the-semantic-web
Lister Hill National Center for Biomedical Communications 5
From linking documents to linking data
Original WWWWeb of documentsLinks among documentsNo semantics in the linksFor humans
Semantic WebWeb of dataLinks among concepts in resourcesLinks have an explicit semanticsFor humans and machines
Lister Hill National Center for Biomedical Communications 6
Set of enabling technologies and tools
Unambiguous names for resources and entitiesStandard representation for explicit links among entitiesShared conceptualization of the domainStandard tools for querying dataStandard mechanisms for reasoning over data
Lister Hill National Center for Biomedical Communications 7
Semantic Web in 2009 Standards
StandardsUniform Resource Identifiers (URI)Resource Description Framework (RDF)Web Ontology Language (OWL)SPARQL query languageRule Interchange Format (RIF)
Lister Hill National Center for Biomedical Communications 8
Semantic Web in 2009 Datahttp://linkeddata.org
Lister Hill National Center for Biomedical Communications 9
Linked data
Lister Hill National Center for Biomedical Communications 10
Linked data
Resources available in RDFUnique, unambiguous identifiers for entitiesExplicit relations among entities
Links across resources (federation)Enabled by
Shared identifiers across resourcesGlobal identifiers, resolvable on the web
Semantic Webfor Health Care and Life Sciences
Lister Hill National Center for Biomedical Communications 12
The original Semantic Web scenario
“Mom needs to see a specialist and then has to have a series of physical therapy sessions. […] I'm going to have my agent set up the appointments.”“The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom’s insurance within a 20-mile radiusof her home and with a rating of excellent or very good on trusted rating services.”…
The Semantic WebTim Berners-Lee, James Hendler and Ora
LassilaScientific American, May 2001
Lister Hill National Center for Biomedical Communications 13
From linked data…http://linkeddata.org
Lister Hill National Center for Biomedical Communications 14
…To linked biomedical data[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)
Lister Hill National Center for Biomedical Communications 15
W3C Health Care and Life Sciences IG
http://www.w3.org/2001/sw/hcls/
Lister Hill National Center for Biomedical Communications 16
W3C Health Care and Life Sciences IG
Formed in 2005Re-chartered in 2008
Broad industry participationOver 100 members Mailing list of over 600
Get involved!Team contact: Eric Prud’hommeaux <[email protected]>
Lister Hill National Center for Biomedical Communications 17
HCLSIG activities
Document use cases to aid individuals in understanding the business and technical benefits of using Semantic Web technologiesDocument guidelines to accelerate the adoption of the technologyImplement a selection of the use cases as proof-of-concept demonstrationsDevelop high-level vocabulariesDisseminate information about the group’s work at government, industry, and academic events
Lister Hill National Center for Biomedical Communications 18
HCLSIG task forces
BioRDFUse of RDF (and OWL) to represent biomedical data
Clinical Observations InteroperabilityEnable the reuse of clinical researchand clinical practice data
Linking Open Drug DataLinking drug information sources
Translational Medicine OntologyDevelopment of an ontology for biomedical data integration
Scientific DiscourseAnnotate information extracted from the scientific literature
TerminologyUse of SW technologies for representing existing biomedical terminologies
Promises of the Semantic Webfor Health Care and Life Sciences
Lister Hill National Center for Biomedical Communications 20
Biomedical Semantic Web
IntegrationData/InformationE.g., translational research
Hypothesis generationKnowledge discovery
[Ruttenberg, BMC Bioinf. 2007]
Lister Hill National Center for Biomedical Communications 21
HCLS mashup of biomedical sources
NeuronDB
BAMS
NC Annotations
Homologene
SWAN
Entrez Gene
Gene Ontology
Mammalian Phenotype
PDSPki
BrainPharm
AlzGene
Antibodies
PubChem
MeSH
Reactome
Allen Brain Atlas
Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister Hill National Center for Biomedical Communications 22
Shared identifiers Example
GO
Lister Hill National Center for Biomedical Communications 23
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
Genes Phenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
Gene Polymorphism
PopulationAlz Diagnosis
AntibodiesGenes Antibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 24
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
GenesPhenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
GenePolymorphism
PopulationAlz Diagnosis
AntibodiesGenesAntibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 25
HCLS mashups
Based on RDF/OWLBased on shared identifiers
“Recombinant data” (E. Neumann)
Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)
Journal of Biomedical Informaticsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008]
Lister Hill National Center for Biomedical Communications 26
Semantic bio-mashupsBio2RDF: Towards a mashup to build bioinformatics knowledge systemsIdentifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledgeSchema driven assignment and implementation of life science identifiers (LSIDs)The SWAN biomedical discourse ontologyAn ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependenceTowards an ontology for sharing medical images and regions of interest in neuroimagingyOWL: An ontology-driven knowledge base for yeast biologistsDynamic sub-ontology evolution for traditional Chinese medicine web ontologyOntology-centric integration and navigation of the dengue literatureInfrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resourcesAn ontological knowledge framework for adaptive medical workflowSemi-automatic web service composition for the life sciences using the BioMoby semantic web frameworkCombining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources
[J. Biomedical Informatics 41(5) 2008]
Lister Hill National Center for Biomedical Communications 27
Bio2RDF
“Semantic web atlas of postgenomic knowledge”65M “triples”34 biomedical sourcesPublicly availableDistributed environment
Lister Hill National Center for Biomedical Communications 28
Another mashup example
gene
GO
PubMed
Gene name
OMIM
Sequence
InteractionsGlycosyltransferase
Congenital muscular dystrophyLink between glycosyltransferase activity and
congenital muscular dystrophy?
Lister Hill National Center for Biomedical Communications 29
…
APP(GeneID: 351)
amyloid beta A4 protein
has_protein_name
Lister Hill National Center for Biomedical Communications 30
APP (geneid-351) amyloid beta A4 protein
eg:has_protein_reference_name_E
subject predicate object
RDF triple Gene property
Lister Hill National Center for Biomedical Communications 31
RDF graph Connecting several genes
PARK1 Parkinson disease
has_associated_disease
MAPT Parkinson disease
MAPT Pick disease
TBP Parkinson disease
TBP Spinocerebellar ataxia
PARK1 Parkinson disease
Parkinson diseaseMAPT
Pick disease
Parkinson diseaseTBP
Spinocerebellar ataxia
PARK1 Parkinson disease
MAPT Pick disease
TBP Spinocerebellar ataxia
Lister Hill National Center for Biomedical Communications 32
From glycosyltransferaseto congenital muscular dystrophy
MIM:608840 Muscular dystrophy, congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375 acetylglucosaminyl-transferase
GO:0016758
Lister Hill National Center for Biomedical Communications 33
Summary of promises
FlexibilityEasier to federate than relational databasesInherently distributed
Based on standardsTooling available
Platform for data integrationEnabling technology for semantic interoperability, hypothesis generation and knowledge discovery
Challenges of the Semantic Webfor Health Care and Life Sciences
Lister Hill National Center for Biomedical Communications 35
Challenging issues
Permanent identifiers for biomedical entitiesBridges across ontologiesOther issues
Availability of biomedical datasetsDiscoverabilityFormalismOntology integrationScalability
Challenging issues
Permanent identifiers for biomedical entities
Lister Hill National Center for Biomedical Communications 37
Identifying biomedical entities
Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general
Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies
Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity
Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI
Lister Hill National Center for Biomedical Communications 38
Possible solutions
PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end
The institution creating a resource is also responsible for minting URIs
E.g., URI for genes in Entrez Gene
Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group
Shared names initiativeIdentify resources vs. entities
[http://sharedname.org/]
Challenging issues
Bridges across ontologies
Lister Hill National Center for Biomedical Communications 40
Trans-namespace integration
Addison Disease(D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
Primary adrenocortical insufficiency(E27.1)
ICD 10
Lister Hill National Center for Biomedical Communications 41
(Integrated) concept repositories
Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr
Open Biomedical Ontologies (OBO)http://obofoundry.org/
Lister Hill National Center for Biomedical Communications 42
Integrating subdomains
Biomedicalliterature
MeSH
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIM
Clinicalrepositories
SNOMED CTOthersubdomains
…
Anatomy
FMA
UMLS
Lister Hill National Center for Biomedical Communications 4343
Integrating subdomains
Biomedicalliterature
Genomeannotations
Modelorganisms
Geneticknowledge bases
Clinicalrepositories
Othersubdomains
Anatomy
Lister Hill National Center for Biomedical Communications 44
Trans-namespace integration
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIMOther
subdomains
…
Anatomy
FMA
UMLSAddison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
UMLSC0001403
Lister Hill National Center for Biomedical Communications 45
Mappings
Created manually (e.g., UMLS)PurposeDirectionality
Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions
Key to enabling semantic interoperabilityEnabling resource for the Semantic Web
Challenging issues
Other issues
Lister Hill National Center for Biomedical Communications 47
Availability
Publicly available datasetsIs a de facto prerequisite for Semantic Web applicationsIs a requirement from some funding agencies (“sharing plan”)
Many datasets are freely availableModel organisms databases annotated to the Gene OntologyNCBI knowledge bases in EntrezOpen Access journalsPublic archives (e.g., PubMedCentral)Ontologies from the Open Biomedical Ontology (OBO) family
Limited availabilityIntellectual property issues
Some ontologies and terminologiesConfidentiality and privacy issues
Personally identifiable medical information
Lister Hill National Center for Biomedical Communications 48
Discoverability
No universal repositories for biomedical datasetsSome datasets made available through portals (NCBI, EBI, NCBO)
Ontology repositoriesUMLS: 152 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~141ontologies(biased towards biological applications)Limited overlap between the two repositories
Need for discovery servicesMetadata for ontologies and biomedical datasets
Lister Hill National Center for Biomedical Communications 49
Formalism
Several major formalism for ontologiesWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm
Biomedical datasetsRDF/OWLLegacy: free text, Excel spreadsheets, PowerPoint
Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)
Lister Hill National Center for Biomedical Communications 50
Ontology integration
Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets
Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies
Lister Hill National Center for Biomedical Communications 51
Scalability
Billions of triples for raw data in biomedicineNeed for query systems to access distributed repositories (“data cloud”)
Technical solutions have emergedTraditional vendors have embraced RDF/OWLNew vendors/products
Semantic Weband medical libraries
Lister Hill National Center for Biomedical Communications 53
Why additional library services?
Biomedical information is growing at an increasingly faster pace
High-throughput approach to knowledge processingInformation retrieval is the starting point, not the end of the journey for the researcher
Towards “computable” knowledgeIntegration between literature and other resources is insufficient
Adequate for navigation purposesInsufficient for knowledge processing
Lister Hill National Center for Biomedical Communications 54
What additional services?
Refined information retrievalIndexing on relations in addition to conceptsFind articles asserting that IL-13 inhibits COX-2
Multi-document summarizationExtract and visualize facts from the literatureSummarize the top 300 papers on panic disorder
Question answeringClinical and biological questionsWhat drugs interact with imipramine?
Knowledge discoveryReasoning with facts from heterogeneous resourcesFrom MEDLINE and UMLS together
Lister Hill National Center for Biomedical Communications 55
Normalized and integrated knowledge
Normalized knowledgeCommon formatCommon identification mechanism
Integrated knowledgeSingle repositorySeamless environmentPhenotype and genotype information together
Biomedical Knowledge Repository
Lister Hill National Center for Biomedical Communications 56
Sources of knowledge
Biomedical literaturePredications extracted from MEDLINE abstracts and full-text publicly available articles using text mining techniquesOther corpora (e.g., ClinicalTrials.gov)
Terminological knowledgeUMLS
Structured knowledge basesNCBI resources (e.g., Entrez Gene)Functional annotations from model organism databases…
Contributed knowledgeThe repository is open to collaborators outside NLM
Lister Hill National Center for Biomedical Communications 57
Formalism Triples
FactsAssertionsRelationsSemantic predicationsRDF triples
Imipramine Panic Disorder
treats
APP Alzheimer disease
has_associated_disease
concept1 concept2
relationship
Lister Hill National Center for Biomedical Communications 58
Annotated knowledge Metadata
Provenance informationSource (e.g., PMID)Extraction mechanismTimestamp
Frequency informationRedundancy
Collaborative annotation“Was this information useful?”Context of use/usefulness
BiomedicalKnowledgeRepository
InformationRetrieval
QuestionAnswering
KnowledgeDiscovery
DocumentSummarization
Sourceselection(PubMed,
annotations)
UMLSTerminological
Knowledge
GO
EntrezGene Structured
Knowl. Bases
ContributedKnowledge
MEDLINE BiomedicalLiteratureCT.gov
Advanced Library Services A research project at NLM
Lister Hill National Center for Biomedical Communications 60
Semantic Medline Prototypehttp://skr3.nlm.nih.gov/SemMedDemo/
Courtesy of Tom Rindflesch (NLM)
Lister Hill National Center for Biomedical Communications 61
Wrapping up
Lister Hill National Center for Biomedical Communications 63
Take home points
Biomedical Semantic WebData integrationEnable applications
Hypothesis generationKnowledge discovery
Paradigm shiftMigration towards data-driven research(as opposed to hypothesis-driven research)Translational research
MedicalOntologyResearch
Olivier Bodenreider
Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA
Contact:Web: