text mining tools for semantically enriching scientific literature
DESCRIPTION
presentation by Sophia Ananiadou at the Cheminformatics workshop 4th March 2008TRANSCRIPT
Text mining tools for
semantically enriching the
scientific literature
Sophia Ananiadou
Director
National Centre for Text Mining
School of Computer Science
University of Manchester
Need for enriching the literature
• Need for semantic search i.e. beyond keywords
• Need for technologies enabling focused semantic search via the creation of semantic metadata from literature
“The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science”
Peter Murray-Rust, Data-driven science: A Scientist’s view. NSF/JISC Repositories Workshop, 2007
Impact of text mining
• Extraction of named entities (genes, proteins,
metabolites, etc)
• Discovery of concepts allows semantic annotation of
documents
– Improves information access by going beyond index
terms, enabling semantic querying
– Improves clustering, classification of documents
– Visualisation based on semantic metadata derived
from text mining results
Beyond named entities: facts
• Extraction of relationships, events (facts) for knowledge discovery
– Information extraction, more sophisticated annotation of texts (fact annotation)
– Enables even more advanced semantic querying
Enriched annotation
• Text Mining provides enriched annotation
layers
– the user will be able to carry out an easily
expressed semantic query which will deliver
facts matching that semantic query rather
than just sets of documents he has to read…
• Information Extraction and not just Information
Retrieval
• Fact extraction and not just sentence extraction
raw
(unstructured)
text
part-of-speech
tagging
named entity
recognition
deep
syntactic
parsing
annotated
(structured)
text
text processing
lexicon ontology
………………………....
... Secretion of TNF was
abolished by BHA in
PMA-stimulated U937 cells. ……………………
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
protein_molecule organic_compound cell_line
PP PP NP
PP
VP
VP
NP
NP
S
negative regulation
Annotations derived from Text MiningAnnotations derived from Text Mining
Multi-layered
annotations
Mining associations from MEDLINE
• FACTA: Finding Associated Concepts with Text Analysis – What diseases are related to a particular chemical?
– What proteins are related to a particular disease?
– etc.
• EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
• PubMatrix http://pubmatrix.grc.nia.nih.gov/
:
• FACTA http://text0.mib.man.ac.uk/software/facta/
– Quick and interactive
Query
Click!
Innovative Technologies applied to:
• Term recognition
• Named entity recognition
• Fact extraction
! semantic mark-up improves search
! classifying, linking documents
! knowledge discovery, hidden links,
associations, hypothesis generation
Semantic
Mark-up
Natural Language Processing
technologies
• Part-of-speech tagging: GENIA
– Tuned to biomedical text: 97-99% precision
• Dictionary-based named-entity recognition
• Deep parsing
– Predicate argument relations (90%)
• Protein-protein interaction extraction
• Event / fact extraction
Automatic Term Recognition
http://www.nactem.ac.uk/software/termine/
Recognising and Disambiguating
Acronyms in Biomedical Literature
http://www.nactem.ac.uk/software/acromine
The peri-kappa B site mediates human immunodeficiency
virus type 2 enhancer activation in monocytes …
Named-entity recognition
!
Entity types (defined by Ontologies)
"
Genes/protein names
"
Enzymes, substances, metabolites, etc
"
GO ontology, KEGG, CheBI, etc
DNA virus
cell_type
Leveraging resources
• Annotated texts (GENIA corpus, GENIA event corpus)
• Resources for bio-text mining
– resource-building NLP tools for text-based knowledge harvesting (NaCTeM)
– BioLexicon • Over 1.5M lexical entries for bio-text mining and
growing….
• Containing rich linguistic information for bio-text mining
Population ProcessPopulation Process
Bio-Lexicon
Existing repositories
Subclustering
of term variants
Manual curation
Named entity
recognition
Term mapping
by normalization
Verb subcategorization
Medline abstracts
gene/protein names
chemical, disease, enzyme, species names
terminological verbs
new gene/protein names
verb subcategorization frames
on-going
Semantic search based on facts
• MEDIE: an interactive advanced IR
system retrieving facts
• Performs a semantic search
!
Core technology annotates texts
"
GENIA tagger " syntactic structures
"
Enju (deep parser) " facts
"
Dictionary-based named entity recognitionJ. Tsujii
Medie system overview
Input
Textbase
Deep
parser
Entity
Recognizer
Semantically-
annotated
Textbase
RegionAlgebra
Search engine
QuerySearch
results
Off-line
On-line
Sentence Retrieval System
Using Semantic Representation
MEDIE
InfoPubMed
!
An interactive Information Extraction system and
an efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins, and the interactions
between them.
!
System components
"
Deep parsing technology
"
Extraction of protein-protein interactions
"
Multi-window interface on a browser
InfoPubMed
Interactions and not
just co-occurrences.
Calculated using ML
and deep semantics.
Semantic Information Retrieval
# KLEIO: a semantically enriched
information retrieval system for biology
# Offers textual and metadata searches
across MEDLINE
# Leverages terminology technologies
#Named entity recognition: gene, protein,
metabolite, organ, disease, symptom
http://nactem4.mc.man.ac.uk:8080/Kleio/
KLEIO architecture
Fewer documents
with more precise
query
Linking and enriching pathways
with text
– REFINE (BBSRC) "
MCISB and NaCTeM (Kell, Ananiadou, Tsujii)
– to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways
– to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining
2 Steps for linking text with
pathways
IkB IkB P
IkB IkB U
IkB !
IkB IkB P IkB U !
… IkappaB is phosphorylated …
… Ikappa B ubiquitination …
… degradation of IkB…
Literature
Biological events
Pathways
Event Extraction
Pathway Construction
Tsujii-lab, Tokyo
Event Annotation - Example
Statistics & References!
Statistics
"
36,114 events have been identified from
and annotated to
!1,000 Medline abstracts, which contain
!9,372 sentences
"
Kim, Jin-Dong, Tomoko Ohta and Jun'ichi
Tsujii (2008) Corpus annotation for
mining biomedical events from
literature. BMC Bioinformatics
"
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
Acknowledgements
• Junichi Tsujii and his lab (University of Tokyo) MEDIE,
InfoPubMed, event annotation
• Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE)
• Naoaki Okazaki (TerMine, AcroMine)
• Yutaka Sasaki (BioLexicon, NER, KLEIO)
• John McNaught (BioLexicon, BOOTStrep project)
• Chikashi Nobata (KLEIO)
• Douglas Kell (REFINE)