© 2003 the mitre corporation. all rights reserved. mitre critical assessment of information...
Post on 18-Dec-2015
214 views
TRANSCRIPT
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Critical Assessment of Information Extraction Systems in
Biology(BioCreAtIvE)
Marc ColosimoLynette HirschmanAlexander Morgan
Alexander Yeh
http://www.mitre.org/public/biocreative
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Outline
Past evaluation- KDD Cup 2002
Current evaluation- BioCreAtIvE
Summary
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Past Evaluation: KDD 2002 Challenge Cup Evaluation
We were invited to run a task for KDD Cup 2002*We ran one of two tasks for 2002
- Alexander Yeh was the chair for Task 1 (fly genes)- Mark Craven (U. Wisc.) was the chair for Task 2
(yeast genes)Data-mining conf: NOT biology nor text processing
*http://www.biostat.wisc.edu/~craven/kddcup/tasks.html
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 1: For a Set of Papers on Genetics or Molecular Biology
We provided for each paper- The full text of the paper- A list of the genes mentioned in that paper
The task was to- Rank the curatable papers before the
non-curatable papers- Does each paper contain any curatable gene
product information (Yes/No)?- For each curatable gene mentioned in the
paper, does that paper have experimental results for
Transcript(s) of that gene (Yes/No)? Protein(s) of that gene (Yes/No)?
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
ResultsThe winner and honorable mentions were all
combined teams from 2 or 3 organizationsWinner: a team from ClearForest and Celera
- Used manually generated rules and patterns to perform information extraction
- Also had the best score in each of the 3 sub-tasks Best MedianRanked-list: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35%
18 teams submitted test results
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Outline
Past evaluation- KDD Cup 2002
Current evaluation- BioCreAtIvE
Summary
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Current Evaluation: BioCreAtIvE Organized by MITRE, CNB (Madrid) and others
- Under the umbrella of the ISCB BioLINK Special Interest Group for Text Data Mining*
Two tasks- Entity extraction (MITRE)
Gene name mentions (NCBI) Gene list (MITRE)
- Functional curation (CNB-Madrid)Automatically map text to GO (Gene Ontology) terms for proteins described in text
*http://www.pdg.cnb.uam.es/BioLINK
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
ScheduleJuly 2003: initial training data & guidelines
Nov-Dec. 2003: test data released, results due
Participants may chose which tasks and which sub-tasks they want to
participate in. You are not limited to one or all of the tasks.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Why Evaluate Entity Extraction for Molecular Biology?
Entity extraction is a basic text mining operation - It indicates the items discussed in a document- Variations in nomenclature constitute a major
stumbling block to accessing the biomedical literature
Many groups working on entity extraction- But there is no way to compare the systems
Different data setsDifferent tasks
Challenge Evaluations have been successful making comparisons
- This work should also lead to resources and standards for handling nomenclature
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Progress in Speech Recognition
Source: Pallett, D. Garofolo, J. and Fiscus, J. (NIST) Measurements in Support of Research Accomplishments. Feb 2000. Communications of the ACM: Special Section on Broadcast News Understanding.
Results show decrease in error rate over time, measured by results from best system each year
Note that the research community selected new, harder problems over time
Can we expect the same progress for accessing biological literature?
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Some Challenges of Extracting Entities in Molecular Biology Texts
Entity mentions are often common nouns (as opposed to proper nouns)
In fact, many entities are named with ordinary words
- E.g., some Drosophila gene names: by, for, if, blue, saw, period, white, midget
Also, new entities are constantly being discovered and/or renamed
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
“Complete” Entity Extraction is More Than Finding Mentions in the Text
For each mention, it is important to determine which entity is being discussed
This is non-trivial in molecular biology- An entity can have synonyms- The same word(s) can refer to different
entitiesE.g., Sek1 refers to two different proteins in mice (Map2k4 and Epha4)
- Mentions can share text: e.g., “MEK1/2” is about both MEK1 and MEK2
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Entity Extraction Task 1A: Gene Name MentionData provided by Lorrie Tanabe & John Wilbur,
NCBI- 15,000 sentences manually annotated for
genes 7,500 sentences for training 2,500 sentences for development test 5,000 sentences for testing
Example (transformed for display purposes)- Data are marked for occurrences of gene-
related mentions (underlined), including binding sites, motifs, domains, proteins, promoters, etc.
Structure and expression of a gene from Arabidopsis thaliana encoding a protein related to SNF1 protein kinase.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Entity Extraction Task 1B: Gene List Annotation
Given a set of abstracts
We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Entity Extraction Task 1B: What a Contestant’s System Should Return
Return a list of the standardized names of the genes mentioned in each abstract:
Also return 1 text mention for each gene in list
We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.
0004656, 0002522, 0015624, 0000439, 0012384, 0004647, 0000611
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 1B: Data Availability Abstracts from PubMed/Medline
- Training- Development test- Test
Gene lists for papers from model organism databases (Drosophila, mouse, yeast)
- A list of genes (standardized names) for each paper is available
- Note that gene list is for full paper, but the text we can get is just the abstract
Synonym lists provided by each database to map alternate gene names, as mentioned in text, to their unique database identifier
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 1B:References Associated w. Lists of Genes
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 1B: Data Set Size (in Abstracts*)
Fly Mouse Yeast
500050005000Training
DevelopmentTest
Test
150 250 (150)
(250) (250) (250)
*Each abstract is around 250 words
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 2: Functional Annotation
Data provided by Swiss-Prot (Rolf Apweiler) and being run by Christian Blaschke (CNB-Madrid)
Task: - Automatically generate evidence for Gene
Ontology annotations for a set of proteins from the text of an article
Gold standard:- SWISS PROT Human Genome Annotations- SWISS PROT curators will also check the
correctness and utility of the pointers to the evidence
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Functional Annotation: Sub-tasks
1. Return text evidence for GO annotations found in a paper- Given a full text paper, protein(s) and
associated GO term(s)2. Generate GO term(s) and evidence for a
protein- Given a paper and protein(s) in the paper - Note that more than one GO term might
be associated with a protein3. Exploratory. Given a set of proteins find
relevant GO annotations and evidence from full text articles
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 2: Find Text Evidence Supporting SWISS PROT GO Annotation
SWISS PROT entry for: Small inducible cytokine A8 precursor;
Synonyms: CCL8; Monocyte chemotactic protein 2 ; MCP-2 Monocyte chemoattractant protein 2; HC14
GO Annotation: 0006816Calcium ion transport
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Task 2: Find Text Evidence (cont.)
Full text article…
…cpt-cAMP (1 mM) pretreatment of the cells completely inhibited RANTES-, MIP-1-, and MCP-2-induced Ca2+ mobilization …
Protein: Small inducible cytokine A8 precursor
Synonym: MCP-2
GO Annotation: 0006816Calcium ion transport
Evidence:
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Outline
Past evaluation- KDD Cup 2002
Current evaluation- BioCreAtIvE
Summary
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Summary We are trying to help the curators by
providing common challenge evaluations based on relevant problems faced by curators
Providing common evaluations provide a means to directly compare different methods and helps to advance research in the area
There is still time to compete in the current challenge
http://www.mitre.org/public/biocreative
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
The End
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Linking Literature, Databases, Ontologies, Data
MEDLINE
Literature Collections
Genbank
Databases
SwissProt
Ontologies
Data integration via metaschemas
Improved searchand indexing
PathwayDiscovery
DB update
DataInterpretation
ExperimentalData
DB curation