cross-species gene normalization by species inference
TRANSCRIPT
Raunak Shrestha
24th November 2011
2
http://reginanuzzo.com/wp-content/iHOP_chimp.gif
BioCreative
3
“BioCreative ….. consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. … ”
http://biocreative.sourceforge.net/
Biocreative III
• 3 Task:
• Gene Normalization (GN)
• Protein-protein interactions (PPI)
• Interactive demonstration task for gene indexing and retrieval (IAT)
• Goal
• Map gene & proteins mentioned in biomedical literature to its standard database identifiers
• Biocreative III
• No species information was provided
• Produce list of EntrezGene identifiers of all the species for gene mentions in full-text biomedical articles
4
Relation Extraction
5
Name Entity Normalization
genes
proteins
disease
SpecificDatabase identifier
GN task in Biocreative III
• 3 MAJOR CHALLENGES !!!
• Gene Mention Variations
• Orthographical Variation : TLR7 TLR-7
• Morphological Variation : GHF-1 transcriptional factor GHF-1 transcription factor
• Enumeration Variation : TLR7/8 TLR7, TLR8
• Variation with Abbreviation
• SLC11A1 NRAMP1
• Orthologous Gene Ambiguity
• Orthologous gene belongs to different species and should be mapped to different database identifiers
• Intra species Gene Ambiguity
• Different genes have the same name6
“CAS” -> multiple database identifiers• EntrezGene Id:1434 (“Cellular
apoptosis susceptibility protein”) • EntrezGene Id:9564 (“Breast cancer
antiestrogen resistance 1”)
Method
7
• GenNorm• an integrative method to
handle the three issues of the GN task
• GenNorm Uses 3 modules• Gene Name
Recognition(GNR) module
• Species Assignation (SA) module
• Species-specific Gene Normalization (SGN) module
Architecture of GenNorm.
Gene Name Recognition(GNR) module
• AIIA-GMT is a XML-RCP client of a web server that recognizes name entities in a biomedical literature
8
Gene Name Recognition(GNR) module
• Identifiers Extraction• If gene mentions cannot be
matched with a particular database identifier
• Queries EntrezGenedatabase
• Attaches all the associated ids (swissprot_id, SGD_id, etc) 9
Distillation
Species Assignation (SA) module• Aggregates three
different species name lexicons:• NCBI taxonomy
• list of cell lines from Wikipedia
• the corpus of Linnaeus
10
Species Assignation (SA) module
11
Guaranteed Inference
Co-occurrence Inference
Species sub-type can disambiguate the species name
Species Assignation (SA) module
125,933,419 Entrez Gene Ids belonging to > 6,000 species
Species-specific Gene Normalization (SGN) module• measures the inference
scores of candidate Entrez Gene Ids in articles
• Entity Inference:• inference by exact match
• tagged entity = gene name entity
• Bags of words• Inference by partial match
• Tagged entity has at least one word matching the gene name identity from the bag of words
13“Hypoxia-inducible factor-1 alpha” “Hypoxia“, “inducible”, “factor”,
“1”, “alpha”
“CCRL1” = “chemokine receptor like 1”
Test Data
• Two sets of test data• A: fully annotated by a group of trained and experienced curators
from different model organism database
• B: partially annotated by curators at NLM
• 507 full text articles from various BMC and PLoS journals
14
Result
• Post-processing step most useful
• Identifier Extraction was the most efficient15
Conclusion and Critique• GenNorm tries addresses key problems in cross-species gene
normalization
• Still orthologous gene ambiguity is a challenging task
• Even challenging step is to filter out the high-false-positive set as described in Table-5
• Paper describes briefly about some of the systems available in biomedical text-mining till date.
• GenNorm seems to be highly integrated system
• Grabs all the “cream” out of the best available resources in the filed of biomedical literature text mining
16