cross-species gene normalization by species inference

Raunak Shrestha

24th November 2011

2

http://reginanuzzo.com/wp-content/iHOP_chimp.gif

BioCreative

3

“BioCreative ….. consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. … ”

http://biocreative.sourceforge.net/

Biocreative III

• 3 Task:

• Gene Normalization (GN)

• Protein-protein interactions (PPI)

• Interactive demonstration task for gene indexing and retrieval (IAT)

• Goal

• Map gene & proteins mentioned in biomedical literature to its standard database identifiers

• Biocreative III

• No species information was provided

• Produce list of EntrezGene identifiers of all the species for gene mentions in full-text biomedical articles

4

Relation Extraction

5

Name Entity Normalization

genes

proteins

disease

SpecificDatabase identifier

GN task in Biocreative III

• 3 MAJOR CHALLENGES !!!

• Gene Mention Variations

• Orthographical Variation : TLR7 TLR-7

• Morphological Variation : GHF-1 transcriptional factor GHF-1 transcription factor

• Enumeration Variation : TLR7/8 TLR7, TLR8

• Variation with Abbreviation

• SLC11A1 NRAMP1

• Orthologous Gene Ambiguity

• Orthologous gene belongs to different species and should be mapped to different database identifiers

• Intra species Gene Ambiguity

• Different genes have the same name6

“CAS” -> multiple database identifiers• EntrezGene Id:1434 (“Cellular

apoptosis susceptibility protein”) • EntrezGene Id:9564 (“Breast cancer

antiestrogen resistance 1”)

Method

7

• GenNorm• an integrative method to

handle the three issues of the GN task

• GenNorm Uses 3 modules• Gene Name

Recognition(GNR) module

• Species Assignation (SA) module

• Species-specific Gene Normalization (SGN) module

Architecture of GenNorm.

Gene Name Recognition(GNR) module

• AIIA-GMT is a XML-RCP client of a web server that recognizes name entities in a biomedical literature

8

Gene Name Recognition(GNR) module

• Identifiers Extraction• If gene mentions cannot be

matched with a particular database identifier

• Queries EntrezGenedatabase

• Attaches all the associated ids (swissprot_id, SGD_id, etc) 9

Distillation

Species Assignation (SA) module• Aggregates three

different species name lexicons:• NCBI taxonomy

• list of cell lines from Wikipedia

• the corpus of Linnaeus

10

Species Assignation (SA) module

11

Guaranteed Inference

Co-occurrence Inference

Species sub-type can disambiguate the species name

Species Assignation (SA) module

125,933,419 Entrez Gene Ids belonging to > 6,000 species

Species-specific Gene Normalization (SGN) module• measures the inference

scores of candidate Entrez Gene Ids in articles

• Entity Inference:• inference by exact match

• tagged entity = gene name entity

• Bags of words• Inference by partial match

• Tagged entity has at least one word matching the gene name identity from the bag of words

13“Hypoxia-inducible factor-1 alpha” “Hypoxia“, “inducible”, “factor”,

“1”, “alpha”

“CCRL1” = “chemokine receptor like 1”

Test Data

• Two sets of test data• A: fully annotated by a group of trained and experienced curators

from different model organism database

• B: partially annotated by curators at NLM

• 507 full text articles from various BMC and PLoS journals

14

Result

• Post-processing step most useful

• Identifier Extraction was the most efficient15

Conclusion and Critique• GenNorm tries addresses key problems in cross-species gene

normalization

• Still orthologous gene ambiguity is a challenging task

• Even challenging step is to filter out the high-false-positive set as described in Table-5

• Paper describes briefly about some of the systems available in biomedical text-mining till date.

• GenNorm seems to be highly integrated system

• Grabs all the “cream” out of the best available resources in the filed of biomedical literature text mining

16

cross-species gene normalization by species inference

Health & Medicine