cross-species gene normalization by species inference

16
Raunak Shrestha 24 th November 2011

Upload: raunak-shrestha

Post on 05-Jul-2015

153 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: Cross-species gene normalization by species inference

Raunak Shrestha

24th November 2011

Page 2: Cross-species gene normalization by species inference

2

http://reginanuzzo.com/wp-content/iHOP_chimp.gif

Page 3: Cross-species gene normalization by species inference

BioCreative

3

“BioCreative ….. consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. … ”

http://biocreative.sourceforge.net/

Page 4: Cross-species gene normalization by species inference

Biocreative III

• 3 Task:

• Gene Normalization (GN)

• Protein-protein interactions (PPI)

• Interactive demonstration task for gene indexing and retrieval (IAT)

• Goal

• Map gene & proteins mentioned in biomedical literature to its standard database identifiers

• Biocreative III

• No species information was provided

• Produce list of EntrezGene identifiers of all the species for gene mentions in full-text biomedical articles

4

Page 5: Cross-species gene normalization by species inference

Relation Extraction

5

Name Entity Normalization

genes

proteins

disease

SpecificDatabase identifier

Page 6: Cross-species gene normalization by species inference

GN task in Biocreative III

• 3 MAJOR CHALLENGES !!!

• Gene Mention Variations

• Orthographical Variation : TLR7 TLR-7

• Morphological Variation : GHF-1 transcriptional factor GHF-1 transcription factor

• Enumeration Variation : TLR7/8 TLR7, TLR8

• Variation with Abbreviation

• SLC11A1 NRAMP1

• Orthologous Gene Ambiguity

• Orthologous gene belongs to different species and should be mapped to different database identifiers

• Intra species Gene Ambiguity

• Different genes have the same name6

“CAS” -> multiple database identifiers• EntrezGene Id:1434 (“Cellular

apoptosis susceptibility protein”) • EntrezGene Id:9564 (“Breast cancer

antiestrogen resistance 1”)

Page 7: Cross-species gene normalization by species inference

Method

7

• GenNorm• an integrative method to

handle the three issues of the GN task

• GenNorm Uses 3 modules• Gene Name

Recognition(GNR) module

• Species Assignation (SA) module

• Species-specific Gene Normalization (SGN) module

Architecture of GenNorm.

Page 8: Cross-species gene normalization by species inference

Gene Name Recognition(GNR) module

• AIIA-GMT is a XML-RCP client of a web server that recognizes name entities in a biomedical literature

8

Page 9: Cross-species gene normalization by species inference

Gene Name Recognition(GNR) module

• Identifiers Extraction• If gene mentions cannot be

matched with a particular database identifier

• Queries EntrezGenedatabase

• Attaches all the associated ids (swissprot_id, SGD_id, etc) 9

Distillation

Page 10: Cross-species gene normalization by species inference

Species Assignation (SA) module• Aggregates three

different species name lexicons:• NCBI taxonomy

• list of cell lines from Wikipedia

• the corpus of Linnaeus

10

Page 11: Cross-species gene normalization by species inference

Species Assignation (SA) module

11

Guaranteed Inference

Co-occurrence Inference

Species sub-type can disambiguate the species name

Page 12: Cross-species gene normalization by species inference

Species Assignation (SA) module

125,933,419 Entrez Gene Ids belonging to > 6,000 species

Page 13: Cross-species gene normalization by species inference

Species-specific Gene Normalization (SGN) module• measures the inference

scores of candidate Entrez Gene Ids in articles

• Entity Inference:• inference by exact match

• tagged entity = gene name entity

• Bags of words• Inference by partial match

• Tagged entity has at least one word matching the gene name identity from the bag of words

13“Hypoxia-inducible factor-1 alpha” “Hypoxia“, “inducible”, “factor”,

“1”, “alpha”

“CCRL1” = “chemokine receptor like 1”

Page 14: Cross-species gene normalization by species inference

Test Data

• Two sets of test data• A: fully annotated by a group of trained and experienced curators

from different model organism database

• B: partially annotated by curators at NLM

• 507 full text articles from various BMC and PLoS journals

14

Page 15: Cross-species gene normalization by species inference

Result

• Post-processing step most useful

• Identifier Extraction was the most efficient15

Page 16: Cross-species gene normalization by species inference

Conclusion and Critique• GenNorm tries addresses key problems in cross-species gene

normalization

• Still orthologous gene ambiguity is a challenging task

• Even challenging step is to filter out the high-false-positive set as described in Table-5

• Paper describes briefly about some of the systems available in biomedical text-mining till date.

• GenNorm seems to be highly integrated system

• Grabs all the “cream” out of the best available resources in the filed of biomedical literature text mining

16