ucb biotext trec 2003 genomics track participants: marti hearst gaurav bhalotia, preslav nakov,...

33
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics: tasks 1 and 2

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

UCB BioText

TREC 2003 Genomics Track

Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz

University of California, Berkeley

Genomics: tasks 1 and 2

Page 2: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Overview

UCB BioText group took part in Task 1 and Task 2 Task 1: Information retrieval + Information Extraction

(+ Text Classification) Task 2: Text Classification + Information Extraction

Commonalities for the both tasks Named entities recognition in the text

Genes and synonyms MeSH concepts

Text classification algorithms

Page 3: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

MeSH Hierarchy

Unique identifier: e.g. Abdomen has D000005 UMLS semantic tags

e.g. Enzyme, Gene or Genome, Mammal, Tissue, Virus etc. Alphanumeric descriptor codes

[A] Anatomy Body Regions [A01] Abdomen [A01.047]

[B] Musculoskeletal System [A02] Back [A01.176]

[C] Digestive System [A03] Breast [A01.236]

[D] Respiratory System [A04] Extremities [A01.378]

[E] Urogenital System [A05] Head [A01.456]

[F] …… Neck [A01.598]

[G] ….

[H] Physical Sciences Electronics Amplifiers

[I] Astronomy Electronics, Medical

[J] Nature Transducers

[K] Time

Page 4: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Task 1

Page 5: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

TREC Task 1: Overview

Search 525,938 MedLine records Titles, abstracts, MeSH category terms, citation information

Topics: Taken from the GeneRIF portion of the LocusLink database We are supplied with a gene names Definition of a GeneRIF:

For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism.  Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states.

Task 1

Page 6: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

TREC Task 1: Sample Query

3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene)

3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6 3 2120 Homo sapiens ALIAS_SYMBOL TEL 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene

6 3 2120 Homo sapiens PRODUCT ets variant gene 6 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene

The first column is the official topic number (1-50). The second column contains the LocusLink ID for the gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name.

Task 1

Page 7: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

General Architecture

Task 1

Page 8: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Main Challenges

Task 1 Given a gene and an organism, find documents likely to have a GeneRIF Relevance judgment: GeneRIF references from LocusLink

Main challenges Ranking Recall

Find more gene synonym variations Precision

Filter out abstracts with genes from incorrect organisms Lower the rank of documents not likely to have a GeneRIF

Task 1

Page 9: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Gene Synonym List CreationQuery

Gene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 10: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

How to Find Gene Name Synonyms?

Strategy: Compile a list of gene names from the text

Start with a list of gene names from LocusLink and MeSH Use an n-gram-based approximate match algorithm to

find alternative representations of these genes in Medline abstracts

Look for commonalities and regularities Create a set of name transformation rules

Some are better than others

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 11: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Gene Expansion: Sample Expansion Pairs

Task 1

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Matches whose Dice coefficient falls between 0.5 and 1.0

Page 12: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Gene Expansion:High Confidence Rules

Matches whose Dice coefficient falls between 0.5 and 1.0 Rules determined by inspection

Task 1

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Page 13: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Organism FilteringQuery

Gene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 14: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Organism Filtering:Strategy

Problem: The query describes the organism name using the

LocusLink terminology which differs from Medline’s Strategy:

Semi-automatically determine the translation: For a given LocusLink organism name, search for that term

against the MEDLINE title, abstract, and MeSH terms Display the most frequent MeSH terms that result

The translation appeared as one of the top 3 Could be a useful strategy for other translation problems

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 15: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Organism Filtering:Results

Task 1

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01Sample Top-Ranked MeSH Terms

Page 16: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

GeneRIF ClassificationQuery

Gene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 17: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

GeneRIF Classification:Training

Used for our second run Motivation

Only Medline documents that have been assigned GeneRIFs are considered relevant

Strategy to improve precision: Identify documents likely to have a GeneRIF assigned

Naïve Bayes classifier (WEKA ML tools) Training:

50 gene names, not in TREC training/testing set Train on 1000 top-ranked documents for each gene

Task 1

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Page 18: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

GeneRIF Classification:Results

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

MeSH uniqueidentifier

MeSH descriptorcodes: whole

stems MeSH descriptorcodes: whole + level

2 + level 3

training set

cross-validation

Task 1

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Page 19: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Document RankingQuery

Gene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 20: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Document Ranking

DB2 Net Search Extender Score = weighted SUM:

1.0 * (H compared to phrases in titles) +1.0 * (H compared to phrases in abstracts) +0.015 * (L compared to phrases in titles) +0.015 * (L compared to phrases in abstracts) +1.4 * (query MeSH compared to document MeSH)

H: high confidence gene rulesL: low confidence

Weights determined experimentally

QueryGene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 21: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Document Retrieval and Ranking Query

Gene Description

in title

High ConfidenceGene Expansions

Low ConfidenceGene Expansions

in abstract in title in abstract in MeSH

Sumweight J

weight J

weight K

weight K

weight L

QueryOrganism Description

OrderedDocuments

Organism Filter

weight 1

SumClassifier

"has GeneRIF"weight 0.01

Task 1

Page 22: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

MAP on TREC training data using GeneRIF classifier: 0.5101 without GeneRIF classifier: 0.5028

MAP on TREC testing data using GeneRIF classifier: 0.3912 without GeneRIF classifier: 0.3753

Analysis Using the classifier performs better on 27 out of 50 queries (= on 12). Tuning the parameters on the test set (tried afterwards) results in

only minor improvement.

0

0.1

0.2

0.3

0.4

0.5

0.6

TREC training TREC testing

using GeneRIF classifier

without GeneRIF classifier

Task 1: TREC Evaluation

Task 1

Page 23: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Task 2

Page 24: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

TREC Task 2

Problem Definition: Given GeneRIFS formatted as:

1    355    12107169    J Biol Chem 2002 Sep 13;277(37):34343-8.    the death effector domain of FADD is involved in interaction with Fas.

2    355    12177303    Nucleic Acids Res 2002 Aug 15;30(16):3609-14.    In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w

… reproduce the GeneRIF from the MEDLINE record.  

Task 2

Page 25: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Preliminary study

Find the GeneRIF text in the abstract 33,662 MEDLINE abstracts with GeneRIFs Best match of the GeneRIF text in the abstract Modified Unigram Dice coefficient Accepted, if scored above 80%

Task 2

Page 26: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Baseline

Baseline: Pick the whole title verbatim Motivation

the best match was a substring of the title: 46.30% the whole title was the best match in 65.10%

Baseline: Modified Unigram Dice score 53.39%

Choose: title vs. last sentence Observation:

the best match is the title OR the last sentence: 73.40% If we choose a whole sentence: title vs. last sentence

Upper bound (best choice each time): 66.33% Lower bound (worst choice each time): 22.62%

Task 2

Page 27: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Features

We experimented with the following features: Nominal features

words/stems verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes gene_freq (number of gene names mentioned) MeSH_unique_ID (e.g. D005796) MeSH_codes (level 1: G14, or level 2: G14.330) MeSH_semantic_type (e.g. cell, human, biological function) journal publication_date (month and year, e.g. 10_2003 )

Boolean features target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) is_last_sentence (is this the last sentence?)

Task 2

Page 28: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Best Features

Standard feature set verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes_freq (number of gene names mentioned) MeSH_code (cut at level 2, e.g. G14.330) target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) Is_last_sentence (is this the last sentence?)

The last two were not used in the final tests.

Weighted using TF.IDF (except the Boolean features)

Task 2

Page 29: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Title vs. Last Sentence

Text classification Choose: title (class A) vs. last sentence (class B) Naïve Bayes classifier (WEKA ML tools) The standard features

Training and testing Each document represents one example

Features: extracted from the title and the last sentence only Features for title and last sentence are undistinguishable. Distinguishing them lowers the accuracy.

Training set: Modified Dice Unigram overlap with the GeneRIF Stratified 10-fold cross-validation

Task 2

Page 30: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Task 2: Evaluation

Training Document collections

1000, 2000, 10000, 20000, 33662 finally, limited the set to the 5 target journals

Classification algorithm selection tried: decision tree, boosting, kNN, logistic regression etc.

Feature selection tuning, for a fixed feature set tuned the best minimum frequency thresholds for verbs and

MeSH_codes: 12 and 5, accordingly

TREC run Training: 5 journals except the 139 abstracts from the TREC test Feature frequency thresholds as found during training: 12 and 5

Task 2

Page 31: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Task 2: Results

30%

35%

40%

45%

50%

55%

60%

65%

70%

CD MUD MBD MBDP

Upper boundCross validationTuned thresholdsTREC runBaseline

Task 2

Page 32: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Discussion

Test sets are small and much harder than training sets Task 1

Organism filter was very helpful Noisy GeneRIF assignment limits the help given by the classifier Initial runs supplied by other research groups were very helpful

Task 2 Sentence truncation could improve the results Need ranking, rather than classification algorithms Better feature selection needed

sensitivity to frequency thresholds MeSH ambiguity verb nominalization

Page 33: UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Thank you!