lecture12a: motivating example: biocuration text-based...

4/16/19

1

Lecture12a:Text-based Information RetrievalCS540 4/16/19

Material borrowed (with permission) from James Pustejovsky & Marc Verhagen of Brandeis. Mistakes are mine.

Motivating Example: BiocurationOver 50,000 articles published per year relevant to cancer research.

No expert can read or remember that many

DARPA’s goal:◦ Create an agent that read every article

◦ Create an interface to let cancer boards access this information

◦ Implement well-informed, individualized cancer treatments

3

Pipeline of NLP IR Tools

Scraping (not covered here)Sentence splittingTokenization(Stemming / Lemmatization)Part-of-speech taggingShallow parsingNamed entity recognitionSyntactic parsing(Semantic Role Labeling)

Covered Today / Index-based

Forthcoming / NLP

4

Sentence splittingCurrent immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

5

A heuristic rule for sentence splitting

sentence boundary = period + space(s) + capital letter

Regular expression in Perl

s/\. +([A-Z])/\.\n\1/g;

6

Errors

Two solutions:◦ Add more rules to handle exceptions◦ Machine learning

IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated cytokines (e.g.

IL-5 and IL-13).

4/16/19

2

7

Tokenization

Convert a sentence into a sequence of tokens

Why do we tokenize?

Because we do not want to treat a sentence as a sequence of characters!

The protein is activated by IL2.

The protein is activated by IL2 .

8

Tokenization Issues separate possessive endings or abbreviated forms from preceding words: ◦ Mary’s ®Mary ‘s

Mary’s ®Mary isMary’s ®Mary has

separate punctuation marks and quotes from words :◦ Mary. ®Mary .◦ “new” ® “ new “

9

Tokenization problemsCommas◦ 2,6-diaminohexanoic acid◦ tricyclo(3.3.1.13,7)decanone

Four kinds of hyphens◦ �Syntactic:�

◦ Calcium-dependent◦ Hsp-60

◦ Knocked-out gene: lush-- flies◦ Negation: -fever◦ Electric charge: Cl-

K. Cohen NAACL-2007

10

Full text à Index terms

document

structure recognition

accentsspacing etc.

stopwords noungroups stemming

automatic or manual indexing

structure full text index terms

text + structure text

11

Automatic IndexingChoose from the terms in a document those which are most indicative of its content.◦ contrast with full-text retrieval

For non-Boolean retrieval include weights with terms (more later).

12

Normalizing termsShould numbers, units (“km/h”), etc. be included ?Should “traffic” and “Traffic” be one term ?

Should “compute”, “computer”, “computation”, “computerisation” be all one term ?◦ Stemming is the process of removing suffixes so that

these are all mapped to “comput”

4/16/19

3

13

Word frequency characteristics

Zipf’s Law: rank * frequency » constant(Most frequent word twice as common as second most frequent, three times as common as the third most frequent, etc.)

0

200

400

600

800

1000

1200

1 4 7 10 13 16 19 22 25 28

14

Statistical Indexing - BasisThe most common words are content-free “function” terms◦ The, and, or, but, of, in, it, her, a, ….◦ Not useful for indexing

Rare words are content-heavy but rare◦ Appear in so few documents, not useful for indexing

Middle-frequency words are the best for indexing documents.

15

Basic Indexing Strategy1 list the unique words in the documents2 remove stopwords (about 250 for English)

3 stem remaining words ◦ Remove ending (ing, ‘s, ful, etc.)

4 assign as index terms eitherA - all resulting terms orB - all but very rare terms C - terms that are most frequent in the doc.D - terms weighted highly by other measures

16

Implementation DetailsIndexing results in records likeDoc12: napoleon, france, revolution, emperor

or (weighted terms)Doc12: napoleon-8, france-6, revolution-4, emperor-7

To find all documents about Napoleon would involve looking at every document’s index record (possibly 1,000s or millions).(Assume that Doc12 references another file which contains other details about the document.)

17

Implementation - Inverted FilesInstead the information is inverted :napoleon : doc12, doc56, doc87, doc99

or (weighted)napoloen : doc12-8, doc56-3, doc87-5, doc99-2

inverted file contains one record per index term

inverted file is organized so that a given index term can be found quickly.

Inverted File (on Tokens)Inverted file: a list of the tokens in a set of documents and the documents in which they appear.

Word Document

abacus 31922

actor 21929

aspen 5atoll 11

34

Stop words are removed before building the index.

4/16/19

4

Keywords and Controlled VocabularyKeyword:

A term that is used to describe the subject matter in a document. It is sometimes called an index term.

Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer.

Controlled vocabulary: A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms.

Inverted file (more complete definition):

A list of the keywords that apply to a set of documents and the documents in which they appear.

Enhancements to Inverted Files

Location: The inverted file holds information about the location of each term within the document.

Uses

adjacency and near operatorsuser interface design -- highlight location of search term

Frequency: The inverted file includes the number of postings for each term.

Uses

term weightingquery processing optimizationuser interface design

Inverted File (Enhanced)Word Postings Document Location

abacus 4 3 9419 719 21222 56

actor 3 2 6619 21329 45

aspen 1 5 43atoll 3 11 3

11 7034 40

Organization of Inverted Files

Term Pointer to postings

ant bee cat

dog elk fox gnu hog

Inverted lists

Index (vocabulary) file Postings file Documents file

Efficiency Criteria

Storage

Inverted files are big, typically 10% to 100% the size of the collection of documents.

Update performance

It must be possible, with a reasonable amount of computation, to:

(a) Add a large batch of documents(b) Add a single document

Retrieval performance

Retrieval must be fast enough to satisfy users and not use excessive resource.

Index File

If an index is held on disk, search time is dominated by the number of disk accesses.

Suppose that an index has 1,000,000 distinct terms.

Each index entry consists of the term and a pointer to the inverted list, average 100 characters.

Size of index is 100 megabytes, which can easily be held in memory.

4/16/19

5

Postings File

Since inverted lists may be very long, it is important to match postings efficiently.

Usually, the inverted lists will be held on disk. Therefore algorithms for matching posting use sequential file processing.

For efficient matching, the inverted lists should all be sorted in the same sequence, usually alphabetic order, "lexicographic index".

Merging inverted lists is the most computationally intensive task in many information retrieval systems.

Efficiency and Query Languages

Some query options may require huge computation, e.g.,

Regular expressionsIf inverted files are stored in alphabetical order,

comp* can be processed efficiently*comp cannot be processed efficiently

Boolean terms

If A and B are search terms

A or B can be processed by comparing two moderate sized lists(not A) or (not B) requires two very large lists

Lexeme, Lexicon & Lemma

Lexeme: Smallest unit of language which has a meaning (roughly dictionary entry), e.g. run ◦ Takes various inflected word forms, e.g. runs, running,

ran◦ conduct (verb) is a different lexeme from conduct

(noun)

Lexicon: A finite set of lexemes (roughly dictionary) Lemma: The canonical or basic form that represents the lexeme, e.g. run

LemmatizationThe process of mapping word forms to their lemmas, e.g. running à runTypically done using morphological analysisOften done in NLP to avoid data sparsity, but depending on the application sometimes it may be best to keep the word forms

Lemmatization is not TrivialMay depend on the context◦ He found the ball à find◦ He will found the Institute à found

Depends on the part of speech◦ He conducted the orchestra à conduct (verb)

30

StemmingThe removal of the inflectional ending from words (strip off any affixes)

◦ Laughing, laugh, laughs, laughedà laugh◦ Problems

◦ Can conflate semantically different words◦ Gallery and gall may both be stemmed to gall

◦ A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

4/16/19

6

31

Porter StemmerLexicon free stemmerRewrite rules

◦ ATIONAL à ATE (e.g. relational, relate)◦ FUL à ε (e.g. hopeful, hope)◦ SSESà SS (e.g. caresses, caress)

Errors of Commission◦ Organizationà organ◦ Policy à police

Errors of Omission◦ Urgency (not stemmed to urgent)◦ European (not stemmed to Europe)

32

Is stemming useful?For IR, some improvement ◦ especially for smaller documents

Helps on average, but not a lot ◦ Word sense disambiguation on query terms: business may be stemmed to

busy, saw (the tool) to see◦ Most studies for stemming for IR done for English

◦ may help more for other languages◦ The possibility of letting people interactively influence the stemming has

not been studied much

Improved by using a dictionary◦ If stem is not in dictionary, use original word◦ Often called lemmatization when a dictionary is used