lecture12a: motivating example: biocuration text-based...
TRANSCRIPT
4/16/19
1
Lecture12a:Text-based Information RetrievalCS540 4/16/19
Material borrowed (with permission) from James Pustejovsky & Marc Verhagen of Brandeis. Mistakes are mine.
Motivating Example: BiocurationOver 50,000 articles published per year relevant to cancer research.
No expert can read or remember that many
DARPA’s goal:◦ Create an agent that read every article
◦ Create an interface to let cancer boards access this information
◦ Implement well-informed, individualized cancer treatments
3
Pipeline of NLP IR Tools
Scraping (not covered here)Sentence splittingTokenization(Stemming / Lemmatization)Part-of-speech taggingShallow parsingNamed entity recognitionSyntactic parsing(Semantic Role Labeling)
Covered Today / Index-based
Forthcoming / NLP
4
Sentence splittingCurrent immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
5
A heuristic rule for sentence splitting
sentence boundary = period + space(s) + capital letter
Regular expression in Perl
s/\. +([A-Z])/\.\n\1/g;
6
Errors
Two solutions:◦ Add more rules to handle exceptions◦ Machine learning
IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated cytokines (e.g.
IL-5 and IL-13).
4/16/19
2
7
Tokenization
Convert a sentence into a sequence of tokens
Why do we tokenize?
Because we do not want to treat a sentence as a sequence of characters!
The protein is activated by IL2.
The protein is activated by IL2 .
8
Tokenization Issues separate possessive endings or abbreviated forms from preceding words: ◦ Mary’s ®Mary ‘s
Mary’s ®Mary isMary’s ®Mary has
separate punctuation marks and quotes from words :◦ Mary. ®Mary .◦ “new” ® “ new “
9
Tokenization problemsCommas◦ 2,6-diaminohexanoic acid◦ tricyclo(3.3.1.13,7)decanone
Four kinds of hyphens◦ �Syntactic:�
◦ Calcium-dependent◦ Hsp-60
◦ Knocked-out gene: lush-- flies◦ Negation: -fever◦ Electric charge: Cl-
K. Cohen NAACL-2007
10
Full text à Index terms
document
structure recognition
accentsspacing etc.
stopwords noungroups stemming
automatic or manual indexing
structure full text index terms
text + structure text
11
Automatic IndexingChoose from the terms in a document those which are most indicative of its content.◦ contrast with full-text retrieval
For non-Boolean retrieval include weights with terms (more later).
12
Normalizing termsShould numbers, units (“km/h”), etc. be included ?Should “traffic” and “Traffic” be one term ?
Should “compute”, “computer”, “computation”, “computerisation” be all one term ?◦ Stemming is the process of removing suffixes so that
these are all mapped to “comput”
4/16/19
3
13
Word frequency characteristics
Zipf’s Law: rank * frequency » constant(Most frequent word twice as common as second most frequent, three times as common as the third most frequent, etc.)
0
200
400
600
800
1000
1200
1 4 7 10 13 16 19 22 25 28
14
Statistical Indexing - BasisThe most common words are content-free “function” terms◦ The, and, or, but, of, in, it, her, a, ….◦ Not useful for indexing
Rare words are content-heavy but rare◦ Appear in so few documents, not useful for indexing
Middle-frequency words are the best for indexing documents.
15
Basic Indexing Strategy1 list the unique words in the documents2 remove stopwords (about 250 for English)
3 stem remaining words ◦ Remove ending (ing, ‘s, ful, etc.)
4 assign as index terms eitherA - all resulting terms orB - all but very rare terms C - terms that are most frequent in the doc.D - terms weighted highly by other measures
16
Implementation DetailsIndexing results in records likeDoc12: napoleon, france, revolution, emperor
or (weighted terms)Doc12: napoleon-8, france-6, revolution-4, emperor-7
To find all documents about Napoleon would involve looking at every document’s index record (possibly 1,000s or millions).(Assume that Doc12 references another file which contains other details about the document.)
17
Implementation - Inverted FilesInstead the information is inverted :napoleon : doc12, doc56, doc87, doc99
or (weighted)napoloen : doc12-8, doc56-3, doc87-5, doc99-2
inverted file contains one record per index term
inverted file is organized so that a given index term can be found quickly.
Inverted File (on Tokens)Inverted file: a list of the tokens in a set of documents and the documents in which they appear.
Word Document
abacus 31922
actor 21929
aspen 5atoll 11
34
Stop words are removed before building the index.
4/16/19
4
Keywords and Controlled VocabularyKeyword:
A term that is used to describe the subject matter in a document. It is sometimes called an index term.
Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer.
Controlled vocabulary: A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms.
Inverted file (more complete definition):
A list of the keywords that apply to a set of documents and the documents in which they appear.
Enhancements to Inverted Files
Location: The inverted file holds information about the location of each term within the document.
Uses
adjacency and near operatorsuser interface design -- highlight location of search term
Frequency: The inverted file includes the number of postings for each term.
Uses
term weightingquery processing optimizationuser interface design
Inverted File (Enhanced)Word Postings Document Location
abacus 4 3 9419 719 21222 56
actor 3 2 6619 21329 45
aspen 1 5 43atoll 3 11 3
11 7034 40
Organization of Inverted Files
Term Pointer to postings
ant bee cat
dog elk fox gnu hog
Inverted lists
Index (vocabulary) file Postings file Documents file
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use excessive resource.
Index File
If an index is held on disk, search time is dominated by the number of disk accesses.
Suppose that an index has 1,000,000 distinct terms.
Each index entry consists of the term and a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in memory.
4/16/19
5
Postings File
Since inverted lists may be very long, it is important to match postings efficiently.
Usually, the inverted lists will be held on disk. Therefore algorithms for matching posting use sequential file processing.
For efficient matching, the inverted lists should all be sorted in the same sequence, usually alphabetic order, "lexicographic index".
Merging inverted lists is the most computationally intensive task in many information retrieval systems.
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressionsIf inverted files are stored in alphabetical order,
comp* can be processed efficiently*comp cannot be processed efficiently
Boolean terms
If A and B are search terms
A or B can be processed by comparing two moderate sized lists(not A) or (not B) requires two very large lists
Lexeme, Lexicon & Lemma
Lexeme: Smallest unit of language which has a meaning (roughly dictionary entry), e.g. run ◦ Takes various inflected word forms, e.g. runs, running,
ran◦ conduct (verb) is a different lexeme from conduct
(noun)
Lexicon: A finite set of lexemes (roughly dictionary) Lemma: The canonical or basic form that represents the lexeme, e.g. run
LemmatizationThe process of mapping word forms to their lemmas, e.g. running à runTypically done using morphological analysisOften done in NLP to avoid data sparsity, but depending on the application sometimes it may be best to keep the word forms
Lemmatization is not TrivialMay depend on the context◦ He found the ball à find◦ He will found the Institute à found
Depends on the part of speech◦ He conducted the orchestra à conduct (verb)
30
StemmingThe removal of the inflectional ending from words (strip off any affixes)
◦ Laughing, laugh, laughs, laughedà laugh◦ Problems
◦ Can conflate semantically different words◦ Gallery and gall may both be stemmed to gall
◦ A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.
4/16/19
6
31
Porter StemmerLexicon free stemmerRewrite rules
◦ ATIONAL à ATE (e.g. relational, relate)◦ FUL à ε (e.g. hopeful, hope)◦ SSESà SS (e.g. caresses, caress)
Errors of Commission◦ Organizationà organ◦ Policy à police
Errors of Omission◦ Urgency (not stemmed to urgent)◦ European (not stemmed to Europe)
32
Is stemming useful?For IR, some improvement ◦ especially for smaller documents
Helps on average, but not a lot ◦ Word sense disambiguation on query terms: business may be stemmed to
busy, saw (the tool) to see◦ Most studies for stemming for IR done for English
◦ may help more for other languages◦ The possibility of letting people interactively influence the stemming has
not been studied much
Improved by using a dictionary◦ If stem is not in dictionary, use original word◦ Often called lemmatization when a dictionary is used