Download - Distance functions and IE – 4?
![Page 1: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/1.jpg)
Distance functions and IE – 4?
William W. Cohen
CALD
![Page 2: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/2.jpg)
Announcements
• Current statistics:– days with unscheduled student talks: 6– students with unscheduled student talks: 4– Projects are due: 4/28 (last day of class)– Additional requirement: draft (for comments)
no later than 4/21
![Page 3: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/3.jpg)
The data integration problem
![Page 4: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/4.jpg)
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)– Distance depends on set of words contained in both s and t – so sensitive
to spelling errors.– Usually weight words to account for “importance”– Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics– Distance is shortest sequence of edit commands that transform s to t.– No notion of word importance– More expensive: O(n2)
• Other metrics– Jaro metric & variants– Monge-Elkan’s recursive string matching– etc?
• Which metrics work best, for which problems?
![Page 5: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/5.jpg)
Jaro metric
![Page 6: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/6.jpg)
Winkler-Jaro metric
![Page 7: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/7.jpg)
![Page 8: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/8.jpg)
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)– Distance depends on set of words contained in both s and t – so sensitive
to spelling errors.– Usually weight words to account for “importance”– Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics– Distance is shortest sequence of edit commands that transform s to t.– No notion of word importance– More expensive: O(n2)
• Other metrics– Jaro metric & variants– Monge-Elkan’s recursive string matching– etc?
• Which metrics work best, for which problems?
![Page 9: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/9.jpg)
So which metric should you use?
• Java toolkit of string-matching methods from AI, Statistics, IR and DB communities
• Tools for evaluating performance on test data• Exploratory tool for adding, testing, combining
string distances– e.g. SecondString implements a generic “Winkler
rescorer” which can rescale any distance function with range of [0,1]
• URL – http://secondstring.sourceforge.net• Distribution also includes several sample
matching problems.
SecondString (Cohen, Ravikumar, Fienberg):
![Page 10: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/10.jpg)
SecondString distance functions
• Edit-distance like:– Levenshtein – unit costs– untuned Smith-Waterman– Monge-Elkan (tuned Smith-Waterman)– Jaro and Jaro-Winkler
![Page 11: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/11.jpg)
Results - Edit Distances
Monge-Elkan is the best on average....
![Page 12: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/12.jpg)
Edit distances
![Page 13: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/13.jpg)
SecondString distance functions
• Term-based, for sets of terms S and T:– TFIDF distance– Jaccard distance:
– Language models: construct PS and PT and use
||
||),(
TS
TSTSsim
![Page 14: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/14.jpg)
SecondString distance functions
• Term-based, for sets of terms S and T:– TFIDF distance– Jaccard distance– Jensen-Shannon distance
• smoothing toward union of S,T reduces cost of disagreeing on common terms
• unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer
– “Simplified Fellegi-Sunter”
![Page 15: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/15.jpg)
![Page 16: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/16.jpg)
Results – Token Distances
![Page 17: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/17.jpg)
![Page 18: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/18.jpg)
SecondString distance functions
• Hybrid term-based & edit-distance based:– Monge-Elkan’s “recursive matching scheme”,
segmenting strings at token boundaries (rather than separators like commas)
– SoftTFIDF• Like TFIDF but consider not just tokens in both S
and T, but tokens in S “close to” something in T (“close to” relative to some distance metric)
• Downweight close tokens slightly
![Page 19: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/19.jpg)
![Page 20: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/20.jpg)
Results – Hybrid distances
![Page 21: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/21.jpg)
Results - Overall
![Page 22: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/22.jpg)
![Page 23: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/23.jpg)
Prospective test on two clustering tasks
![Page 24: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/24.jpg)
An anomolous dataset
![Page 25: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/25.jpg)
An anomalous dataset: census
![Page 26: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/26.jpg)
An anomalous dataset: census
Why?
![Page 27: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/27.jpg)
Other results with SecondString
• Distance functions over structured data records (first name, last name, street, house number)
• Learning to combine distance functions
• Unsupervised/semi-supervised training for distance functions over structured data
![Page 28: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/28.jpg)
Combining Information Extraction and Similarity Computations
2) Krauthammer et al
1) Bunescu et al
![Page 29: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/29.jpg)
Experiments
• Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes)
• Collected dictionary of 40,000+ protein names from on-line sources– not complete– example matching is not sufficient
• Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries.– not hand-coded or off-the-shelf similarity metrics
![Page 30: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/30.jpg)
Example name generalizations
![Page 31: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/31.jpg)
Basic idea behind the algorithm
original dictionary
carefully-tuned heuristics (aka hacks)
similar (but not identical process) applied to word n-grams from text to do IE: extract if n-gram -> CD
![Page 32: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/32.jpg)
Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)
![Page 33: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/33.jpg)
Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)
NF-25 in ODNF<n>
NF<n>Nf<n>
“... NF-kappa B...” NF<g><l>
NF in CD?(<x><g><l>)
NF => CD(from <x><n>)
Recognize:
![Page 34: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/34.jpg)
Results
• Why is precision less than 100%?
• When should you use “similarity by normalization”?
• Could a simpler algorithm do as well?
• Is there overfitting? (50 abstracts, <750 proteins)
![Page 35: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/35.jpg)
...
![Page 36: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/36.jpg)
Combining Information Extraction and Similarity Computations
2) Krauthammer et al
1) Bunescu et al
![Page 37: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/37.jpg)
Background
• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a
large “database” of sequences.– want to find subsequences (genes) that are
highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local
alignment)• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
![Page 38: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/38.jpg)
Background
• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a
large “database” of sequences.– want to find subsequences (genes) that are
highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local
alignment)• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
![Page 39: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/39.jpg)
Smith-Waterman distance
c o h e n d o r f
m 0 0 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
o 0 2 1 0 0 0 2 1 0
h 0 1 4 3 2 1 1 1 0
n 0 0 3 3 5 4 3 2 1
s 0 0 2 2 4 4 3 2 1
k 0 0 1 1 3 3 3 2 1
i 0 0 0 0 2 2 2 2 1
dist=5
![Page 40: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/40.jpg)
In general “peaks” in the matrix scores indicate highly similar substrings.
![Page 41: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/41.jpg)
Background
• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a
large “database” of sequences.– possible technique is Smith-Waterman (local
alignment)• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance• based on substitutability theory for amino acids
– doesn’t scale well• BLAST and FASTA: fast approximate S-W
![Page 42: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/42.jpg)
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• FASTA:– Use inverted indices to find out where these
words appear in the DB sequence– Use S-W only near DB sections that contain
some of these words
![Page 43: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/43.jpg)
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• BLAST:– Generate variations of these words by looking
for changes that would lead to strong similarities
– Discard “low IDF” words (where accidental matches are likely)
– Use expanded set of n-grams to focus search
![Page 44: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/44.jpg)
query string
words and expansions
![Page 45: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/45.jpg)
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.• BLAST:
– Generate variations of these words by looking for changes that would lead to strong similarities
– Discard “low IDF” words (where accidental matches are likely)– Use expanded set of n-grams to focus search
• The BLAST program:– Widely used, – Fast implementation, – Supports asking multiple queries against a database at once...– Can one use it find soft matches of protein names (from a
dictionary) in text?
![Page 46: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/46.jpg)
Basic idea:
• Protein database• Query strings• Proposed alignment
(query->database)• Query algorithm:
BLAST
• Biomedical paper• Protein name dictionary• Extracted protein name
(dict. entry->text)• IE system:
dictionaries+BLAST (optimized for this problem)
![Page 47: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/47.jpg)
1) Mapping text to DNA sequences(Q: what sort of char similarity is this?)
![Page 48: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/48.jpg)
2) Optimizing blast
• Split protein-name database into several parts (for short, medium-length, long protein names)
• Require space chars before and after “short” protein names.
• Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase – With what data?
• Evaluate on one review article, 1162 protein names– inter-annotator agreement not great (70-85%)
![Page 49: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/49.jpg)
2) Optimizing blast
![Page 50: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/50.jpg)
2) Optimizing blast
![Page 51: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/51.jpg)
Results
![Page 52: Distance functions and IE – 4?](https://reader037.vdocuments.mx/reader037/viewer/2022110212/56813fa3550346895daa9084/html5/thumbnails/52.jpg)
Results
Overall: precision 71.1%, recall 78.8% (opt)