ie by candidate classification: jansche & abney, cohen et al

49
IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03

Upload: tarak

Post on 14-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

IE by Candidate Classification: Jansche & Abney, Cohen et al. William Cohen 1/19/03. SCAN: Search & Summarization for Audio Collections (AT&T Labs). Why IE from personal voicemail. Unified interface for email, voicemail, fax, … requires uniform headers: Sender, Time, Subject, … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IE by Candidate Classification: Jansche & Abney, Cohen  et al

IE by Candidate Classification:Jansche & Abney, Cohen et al

William Cohen

1/19/03

Page 2: IE by Candidate Classification: Jansche & Abney, Cohen  et al

SCAN: Search & Summarization for Audio Collections (AT&T Labs)

Page 3: IE by Candidate Classification: Jansche & Abney, Cohen  et al
Page 4: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Why IE from personal voicemail

• Unified interface for email, voicemail, fax, … requires uniform headers:– Sender, Time, Subject, …– Headers are key for uniform interface

• Independently, voicemail access is slow:– useful to have fast access to important parts

of message (contact number, caller)

Page 5: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Why else to read this paper

• Robust information extraction– Generalizing from manual

transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts

• Place of hand-coding vs learning in information extraction– How to break up task– Where and how to use

engineering

Candidate Generator

Learned filter

Candidate phrase

Extracted phrase

Page 6: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Voicemail corpus

• About 10,000 manually transcribed and annotated voice messages.

• 1869 used for evaluation

Page 7: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Observation: caller phrases are short and near the beginning of the message.

Page 8: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Caller-phrase extraction

• Propose start positions i1,…,iN

• Use a learned decision tree to pick the best i

• Propose end positions i+j1,i+j2,…,i+jM

• Use a learned decision tree to pick the best j

Page 9: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Baseline (HZP, Col log-linear)

• IE as tagging:

• Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model

• Beam search to find best tag sequence given word sequence

• Features of model are words, word pairs, word pair+tag trigrams, ….

Hi there it’s Bill and…

OUT OUT IN IN OUT…

Page 10: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Performance

Page 11: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Observation: caller names are really short and near the beginning of the message.

Page 12: IE by Candidate Classification: Jansche & Abney, Cohen  et al

What about ASR transcripts?

Page 13: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Extracting phone numbers

• Phase 1: hand-coded grammer proposes candidate phone numbers– Not too hard, due to limited vocabulary– Optimize recall (96%) not precision (30%)

• Phase 2: a learned decision tree filters candidates– Use length, position, context, …

Page 14: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Results

Page 15: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Their Conclusions

Page 16: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Cohen, Wang, Murphy

• Another paper with a similar flavor:– IE for a particular task– IE using similar propose-and-filter approach– When and how to you engineer, and when

and how to you use learning?

Page 17: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Background – subcellular localization

The most important tool for studying protein localizations is fluorescence microscopy.

New image processing techniques can automatically produce a quantitative description of subcellular localization.

Page 18: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Background – subcellular localization

Two golgi proteins that cannot be distinguished by eye

Page 19: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Background – subcellular localization

EntrezEntrez: “a new 376kD Golgi complex : “a new 376kD Golgi complex outher membrane protein”outher membrane protein”SWISSProtSWISSProt: : “INTEGRAL MEMBRANE “INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE”PROTEIN. GOLGI MEMBRANE”

EntrezEntrez: “GPP130; type II Golgi : “GPP130; type II Golgi membrane protein”membrane protein”SWISSProtSWISSProt: : nothingnothing

Page 20: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Overview of SLIF: image analysis of existing images from online publications

Image

Panel Splitter

Panel Classifier

Scale FinderFl. Micr. Panel

Micr. Scale

On-line paper

Figure

Figure finder

Page 21: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Overview of SLIF: image analysis of existing images from online publications

End result: collection of on-line fluorescence microscope images, with quantitative description of localization.

E.g.: we know this figure section shows a tubulin-like protein…

…but not which one!

Page 22: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Background – overview of SLIF2.0

Caption

Image Pointer Finder

Scope Finder

Name Finder

Panel Label Matcher

Image

Panel Splitter

Panel Classifier

Scale FinderFl. Micr. Panel

Micr. Scale Cell Type Protein Name

Page 23: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Background – overview of SLIF2.0

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

An old issue: entity recognition

BY-2U2B 0-GFP

p80-coilin

anti-p80 coilin

A new issue: “caption understanding” - where are the entities in the image?

Page 24: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Why caption understanding?

- Location proteomics.- Remove extraneous junk from caption text for “ordinary” IE, NLP, indexing, … - Better text- or content-based image retrieval for scientific images.

Page 25: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identify image pointers:Substrings that refer to parts of the image

Will focus on text issues, not matching

Page 26: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Identify image pointers:Substrings that refer to parts of the image

Classify image pointers as citation-style or bullet-style.

Page 27: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Classify image pointers as citation-style or bullet-style.

Compute scopes: - The scope of a bullet-style image pointer is all words between it and the next “bullet”

scope of (A)

scope of (B)

Page 28: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Compute scopes: - The scope of a bullet-style image pointer is all words after it, but before next “bullet” - The scope of a citation-style image pointer is some set of words nearby it (heuristically determined by separating words and punctuation)

Page 29: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Figure 1. (A) Single confocal optical section of BY-2 cells expressing U2B 0-GFP, double labeled with GFP (left panel) and autoantibody against p80 coilin (right panel). Three nuclei are shown, and the bright GFP spots colocalize with bright foci of anti-coilin labeling. There is some labeling of the cytoplasm by anti-p80 coilin. (B) Single confocal optical section of BY-2 cells expressing U2B 0 -GFP, double labeled with GFP (left panel) and 4G3 antibody (right panel). Three nuclei are shown. Most coiled bodies are in the nucleoplasm, but occasionally are seen in the nucleolus (arrows). All coiled bodies that contain U2B 0 also express the U2B 0-GFP fusion. Bars, 5 m m. Movement of Coiled Bodies Vol. 10, July 1999 2299

Image pointers share all entities in their “scope”.

Entities are assigned to panels based on matches of image-pointers to annotations in panels.

Page 30: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Outline

• Details on caption understanding– Baseline hand-coded methods

– Learning methods

– Experimental results

Page 31: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Task

• Identify image pointers in captions.• Classify image pointers:

– bullet-style, citation-style, or NP-style• E.g., “Panels A and C show the …”

• Won’t talk about scoping• Will focus first on extracting image pointers

—i.e., binary classification of substrings “is this an image pointer”

• Data: 100 captions from 100 papers—about 600 positive examples.

Page 32: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Baseline methods

• Labeled 100 sample figure captions.

• HANDCODE-1: patterns like (A), (B-E), (c and d), etc.

• HANDCODE-2: all short parenthesized expressions & patterns like “panel A” or “in B-C”

HC-1 HC-2

Precision

98.5 74.5

Recall 45.6 98.0

F1 62.3 84.6

Some plausible tricks (like filtering HC-2) don’t help much…

HC-1 HC-2f

HC-2

Precis. 98.5 89.0 74.5

Recall 45.6 54.8 98.0

F1 62.3 67.8 84.6

Page 33: IE by Candidate Classification: Jansche & Abney, Cohen  et al

How hard is the problem?

Some citation-style image pointers

Page 34: IE by Candidate Classification: Jansche & Abney, Cohen  et al

How hard is the problem?

NP-style

non-image pointers

The difficulty of the task suggests using a learning approach

Page 35: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Another use of propose-and-filter

Candidate Generator

Learned filter

Candidate phrase

Extracted phrase

Note that Hand-Code2 (recall 98%) is a natural candidate generator.

We’ll start with “off the shelf” features…

Page 36: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods: boosting

Generalized version of AdaBoost (Singer&Schapire, 99)

Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

Page 37: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods: boosting rules

Weak learner: to find weak hypothesis t:

1. Split Data into Growing and Pruning sets

2. Let Rt be an empty conjunction

3. Greedily add conditions to Rt guided by Growing set:

4. Greedily remove conditions from Rt guided by Pruning set:

5. Convert to weak hypothesis:

where

Constraint: W+ > W-

and caret is smoothing

Page 38: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods: boosting rules

SLIPPER also produces fairly compact rule sets.

Page 39: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods: BWI

• Boosted wrapper induction (BWI) learns to extract substrings from a document.– Learns three concepts: firstToken(x),

lastToken(x), substringLength(k)– Conditions are tests on tokens before/after x

• E.g., toki-2=‘from’, isNumber(toki+1)

– SLIPPER weak learner, no pruning.– Greedy search extends window size by at most L

in each iteration, uses lookahead L, no fixed limit on window size.

• Good results in (Kushmeric and Frietag, 2000)

Page 40: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods: ABWI

• “Almost boosted wrapper induction” (ABWI) learns to extract substrings:– Learns to filter candidate substrings (HandCode2)– Conditions are the same tests on tokens near x:

• E.g., toki-2=‘from’, isNumber(toki+1)

– SLIPPER weak learner, no pruning.– Greedy search extends window size any amount, uses

no lookahead, has fixed limit on window size.• Optimal window sizes for this problem seem to be small…

Page 41: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning methods

• Features: W tokens before/after, all tokens inside.

• Learner: 100 rounds boosting conjunctions of feature tests– Inspired by BWI (Frietag

& Kushmeric)– Implemented with

SLIPPER learner

HC-1 HC-2f

HC-2

ABWI (W=2)

Precis.

98.5 89.0 74.5 89.7

Recall 45.6 54.8 98.0 91.0

F1 62.3 67.8 84.6 90.3

Page 42: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Other learning methods

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI Slipper

ABWI Ripper

ABWI SVM1

ABWI SVM2

Precis.

98.5 89.0 74.5 89.7 96.1 88.1 69.0 100.0

Recall 45.6 54.8 98.0 91.0 85.2 87.1 78.0 75.2

F1 62.3 67.8 84.6 90.3 90.3 87.6 73.2 85.6

All learning methods are competitive with hand-coded methods

Page 43: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Additional features

• Check if candidate contains certain “special” substrings:– Matches color name: labeled color– Matches HANDCODE-1 pattern: handcode1– Matches “mm”, “mg”, etc: measure– Matches 1980,…,2003, “et al”: citation– Matches “top”, “left”, etc: place

• Added “sentence boundary” substrings:– Feature is “distance to boundary”.

Page 44: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning with expanded feature set

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI

+ NA

Precis. 98.5 89.0 74.5 89.7 85.9

Recall 45.6 54.8 98.0 91.0 92.2

F1 62.3 67.8 84.6 90.3 89.0

Many new features are inversely correlated with class (e.g. citation), but ABWI looks only for positively-correlated patterns.

Page 45: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Learning with expanded feature set

HC-1 HC-2f HC-2 ABWI (W=2)

ABWI

+ NA

SABWI

+ NA

Precis. 98.5 89.0 74.5 89.7 85.9 88.6

Recall 45.6 54.8 98.0 91.0 92.2 93.8

F1 62.3 67.8 84.6 90.3 89.0 91.1

SABWI is a symmetric version of ABWI: can use rules and/or conditions negatively or positively correlated with the class

Page 46: IE by Candidate Classification: Jansche & Abney, Cohen  et al
Page 47: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Task

• Identify image pointers in captions.

• Classify image pointers:– bullet-style, citation-style, or NP-style

• Combine these to get a four-class problem:– bullet-style, citation-style, or NP-style, other– no hand-coded baseline methods

Page 48: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Four-class extraction results

Method Error rate

W=2 W=3 W=5

ABWI 24.6 27.5 26.7

ABWI+NA 26.7 22.2 26.7

SABWI+NA 24.2 18.2 22.6

Page 49: IE by Candidate Classification: Jansche & Abney, Cohen  et al

Further improvement is probablewith additional labeled data