![Page 1: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/1.jpg)
Bootstrapping Information Extraction with Unlabeled Data
Rayid Ghani Accenture Technology Labs
Rosie JonesCarnegie Mellon University & Overture
(With contributions from Tom Mitchell and Ellen Riloff)
![Page 2: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/2.jpg)
What is Information Extraction?
Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships
Recent Commercial Applications Database of Job Postings extracted from corporate web
pages (flipdog.com) Extracting specific fields from resumes to populate HR
databases (mohomine.com) Information Integration (fetch.com) Shopping Portals
![Page 3: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/3.jpg)
IE Approaches
Hand-Constructed Rules Supervised Learning
Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction
(Seymore et al 99) 7000 labeled examples to learn MUC extraction rules
(Soderland 99)
Semi-Supervised Learning
![Page 4: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/4.jpg)
Semi-Supervised Approaches
Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost,
Meta-Bootstrapping, Co-EM, etc. Goal:
Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set
of problems and corpus
![Page 5: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/5.jpg)
Tasks
Extract Noun Phrases belonging to the following semantic classes Locations Organizations People
![Page 6: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/6.jpg)
Aren’t you missing the obvious?
Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names
Named Entity Extraction? But not all instances are proper nouns
*by the river*, *customer*,*client*
![Page 7: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/7.jpg)
Use context to disambiguate
A lot of NPs are unambiguous “The corporation”
A lot of contexts are also unambiguous Subsidiary of <NP>
But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington
![Page 8: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/8.jpg)
Bootstrapping Approaches
Utilize Redundancy in Text Noun-Phrases
New York, China, place we met last time Contexts
Located in <X>, Traveled to <X> Learn two models
Use NPs to label Contexts Use Contexts to label NPs
![Page 9: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/9.jpg)
Interesting Dimensions for Bootstrapping Algorithms
Incremental vs. Iterative Symmetric vs. Asymmetric Probabilistic vs. Heuristic
![Page 10: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/10.jpg)
Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999)
Incremental, Asymmetric, Heuristic
Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?)
Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic
Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds
![Page 11: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/11.jpg)
Data Set
~4200 corporate web pages (WebKB project at CMU)
Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none
Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)
![Page 12: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/12.jpg)
Seeds
Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan
People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director
Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier
![Page 13: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/13.jpg)
Intuition Behind Bootstrapping
the dog
australia
france
the canary islands
<X> ran away
travelled to <X>
<X> is beautiful
Noun Phrases Contexts
![Page 14: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/14.jpg)
Co-Training(Blum & Mitchell, 99)
Incremental, symmetric, probabilistic1. Initialize with pos and neg NP seeds
2. Use NPs to label all contexts
3. Add n top scoring contexts for both positive and negative class
4. Use new contexts to label all NPS
5. Add n top scoring NPs for both positive and negative class
6. Loop
![Page 15: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/15.jpg)
Co-EM(Nigam & Ghani, 2000)
Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and
contexts to the labeled set
)|()|()|( ijjj
ii contextNPPNPclassPcontextclassP
)|()|()|( ijjj
ii NPcontextPcontextclassPNPclassP
![Page 16: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/16.jpg)
Meta-Bootstrapping(Riloff & Jones, 99)
Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to
co-occurring frequency and diversity After first level, all contexts are discarded and
only the best NPs are retained
![Page 17: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/17.jpg)
Common Assumptions
Seeds Seed Density in the corpus Head-labeling Accuracy
Syntactic-Semantic Agreement
Redundancy Feature Sets are redundant and sufficient Labeling disagreement
![Page 18: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/18.jpg)
Feature Set Ambiguity
Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient,
either of them alone would be enough to correctly classify the instance
Calculate the ambiguity for each feature set Washington, Went to <<X>>, Visit <<X>>
![Page 19: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/19.jpg)
2%
NP Ambiguity
Ambiguity Type Class(es) Number of NPs
1 NoneLocationOrganizationPerson
3574114451189
2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person
63125613
3 Loc, Org, NoneOrg, Person, None
13
![Page 20: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/20.jpg)
36%
Context Ambiguity
Ambiguity Type Class(es) Number of Contexts
1 NoneLocationOrganizationPerson
1068259859
2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person
51271206550
3 Loc, Org, NoneOrg, Person, None
1883
4 Loc, Org, Per, None
6
![Page 21: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/21.jpg)
Labeling Disagreement
Agreement among human labelers Same set of instances but different levels of
information NP only Context Only NP and Context NP, Context and the entire sentence from the
corpus
![Page 22: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/22.jpg)
Labeling Disagreement
90.5% agreement when NP, context and sentence are given
88.5% when sentence is not given
![Page 23: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/23.jpg)
Results Comparing Bootstrapping Algorithms
Meta-Bootstrapping, Co-Training, co-EM
Locations, Organizations, Person
![Page 24: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/24.jpg)
Co-EM
MetaBoot
Co-Training
![Page 25: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/25.jpg)
Co-EM
MetaBoot
Co-Training
![Page 26: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/26.jpg)
Co-EM
MetaBoot
Co-Training
![Page 27: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/27.jpg)
More Results
Bootstrapping outperforms both baselines Improvement is less pronounced for “people”
class Ambiguous classes don’t benefit as much
from bootstrapping?
![Page 28: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/28.jpg)
Why does co-EM work well?
Co-EM outperforms Meta-bootstrapping & Co-Training
Co-EM is probabilistic and does not do hard classifications
Reflective of the ambiguity among classes
![Page 29: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/29.jpg)
Summary
Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM
Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes
Co-EM performs robustly even when the underlying assumptions are violated
![Page 30: Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649f295503460f94c42f4f/html5/thumbnails/30.jpg)
Ongoing Work
Varying initial seed size and type Collecting Training Corpus automatically
(from the Web) Incorporating the user in the loop (Active
Learning)