self-adjustable bootstrapping for named entity set expansion

27
July 30th, 2009 Lexical Knowledge from Ngrams 1 Self-adjustable bootstrapping for Named Entity set expansion Sushant Narsale (JHU) Satoshi Sekine (NYU)

Upload: oralee

Post on 20-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Self-adjustable bootstrapping for Named Entity set expansion. Sushant Narsale (JHU) Satoshi Sekine (NYU). Nail: Set (NE list) Expansion using bootstrapping. Expand Named Entity Sets for 150 Named Entity Categories. Self-adjustable bootstrapping. ngrams. July 30th, 2009. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from

Ngrams1

Self-adjustable bootstrappingfor Named Entity set expansion

Sushant Narsale (JHU)Satoshi Sekine (NYU)

Page 2: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 2

Nail: Set (NE list) Expansionusing bootstrapping

2Lexical Knowledge from Ngrams

Expand Named Entity Sets for 150 Named Entity Categories

July 30th, 2009

ngrams

Self-adjustable bootstrapping

Page 3: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 3

Our Task

• Input: Seeds for 150 Named Entity Categories•Output: More examples like seeds

• Motivation– “Creating lists of Named Entities on Web is critical

for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al

3Lexical Knowledge from NgramsJuly 30th, 2009

Page 4: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 4

Examples of 3 categories from 150

Awards (1091) Academy (215) Title (8)

AAASS/Orbis Books Prize Abd-el-Tif prizeAbel PrizeAcademy AwardACM Turing AwardAdalbert Stifter PrizeAdriano Gonzalez Leon Biennial Novel PrizeAga Khan Prize for FictionAgatha AwardAgatha AwardsAIA Gold Medal

Aboriginal StudiesAccountingActuarial Science and StatisticsAdministration of JusticeAdministrative and Policy StudiesAfrican StudiesAfricana StudiesAmerican CulturesAmerican StudiesAnatomyAnesthesiologyAnthropology

Mr.MrMisterMrs.MrsMissMs.Ms

Page 5: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 5

150 category Named Entity

Page 6: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 6

Bootstrapping

• Get more of similar– Set of names (i.e. Presidents)

• Clinton, Bush Putin, Chirac– They must share something…

• They share the same context in texts• President * said yesterday of President * in• President * , the President * , who

• The contexts may be shared by other Presidents• Yeltsin, Zemin, Hussein, Obama

We need scoring function to score the candidatesWe need to set the number of contexts/examples to learn

Page 7: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 7

Problem

• Different NE categories need different parameter settings in bootstrapping

– “Academic” has a small number of strong contexts (Department of … at)

– “Company” has a large number of weak contexts (… was bankrupted, … hires)

– “Award” has strong suffix feature (… Award/Prize)– “Nationality” has a specific length (1), “Book” has a wide

length variation

Page 8: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 8

Self-Adjustable bootstrapping• We need to find the best parameter setting for each

category• Idea:

Bootstrapping + Machine Learning Approach

Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data)

Page 9: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 9

Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank

Page 10: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 10

Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank

Page 11: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 11

1. Scoring formula’s

• Scoring Targets1. Fi / CF2. Ft / log(CF)3. Ft * log(Fi) /CF4. log(Fi)*Ft / CF5. log(Fi)*Ft / log(CF)

• Scoring Context1. Fi / CF2. Ft / log(CF)3. Fi * log(Ft) / CF4. log(Fi) * Ft / CF5. log(Fi) * Ft / log(CF)

Fi = Co-occurrence frequency of targets and the contextFt = Number of target types co-occurred with the contextCF = Corpus frequency of the context

We observed that different scoring formula’s work best for different categories

Page 12: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 12

Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank

Page 13: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 13

2. Prefix/Suffix

Award Lake Bridge Bird

Academy AwardAmerican Book AwardsFilmfare AwardsBAFTA AwardsBatty Weber PrizeBooker PrizeCameos PrizeCarnegie PrizeWorld CupEdgar Award

Aberdeen LakeWhite Rock LakeTucker LakeSummersville LakeBelmont LakeLake MonroeLake NakuwaLake MuhlenbergLake LacanauLake Columbia

Albert BridgeGeorge Washington BridgeAuckland Harbor BridgeBenjamin Franklin BridgeYokohama Bay BridgeWalter Taylor Bridge

African CuckooAfep PigeonOwlAcorn WoodpeckerPenguinHawkEagleParrotCrow

S=Award (19%)S=Prize (16%)

P=Lake (47%)S=Lake (30%)

S=Bridge (70%)S=bridge (8%)

N/A

July 30th, 2009 Lexical Knowledge from Ngrams 13

Page 14: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 14

Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank

Page 15: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 15

3. Length

• Set bias for length of retrieved entity set based on distribution of length over the seed words.

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5

Nationality.txt

Nationality

Page 16: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 16

3. Length

• Set bias for length of retrieved entity set based on distribution of length over the seed words.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Bird.txt

Nationality.txt

Bird

Page 17: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 17

3. Length

• Set bias for length of retrieved entity set based on distribution of length over the seed words.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Bird.txt

Book.txt

Nationality.txt

Book

Page 18: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 18

3. Length

• Set bias for length of retrieved entity set based on distribution of length over the seed words.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Academic.txtAward.txtConference.txtBird.txtBook.txtMoney_Form.txtLake.txtAirport.txtNationality.txt

Page 19: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 19

Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank

Page 20: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 20

Optimization Function

• TRR (Total Reciprocal Rank) – We want to get higher score for parameters which retrieve

our test examples at the top of the retrieved set.

1 2 8

Score = 1/1 +1/2+1/8 = 1.625

2 3 4 6

Score = 1/2 +1/3+1/4+1/6 = 1.358

TRR =

Page 21: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 21

Experiment

• Data– The dataset consists of seeds for all 150 NE’s – The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004)– Examples

• Program– N-gram search engine for Wikipedia.– 1.7 billion tokens and 1.2 billion 7-grams.

Academic 214 Artificial Intelligence, Asia-Pacific Studies, Biochemistry

Airport 1054 A.P Hill Army Airfield, Aberdeen Airport, Afron Municipal Airport

Bridge 1174 10th Avenue Bridge, 23the Street viaduct, Acosta Bridge,

Page 22: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 22

Optimization Result

• Different parameter settings give the best results for different categories

Context scoring func.

Target scoring func.

threshold for

context

p/suffix feat.

weight

length feat.

weight

Academic (43)

Airport (210)

Bridge (235)

#2 #2 200 100 10 0.90 1.09 0.26#5 #4 50 700 10 0.24 1.56 0.57#5 #4 200 700 50 0.24 1.22 0.77#2 #5 50 100 100 0.83 1.47 0.56

Last line is the best for all categories combined (baseline)

Page 23: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 23

Results

Our Method Baseline

Rec. Prec. F Rec. Prec. FAcademic 71 61 66 (+11) 61 50 55Airport 23 61 33 (+0) 23 60 33Bridge 16 47 24 (+10) 9 29 14

Recall: percentage of held-out seed examples in top 2,000

Precision: percentage of correct targets in 100 random sample of top 2,000

Page 24: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 24

Future Work

• More Features– Phrase Clustering– Genre information– Longer dependency

• Better optimization• Start with smaller number of seeds• Other targets (e.g. relation)• Make a tool (like Google Sets)

Page 25: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 25

Using Phrase ClustersMatching % Total Matches Category Cluster ID

92% 237 Airport 28791% 23 Incident 548

83% 24 Ocassions 47470% 72 Facility_Other 769

64% 11566 Titles 54510%   7252 Flora      950  

10%      103009 City 441  

9.7% 11272 Religion 464  

9.4%   3107 Planet     332

9.5% 2883 Train 326  

25

Page 26: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 26

Airport Cluster• 1301 “airport” in Cluster #287• Chicago 's O'Hare Airport• Ben Gurion International Airport• Little Rock National Airport• London 's Heathrow airport• Austin airport• Burbank airport• London 's Heathrow Airport• Memphis airport• La Guardia airport • Corpus Christi International Airport• Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport• Sea-Tac airport

Page 27: Self-adjustable bootstrapping for Named Entity set expansion

July 30th, 2009 Lexical Knowledge from Ngrams 27

Conclusion

• A solution for “Different methods work different categories”

• Large dictionary of 150 category Named Entities

[email protected], [email protected]