self-adjustable bootstrapping for named entity set expansion

July 30th, 2009 Lexical Knowledge from

Ngrams1

Self-adjustable bootstrappingfor Named Entity set expansion

Sushant Narsale (JHU)Satoshi Sekine (NYU)

July 30th, 2009 Lexical Knowledge from Ngrams 2

Nail: Set (NE list) Expansionusing bootstrapping

2Lexical Knowledge from Ngrams

Expand Named Entity Sets for 150 Named Entity Categories

July 30th, 2009

ngrams

Self-adjustable bootstrapping


Our Task

• Input: Seeds for 150 Named Entity Categories•Output: More examples like seeds

• Motivation– “Creating lists of Named Entities on Web is critical

for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al

3Lexical Knowledge from NgramsJuly 30th, 2009


Examples of 3 categories from 150

Awards (1091) Academy (215) Title (8)

AAASS/Orbis Books Prize Abd-el-Tif prizeAbel PrizeAcademy AwardACM Turing AwardAdalbert Stifter PrizeAdriano Gonzalez Leon Biennial Novel PrizeAga Khan Prize for FictionAgatha AwardAgatha AwardsAIA Gold Medal

Aboriginal StudiesAccountingActuarial Science and StatisticsAdministration of JusticeAdministrative and Policy StudiesAfrican StudiesAfricana StudiesAmerican CulturesAmerican StudiesAnatomyAnesthesiologyAnthropology

Mr.MrMisterMrs.MrsMissMs.Ms


150 category Named Entity


Bootstrapping

• Get more of similar– Set of names (i.e. Presidents)

• Clinton, Bush Putin, Chirac– They must share something…

• They share the same context in texts• President * said yesterday of President * in• President * , the President * , who

• The contexts may be shared by other Presidents• Yeltsin, Zemin, Hussein, Obama

We need scoring function to score the candidatesWe need to set the number of contexts/examples to learn

http://images.google.com/imgres?imgurl=http://blogs.chron.com/txpotomac/Bush%2520Sr.%2520Photo.jpg&imgrefurl=http://blogs.chron.com/txpotomac/2008/06/&usg=__rLgtGgBZQ__MReE1vXxYNzScZEE=&h=432&w=370&sz=26&hl=ja&start=4&um=1&tbnid=tufQGf9CbcH1fM:&tbnh=126&tbnw=108&prev=/images%3Fq%3DBush%2Bsenior%26hl%3Dja%26lr%3D%26rls%3Dcom.microsoft:en-US%26um%3D1

http://images.google.com/imgres?imgurl=http://www.topnews.in/files/putin_0.jpg&imgrefurl=http://www.topnews.in/regions/russia&usg=__wMfWzGz1UHzoE-bXqeno4JhQBUw=&h=600&w=517&sz=35&hl=ja&start=8&um=1&tbnid=7FI0jqm8ZAoUaM:&tbnh=135&tbnw=116&prev=/images%3Fq%3Dputin%26hl%3Dja%26lr%3D%26rls%3Dcom.microsoft:en-US%26um%3D1


Problem

• Different NE categories need different parameter settings in bootstrapping

– “Academic” has a small number of strong contexts (Department of … at)

– “Company” has a large number of weak contexts (… was bankrupted, … hires)

– “Award” has strong suffix feature (… Award/Prize)– “Nationality” has a specific length (1), “Book” has a wide

length variation


Self-Adjustable bootstrapping• We need to find the best parameter setting for each

category• Idea:

Bootstrapping + Machine Learning Approach

Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data)


Our Approach

• Parameters1. Context

• Formula’s to score Contexts and Targets• Number of contexts to be used

2. Suffix/Prefixe.g. Suffix=Awards, for award categories

3. Lengtha bias on lengths of retrieved Entity set

– Weighted Linear Interpolation of three functions

• Optimization Function : Total Reciprocal Rank


Our Approach








1. Scoring formula’s

• Scoring Targets1. Fi / CF2. Ft / log(CF)3. Ft * log(Fi) /CF4. log(Fi)*Ft / CF5. log(Fi)*Ft / log(CF)

• Scoring Context1. Fi / CF2. Ft / log(CF)3. Fi * log(Ft) / CF4. log(Fi) * Ft / CF5. log(Fi) * Ft / log(CF)

Fi = Co-occurrence frequency of targets and the contextFt = Number of target types co-occurred with the contextCF = Corpus frequency of the context

We observed that different scoring formula’s work best for different categories


Our Approach








2. Prefix/Suffix

Award Lake Bridge Bird

Academy AwardAmerican Book AwardsFilmfare AwardsBAFTA AwardsBatty Weber PrizeBooker PrizeCameos PrizeCarnegie PrizeWorld CupEdgar Award

Aberdeen LakeWhite Rock LakeTucker LakeSummersville LakeBelmont LakeLake MonroeLake NakuwaLake MuhlenbergLake LacanauLake Columbia

Albert BridgeGeorge Washington BridgeAuckland Harbor BridgeBenjamin Franklin BridgeYokohama Bay BridgeWalter Taylor Bridge

African CuckooAfep PigeonOwlAcorn WoodpeckerPenguinHawkEagleParrotCrow

S=Award (19%)S=Prize (16%)

P=Lake (47%)S=Lake (30%)

S=Bridge (70%)S=bridge (8%)

N/A



Our Approach








3. Length

• Set bias for length of retrieved entity set based on distribution of length over the seed words.

00.10.20.30.40.50.60.70.80.91

1 2 3 4 5

Nationality.txt

Nationality


3. Length


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Bird.txt

Nationality.txt

Bird


3. Length


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Bird.txt

Book.txt

Nationality.txt

Book


3. Length


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Academic.txtAward.txtConference.txtBird.txtBook.txtMoney_Form.txtLake.txtAirport.txtNationality.txt


Our Approach








Optimization Function

• TRR (Total Reciprocal Rank) – We want to get higher score for parameters which retrieve

our test examples at the top of the retrieved set.

1 2 8

Score = 1/1 +1/2+1/8 = 1.625

2 3 4 6

Score = 1/2 +1/3+1/4+1/6 = 1.358

TRR =


Experiment

• Data– The dataset consists of seeds for all 150 NE’s – The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004)– Examples

• Program– N-gram search engine for Wikipedia.– 1.7 billion tokens and 1.2 billion 7-grams.

Academic 214 Artificial Intelligence, Asia-Pacific Studies, Biochemistry

Airport 1054 A.P Hill Army Airfield, Aberdeen Airport, Afron Municipal Airport

Bridge 1174 10th Avenue Bridge, 23the Street viaduct, Acosta Bridge,


Optimization Result

• Different parameter settings give the best results for different categories

Context scoring func.

Target scoring func.

threshold for

context

p/suffix feat.

weight

length feat.

weight

Academic (43)

Airport (210)

Bridge (235)

#2 #2 200 100 10 0.90 1.09 0.26#5 #4 50 700 10 0.24 1.56 0.57#5 #4 200 700 50 0.24 1.22 0.77#2 #5 50 100 100 0.83 1.47 0.56

Last line is the best for all categories combined (baseline)


Results

Our Method Baseline

Rec. Prec. F Rec. Prec. FAcademic 71 61 66 (+11) 61 50 55Airport 23 61 33 (+0) 23 60 33Bridge 16 47 24 (+10) 9 29 14

Recall: percentage of held-out seed examples in top 2,000

Precision: percentage of correct targets in 100 random sample of top 2,000


Future Work

• More Features– Phrase Clustering– Genre information– Longer dependency

• Better optimization• Start with smaller number of seeds• Other targets (e.g. relation)• Make a tool (like Google Sets)


Using Phrase ClustersMatching % Total Matches Category Cluster ID

92% 237 Airport 28791% 23 Incident 548

83% 24 Ocassions 47470% 72 Facility_Other 769

64% 11566 Titles 54510% 　 7252 Flora 　　　　 950 　

10% 　　　　 103009 City 441 　

9.7% 11272 Religion 464 　

9.4% 　 3107 Planet 　　　 332

9.5% 2883 Train 326 　

25


Airport Cluster• 1301 “airport” in Cluster #287• Chicago 's O'Hare Airport• Ben Gurion International Airport• Little Rock National Airport• London 's Heathrow airport• Austin airport• Burbank airport• London 's Heathrow Airport• Memphis airport• La Guardia airport • Corpus Christi International Airport• Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport• Sea-Tac airport


Conclusion

• A solution for “Different methods work different categories”

• Large dictionary of 150 category Named Entities

[email protected], [email protected]

self-adjustable bootstrapping for named entity set expansion

Documents

lexical knowledge

mrsmissms knowledge

named entity categoriesoutput

named entity categoriesjuly

targetsnumber of contexts

whothe contexts

large number of weak

logft cflogfi