the semeval-2007 web people search evaluation the semeval-2007 web people search evaluatin javier...

12
The SemEval-2007 Web People Search Evaluation The SemEval-2007 Web People Search Evaluatin Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluatin The SemEval-2007 WePS Evaluation Establishing a Benchmark for the Web People Search Task Javier Artiles, Julio Gonzalo , Satoshi Sekine UNED NLP & IR Group Madrid, Spain nlp.uned.es/~{javier, julio} CS Department New York University, USA nlp.cs.nyu.edu/sekine Aarhus, 19 Sep 2008

Upload: stephany-sullivan

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluatin

The SemEval-2007 WePS EvaluationEstablishing a Benchmark for the Web People Search Task

Javier Artiles, Julio Gonzalo, Satoshi Sekine

UNED NLP & IR GroupMadrid, Spain

nlp.uned.es/~{javier, julio}

CS DepartmentNew York University, USAnlp.cs.nyu.edu/sekine

Aarhus, 19 Sep 2008

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS Task

The Web People Search problem

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS Task

The WePS 1 Task

Input: first 100 results for a person name search

Output: clustering according to the actual people

John Smith 1 (Captain)

Captain John Smith - www.apva.org

John Smith Wikipedia - en.wikipedia.org/wiki…

John Smith 2 (Labour leader)

BBC: Labour leader John Smith – news.bbc.co.uk…

John Smith Wikipedia - en.wikipedia.org/wiki…

John Smith 3 (IBM researcher)

John Smith 4 (Film director)

John Smith 5 (Shoe company)

John Smith 6 (Writer)

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS Task

The WePS Task

Person names are a frequent type of search in the Web Approx. 30% of queries to Web search engines include a p.n.

But names can be very ambiguous.90,000 names are shared by 100 million people according to the U.S. Census Bureau.

We can find:

– High ambiguity (e.g. 82 different people in 100 pages that mention “Martha Edwards”)

– Monopolized names (e.g. +100 top results for the search “Scarlett Johansson” only mention the famous actress)

Final task with a clear application.

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

Why the WePS Task ?

Connections with traditional WSD, but also some exciting differences:

– Unknown number of “senses” (sense discrimination)– Much more average ambiguity…– … but sharper boundaries between senses.– A document might refer to different people with the same

ambiguous name (multiclass problem).

Receiving increasing attention from IR/IE research community and companies:– ZoomInfo people search engine (www.zoominfo.com).– Spock (www.spock.com).

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

Also a relevant multilingual task!

Connections with traditional WSD, but also some exciting differences:

– Unknown number of “senses” (sense discrimination)– Much more average ambiguity…– … but sharper boundaries between senses.– A document might refer to different people with the same

ambiguous name (multiclass problem).

Receiving increasing attention from IR/IE research community and companies:– ZoomInfo people search engine (www.zoominfo.com).– Spock started a similar challenge just a few months ago

(www.spock.com).

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

ObjectivesData

Data: training and test datasets

name sources av. entities av. documents name sources av. entities av. documents

Wikipedia 23,14 99,00 Wikipedia 56,50 99,30

ECDL06 15,30 99,20 ACL06 31,00 98,40

WEB03 * 5,90 47,20 Census 50,30 99,10

total av. 10,76 71,02 total av. 45,93 98,93

Training Test

Random selection of names.

Different sources (Wikipedia, US Census, CS conferences).

For each person name retrieve at most the top 100 documents (Yahoo! API).

Manual clustering of each set of documents.

* Gideon S. Mann, "Multidocument Statistical Fact Extraction and Fusion",

2006, Johns Hopkins University.

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

ObjectivesData

Data: training and test datasets

Different name sources should provide different ambiguity scenarios.

But we found… High and unpredictable variability across test cases This affected the balance between training and test. And added an (unintentional) challenge for systems.

name sources av. entities av. documents name sources av. entities av. documents

Wikipedia 23,14 99,00 Wikipedia 56,50 99,30

ECDL06 15,30 99,20 ACL06 31,00 98,40

WEB03 * 5,90 47,20 Census 50,30 99,10

total av. 10,76 71,02 total av. 45,93 98,93

Training Test

* Gideon S. Mann, "Multidocument Statistical Fact Extraction and Fusion",

2006, Johns Hopkins University.

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

ObjectivesEvaluation measures and Baselines

Evaluation measures and Baselines

Purity: rewards less noise in each cluster.

Inverse Purity: rewards elements of a category grouped

F-measure =0,5 : harmonic mean of Purity and Inverse Purity.

Scattered

P: 1,00IP: 0,48F

0,5: 0,65

1 2

34

56

Joined

P: 0,50 IP: 1,00F0,5: 0,67

1

23

4 5

6

Combined

P: 0,75IP: 1,00F0,5: 0,86

1 12 2

3 34 4

6 65 5

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

The WePS TaskWhy the WePS Task ?

ObjectivesEvaluation measures and Baselines

Results

16 groups submitted (largest single task at Semeval)

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

Other issues

Current standard clustering evaluation measures can be cheated (see combined baseline).

Adapted B-Cubed measure.Enrique Amigó, Julio Gonzalo and Javier Artiles (2007), Evaluation metrics for clustering tasks: a comparison based on formal constraints.http://nlp.uned.es/docs/amigo2007a.pdf

Inter-annotator agreement?

Double annotation of WePS test data.

No significant swaps in the ranking.

The WePS TaskWhy the WePS Task ?

ObjectivesEvaluation measures and Baselines

Results

The SemEval-2007 Web People Search EvaluationThe SemEval-2007 Web People Search Evaluatin

Javier Artiles, Julio Gonzalo, Satoshi Sekine The SemEval-2007 Web People Search Evaluation

WePS 2

Clustering task (group documents by person) + Information Extraction task (extract person attributes)

Workshop in April 2009 (together with WWW 2009), in Madrid.

More info: http://nlp.uned.es/weps

The WePS TaskWhy the WePS Task ?

ObjectivesEvaluation measures and Baselines

Results