moving beyond entity extraction to entity resolution - human language technology conference

59
Basis Technology – Human Language Technology Conference 2012 1 Things, not Strings: From Entity Extraction to Entity Resolution David Murgatroyd VP, Engineering Basis Technology

Upload: basis-technology

Post on 25-Dec-2014

4.079 views

Category:

Technology


0 download

DESCRIPTION

Entity extraction finds names in documents, providing important raw material for big decisions. But finding all mentions of the name “George Bush” is very different than finding all mentions of the 43rd US President. Making big decisions from big data is hopeless unless analytics advance from providing snippets of text to providing statements of truth. Such advances present challenges both of accuracy and of usability. We’ll explore these challenges and demonstrate ways of addressing them. View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides

TRANSCRIPT

Page 1: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 1

Things, not Strings: From Entity Extraction to Entity Resolution David Murgatroyd

VP, Engineering

Basis Technology

Page 2: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 2

Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya.

Motivation

Page 3: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
Page 4: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 4

Page 5: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

✗  

Page 6: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
Page 7: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

✗  

Page 8: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
Page 9: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

✗  

Page 10: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 10

Page 11: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

✗  

✗  

Page 12: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
Page 13: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

✓  

✗  

✗  

Page 14: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 14

That was a lot of work. Can text analytics help?

Help?

Page 15: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 15

✓  

✗  

✗  

Filter out pages with the wrong guy?

Filter?

Page 16: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference
Page 17: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Filter Example

Page 18: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 18

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Page 19: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 19

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Page 20: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 20

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 21: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 21

✓  

✗  

✗  

But what can we use as choices?

Filter?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

   

Page 22: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 22

Find names of person, places, organizations in document.

Entity Extraction (Name Tagging)

   

Page 23: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 23

Group names referring to the same person, within a document.

In-document Coreference Resolution

Page 24: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 24

✓  

✗  

✗  

But what can we use as choices?

Filter choices?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 25: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 25

✓  

✗  

✗  

Choices: first way that each person was mentioned in each document?

Filter choices?

Filter  results  by…  

Persons  named  Kris  Stephens  Chris  Stephens  Dan  Cathy  George  LiBle  …  

Page 26: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 26

✓  

✗  

Choices: first name string for each person in each document?

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens   ✗  

Page 27: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 27

✓  

✗  

Choices: first name string for each person in each document?

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 28: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 28

✓  

✗  

Problem: Ambiguity – one name, many entities

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 29: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 29

✓  

✗  

Problem: Variety – one person, many names

Filter?

Add  filters…  

Filtered  by…  

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 30: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 30

✓  

✗  

Problem: Variety – one person, many names

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  Chris  Stevens  J.  Christopher        Stevens  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 31: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 31

Where does your favorite data set fall?

Ambiguity  

Variety  

Thousands  

1  

#  of  documents  

Millions  

Billions  

Page 32: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 32

✓  

✗  

✗  

Magically group names by person across documents.

Deal with ambiguity and variety?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 33: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 33

✓  

✗  

✗  

But there’s still the problem of choices…

Labels for choices?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

   

Page 34: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 34

✓  

✗  

✗  

Use person’s name from highest ranked doc? Still some ambiguity.

Labels for choices?

Filter  results  by…  

People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …  

   

Page 35: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 35

✓  

✗  

✗  

Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia).

Labels for choices?

Filter  results  by…  

People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …  

   Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …  

Page 36: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 36

✓  

✗  

✗  

Labels for choices?

Filter  results  by…  

People  

For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).

Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …  

   

   

Page 37: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 37

✓  

✗  

✗  

For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).

Filter?

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)      

   

   

Page 38: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 38

✓  

✗  

✗  

Let’s give it a try…

Filter.

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …    

Page 39: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 39

✓  

✗  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✗  

Page 40: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 40

✓  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

Page 41: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 41

✓  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

Page 42: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 42

✓  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✓  

✓  

Page 43: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 43

Does it work?

How do you measure?

Page 44: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 44

Imagine this was the result of applying the filter with the name from wikipedia.

How do you measure?

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

Page 45: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 45

Precision: for each document, how much of the stuff grouped with it is correct?

How do you measure?

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✓  ✓  

✗  2    /  3  =  67%  

 2  /  3  =  67%  

 1  /  3  =  33%  

Page 46: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 46

Recall: for each document, how much of the correct stuff is grouped with?

How do you measure?

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✓  ✓  

2    /  5  =  40%  

 2  /  5  =  40%  

✗  ✗  ✗  

Page 47: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 47

Does it work?

We often combine Precision and Recall measurements into a single measurement, called “F”.

Page 48: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 48

Where does your favorite data set fall?

Ambiguity  

Variety  

Thousands  

1  

#  of  documents  

Millions  

Billions  

Page 49: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 49

ACE  2005   WEPS-­‐2   TAC  pre-­‐2012  TAC  eng  2012   TAC  zho  2012   TAC  spa  2012  Basis  Balanced   Basis  Ambig   Basis  Variance  1  Basis  Variance  2  

Where does your favorite data lie?

Ambiguity  

Variety  

1  

F>=70  

F>=?  

Thousands  

#  of  documents  

Millions  

Billions  F>=85  

corpus  

Page 50: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 50

Let’s pretend you’re researching the pastors instead.

Trading off Errors

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens        (pastor)  Dan  Cathy  George  LiBle  …    

Page 51: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 51

What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse).

Trading off Errors

Add  filters…  

People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  Kris  Stephens      (pastor)    

Page 52: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 52

Make the filter more fine.

Trading off Errors

Add  filters…  

People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  Kris  Stephens      (pastor)    

Page 53: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Demo

Page 54: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 54

Questions

•  Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution

beyond faceted search?

Page 55: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 55

For more information: Visit www.basistech.com

Write to [email protected]

Call 617-386-2090

Thank you!

Page 56: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 56

Doesn’t  Google  already  do  this?   Some, when searching for famous entities.

Page 57: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 57

Speed/Scale

•  Support from BRAVE for scale in CY13! •  Research version:

–  tested up to 1m docs – Sub-second per document –  Incremental updates (i.e., you see documents

published minutes ago)

Page 58: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 58

Doesn’t  Google  already  do  this?  

Page 59: Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 59

Other uses for entity resolution ?

•  Supporting relationship resolution by resolving participating entities in the them.

•  Knowledge base population •  Integrating disparate data sets •  Alerting •  Improving relevance of search results •  Predictive Analytics