from text to truth real world facets for multilingual search

53
Lucene/SOLR Revolution 2013 1 From Text to Truth: Real World Facets for Multilingual Search Benson Margulies Executive Vice President and Chief Technical Officer

Upload: lucenerevolution

Post on 13-Dec-2014

682 views

Category:

Education


1 download

DESCRIPTION

Presented by Benson Margulies, Executive Vice President and Chief Technology Officer, Basis Technology Solr's ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn't want a "George Bush" facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for "George W. Bush" or even "乔治·沃克·布什" (a Chinese translation) that are limited to just one string. We'll explore the benefits and challenges of empowering Solr users with real-world facets.

TRANSCRIPT

Page 1: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 1

From Text to Truth: Real World Facets for Multilingual Search Benson Margulies Executive Vice President and Chief Technical Officer

Page 2: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 2

Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya.

Motivation

Page 3: From text to truth real world facets for multilingual search
Page 4: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 4

Page 5: From text to truth real world facets for multilingual search

✗  

Page 6: From text to truth real world facets for multilingual search
Page 7: From text to truth real world facets for multilingual search

✗  

Page 8: From text to truth real world facets for multilingual search
Page 9: From text to truth real world facets for multilingual search

✗  

Page 10: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 10

Page 11: From text to truth real world facets for multilingual search

✗  

✗  

Page 12: From text to truth real world facets for multilingual search
Page 13: From text to truth real world facets for multilingual search

✓  

✗  

✗  

Page 14: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 14

That was a lot of work. Can text analytics help?

Help?

Page 15: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 15

✓  

✗  

✗  

Filter out pages with the wrong guy?

Filter?

Page 16: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 16

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Page 17: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 17

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Page 18: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 18

✓  

✗  

✗  

Add some filters (a/k/a facets)…

Filter?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 19: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 19

✓  

✗  

✗  

But what can we use as choices?

Filter?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

   

Page 20: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 20

Find names of person, places, organizations in document.

Entity Extraction (Name Tagging)

   

Page 21: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 21

Group names referring to the same person, within a document.

In-document Coreference Resolution

Page 22: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 22

✓  

✗  

✗  

But what can we use as choices?

Filter choices?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 23: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 23

✓  

✗  

✗  

Choices: first way that each person was mentioned in each document?

Filter choices?

Filter  results  by…  

Persons  named  Kris  Stephens  Chris  Stephens  Dan  Cathy  George  LiBle  …  

Page 24: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 24

✓  

✗  

Choices: first name string for each person in each document?

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens   ✗  

Page 25: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 25

✓  

✗  

Choices: first name string for each person in each document?

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 26: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 26

✓  

✗  

Problem: Ambiguity – one name, many entities

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 27: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 27

✓  

✗  

Problem: Variety – one person, many names

Filter?

Add  filters…  

Filtered  by…  

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 28: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 28

✓  

✗  

Problem: Variety – one person, many names

Filter?

Add  filters…  

Persons  named  Dan  Cathy  George  LiBle  …  Chris  Stevens  J.  Christopher        Stevens  …  

Filtered  by…  

Persons  named  Chris  Stephens  

Page 29: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 29

✓  

✗  

✗  

Magically group names by person across documents.

Deal with ambiguity and variety?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

Page 30: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 30

✓  

✗  

✗  

But there’s still the problem of choices…

Labels for choices?

Filter  results  by…  

People  <choice  1>  <choice  2>  <choice  3>  …  

   

Page 31: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 31

✓  

✗  

✗  

Use person’s name from highest ranked doc? Still some ambiguity.

Labels for choices?

Filter  results  by…  

People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …  

   

Page 32: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 32

✓  

✗  

✗  

Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia).

Labels for choices?

Filter  results  by…  

People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …  

   Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …  

Page 33: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 33

✓  

✗  

✗  

Labels for choices?

Filter  results  by…  

People  

For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).

Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …  

   

   

Page 34: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 34

✓  

✗  

✗  

For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).

Filter?

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)      

   

   

Page 35: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 35

✓  

✗  

✗  

Let’s give it a try…

Filter.

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …    

Page 36: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 36

✓  

✗  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✗  

Page 37: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 37

✓  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

Page 38: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 38

✓  

Let’s give it a try…

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

Page 39: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 39

✓  

On a cross lingual index, real-world entity facets can open results up across languages, unlike search strings

Filter.

Add  filters…  

People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  J.  Christopher        Stevens    

✓  

✓  

Language  English  Chinese  Arabic  

Page 40: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 40

Let’s pretend you’re researching the pastors instead.

Trading off Errors

Filter  results  by…  

People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens        (pastor)  Dan  Cathy  George  LiBle  …    

Page 41: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 41

What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse).

Trading off Errors

Add  filters…  

People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  Kris  Stephens      (pastor)    

Page 42: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 42

Make the filter more fine.

Trading off Errors

Add  filters…  

People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  

Filtered  by…  

People  Kris  Stephens      (pastor)    

Page 43: From text to truth real world facets for multilingual search

Demo

Page 44: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 44

RNI Similarity Matching “Tamerlan Tsarnaev”

And the problem only gets worse with Multiple Languages

Page 45: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 45

Fuzzy name search in Solr

•  Facets  are  one  way  to  navigate  names  o  assume  that  you've  found  some  interesNng  data  

with  an  ordinary  query  o  what  if  you  are  having  trouble  gePng  started?  

•  Name-­‐specific  comparison  search  is  another  • More  complex  algorithm  than  levenshtein  

distance  on  names  

Page 46: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 46

Plugging in more complex search

•  Open  up  the  'search  component  pipeline'  •  First  component  preprocesses  query  

o  Maps  from  "Fred  Chopin"  to  a  complex  Lucene  query  that  looks  for  possible  matches  across  languages  and  scripts  

•  Second  component  rescores  results  o  detailed  comparison  of  pairs  of  names  to  derive  

final  score.  

•  Sad  limitaNon  (so  far):  scores  not  normalized  to  ordinary  Lucene  values  

Page 47: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 47

And it does SolrCloud, too ...

•  Preprocessor  runs  before  fan-­‐out  to  shards  •  rescoring  runs  out  on  the  shards  •  So  the  work  of  checking  candidate  matches  is  

divided  up  amongst  the  scores.  

Page 48: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 48

Questions

•  Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution

beyond faceted search?

Page 49: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 49

Doesn’t  Google  already  do  this?   Some, when searching for famous entities.

Page 50: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 50

Speed/Scale

•  Future Plans include scaling experiments •  Research version:

–  tested up to 1m docs – Sub-second per document –  Incremental updates (i.e., you see documents

published minutes ago)

Page 51: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 51

Other uses for entity resolution ?

•  Supporting relationship resolution by resolving participating entities in the them.

•  Knowledge base population •  Integrating disparate data sets •  Alerting •  Improving relevance of search results •  Predictive Analytics

Page 52: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 52

For more information: Visit www.basistech.com

Write to [email protected]

Call 617-386-2090

Thank you!

Page 53: From text to truth real world facets for multilingual search

Lucene/SOLR Revolution 2013 53

CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT Benson Margulies [email protected]