from text to truth real world facets for multilingual search
DESCRIPTION
Presented by Benson Margulies, Executive Vice President and Chief Technology Officer, Basis Technology Solr's ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn't want a "George Bush" facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for "George W. Bush" or even "乔治·沃克·布什" (a Chinese translation) that are limited to just one string. We'll explore the benefits and challenges of empowering Solr users with real-world facets.TRANSCRIPT
Lucene/SOLR Revolution 2013 1
From Text to Truth: Real World Facets for Multilingual Search Benson Margulies Executive Vice President and Chief Technical Officer
Lucene/SOLR Revolution 2013 2
Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya.
Motivation
Lucene/SOLR Revolution 2013 4
✗
✗
✗
Lucene/SOLR Revolution 2013 10
✗
✗
✓
✗
✗
Lucene/SOLR Revolution 2013 14
That was a lot of work. Can text analytics help?
Help?
Lucene/SOLR Revolution 2013 15
✓
✗
✗
Filter out pages with the wrong guy?
Filter?
Lucene/SOLR Revolution 2013 16
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 17
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 18
✓
✗
✗
Add some filters (a/k/a facets)…
Filter?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Lucene/SOLR Revolution 2013 19
✓
✗
✗
But what can we use as choices?
Filter?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Lucene/SOLR Revolution 2013 20
Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)
Lucene/SOLR Revolution 2013 21
Group names referring to the same person, within a document.
In-document Coreference Resolution
Lucene/SOLR Revolution 2013 22
✓
✗
✗
But what can we use as choices?
Filter choices?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Lucene/SOLR Revolution 2013 23
✓
✗
✗
Choices: first way that each person was mentioned in each document?
Filter choices?
Filter results by…
Persons named Kris Stephens Chris Stephens Dan Cathy George LiBle …
Lucene/SOLR Revolution 2013 24
✓
✗
Choices: first name string for each person in each document?
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens ✗
Lucene/SOLR Revolution 2013 25
✓
✗
Choices: first name string for each person in each document?
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Lucene/SOLR Revolution 2013 26
✓
✗
Problem: Ambiguity – one name, many entities
Filter?
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Lucene/SOLR Revolution 2013 27
✓
✗
Problem: Variety – one person, many names
Filter?
Add filters…
Filtered by…
Add filters…
Persons named Dan Cathy George LiBle …
Filtered by…
Persons named Chris Stephens
Lucene/SOLR Revolution 2013 28
✓
✗
Problem: Variety – one person, many names
Filter?
Add filters…
Persons named Dan Cathy George LiBle … Chris Stevens J. Christopher Stevens …
Filtered by…
Persons named Chris Stephens
Lucene/SOLR Revolution 2013 29
✓
✗
✗
Magically group names by person across documents.
Deal with ambiguity and variety?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Lucene/SOLR Revolution 2013 30
✓
✗
✗
But there’s still the problem of choices…
Labels for choices?
Filter results by…
People <choice 1> <choice 2> <choice 3> …
Lucene/SOLR Revolution 2013 31
✓
✗
✗
Use person’s name from highest ranked doc? Still some ambiguity.
Labels for choices?
Filter results by…
People Kris Stephens Chris Stephens 1 Chris Stephens 2 …
Lucene/SOLR Revolution 2013 32
✓
✗
✗
Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia).
Labels for choices?
Filter results by…
People Kris Stephens Chris Stephens 1 Chris Stephens 2 …
Kris Stephens J. Christopher Stevens Chris Stephens …
Lucene/SOLR Revolution 2013 33
✓
✗
✗
Labels for choices?
Filter results by…
People
For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).
Kris Stephens J. Christopher Stevens Chris Stephens …
Lucene/SOLR Revolution 2013 34
✓
✗
✗
For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).
Filter?
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor)
Lucene/SOLR Revolution 2013 35
✓
✗
✗
Let’s give it a try…
Filter.
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Lucene/SOLR Revolution 2013 36
✓
✗
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✗
Lucene/SOLR Revolution 2013 37
✓
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
Lucene/SOLR Revolution 2013 38
✓
Let’s give it a try…
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
Lucene/SOLR Revolution 2013 39
✓
On a cross lingual index, real-world entity facets can open results up across languages, unlike search strings
Filter.
Add filters…
People Kris Stephens (pastor) Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People J. Christopher Stevens
✓
✓
Language English Chinese Arabic
Lucene/SOLR Revolution 2013 40
Let’s pretend you’re researching the pastors instead.
Trading off Errors
Filter results by…
People Kris Stephens (pastor) J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Lucene/SOLR Revolution 2013 41
What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse).
Trading off Errors
Add filters…
People J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People Kris Stephens (pastor)
Lucene/SOLR Revolution 2013 42
Make the filter more fine.
Trading off Errors
Add filters…
People J. Christopher Stevens Chris Stephens (pastor) Dan Cathy George LiBle …
Filtered by…
People Kris Stephens (pastor)
Demo
Lucene/SOLR Revolution 2013 44
RNI Similarity Matching “Tamerlan Tsarnaev”
And the problem only gets worse with Multiple Languages
Lucene/SOLR Revolution 2013 45
Fuzzy name search in Solr
• Facets are one way to navigate names o assume that you've found some interesNng data
with an ordinary query o what if you are having trouble gePng started?
• Name-‐specific comparison search is another • More complex algorithm than levenshtein
distance on names
Lucene/SOLR Revolution 2013 46
Plugging in more complex search
• Open up the 'search component pipeline' • First component preprocesses query
o Maps from "Fred Chopin" to a complex Lucene query that looks for possible matches across languages and scripts
• Second component rescores results o detailed comparison of pairs of names to derive
final score.
• Sad limitaNon (so far): scores not normalized to ordinary Lucene values
Lucene/SOLR Revolution 2013 47
And it does SolrCloud, too ...
• Preprocessor runs before fan-‐out to shards • rescoring runs out on the shards • So the work of checking candidate matches is
divided up amongst the scores.
Lucene/SOLR Revolution 2013 48
Questions
• Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution
beyond faceted search?
Lucene/SOLR Revolution 2013 49
Doesn’t Google already do this? Some, when searching for famous entities.
Lucene/SOLR Revolution 2013 50
Speed/Scale
• Future Plans include scaling experiments • Research version:
– tested up to 1m docs – Sub-second per document – Incremental updates (i.e., you see documents
published minutes ago)
Lucene/SOLR Revolution 2013 51
Other uses for entity resolution ?
• Supporting relationship resolution by resolving participating entities in the them.
• Knowledge base population • Integrating disparate data sets • Alerting • Improving relevance of search results • Predictive Analytics
Lucene/SOLR Revolution 2013 52
For more information: Visit www.basistech.com
Write to [email protected]
Call 617-386-2090
Thank you!
Lucene/SOLR Revolution 2013 53
CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT Benson Margulies [email protected]