simple fuzzy name matching in elasticsearch paris meetup

35
Name Matching with Elasticsearch 29th of July, 2015 Declan Trezise [email protected]

Upload: basis-technology

Post on 16-Apr-2017

537 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Simple fuzzy name matching in elasticsearch   paris meetup

Name Matchingwith Elasticsearch

29th of July, 2015Declan Trezise

[email protected]

Page 2: Simple fuzzy name matching in elasticsearch   paris meetup

Quick survey: How many of us...

● Regularly develop Elastic applications?● Develop Elastic applications that include

names of…○ ...People?○ ...Places?○ ...Products?○ ...Organisations?○ …(other entity types)?

● Have names in languages beside English?● Want to have better name search?● Are Elasticsearch or plugin developers?

Page 3: Simple fuzzy name matching in elasticsearch   paris meetup

Motivating Questions...

● How could a border officer know whetheryou’re on a terrorist watch list?

● How does your bank know if you’re wiring money to a Colombian drug lord?

● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?

● How can a system search for mentions of people across news articles?

Page 4: Simple fuzzy name matching in elasticsearch   paris meetup

Reality...

Page 5: Simple fuzzy name matching in elasticsearch   paris meetup

April 15 2013 2:49 PM .

Page 6: Simple fuzzy name matching in elasticsearch   paris meetup
Page 7: Simple fuzzy name matching in elasticsearch   paris meetup
Page 8: Simple fuzzy name matching in elasticsearch   paris meetup
Page 9: Simple fuzzy name matching in elasticsearch   paris meetup
Page 10: Simple fuzzy name matching in elasticsearch   paris meetup
Page 11: Simple fuzzy name matching in elasticsearch   paris meetup

Real life exampleDavid K. MurgatroydVP of Engineering

Boarding Pass

Page 12: Simple fuzzy name matching in elasticsearch   paris meetup

Current Best Practice?

● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names)

"mappings": { ... "type": "multi_field", "fields": {

"pty_surename": { "type": "string", "analyzer": "simple" },

"metaphone": { "type": "string", "analyzer": "metaphone" },

"porter": { "type": "string", "analyzer": "porter" } …

● Complex query against each field

● Generally gives high recall(but how do you get high precision too?)

Page 13: Simple fuzzy name matching in elasticsearch   paris meetup

So can a name field-type do this?

● Manage all the subfields

● Contribute score that reflects phenomena

● Be part of queries using many field types

● Have multiple fields per document

● Have multiple values per field (coming soon)

Page 14: Simple fuzzy name matching in elasticsearch   paris meetup

“Jesus Alfonso Lopez Diaz”

vs.

“LobezDias, Chuy”

Page 15: Simple fuzzy name matching in elasticsearch   paris meetup

Can we do better?

● Incorporates our proprietary name matching technology

● Provides similarity scores to name pairs● Uses Elasticsearch's Rescore query● Allows for higher precision ranking and

tresholding● Multi-lingual name search

Page 16: Simple fuzzy name matching in elasticsearch   paris meetup

RNI

Page 17: Simple fuzzy name matching in elasticsearch   paris meetup

Elastic + RNI

Page 18: Simple fuzzy name matching in elasticsearch   paris meetup

Rescore Query

Main Query

Plug-in Implementation

match : { name: "Bob Smitty" }

bool:name.Key1:...name.Key2:...name.Key3:...

User Query

Rescorename_score : { field : "name", name : "Bob

Smitty")

name:"Robert Smith"dob:2/13/1987score : .79

Indexing

{ name: "Robert Smith"dob:"1987/02/13" }

{ name: "Robert Smith"name.Key1:…name.Key2:…name.Key3:…dob: "1987/02/13" }

User Doc

Index

subset

Page 19: Simple fuzzy name matching in elasticsearch   paris meetup

Demo

Page 20: Simple fuzzy name matching in elasticsearch   paris meetup

How could you use such a Field?

● Plugin contains custom mapper which does all the work behind the scenesPUT /ofac/ofac/_mapping{ "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } }}

Page 21: Simple fuzzy name matching in elasticsearch   paris meetup

What happens at index time?

● NameMapper indexes keys for different phenomena in separate (sub) fields@Override

public void parse(ParseContext context) throws IOException {

Name name = NameBuilder.data(nameString).build();

//Generate keys for name

Collection<FieldSpec> fields = helper.deriveFieldsForName(name);

//Parse each key with the appropriate Mapper

for (FieldSpec field : fields) {

Mapper mapper = keyMappers.get(field.getField().fieldName());

context = context.createExternalValueContext(field.getStringValue());

mapper.parse(context);

}

}

Page 22: Simple fuzzy name matching in elasticsearch   paris meetup

What happens at query time?

● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring@Override

public Query termQuery(Object value, @Nullable QueryParseContext context) {

//Parse name string

Name name = NameBuilder.data(value.toString()).build();

QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name));

//Build Lucene query

Query query = spec.accept(new ESQueryVisitor(names.indexName() + "."));

return query;

}

Page 23: Simple fuzzy name matching in elasticsearch   paris meetup

What else happens at query time?

● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly○ Tuned for high precision name matching○ Computationally expensive"rescore" : {

"query" : {

"rescore_query" : {

"function_score" : {

"name_score" : {

"field" : "name",

"query_name" : "LobEzDiaS, Chuy"

}

...

Page 24: Simple fuzzy name matching in elasticsearch   paris meetup

● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score

@Override

public double score(int docId, float subQueryScore) {

//Create a scorer for the query name

CachedScorer cs = createCachedScorer(queryName);

//Retrieve name data from doc values

nameByteData.setDocument(docId);

Name indexName = bytesToName(nameByteData.valueAt(i).bytes);

//Score the query against the indexed name in this document

return cs.score(indexName);

}

What does that function do?

Page 25: Simple fuzzy name matching in elasticsearch   paris meetup

HighRecall Query(Elastic)

Subset High Recall Results

Total < windowsize

&

Score > minimumScoreThreshold

Re-scoring High Precision

Query

ScoredResults

Trading Off Accuracy for Speed

● window_size○ Controls how many of the subset

documents to rescore (imagine a HUGE name index)

○ Trade-off accuracy vs speed

● minScoreToCheck - (Added by Us)○ Lucene score threshold subset

docs must meet to be rescored○ Trade-off accuracy vs speed

Page 26: Simple fuzzy name matching in elasticsearch   paris meetup

What Challenges Were There?

● Design based on similar Solr plugin● 1-2 months solo develop time● Nice plugin infrastructure● Missing some useful javadocs/comments● No (official) plugin development guide● Used other plugin implementations as

guides https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins

Page 27: Simple fuzzy name matching in elasticsearch   paris meetup

Summary: How it works

● Custom field type mapping○ Splits a single field into multiple fields covering

different phenomena○ Supports multiple name fields in a document○ Intercepts the query to inject a custom Lucene query

● Custom re-score function○ Re-scores documents with algorithm specific to

name matching ○ Limits intense calculations to only top candidates○ Highly configurable

Page 28: Simple fuzzy name matching in elasticsearch   paris meetup

Simple Fuzzy Name Matching

with Elasticsearch21st of July, 2015Declan Trezise

[email protected]

Page 29: Simple fuzzy name matching in elasticsearch   paris meetup

Major terrorist attack is ‘inevitable’ as Isis fighters return, say EU officials

Trojan horse: ISIS militants come to Europe disguised as refugees, US intel sources claim

The Guardian, Thursday 25 September 2014

RT.com, Thursday 9 October 2014

Europe fears more 9/11 or 7/7 terrorist attacks delivered by rising Militant Group ISIS

Europeans are returning from Syria in their masses, many fearing the rise of ISIS however many may also be radicalised natives - not only is this bad for Europe but also the US may have to consider its VISA waiver program for Europe

ISIS militants may enter Europe posing as refugees

Turkish border is the issue - poor passport control and no VISA requirements mean free crossing of Syrian-Turkish border and many refugees take this route. It is near impossible to separate jihadists from legitimate refugees

Page 30: Simple fuzzy name matching in elasticsearch   paris meetup

The problem at the border● Land border control is at the heart of the problem

○ Islamic State averse to using air travel○ lack of a visa requirement between Turkey and

Syria○ large number of refugees crossing Turkey-

Syria land border○ large number of European ex-pats leaving

Syria○ The Islamic State ‘Trojan horse’ - Jihadist

terrorists radicalised from ex-pats or posing as refugees

● There is currently a high dependence on visas for preventing movements○ This reliance could be relieved by having

effective control at the borders for name / identity checking

Islamic State militant poses with flag

Page 31: Simple fuzzy name matching in elasticsearch   paris meetup

● ~5000 EU citizens currently fighting alongside IS in Syria

● Including 500+ German citizens

● 1000+ French citizens, only ~150 returned to France so far

(Charlie Hebdo was only a handful)

● Compare with less than 200 EU citizens who fought alongside

al-Qaeda / Taliban

● FBI watchlist; 400,000 suspects, 1M aliases, 1,600 new

names, 600 deletions, 4,800 corrections every single day!

The problem in numbers

Page 32: Simple fuzzy name matching in elasticsearch   paris meetup

“Chuy”

Page 33: Simple fuzzy name matching in elasticsearch   paris meetup
Page 34: Simple fuzzy name matching in elasticsearch   paris meetup
Page 35: Simple fuzzy name matching in elasticsearch   paris meetup

R

P