gids2016

32
Real Time Fuzzy Matching With Spark and ElasticSearch

Upload: sonal-goyal

Post on 13-Jan-2017

31 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Gids2016

Real Time Fuzzy Matching With Spark and ElasticSearch

Page 2: Gids2016

BFSI

Page 3: Gids2016

Wilful Defaulters?

Page 4: Gids2016

Sanctions Screening

PEP

HMT

OFAC SDN

..and many others

Page 5: Gids2016

However ...

7TH OF TIR

7TH OF TIR COMPLEX

7TH OF TIR INDUSTRIAL COMPLEX

7TH OF TIR INDUSTRIES

7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN

SEVENTH of Tir

Page 6: Gids2016

Entity Resolution

Page 7: Gids2016

Directory Listings

De

Dew Drops, Shop no - A-152, super mart 1, Gurgaon - 122001, DLF Phase 4

DewDrop Florist, A 152, DLF City Phase 4, Near Galleria Market, Super Mart 1

Page 8: Gids2016

Ecommerce

Cherry Mobile Amethyst Android 4.2 Jelly Bean (Black) with Free Smart and Globe SIM

Cherry Mobile Amethyst (White) with 1 Smart SIM

CHERRY MOBILE AMETHYST + 1 SMART SIM

Cherry Mobile Amethyst Android 4.2 Jelly Bean

Cherry Mobile Amethyst (White) with 1 Samsung Galaxy V

CHERRY MOBILE AMETHYST + 1 SAMSUNG GALAXY V. + 1 SMART AND GLOBE SIM

Page 9: Gids2016

Government of ..

● Benefit rollouts● Surveillance● Licenses● Linking NPR with Passport

Page 10: Gids2016

360 viewID Company Name Project

12345 UBM Asia Dave Chan HK - Fine Jewellery

13222 UBM A Dave C HK - Fashion Jewellery

15656 UBM Davechan HK - Beauty

14456 ubmAsia Mr. Dave CChan HK - Fine Jewellery

Page 11: Gids2016

“In order to be irreplaceable, one must always be different.”

― Coco Chanel

Page 12: Gids2016

Other uses

● Cross selling● Data Quality● Vendor consolidation● Master Data Management● CRM Deduplication

Page 13: Gids2016

Challenges

● Discovering and maintaining rules is extremely tough

● Custom coding and domain specific logic makes maintenance a nightmare

● No one size fits all, big custom implementations needed every time even after using existing tools

Page 14: Gids2016

Challenges..

● High Data volumes ● Each record has multiple dimensions● Exact matches are rare● Comparing each record with every other is not

possible● Languages have unique issues

Page 15: Gids2016

Lets start wishing...

● Data variety● Scalable● No manual configuration of rules or algorithms● Multi language● Real time

Page 16: Gids2016

Our Approach

- Learn from the data- Divide the load

Page 17: Gids2016

Reifier Workflow

Configure data

Reifier Interactive Learner

Linked Result

Have training data?Reifier Match

Yes

No

Page 18: Gids2016

1. Select Data

Page 19: Gids2016

2. Field Selection and Stop Words

Page 20: Gids2016

Strata Hadoop World Singapore 2015

3. Choose Training Set

Page 21: Gids2016

Strata Hadoop World Singapore 2015

4. Run the Spark Job

Page 22: Gids2016

Strata Hadoop World Singapore 2015

5. Enjoy the results

Page 23: Gids2016

Strata Hadoop World Singapore 2015

At the beginning: (Without Chinese Stopped words)

亚洲博闻有限公司 Dave Chan亚洲华乐有限公司 David Chan

In this case, the similarity between 2 records is very high

What if we include the stopped word? (亚洲,有限公司)

博闻 Dave Chan华乐 David Chan

Company names for these records now are not matched at all and the system will not group them together.

Fuzzy Match in Reifier – Stopped word

Page 24: Gids2016

Reifier Interactive Learner

Page 25: Gids2016

Reifier Interactive Learner

Page 26: Gids2016

Reifier Interactive Learner

Page 27: Gids2016

Reifier Interactive Learner

Page 28: Gids2016

Spark Benefits

● Distributed● Scalable● Fast● Machine Learning● Sampling● No need to orchestrate multiple jobs

Page 29: Gids2016

Real Time

Spark + ElasticSearch

Page 30: Gids2016

Advantages● Point and Shoot - Zero config

● Learning similarity definitions from data

■ - No hard coding of business rules

■ - Domain agnostic

■ - Handle multiple languages (English,

Chinese, Japanese, Thai)

Page 31: Gids2016

Advantages

● Scalability

● Real time as well as batch