atlhug 20150625

13
Cardlytics & Drill Use Case: Matching Big Data David Kim Principal Engineer 2015.06.25

Upload: mapr-technologies

Post on 03-Aug-2015

154 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Atlhug 20150625

Cardlytics & Drill Use Case: Matching Big Data

David Kim

Principal Engineer

2015.06.25

Page 2: Atlhug 20150625

About Cardlytics

© 2013 Cardlytics. Proprietary and Confidential. 2

•  Privately  held  company  leveraging  proprietary  purchase-­‐driven  intelligence  pla6orm  to  provide  ac7onable  insights  into  consumer  behavior  to  numerous  organiza7ons  using  consumer  purchase  data  that  we  have  exclusive  rights  to  

•  Founded  in  2008  by  ScoA  Grimes  (CEO)  and  Lynne  Laube  (COO)  both  former  execu7ves  at  Capital  One  

•  Headquartered  in  Atlanta,  we  have  320  employees  with  offices  in  NY,  Chicago,  San  Francisco  &  London    

•  Owns  mul7ple  patents  and  nearly  700  banking  rela7onships  in  the  US  and  the  UK  represen7ng  over  100  million  households  and    $1  trillion  in  yearly  spend  

Page 3: Atlhug 20150625

Problem Statement

A customer (advertiser) requested analysis to provide insight into their own business and customer base in order to better understand and make better business decisions. •  Must match advertiser customers to Cardlytics customers

•  Matches must be highly confident and unique

© 2013 Cardlytics. Proprietary and Confidential. 3

Page 4: Atlhug 20150625

Our Approach: Pattern Matching

time

© 2013 Cardlytics. Proprietary and Confidential. 4

Page 5: Atlhug 20150625

Challenges

•  Matches must be unique

•  Matches must be highly confident

•  Limited information available to match data points (no PII)

•  Missing data points

•  Scale (Drill) »  Depending on the advertiser, data points are sparse or densely packed

© 2013 Cardlytics. Proprietary and Confidential. 5

Page 6: Atlhug 20150625

Scale Issues with Dense Data Points

© 2013 Cardlytics. Proprietary and Confidential. 6

Page 7: Atlhug 20150625

Scale Issues

•  60M x 40M = 2.4T potential matches evaluated

•  120M x 120M = 14.4T potential matches evaluated

•  590M x 130M = 76.7T potential matches evaluated

© 2013 Cardlytics. Proprietary and Confidential. 7

Page 8: Atlhug 20150625

Our Environment…

SQL Server: 64 cores (32 physical), 256GB RAM, direct-attached storage w/enterprise disks

Hadoop Cluster: 10 nodes, 32 cores/node, 128GB RAM, 12 x 4TB consumer grade disks

© 2013 Cardlytics. Proprietary and Confidential. 8

Page 9: Atlhug 20150625

Actual Results…

•  POC 1: 60M customer data points x 40M Cardlytics data points collected over 2 years

»  SQL Server : ~20 hours

•  POC 2:120M x 120M over 6 months

»  SQL Server: 1~2 months »  Hive: Killed after several days (estimated to take about a week) »  Drill: 17-18 hours yielding 91+B matching data points

•  POC3: 590M x 130M over 1 year »  Drill: ~17 hours to yield 1.3T matches and 72TB »  Required some tweaking and turning some secret knobs

© 2013 Cardlytics. Proprietary and Confidential. 9

…PROBABLY

Page 10: Atlhug 20150625

…from the MapR Drill team

Compliments of Jacques Nadeau/Aman Sinha

•  store.format

•  store.parquet.block-size

•  planner.broadcast_threshold

•  planner.broadcast_factor

•  planner.join.row_count_estimate_factor

•  planner.enable_multiphase_agg

•  planner.enable_mux_exchange

•  exec.min_hash_table_size

•  planner.enable_hashjoin

•  select * from sys.options;

© 2013 Cardlytics. Proprietary and Confidential. 10

Page 11: Atlhug 20150625

Other Nuggets

•  Drill is memory intensive

•  You will always know more about your data than Drill

•  Hadoop and Drill are great tools but doesn’t solve stupidity

•  Some of the basic principles of querying a dataset still apply

»  Intelligent batching

»  Applying filters early to work with smaller datasets »  Bringing back only the data that you need

»  Partitioning

»  Understanding the configurations and internals of your tools

© 2013 Cardlytics. Proprietary and Confidential. 11

Page 12: Atlhug 20150625

"Louis, I think this is the beginning of a beautiful friendship."

Our close partnership with MapR includes… •  Semi-weekly check-ins with Drill dev team

•  Weekly check-ins with MapR product managers

•  Improving Drill with real world applications, tests, and data

•  Input to future roadmap »  Large IN-clause

»  DST support

»  Auto-partitioning

»  Windowing functions

»  Support for inserts

© 2013 Cardlytics. Proprietary and Confidential. 12

Page 13: Atlhug 20150625

Grab a seat at the cool kids’ table!!

Careers @Cardlytics

http://cardlytics.com/cardlytics/?s=career

Apache Drill

https://drill.apache.org/

https://drill.apache.org/docs/

MapR

https://www.mapr.com/products/product-overview/apache-drill

© 2013 Cardlytics. Proprietary and Confidential. 13

Michael Fabacher, VP of Data Development [email protected] David Kim, Principal Engineer [email protected]