exstreamly cheap - insight data engineering 2016a project

22
{ …where the best deals find you in real time. Emmanuel Awa

Upload: emmanuel-awa

Post on 14-Apr-2017

146 views

Category:

Software


3 download

TRANSCRIPT

Page 1: ExStreamly Cheap - Insight Data Engineering 2016a Project

{…where the best deals find you in real time.

Emmanuel Awa

Page 2: ExStreamly Cheap - Insight Data Engineering 2016a Project

For the love of deals, we all just love it.

Real world engineering challenge.

MOTIVATION

Page 3: ExStreamly Cheap - Insight Data Engineering 2016a Project

ONE platform : User’s preference Inspired Searches and Shopping..

MOTIVATION

Page 4: ExStreamly Cheap - Insight Data Engineering 2016a Project

Sqoot API. Scaled to all categories offered by

API

Sample Data

Page 5: ExStreamly Cheap - Insight Data Engineering 2016a Project

User Interaction – Engineered 1B users

Current Data Source

Page 6: ExStreamly Cheap - Insight Data Engineering 2016a Project

Any trending deals?

Top selling providers

Categorize deals based on price and discount percentages.

Friends purchase pattern

Sample Queries.

Page 7: ExStreamly Cheap - Insight Data Engineering 2016a Project

Complex queries? Real time response?

Sample Queries.

Page 8: ExStreamly Cheap - Insight Data Engineering 2016a Project

Current Pipeline

API

INGESTION

BATCH LAYER

SERVING LAYER

Hybrid Streaming

API Interaction and deals collection

Page 9: ExStreamly Cheap - Insight Data Engineering 2016a Project

API DESIGN Bad or Good?

Biggest Engineering Challenges

Page 10: ExStreamly Cheap - Insight Data Engineering 2016a Project

Pagination limits and constant API updates.

http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100 Freezing time for real-time non-fire-hose data source is hard

Data Source Constraints

Page 11: ExStreamly Cheap - Insight Data Engineering 2016a Project

Biggest Project Challenge

Three queries done at the same time. Not fun – Inconsistent. Pagination depends on total largely.

New Page refresh New

Page 12: ExStreamly Cheap - Insight Data Engineering 2016a Project

ASYNC DISTRIBUTED QUERYING ENGINE

First Stage Master Producer (FSM)

Intermediate Hybrid Consumer-Producer

Final Stage Consumer

Design to solve this?

Page 13: ExStreamly Cheap - Insight Data Engineering 2016a Project

.

Architecture

Page 14: ExStreamly Cheap - Insight Data Engineering 2016a Project

FIRST STAGE MASTER

Compute page chunks Leaky bucket approach

Page 15: ExStreamly Cheap - Insight Data Engineering 2016a Project

FIRST STAGE MASTER Cont’d

Page 16: ExStreamly Cheap - Insight Data Engineering 2016a Project

HYBRID CONSUMER-PRODUCER

Fetch and produce actual data.

Page 17: ExStreamly Cheap - Insight Data Engineering 2016a Project

FINAL STAGE CONSUMER

Persist data - HDFS

Page 18: ExStreamly Cheap - Insight Data Engineering 2016a Project

Nigerian. Masters’ in Computer Science – Brandeis

University MA Software Engineer 2 ½ years. Hobbyist Photographer.

About Me.

Page 19: ExStreamly Cheap - Insight Data Engineering 2016a Project

PyKafka vs. Kafka-Python. Balanced consumer. Topic to partition assignment – Hash partitioning.

Engineering architecture to handle complex real world data source.

Deep dive. Tweak source code for use case.

DevOps

General learning curves.

Other Challenges

Page 20: ExStreamly Cheap - Insight Data Engineering 2016a Project

CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text, merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text, merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC);

Sample tables

Page 21: ExStreamly Cheap - Insight Data Engineering 2016a Project

Elasticsearch or Cassandra or Elasticsearch on Cassandra

Elasticsearch – Good with preserving indexes data. Great for more reads than writes. Analytics. Search

Cassandra – Good for fast writes. Preserving data schema Uptime critical Time seriesElastic Search vs

Cassandra

Page 22: ExStreamly Cheap - Insight Data Engineering 2016a Project

Benchmarking Pipeline

API

INGESTION

BATCH LAYER

SERVING LAYER

Hybrid Streaming

API Interaction and deals collection