exstreamly cheap - insight data engineering 2016a project

Post on 14-Apr-2017

149 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

{…where the best deals find you in real time.

Emmanuel Awa

For the love of deals, we all just love it.

Real world engineering challenge.

MOTIVATION

ONE platform : User’s preference Inspired Searches and Shopping..

MOTIVATION

Sqoot API. Scaled to all categories offered by

API

Sample Data

User Interaction – Engineered 1B users

Current Data Source

Any trending deals?

Top selling providers

Categorize deals based on price and discount percentages.

Friends purchase pattern

Sample Queries.

Complex queries? Real time response?

Sample Queries.

Current Pipeline

API

INGESTION

BATCH LAYER

SERVING LAYER

Hybrid Streaming

API Interaction and deals collection

API DESIGN Bad or Good?

Biggest Engineering Challenges

Pagination limits and constant API updates.

http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100 Freezing time for real-time non-fire-hose data source is hard

Data Source Constraints

Biggest Project Challenge

Three queries done at the same time. Not fun – Inconsistent. Pagination depends on total largely.

New Page refresh New

ASYNC DISTRIBUTED QUERYING ENGINE

First Stage Master Producer (FSM)

Intermediate Hybrid Consumer-Producer

Final Stage Consumer

Design to solve this?

.

Architecture

FIRST STAGE MASTER

Compute page chunks Leaky bucket approach

FIRST STAGE MASTER Cont’d

HYBRID CONSUMER-PRODUCER

Fetch and produce actual data.

FINAL STAGE CONSUMER

Persist data - HDFS

Nigerian. Masters’ in Computer Science – Brandeis

University MA Software Engineer 2 ½ years. Hobbyist Photographer.

About Me.

PyKafka vs. Kafka-Python. Balanced consumer. Topic to partition assignment – Hash partitioning.

Engineering architecture to handle complex real world data source.

Deep dive. Tweak source code for use case.

DevOps

General learning curves.

Other Challenges

CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text, merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text, merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC);

Sample tables

Elasticsearch or Cassandra or Elasticsearch on Cassandra

Elasticsearch – Good with preserving indexes data. Great for more reads than writes. Analytics. Search

Cassandra – Good for fast writes. Preserving data schema Uptime critical Time seriesElastic Search vs

Cassandra

Benchmarking Pipeline

API

INGESTION

BATCH LAYER

SERVING LAYER

Hybrid Streaming

API Interaction and deals collection

top related