exstreamlycheap final slides

31
ExStreamlyCheap.club

Upload: emmanuel-awa

Post on 14-Apr-2017

158 views

Category:

Software


1 download

TRANSCRIPT

Page 1: ExStreamlycheap Final Slides

ExStreamlyCheap.club

Page 2: ExStreamlycheap Final Slides

hello!

I am Emmanuel AwaInsight Data Engineering Fellow 2016 Spring cohort.

You can find me at @awaemma

Page 3: ExStreamlycheap Final Slides

content.

1. project motivation.

2. Engineering solution.

3. Challenges.

4. Takeaways

Page 4: ExStreamlycheap Final Slides

1.

Project Motivation

...what did I spend three weeks working on?

Page 5: ExStreamlycheap Final Slides

“I have learnt to seek my

happiness in limiting my desires, rather than attempting to satisfy

them.~ John Stuart Mill

Page 6: ExStreamlycheap Final Slides

Big concept1. Real time scalable deals serving

platform2. Insights and searching capabilities3. Maximizing time and profit.

Page 7: ExStreamlycheap Final Slides

sample queries

� User’s holistic view - real time trends visualization.

Page 8: ExStreamlycheap Final Slides

sample queries

� One highly scalable search platform: price or discount options.

Page 9: ExStreamlycheap Final Slides

sample queries

� Engineer User purchase interaction…

Page 10: ExStreamlycheap Final Slides

sample queries

� … and reaction.

Page 11: ExStreamlycheap Final Slides

2.

Engineering Solution

...Finding the right tools for the job.

Page 12: ExStreamlycheap Final Slides

data source - sqoot api

� Rich Merchant Info.○ Location based queries.

Page 13: ExStreamlycheap Final Slides

data source - sqoot api

� Scaled to all categories ○ ~ 11 million served.○ Over 80 categories.

Page 14: ExStreamlycheap Final Slides

pipeline - λ architecture

Ingestion Batch / Speed layer

Serving layer

Queries

Page 15: ExStreamlycheap Final Slides

batch pipeline HybridStreaming

API Interaction

Async Hybrid Distributed Query

Engine

restful api

QueriesServing LayerBatch Layer

Ingestion Layer

Page 16: ExStreamlycheap Final Slides

Real Time pipeline

Async Hybrid Distributed Query

Enginerestful api

QueriesServing LayerSpeed Layer

Ingestion Layer

Engineered user data

~ 200k events / min

Page 17: ExStreamlycheap Final Slides

3.

Challenges

...now it was easy, right?

Page 18: ExStreamlycheap Final Slides

project challenges.

BAD API DESIGN

1. Pagination.

2. Max #100 per page.

3. Dynamic api without firehose or sockets.

4. Duplicate deals with sync api calls.

ROBUST PLUGINS.

PyKafka vs Kafka-Python

1. Balanced consumer.

2. Topic to Partition assignment - HASH PARTITIONING.

GENERAL ENGR. CONSTRAINTS.

1. Design and architecture choice.

2. Tools deep dive - Tweak source code.

3. Constant Cassandra crashes. - Real time writes.

4. DevOps

Page 19: ExStreamlycheap Final Slides

4.

Takeaways

...some lessons learned.

Page 20: ExStreamlycheap Final Slides

PROPER DB INDEXES

� Partition and Clustering keys

CREATE TABLE trend_with_price PRIMARY KEY (price, discount)) WITH CLUSTERING ORDER BY (discount DESC);

� Secondary indexes

CREATE INDEX trend_with_price_category_idx ON trend_with_price (category);

secret to answering the

questions?

Page 21: ExStreamlycheap Final Slides

KAFKA CONSUMPTION OFFSETS

� Topic Offsets

Set the right start offset per partition

secret to answering the

questions?

Page 22: ExStreamlycheap Final Slides

secret to answering the

questions?

Page 23: ExStreamlycheap Final Slides

that’s all

folks!

Page 24: ExStreamlycheap Final Slides

About me

- Masters in CS

- 2 ½ yrs SE in Travel

- Nigerian - Hobbyist Photographer

Page 25: ExStreamlycheap Final Slides

thanks!

Any questions?You can find me at

@[email protected]/awaemmanuel

linkedin.com/in/emmanuelawa

Page 26: ExStreamlycheap Final Slides

BACKUP SLIDES

Page 27: ExStreamlycheap Final Slides

Benchmarking exercises

Elasticsearch and cassandra

METRICS - 20 GB of dirty data

1. I/O - Read and Writes.2. EC2 four (4) m4.xlarge

clusters

CONSIDERATIONS.

1. ElasticSearch vs Cassandra.

2. ElasticSearch on Cassandra.

GENERAL PROCESS FLOW.1. Read dirty python dictionary

from DB.2. Parse and process3. Write back to DB.4. Profile process.

ElasticSearch Advantages

1. Good for preserving data indexes.2. Great for more reads than writes.3. Analytics and text search.

Cassandra Advantages.

1. Good for fast writes.2. Preserving data schemas.3. Uptime critical and Time series data.

Page 28: ExStreamlycheap Final Slides

benchmarking

pipeline

Async Hybrid Distributed Query

Engine

restful api

QueriesServing LayerBatch Layer

Ingestion Layer

Page 29: ExStreamlycheap Final Slides

BIGGEST PROJECT

CHALLENGE - api constraintsAPI Pagination and max per page:

http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100

Freezing time for real-time non-fire-hose data source is hard

1. Three queries done at the same time. 2. Not fun – Inconsistent. 3. Application depends largely on total counts.

Page #1 loads first time Page #1 refresh in millisecs Page #2 loads

Page 30: ExStreamlycheap Final Slides

Async distributed query engine - (Async DQE)

1. First Stage Master

Producer (FSM)

2. Intermediate Hybrid

Consumer-Producer

3. Final Stage Consumer

Page 31: ExStreamlycheap Final Slides

THE END...