moving from mysql to elasticsearch for analytics

23
Yannick Dawant & Vinh Nguyen Moving from MySQL to Elasticsearch for Analytics

Upload: percolate

Post on 08-Jan-2017

192 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Moving From MySQL to Elasticsearch for Analytics

Yannick Dawant & Vinh Nguyen

Moving from MySQL to Elasticsearch for Analytics

Page 2: Moving From MySQL to Elasticsearch for Analytics

— What is Analytics, and why is it important to Percolate?

— Analytics 1.0 - MySQL

— Analytics 2.0 - Elasticsearch

— Next Steps

Agenda

Page 3: Moving From MySQL to Elasticsearch for Analytics

The System of Record for Marketing

Page 4: Moving From MySQL to Elasticsearch for Analytics

What does Analytics mean to Percolate?How does it work?

Page 5: Moving From MySQL to Elasticsearch for Analytics

Analytics 1.0 - Design

Crawlers MySQL

API

UIFacebook

Twitter

Instagram

LinkedIn

[…]

metrics

Page 6: Moving From MySQL to Elasticsearch for Analytics

MySQL Data Model

post_id service_id tag created_at

1 1 blog 2016-01-01 10:11:15

2 1 blog, video 2016-01-01 12:12:30

3 2 election 2016 2016-01-01 10:10:57

metric_id service_id name

1 1 likes

2 1 comments

3 1 follows

4 2 follows

5 2 mentions

6 2 retweets

post_id metric_id metric_value captured_at

1 1 10 2016-01-01 10:11:15

1 1 20 2016-01-01 12:12:30

2 2 5 2016-01-01 10:10:57

2 2 10 2016-01-01 13:12:20

3 1 15 2016-01-01 13:12:45

3 2 30 2016-01-01 17:05:11

[post]service_id name

1 facebook

2 twitter

3 instagram

[service]

[post_metrics] [metric_names]

Page 7: Moving From MySQL to Elasticsearch for Analytics

— Relational data models — Very well known pattern

— Application-level objects map cleanly to DB tables

— Joins are easy to do

— Easy to use — Amazon RDS for managed hosting/deployment/monitoring

— Very familiar to Ops team and other developers, shared knowledge base

— Lots of support available online

— Met product requirements

Why MySQL?

Page 8: Moving From MySQL to Elasticsearch for Analytics

Seems reasonable.What are the tradeoffs?

Page 9: Moving From MySQL to Elasticsearch for Analytics

— Data Modeling Issues — Starts easy but becomes complex over time (increasing number of tables)

— Schema inflexibility (dynamic changes, unused columns)

— Hard to modify live schemas, may require downtime

— Slow Queries — Lots of joins at query time

— Tables grow larger and larger over time

— Hard to partition Time series data

— Expensive post-processing on application side

MySQL Tradeoffs

Page 10: Moving From MySQL to Elasticsearch for Analytics

— Scalability Issues — Database grows larger and larger over time

— Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime

— Hard to scale horizontally

— Not suitable for our Search needs

MySQL Tradeoffs

Page 11: Moving From MySQL to Elasticsearch for Analytics

Where do we go from here?

Page 12: Moving From MySQL to Elasticsearch for Analytics

Analytics 1.0 - Design

Crawlers MySQL

API

UIFacebook

Twitter

Instagram

LinkedIn

[…]

metrics

Page 13: Moving From MySQL to Elasticsearch for Analytics

Analytics 2.0 - Design

Crawlers Elasticsearch

API

UIFacebook

Twitter

Instagram

LinkedIn

[…] MySQL

Kafka Data Transformation

metrics

Data Transformation

Page 14: Moving From MySQL to Elasticsearch for Analytics

— Decouples data collection from storage

— Enhances reliability of our data pipelines — Message queue persistence, replay

— Enhances horizontal scalability of our data pipelines — Multiple brokers, parallel consumers/producers

Why Kafka?

Page 15: Moving From MySQL to Elasticsearch for Analytics

— Applies data transformation rules — Validation, enrichment, denormalization, rollups

— Writes data to various indexes in ES

— Error handling — Network issues, ES load/timeout issues, mapping conflicts

— Multiple workers to increase overall throughput

— Real time and asynchronous workers

Data Transformation

Page 16: Moving From MySQL to Elasticsearch for Analytics

{ "_index" : "analytics_2016-11-01", "_type" : "post", "_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9", "post_id": "19398339", "service": "facebook", "captured_at": "2016-10-31T20:32:17+00:00", "metrics": { "comments": 13, "consumptions": 132, “engaged": 24, "impressions": 132, "likes": 50, “negative_feedback": 5, "reach": 93, "shares": 76 “video_views": 42 }, "tags": ["blog","video"] }

Elasticsearch Data Model

Page 17: Moving From MySQL to Elasticsearch for Analytics

— Document based datastore — Flexible schemas, dynamic mapping, mapping templates

— JSON, rich data structures, nested objects

— REST APIs make integration simple

— Query performance — Shards spread across nodes (versus entire MySQL DB/table on single node)

— Rolling indexes for Time series data == querying only the indexes needed (versus entire MySQL table)

Why Elasticsearch?

Page 18: Moving From MySQL to Elasticsearch for Analytics

— Search — Rich set of built-in queries

— Powerful aggregations (and sub aggregations) — Scalability

— More control over shards and indexes

— Horizontally scale by adding more nodes and clusters

— Easy to archive old data/indexes to free up resources

— Meets current and *new* product requirements

Why Elasticsearch?

Page 19: Moving From MySQL to Elasticsearch for Analytics

Seems reasonable.What are the tradeoffs?

Page 20: Moving From MySQL to Elasticsearch for Analytics

— Data updates are more complex — Update by query, upserts, script security issues

— Not truly schema-less

— Reindexing is time consuming — Adding fields, mapping conflicts

— Still need custom, index management layer — Index mappings, settings, templates, naming patterns, data retention, backup/restore

— Operating ES requires effort — Deployment, configuration, performance tuning, monitoring

Elasticsearch Tradeoffs

Page 21: Moving From MySQL to Elasticsearch for Analytics

— More index management — Better support for different types of indexes, each with own settings

— Add APIs + Tools for operations

— Avoid oversharding, which causes cluster stability issues

— More focus on UPDATE operations — Field updates (i.e. tags) require update by query/script

— Faster reindexing (i.e. adding new fields, changing field mappings)

— Slow updates/reindexing can affect other system operations/transactions

— Data denormalization vs joins

— More production monitoring

Next Steps

Page 22: Moving From MySQL to Elasticsearch for Analytics
Page 23: Moving From MySQL to Elasticsearch for Analytics

https://percolate.com/careers/

We’re Hiring!