moving from mysql to elasticsearch for analytics
TRANSCRIPT
Yannick Dawant & Vinh Nguyen
Moving from MySQL to Elasticsearch for Analytics
— What is Analytics, and why is it important to Percolate?
— Analytics 1.0 - MySQL
— Analytics 2.0 - Elasticsearch
— Next Steps
Agenda
The System of Record for Marketing
What does Analytics mean to Percolate?How does it work?
Analytics 1.0 - Design
Crawlers MySQL
API
UIFacebook
[…]
metrics
MySQL Data Model
post_id service_id tag created_at
1 1 blog 2016-01-01 10:11:15
2 1 blog, video 2016-01-01 12:12:30
3 2 election 2016 2016-01-01 10:10:57
metric_id service_id name
1 1 likes
2 1 comments
3 1 follows
4 2 follows
5 2 mentions
6 2 retweets
post_id metric_id metric_value captured_at
1 1 10 2016-01-01 10:11:15
1 1 20 2016-01-01 12:12:30
2 2 5 2016-01-01 10:10:57
2 2 10 2016-01-01 13:12:20
3 1 15 2016-01-01 13:12:45
3 2 30 2016-01-01 17:05:11
[post]service_id name
1 facebook
2 twitter
3 instagram
[service]
[post_metrics] [metric_names]
— Relational data models — Very well known pattern
— Application-level objects map cleanly to DB tables
— Joins are easy to do
— Easy to use — Amazon RDS for managed hosting/deployment/monitoring
— Very familiar to Ops team and other developers, shared knowledge base
— Lots of support available online
— Met product requirements
Why MySQL?
Seems reasonable.What are the tradeoffs?
— Data Modeling Issues — Starts easy but becomes complex over time (increasing number of tables)
— Schema inflexibility (dynamic changes, unused columns)
— Hard to modify live schemas, may require downtime
— Slow Queries — Lots of joins at query time
— Tables grow larger and larger over time
— Hard to partition Time series data
— Expensive post-processing on application side
MySQL Tradeoffs
— Scalability Issues — Database grows larger and larger over time
— Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime
— Hard to scale horizontally
— Not suitable for our Search needs
MySQL Tradeoffs
Where do we go from here?
Analytics 1.0 - Design
Crawlers MySQL
API
UIFacebook
[…]
metrics
Analytics 2.0 - Design
Crawlers Elasticsearch
API
UIFacebook
[…] MySQL
Kafka Data Transformation
metrics
Data Transformation
— Decouples data collection from storage
— Enhances reliability of our data pipelines — Message queue persistence, replay
— Enhances horizontal scalability of our data pipelines — Multiple brokers, parallel consumers/producers
Why Kafka?
— Applies data transformation rules — Validation, enrichment, denormalization, rollups
— Writes data to various indexes in ES
— Error handling — Network issues, ES load/timeout issues, mapping conflicts
— Multiple workers to increase overall throughput
— Real time and asynchronous workers
Data Transformation
{ "_index" : "analytics_2016-11-01", "_type" : "post", "_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9", "post_id": "19398339", "service": "facebook", "captured_at": "2016-10-31T20:32:17+00:00", "metrics": { "comments": 13, "consumptions": 132, “engaged": 24, "impressions": 132, "likes": 50, “negative_feedback": 5, "reach": 93, "shares": 76 “video_views": 42 }, "tags": ["blog","video"] }
Elasticsearch Data Model
— Document based datastore — Flexible schemas, dynamic mapping, mapping templates
— JSON, rich data structures, nested objects
— REST APIs make integration simple
— Query performance — Shards spread across nodes (versus entire MySQL DB/table on single node)
— Rolling indexes for Time series data == querying only the indexes needed (versus entire MySQL table)
Why Elasticsearch?
— Search — Rich set of built-in queries
— Powerful aggregations (and sub aggregations) — Scalability
— More control over shards and indexes
— Horizontally scale by adding more nodes and clusters
— Easy to archive old data/indexes to free up resources
— Meets current and *new* product requirements
Why Elasticsearch?
Seems reasonable.What are the tradeoffs?
— Data updates are more complex — Update by query, upserts, script security issues
— Not truly schema-less
— Reindexing is time consuming — Adding fields, mapping conflicts
— Still need custom, index management layer — Index mappings, settings, templates, naming patterns, data retention, backup/restore
— Operating ES requires effort — Deployment, configuration, performance tuning, monitoring
Elasticsearch Tradeoffs
— More index management — Better support for different types of indexes, each with own settings
— Add APIs + Tools for operations
— Avoid oversharding, which causes cluster stability issues
— More focus on UPDATE operations — Field updates (i.e. tags) require update by query/script
— Faster reindexing (i.e. adding new fields, changing field mappings)
— Slow updates/reindexing can affect other system operations/transactions
— Data denormalization vs joins
— More production monitoring
Next Steps