moving from mysql to elasticsearch for analytics

Download Moving From MySQL to Elasticsearch for Analytics

Post on 08-Jan-2017



Data & Analytics

1 download

Embed Size (px)


  • Yannick Dawant & Vinh Nguyen

    Moving from MySQL to Elasticsearch for Analytics

  • What is Analytics, and why is it important to Percolate?

    Analytics 1.0 - MySQL

    Analytics 2.0 - Elasticsearch

    Next Steps


  • The System of Record for Marketing

  • What does Analytics mean to Percolate?How does it work?

  • Analytics 1.0 - Design

    Crawlers MySQL







  • MySQL Data Model

    post_id service_id tag created_at

    1 1 blog 2016-01-01 10:11:15

    2 1 blog, video 2016-01-01 12:12:30

    3 2 election 2016 2016-01-01 10:10:57

    metric_id service_id name

    1 1 likes

    2 1 comments

    3 1 follows

    4 2 follows

    5 2 mentions

    6 2 retweets

    post_id metric_id metric_value captured_at

    1 1 10 2016-01-01 10:11:15

    1 1 20 2016-01-01 12:12:30

    2 2 5 2016-01-01 10:10:57

    2 2 10 2016-01-01 13:12:20

    3 1 15 2016-01-01 13:12:45

    3 2 30 2016-01-01 17:05:11

    [post]service_id name

    1 facebook

    2 twitter

    3 instagram


    [post_metrics] [metric_names]

  • Relational data models Very well known pattern

    Application-level objects map cleanly to DB tables

    Joins are easy to do

    Easy to use Amazon RDS for managed hosting/deployment/monitoring

    Very familiar to Ops team and other developers, shared knowledge base

    Lots of support available online

    Met product requirements

    Why MySQL?

  • Seems reasonable.What are the tradeoffs?

  • Data Modeling Issues Starts easy but becomes complex over time (increasing number of tables)

    Schema inflexibility (dynamic changes, unused columns)

    Hard to modify live schemas, may require downtime

    Slow Queries Lots of joins at query time

    Tables grow larger and larger over time

    Hard to partition Time series data

    Expensive post-processing on application side

    MySQL Tradeoffs

  • Scalability Issues Database grows larger and larger over time

    Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime

    Hard to scale horizontally

    Not suitable for our Search needs

    MySQL Tradeoffs

  • Where do we go from here?

  • Analytics 1.0 - Design

    Crawlers MySQL







  • Analytics 2.0 - Design

    Crawlers Elasticsearch





    [] MySQL

    Kafka Data Transformation


    Data Transformation

  • Decouples data collection from storage

    Enhances reliability of our data pipelines Message queue persistence, replay

    Enhances horizontal scalability of our data pipelines Multiple brokers, parallel consumers/producers

    Why Kafka?

  • Applies data transformation rules Validation, enrichment, denormalization, rollups

    Writes data to various indexes in ES

    Error handling Network issues, ES load/timeout issues, mapping conflicts

    Multiple workers to increase overall throughput

    Real time and asynchronous workers

    Data Transformation

  • { "_index" : "analytics_2016-11-01", "_type" : "post", "_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9", "post_id": "19398339", "service": "facebook", "captured_at": "2016-10-31T20:32:17+00:00", "metrics": { "comments": 13, "consumptions": 132, engaged": 24, "impressions": 132, "likes": 50, negative_feedback": 5, "reach": 93, "shares": 76 video_views": 42 }, "tags": ["blog","video"] }

    Elasticsearch Data Model

  • Document based datastore Flexible schemas, dynamic mapping, mapping templates

    JSON, rich data structures, nested objects

    REST APIs make integration simple

    Query performance Shards spread across nodes (versus entire MySQL DB/table on single node)

    Rolling indexes for Time series data == querying only the indexes needed (versus entire MySQL table)

    Why Elasticsearch?

  • Search Rich set of built-in queries

    Powerful aggregations (and sub aggregations) Scalability

    More control over shards and indexes

    Horizontally scale by adding more nodes and clusters

    Easy to archive old data/indexes to free up resources

    Meets current and *new* product requirements

    Why Elasticsearch?

  • Seems reasonable.What are the tradeoffs?

  • Data updates are more complex Update by query, upserts, script security issues

    Not truly schema-less

    Reindexing is time consuming Adding fields, mapping conflicts

    Still need custom, index management layer Index mappings, settings, templates, naming patterns, data retention, backup/restore

    Operating ES requires effort Deployment, configuration, performance tuning, monitoring

    Elasticsearch Tradeoffs

  • More index management Better support for different types of indexes, each with own settings

    Add APIs + Tools for operations

    Avoid oversharding, which causes cluster stability issues

    More focus on UPDATE operations Field updates (i.e. tags) require update by query/script

    Faster reindexing (i.e. adding new fields, changing field mappings)

    Slow updates/reindexing can affect other system operations/transactions

    Data denormalization vs joins

    More production monitoring

    Next Steps


    Were Hiring!