real-time data loading from mysql to hadoop
DESCRIPTION
Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. Continuent Tungsten Replicator provides enterprise-quality replication from MySQL to Hadoop under a GPL V2 license. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself. This webinar covers the following topics: - How Hadoop works and why it's useful for processing transaction data from MySQL - Setting up Continuent Tungsten replication from MySQL to Hadoop - Transforming MySQL data within Hadoop to enable efficient analytics - Tuning replication to maximize performance. You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten. The software we are discussing is 100% open source and available from the Tungsten Replicator website at code.google.com.TRANSCRIPT
©Continuent 2014
Real-Time Loading from MySQL to Hadoop
Featuring Continuent Tungsten
Robert Hodges, CEO
©Continuent 2014 ���2
Introducing Continuent
©Continuent 2014
Introducing Continuent
���3
• The leading provider of clustering and replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance scaling and data management for MySQL
• Replication - Flexible, high-performance data movement
©Continuent 2014
Quick Continuent Facts
• Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data
• Tungsten Replicator was application of the year at the 2011 MySQL User Conference
• Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now
• MySQL to Hadoop deployments are now in progress with multiple customers
���4
©Continuent 2014
Selected Continuent Customers
���5
23
©Continuent 2014 ���6
Five Minute Hadoop Introduction
©Continuent 2014
What Is Hadoop, Exactly?
���7
a.A distributed file system
b.A method of processing massive quantities of data in parallel
c.The Cutting family’s stuffed elephant
d.All of the above
©Continuent 2014
Hadoop Distributed File System
���8
Java Client
NameNode (directory)
DataNodes (replicated data)
Hive
Pig
hadoop command
Find file
Read block(s)
©Continuent 2014
Map/Reduce
���9
Acme,2013,4.75!Spitze,2013,25.00!Acme,2013,55.25!Excelsior,2013,1.00!Spitze,2013,5.00
Spitze,2014,60.00!Spitze,2014,9.50!Acme,2014,1.00!Acme,2014,4.00!Excelsior,2014,1.00!Excelsior,2014,9.00
Acme,60.00!Excelsior,1.00!Spitze,30.00
Acme,5.00!Excelsior,10.00!Spitze,69.50
MAP
MAP
REDUCEAcme,65.00!Excelsior,11.00!Spitze,99.50
©Continuent 2014
Typical MySQL to Hadoop Use Case
���10
Hive (Analytics)
Hadoop Cluster
Transaction Processing
Initial Load?
Latency?
App changes?
Materialized views?
Changes?
App load?
©Continuent 2014
Options for Loading Data
���11
CSV Files
Sqoop
Manual Loading Sqoop
Tungsten Replicator
©Continuent 2014
Comparing Methods in Detail
���12
Manual via CSV
SqoopTungsten
Replicator
Process Manual/Scripted
Manual/Scripted
Fully automated
Incremental Loading
Possible with DDL changes
Requires DDL changes
Fully supported
Latency Full-load Intermittent Real-time
Extraction Requirements
Full table scan Full and partial table scans
Low-impact binlog scan
©Continuent 2014 ���13
Replicating MySQL Data to Hadoop using
Tungsten Replicator
©Continuent 2014
What is Tungsten Replicator?
���14
A real-time, high-performance,
open source database replication engine
!GPL V2 license - 100% open source
Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent
“Golden Gate without the Price Tag”®
©Continuent 2014
Tungsten Replicator Overview
���15
Master
(Transactions + Metadata)
Slave
THL
DBMS Logs
Replicator
(Transactions + Metadata)
THLReplicator
Extract transactions
from log
Apply
©Continuent 2014
Tungsten Replicator 3.0 & Hadoop
���16
• Extract from MySQL or Oracle
• Base Hadoop plus commercial distributions: Cloudera and HortonWorks
• Provision using Sqoop or parallel extraction
• Automatic replication of incremental changes
• Transformation to preferred HDFS formats
• Schema generation for Hive
• Tools for generating materialized views
©Continuent 2014
Basic MySQL to Hadoop Replication
���17
MySQL Tungsten Master Replicator
hadoop
Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated
binlog_format=row
Tungsten Slave Replicator
hadoop
MySQL Binlog
CSV FilesCSV FilesCSV FilesCSV FilesCSV Files
Hadoop Cluster
Extract from MySQL binlog
Load raw CSV to HDFS (e.g., via LOAD DATA to
Hive)
Access via Hive
©Continuent 2014
Hadoop Data Loading - Gory Details
���18
Replicator
hadoopTransactions from master
CSV FilesCSV FilesCSV Files
Staging TablesStaging TablesStaging “Tables”
Base TablesBase TablesMaterialized Views
Javascript load script
e.g. hadoop.js
Write data to CSV
(Run Map/Reduce)
(Generate Table
Definitions)
(Generate Table
Definitions)
Load using hadoop
command
©Continuent 2014 ���19
Demo #1 !
Replicating sysbench data
©Continuent 2014 ���20
Viewing MySQL Data in Hadoop
©Continuent 2014
Generating Staging Table Schema
���21
$ ddlscan -template ddl-mysql-hive-0.10-staging.vm \! -user tungsten -pass secret \! -url jdbc:mysql:thin://logos1:3306/db01 -db db01!...!DROP TABLE IF EXISTS db01.stage_xxx_sbtest;!!CREATE EXTERNAL TABLE db01.stage_xxx_sbtest!(! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)!ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' ESCAPED BY '\\'!LINES TERMINATED BY '\n'!STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest';
©Continuent 2014
Generating Base Table Schema
$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten \! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01!...!DROP TABLE IF EXISTS db01.sbtest;!!CREATE TABLE db01.sbtest!(! id INT ,! k INT ,! c STRING ,! pad STRING )!;!
���22
©Continuent 2014
Creating a Materialized View in Theory
���23
Log #1 Log #2 Log #N...
MAP Sort by key(s), transaction order
REDUCE Emit last row per key if not a delete
©Continuent 2014
Creating a Materialized View in Hive
$ hive!...!hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/tungsten-reduce;!hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id!) map1!INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;!...
���24
MAP
REDUCE
©Continuent 2014
Comparing MySQL and Hadoop Data
$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!...!$ /opt/continuent/tungsten/bristlecone/bin/dc \! -url1 jdbc:mysql:thin://logos1:3306/db01 \! -user1 tungsten -password1 secret \! -url2 jdbc:hive2://localhost:10000 \! -user2 'tungsten' -password2 'secret' -schema db01 \! -table sbtest -verbose -keys id \! -driver org.apache.hive.jdbc.HiveDriver!22:33:08,093 INFO DC - Data comparison utility!...!22:33:24,526 INFO Tables compare OK!
���25
©Continuent 2014
Doing it all at once
$ git clone \! https://github.com/continuent/continuent-tools-hadoop.git!!$ cd continuent-tools-hadoop!!$ bin/load-reduce-check \! -U jdbc:mysql:thin://logos1:3306/db01 \! -s db01 --verbose
���26
©Continuent 2014 ���27
Demo #2 !
Constructing and Checking a Materialized View
©Continuent 2014 ���28
Scaling It Up!
©Continuent 2014
MySQL to Hadoop Fan-In Architecture
���29
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)
m2 (master)
m3 (master)
Replicator
Replicator
RBR
RBR
Slaves
Hadoop Cluster
(many nodes)
Masters
RBR
©Continuent 2014
Integration with Provisioning
���30
MySQL
Tungsten Master
hadoop
binlog_format=row
Tungsten Slave
hadoopMySQL Binlog
CSV FilesCSV FilesCSV FilesCSV FilesCSV Files
Hadoop Cluster
Access via Hive
Sqoop/ETL
(Initial provisioning run)
©Continuent 2014
On-Demand Provisioning via Parallel Extract
���31
MySQL Tungsten Master Replicator
hadoop
Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) binlog_format=row
Tungsten Slave Replicator
hadoop
MySQL Binlog
CSV FilesCSV FilesCSV FilesCSV FilesCSV Files
Hadoop Cluster
Extract from MySQL tables
Load raw CSV to HDFS (e.g., via LOAD DATA to
Hive)
Access via Hive
©Continuent 2014
Tungsten Replicator Roadmap
���32
• Parallel CSV file loading
• Partition loaded data by commit time
• Data formats and tools to support additional Hadoop clients as well as HBase
• Replication out of Hadoop
• Integration with emerging real-time analytics based on HDFS (Impala, Spark/Shark, Stinger,...)
©Continuent 2014 ���33
Getting Started with Continuent Tungsten
©Continuent 2014
Where Is Everything?
���34
• Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/
• Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html
• Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop
Contact Continuent for support
©Continuent 2014
Commercial Terms
• Replicator features are open source (GPL V2)
• Investment Elements
• POC / Development (Walk Away Option)
• Production Deployment
• Annual Support Subscription
• Governing Principles
• Annual Subscription Required
• More Upfront Investment -> Less Annual Subscription
���35
©Continuent 2014
We Do Clustering Too!
���36
Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !
• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration
Amazon US West
apache /php
GonzoPortal.com
Connector Connector
©Continuent 2014
In Conclusion: Tungsten Offers...
• Fully automated, real-time replication from MySQL into Hadoop
• Support for automatic transformation to HDFS data formats and creation of full materialized views
• Positions users to take advantage of evolving real-time features in Hadoop
���37
©Continuent 2014
Continuent Web Page: http://www.continuent.com
!
Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator
Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs
560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]