Download - Colorado Springs Open Source Hadoop/MySQL
![Page 1: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/1.jpg)
MYSQL AND HADOOP INTEGRATION
![Page 2: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/2.jpg)
STRUCTURED AND UNSTRUCTURED DATA
•Structured Data (Deductive Logic)• Analysis of defined relationships• Defined Data Architecture• SQL Compliant for fast processing with certainty• Precision, Speed
•Unstructured Data (Inductive Logic)• Hypothesis testing against unknown relationships
• Unknown (being less than 100% certainty)
• Iterative analysis to a level-of-certainty• Open standards and tools
• Extremely high rate of change in processing/tooling options
• Volume, Speed
![Page 3: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/3.jpg)
STRUCTURED DATA: RDBMS
•Capabilities• Defined Data Architecture/Structured Schema• Data Integrity• ACID Compliant
• Atomicity - Requires that each transaction is "all or nothing"• Consistency - Any transaction will bring the database from one valid
state to another• Isolation - All transactions are consistent as if they were issued serially • Durability - Once the transaction is committed it persists
• Real-time processing and analysis against known relationships
•Limitations• Comparatively static data architecture• Requires defined data architecture for all data stored• Relatively smaller, defined, more discrete data sets
![Page 4: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/4.jpg)
UNSTRUCTURED DATA:NOSQL
•Capabilities• Key Value lookup• Designed for fast single row lookup• Loose Schema designed for fast lookup• MySQL NoSQL Interface• Used to augment Big Data solutions
•Limitations• Not designed for Analytics• Does not support 2 – 3 of the V for Big Data• On its own, NoSQL is not considered Big Data
![Page 5: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/5.jpg)
UNSTRUCTURED DATA:HADOOP
•Capabilities• High data load with different data formats• Allows discovery and hypothesis testing against large data sets• Near Real-time processing and analysis against unknown
relationships
•Limitations• Not ACID compliant• No transactional consistency• High latency system• Not designed for real time lookups• Limited BI tool integration
![Page 6: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/6.jpg)
WHAT IS HADOOP?•What is Hadoop?
• Fastest growing, commercial Big Data Technology• Basic Composition:
• Mapper• Reducer• Hadoop File System (HDFS)
• Approx 30 tools/subcomponents in the eco-system• Primarily produced so developers and admin’s do not have to write raw map/reduce code
in Java
•Systems Architecture:• Linux• Commodity x86 Servers• JBOD (standard block size 64-128MB)• Virtualization not recommended due to high I/O requirements
•Open Source Project:• http://hadoop.apache.org/
![Page 7: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/7.jpg)
HADOOP: QUICK HISTORY
•Map Reduce Theory Paper:• Published 2004• Jeffrey Dean and Sanjay Ghemawat• Foundation for GFS (Google File System)• Problem:
• Ingest and search large data sets
•Hadoop:• Doug Cutting, Cloudera (Yahoo)
• Lucene (1999) – indexing large files• Nutch (2004) – search massive amounts of web data• Hadoop (2007) – first release 2007, started in 2005
![Page 8: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/8.jpg)
WHY IS HADOOP SO POPULAR?•Store everything regardless
• Analyze Now, or analyze Later
•Schema On-Read methodology• Allows you to store all the data and determine how to use it later
•Low cost, scale out infrastructure• Low cost hardware and large storage pools• Allows for more of a load-it and forget-it approach
•Usage• Sentiment analysis• Marketing campaign analysis• Customer churn modeling• Fraud detection• Research and Development• Risk Modeling
![Page 9: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/9.jpg)
HADOOP 1.0
![Page 10: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/10.jpg)
MAP/REDUCE
•Programming and execution framework
•Taken from functional programming• Map – operate on every element• Reduce – combine and aggregate results
•Abstracts storage, concurrency, execution
• Just write two Java functions• Contrast with MPI
![Page 11: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/11.jpg)
HDFS
•Based on GFS
•Distributed, fault-tolerant filesystem
•Primarily designed for cost and scale• Works on commodity hardware• 20PB / 4000 node cluster at Facebook
![Page 12: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/12.jpg)
HDFS ASSUMPTIONS
•Failures are common• Massive scale means more failures• Disks, network, node
•Files are append-only
•Files are large (GBs to TBs)
•Accesses are large and sequential
![Page 13: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/13.jpg)
HDFS PRIMER
•Same concepts as the FS on your laptop
• Directory tree• Create, read, write, delete files
•Filesystems store metadata and data• Metadata: filename, size, permissions, …• Data: contents of a file
![Page 14: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/14.jpg)
HDFS ARCHITECTURE
DataNode
NameNode
DataNode DataNode DataNode
Rack 1
Rack 2
![Page 15: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/15.jpg)
MAP REDUCE AND HDFS SUMMARY
•GFS and MR co-design• Cheap, simple, effective at scale
•Fault-tolerance baked in• Replicate data 3x• Incrementally re-execute computation• Avoid single points of failure
•Held the world sort record (0.578TB/min)
![Page 16: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/16.jpg)
SQOOP
• Performs bidirectional data transfers between Hadoop and almost any SQL database with a JDBC driver
![Page 17: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/17.jpg)
FLUME
• Streaming data collection and aggregation
• Massive volumes of data, such as RPC services, Log4J, Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
![Page 18: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/18.jpg)
HIVE
• Relational database abstraction using a SQL like dialect called HiveQL
• Statements are executed as one or more Map Reduce Jobs
SELECTs.word, s.freq, k.freq
FROM shakespeare JOIN ON (s.word= k.word)WHERE s.freq >= 5;
![Page 19: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/19.jpg)
PIG
• High-level scripting
language for
executing one or
more MapReduce jobs
• Created to simplify
authoring of
MapReduce jobs
• Can be extended with
user defined functions
emps = LOAD
'people.txt’ AS
(id,name,salary);
rich = FILTER emps BY
salary > 200000;
sorted_rich = ORDER
rich BY salary DESC;
STORE sorted_rich
INTO ’rich_people.txt';
![Page 20: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/20.jpg)
HBASE
• Low-latency, distributed, columnar key-value store
• Based on BigTable
• Efficient random reads/writes on HDFS
• Useful for frontend applications
![Page 21: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/21.jpg)
OOZIE
• Workflow engine
and scheduler
built specifically
for large-scale job
orchestration on a
Hadoop cluster
![Page 22: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/22.jpg)
HUE
• Hue is an open source web-based application for making it easier to use Apache Hadoop.
•Hue features
• File Browser for HDFS• Job Designer/Browser
for MapReduce• Query editors for Hive,
Pig and Cloudera Impala
• Oozie
![Page 23: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/23.jpg)
ZOOKEEPER
• Zookeeper is a distributed
consensus engine
• Provides well-defined
concurrent access
semantics:
• Leader election
• Service discovery
• Distributed locking / mutual
exclusion
• Message board / mailboxes
![Page 24: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/24.jpg)
CASCADING
•Next gen software abstraction layer for Map/Reduce
•Create and execute complex data processing workflows
• Specifically for a Hadoop cluster using any JVM-based language• Java• Jruby• Clojure
•Generally acknowledged as a better alternative to Hive/Pig
![Page 25: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/25.jpg)
MYSQL AND BIG DATA
![Page 26: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/26.jpg)
CHARACTERISTICS OF BIG DATA
•Big Data covers 4 dimensions• Volume - 90% of all the data stored in the world has been
produced in the last 2 years• Velocity – The ability to perform advanced analytics on
Terabytes or Petabytes of data in minutes to hours compared to days
• Variety – Any data type from structured to unstructured data including image files, social media, relational database content, and text data from weblogs or sensors
• Veracity - 1 in 3 business leaders don’t trust the information they use to make decisions. How do we ensure the results are accurate and meaning?
![Page 27: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/27.jpg)
BIG DATA CHALLENGES
• Loading web logs into MySQL• How do you parse and keep all the Data? • What about the variability of the Query String Parameters? • What if the web log format changes?
• Integration of other data sources• Social Media – Back in the early days even Facebook didn’t keep all
the data. How do we know what is important in the stream?• Video and Image data – How do we store that type of data so we can
extract the metadata information?• Sensor Data – Imagine all the different devices producing data and
the different formats of the data.
![Page 28: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/28.jpg)
Web ServersEcommerce
Database
Order Processing
CRM
Operational Data Store
Enterprise Data Warehouse
![Page 29: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/29.jpg)
Web ServersEcommerce
Database
Order Processing
CRM
Operational Data Store
Enterprise Data Warehouse
![Page 30: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/30.jpg)
LIFE CYCLE OF BIG DATA
•Acquire• Data captured at source• Part of ongoing operational processes (Web Log, RDBMS)
•Organize• Data transferred from operational systems to Big Data Platform
•Analyze• Data processed in batch by Map/Reduce• Data processed by Hadoop Tools (Hive, Pig)• Can Pre-condition data that is loaded back into RDBMS
•Decide• Load back into Operational Systems• Load into BI Tools and ODS
![Page 31: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/31.jpg)
MYSQL INTEGRATION WITH THE BIG DATA LIFE CYCLEACQUIRE ORGANIZE
DECIDE ANALYZE
Applier
NoSQL
Sensor Logs
Web Logs
BI Tools
![Page 32: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/32.jpg)
LIFE CYCLE OF BIG DATA: MYSQL
•Acquire• MySQL as a Data Source• MySQL’s NoSQL
• New NoSQL API’s• Ingest high volume, high velocity data, with veracity• ACID guarantees not compromised
• Data Pre-processing or Conditioning• Run Real-time analytics against new data• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
![Page 33: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/33.jpg)
LIFE CYCLE OF BIG DATA:MYSQL
•Organize• Data transferred in batches from MySQL tables to Hadoop using Apache
Sqoop or MySQL Applier• With Applier, users can also invoke real-time change data capture
processes to stream new data from MySQL to HDFS as it is committed by the client.
•Analyze• Multi-structured, multi-sourced data consolidated and processed• Run Map/Reduce Jobs and or Hadoop Tools (Hive, Pig, others)
•Decide• Results loaded back to MySQL via Apache Sqoop• Provide new data for real-time operational processes• Provide broader, normalized data sets for BI Tool analytics
![Page 34: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/34.jpg)
TOOLS: MYSQL APPLIER
•Overview• Provides real-time replication of events between MySQL and
Hadoop
•Usage• MySQL Applier for Hadoop uses an API (libhdfs, precompiled
with Hadoop) to connect to MySQL master • Reads the binary log and then:
• Fetches the row insert events occurring on the master• Decodes events, extracts data inserted into each field of the
row• Uses content handlers to get it in required format• Appends it to a text file in HDFS
![Page 35: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/35.jpg)
TOOLS: MYSQL APPLIER
•Capabilities• Streaming real-time updates from MySQL into Hadoop for
immediate analysis• Addresses performance issues from bulk loading• Exploits existing replication protocol• Provides row-based replication• Consumable by other tools• Possibilities for update/delete
•Limitations• DDL not handled• Only row inserts
![Page 36: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/36.jpg)
TOOLS: MYSQL APPLIER
EventsHDFS
MySQL Applier for Hadoop
Binlog API
libhdfs
Binary Log
Decode Row
Timestamp Primary Key
Data
![Page 37: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/37.jpg)
TOOLS: MYSQL NOSQL
•Overview• NoSQL interfaces directly to the InnoDB and MySQL Cluster (NDB) storage
engines• Bypass the SQL layer completely• Without SQL parsing and optimization, Key-Value data can be written
directly to MySQL tables up to 9x faster, while maintaining ACID guarantees.
•Usage• Key Value Definition/Lookup
• Designed for fast single row lookup• Loose Schema designed for fast lookup
• Data Pre-processing or Conditioning• Run Real-time analytics against new data• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
![Page 38: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/38.jpg)
TOOLS: MYSQL NOSQL
•Capabilities• Ingest high volume, high velocity data, with veracity• ACID guarantees are not compromised• Single stack for RDMBS and NoSQL• High volume KVP processing
• Single-node processing: • 70k transactions per second
• Clustered processing: • 650k ACID-compliant writes per sec• 19.5M writes per sec
• Auto-sharding across distributed clusters of commodity nodes• Shared-nothing, fault-tolerant architecture for 99.999% uptime
•Limitations• <specify>
![Page 39: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/39.jpg)
WHAT’S NEXT
![Page 40: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/40.jpg)
MAHOUT
•Scalable machine learning algorithms
•Primary focus:• Collaborative filtering• Clustering• Classification
![Page 41: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/41.jpg)
SPARK
•In-memory cluster computing • Allows user programs to load data into a cluster's memory
and query it repeatedly
•Streaming Architecture
![Page 42: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/42.jpg)
MESOS
•Cluster manager that manages resources across distributed systems
•Allows finite control over system resources
• Stateful versus stateless (i.e. traditional virtualization architecture)
![Page 43: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/43.jpg)
SENTRY
•Granular role-based access control to data
•Addresses both data and metadata
![Page 44: Colorado Springs Open Source Hadoop/MySQL](https://reader033.vdocuments.mx/reader033/viewer/2022061300/54c6fc484a795944168b4641/html5/thumbnails/44.jpg)
Questions?