hadoop world 2011: building scalable data platforms ; hadoop & netezza deployment models
DESCRIPTION
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.TRANSCRIPT
Building Scalable Data PlatformsHadoop and Netezza Deployment Models
Krishnan ParasuramanNetezza
Greg RokitaEdmunds.com
Hadoop World 20112
Talking Points
• Building scalable data platforms– Architectural considerations
• Hadoop and Massively Parallel Databases– Similarities and differences– Usage patterns
• Practitioner’s View Point– Edmunds.com data warehouse platform
Hadoop World 20113
Building scalable data platformsTypical Digital Media Information Processing Pipeline
Clicks
Visits
Page Views
Likes
Tweets
Impressions
Real Time Decision Engine
• Display Ads• Recommendation• Personalized Content
Locations
Data Processing
• Correlate• Structure• Consolidate
Analytics and Optimization• Scoring• Yield optimization• Audience Analytics
Reporting
• Aggregate• Summarize• Ad-hoc analysis
Hadoop World 20114
DATA PLATFORM
Building scalable data platformsClicks
Visits
Page Views
Likes
Tweets
Impressions
Real Time Decision Engine
Locations
Data Processing
Analytics and Optimization
Reporting
Hadoop World 20115
Building scalable data platforms
Real Time Decision Engine
Data Processing
Analytics and Optimization
Reporting
Workloads• Real Time• High Concurrency• Transactional• High Thruput
• High Velocity• Linearly Scalable• Disk bound
• Cached Queries• Low Latency• H. Concurrency
• Compute intensive• Full table scans• Disk bound
Data• Structured• Un-Structured• Key-Value pairs
• Structured• Un-Structured• Machine Gen.
• Mostly Structured• Some unstructured
• Structured• Relational
Capability• Stream Processing• Memory resident• Key based lookups
• Low Disk I/O• Fast Processing• Low Cost/TB
• In-DB computation• SQL and MR• Analytic Libraries
• OLAP• Columnar
Hadoop World 20116
Building scalable data platforms
Real Time Decision Engine
Data Processing
Analytics and Optimization
Reporting
Workloads• Real Time• High Concurrency• Transactional• High Thruput
• High Velocity• Linearly Scalable• Disk bound
• Cached Queries• Low Latency• H. Concurrency
• Compute intensive• Full table scans• Disk bound
Data• Structured• Un-Structured• Key-Value pairs
• Structured• Un-Structured• Machine Gen.
• Mostly Structured• Some unstructured
• Structured• Relational
Capability• Stream Processing• Memory resident• Key based lookups
• Low Disk I/O• Fast Processing• Low Cost/TB
• In-DB computation• SQL and MR• Analytic Libraries
• OLAP• Columnar
NoSQL Databases
Hadoop
Graph DB
Massively Parallel DB
Plain Ole’ DB on steroids
In-Memory DB
Hadoop World 20117
Myth
A single technology will meet all the considerations for our scalable data platform needs
Best Practices
Workloads scale differently – Monolithic architectures don’t work
Minimize components – Data movement is painful
Understand tradeoffs – Performance Price Effort
Start with the core architecture and work in the edge cases
Hadoop World 20118
Massively parallel data warehouses
FPGA
Memory
CPU FPGA
Memory
CPU FPGA
Memory
CPU
Hosts
Distributed Storage
Massively parallel compute nodes
Network fabric
Host controllers
SQL And MR
Hadoop World 20119
Hadoop
Parallel compute nodes
Network fabric
Master Node
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Distributed Storage
Hadoop World 201110
There are striking similarities….
Highly Available
Scalable
Execute code & algorithms next to data
Massive parallelism
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Map Reduce
But also key differences
11
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Data Loading = File copy Look Ma, No ETL
Schema on Read – Data loading is fast
Batch Mode data access
Lower cost of data storage
Process unstructured data
Had
oop
Optimized for Performance
Real time access, random reads, query optimizer, co-located joins
SQL and Map Reduce
Hardware Accelerated queriesNet
ezza
Hadoop World 201112
These differences lead to opportunities for co-existence for Hadoop in a Netezza environment
1. Scalable ETL engine– Complex data
– Relationships not defined
– Evolving schema
2. Queryable Archive– Moving computation is cheaper than moving data
3. Analytics sandbox– Exploratory analysis
Hadoop World 201113
Netezza-Hadoop: Deployment Patterns
unstructured data
semi-structured data
structured data
Create context (classification, text mining)
Analyze
Parse, aggregate Analyze, report
Analyze, reportActive archival
Long running queries
Hadoop World 201114
Pattern 1: Data Processing Engine (ETL)
NameNodeJobTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Hadoop Cluster Netezza Environment
Raw Weblogs
Hadoop World 201115
Pattern 2: Low cost storage and dynamic provisioning
Elastic MapReduce
2
3
Amazon S3
Amazon Cloud
1
Netezza Environment
Hadoop World 201116
Pattern 3: Queryable Archive
Data Sources
1
23
Netezza Environment
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Edmunds.com and Scaleo Premier online resource for automotive information
launched in 1995 as the first automotive information Web site
o 15 million unique visitorso 210 million page viewso 1 million+ new inventory items per dayo 2 TB of new data every montho 40 node Hadoop cluster aggregating logs,
advertising, vehicle, pricing, inventory and other data sets
o
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Edmunds Proposition
We have developed an iterative approach to data warehouse
development that has dropped the time it takes for us to deliver reports to our
users from months to weeks.
18
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
How did we do it?
o Processo Technologyo Understanding of Value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Process: agile approach
o Continuous and fast delivery of new featureso Collaboration between users and developerso Make new data available quickly and
inexpensivelyo Quick problem resolution o No wasting of entire development cycle if data is
not usefulo Encouragement of exploration and creation of
new applications
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Process
21
Post-process:• Filtered• Transformed • Modeled as star schema• Optimized• Slow turn-around• High retention • Fast performance
Pre-process:• Complete• Raw• Modeled as source data • Generically loaded• Quick turn-around • Low retention • Slower performance
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Post-Process Sandbox
22
YesDevelop Optimized Pipeline: data is confirmed to
be useful effort is warranted
No
Discard: prevents shadow
production little effort lost
Prototype
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Technology
23
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
24
Edmunds Publishing System
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
25
Generic flow for pre-process
Generic, written once
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
What architecture enables generic consumer?
o Message o Deliveryo Routing o Persistenceo Durability
o Retrieso Throttling o Versioningo Monitoring
ActiveMQ
Camel
Thrift
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Flexibility for Producers and Consumers: Support for Topologies
Field Example Values Purpose
Environment PROD, TEST, DEV Promotion cycle of deployment units
Index Blue, Green, Stage Environment Index
Data Center LAX1, EC2 The data center where deployment unit is located
Site Edmunds, Insideline Company’s Product
Application HBase, Digital Asset Manager Deployment Unit
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Producer-Consumer matching
Producer
Consumer
ProdLaxEdmundsInventory
Prod, TestLax, EC2EdmundsDealer
ProdLax, EC2EdmundsInventory
TestEC2EdmundsDealer
BrokerDestinationInterceptor
PublishInventory
PublishInventory
Virtual Topic Name
QueueName
Match!
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
HBase: how to handle data generically
Colum Family
Binary Discrete Type 2
Columns Serialized Thrift Object
Hashcode of the Thrift Object
Thrift ObjectField 1
Thrift ObjectField 2
Thrift ObjectField 3
Start Date
End Date
List of fields
Role System of record
Check if updates arenecessary(optimization)
Versioning at the most granular level for lookups
Versioning for optimized dimension tables
29
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Generic Thrift Persistence in HBaseColumn Name Value
[ModelYear]|F:id|T:long|I:0[ModelYear]|F:midYear|T:boolean|I:1[ModelYear]|F:year|T:int|I:2[ModelYear]|F:name|T:java.lang.String|I:4[ModelYear]#[attributss][0]|F:_key|T:java.lang.Long[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F:value|T:java.lang.String|I:1[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:value|T:java.lang.String|I:1[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:id|T:long|I:2[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F:value|T:java.lang.String|I:1
1368false1993Celica64Standard SportV:GT-S 2dr Hatchback
441
V:GT-S
30
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Netezza: Time is Money
31
Compared to Oracle Business Value
Up to 12x faster load times Can reload data more frequently Failed workflows are no longer a big problem Helps in transition to real time system: We can now create intraday reports for Leads!
Up to 400x faster query times
More productive Business Intelligence Queries that could ‘never’ finish in Oracle are
now providing business value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Generic and reusable Oozie actions for Netezza
32
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.
Value
o Data warehouse proves product value both internally and to our customers
o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment
o By combining all data in a single system we are enabling new products to be developed that we previously could not
33
Building Scalable Data PlatformsHadoop and Netezza Deployment Models
Krishnan Parasuraman@kparasuraman
Greg RokitaEdmunds.com