hadoop: what it is and what it's not

38

Upload: inside-analysis

Post on 20-Aug-2015

651 views

Category:

Technology


0 download

TRANSCRIPT

Twitter Tag: #briefr

The Briefing Room

[email protected]

Twitter Tag: #briefr

The Briefing Room

!  Reveal the essential characteristics of enterprise software, good and bad

!  Provide a forum for detailed analysis of today’s innovative technologies

!  Give vendors a chance to explain their product to savvy analysts

!  Allow audience members to pose serious questions... and get answers!

Twitter Tag: #briefr

The Briefing Room

!  November: Cloud

!  December: Innovators

!  January: Big Data

!  February: Performance

!  March: Integration

Twitter Tag: #briefr

The Briefing Room

!  The Data Warehouse was once considered the Holy Grail of Business Intelligence, but as data volumes increase exponentially, we’re finding that data warehousing cannot be all things for all users.

! Hadoop was initially developed at Yahoo! to support a search

engine project and has since turned into the poster child for open source Big Data processing.

!  While Hadoop is not a data warehouse, its capabilities can help organizations store and analyze huge volumes of data.

Twitter Tag: #briefr

The Briefing Room

Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor at Forbes Online and Information Management. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net

Twitter Tag: #briefr

The Briefing Room

! Hortonworks is an enterprise software company that focuses on the development and support of Apache Hadoop.

!   Its product is the Hortonworks Data Platform, an open source platform for storing, processing and analyzing large volumes of data from many sources and in a variety of formats.

! Hortonworks recently introduced its Hive ODBC Driver 1.0, which allows users to integrate its Hadoop platform with the BI apps running on top.

Twitter Tag: #briefr

The Briefing Room

Jim is the Director of Product Marketing at Hortonworks. He is a recovering developer, professional marketer and amateur photographer with nearly twenty years experience building products and developing emerging technologies. During his career, he has brought multiple  products to market in a variety of fields, including data loss prevention, master data management and now big data.  At Hortonworks, Jim is focused on accelerating the development and adoption of Apache Hadoop.

© Hortonworks Inc. 2012

Hadoop: What It Is & Isn’t October 2012

Jim Walker Director, Product Marketing Hortonworks

Page 9

© Hortonworks Inc. 2012

Big Data: Organizational Game Changer

Page 10

Megabytes

Gigabytes

Terabytes

Petabytes

Purchase detail Purchase record Payment record

ERP

CRM

WEB

BIG DATA

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

User Generated Content

Mobile Web

SMS/MMS Sentiment

External Demographics

HD Video, Audio, Images

Speech to Text

Product/Service Logs

Social Interactions & Feeds

Business Data Feeds

User Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

Increasing Data Variety and Complexity

Transactions + Interactions + Observations = BIG DATA

© Hortonworks Inc. 2012

What is a Data Driven Business?

• DEFINITION Better use of available data in the decision making process

• RULE Key metrics derived from data should be tied to goals

• PROVEN RESULTS Firms that adopt Data-Driven Decision Making have output and productivity that is 5-6% higher than what would be expected given their investments and usage of information technology*

Page 11

* “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” Brynjolfsson, Hitt and Kim (April 22, 2011)

1110010100001010011101010100010010100100101001001000010010001001000001000100000100010010010001000010111000010010001000101001001011110101001000100100101001010010011111001010010100011111010001001010000010010001010010111101010011001001010010001000111

© Hortonworks Inc. 2012

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

Big Data: Optimize Outcomes at Scale

Media Content

Intelligence Detection

Finance Algorithms

Advertising Performance

Fraud Prevention

Retail / Wholesale Inventory turns

Manufacturing Supply chains

Healthcare Patient outcomes

Education Learning outcomes

Government Citizen services

Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.

Page 12

© Hortonworks Inc. 2012

Dashboards, Reports, Visualization, …

CRM, ERP Web, Mobile Point of sale

Enterprise Big Data Flows

Page 13

Big Data Platform

Business Transactions & Interactions

Business Intelligence & Analytics

Unstructured Data

Log files

DB data

Exhaust Data

Social Media

Sensors, devices

Classic Data Integration & ETL

Capture Big Data Collect data from all sources structured &unstructured

Process Transform, refine, aggregate, analyze, report

Distribute Results Interoperate and share data with applications/analytics

Feedback Use operational data w/in big data platform, preserve data

1 2 3 4

© Hortonworks Inc. 2012

Data Platform for Big Data

Data Platform Requirements for Big Data

Page 14

Capture

•  Collect data from all sources - structured and unstructured data

•  all speeds batch, async, streaming, real-time

Process

•  Transform, refine, aggregate, analyze, report

Exchange

•  Deliver data with enterprise data systems

•  Share data with analytic applications and processing

Operate •  Provision, monitor, diagnose, manage at scale •  Reliability, availability, affordability, scalability, interoperability

Operating Systems

Virtual Platforms

Cloud Platforms

Big Data Appliances

Across all deployment models

© Hortonworks Inc. 2012

Big Data Transactions, Interactions, Observations

Apache Hadoop & Big Data Use Cases

Page 15

Refine Explore Enrich

Business Case

© Hortonworks Inc. 2012

Enterprise Data Warehouse

Operational Data Refinery Hadoop as platform for ETL modernization

Capture •  Capture new unstructured data along with log

files all alongside existing sources •  Retain inputs in raw form for audit and

continuity purposes Process •  Parse the data & cleanse •  Apply structure and definition •  Join datasets together across disparate data

sources Exchange •  Push to existing data warehouse for

downstream consumption •  Feeds operational reporting and online systems

Page 16

Unstructured Log files

Refinery

Structure and join

Capture and archive

Parse & Cleanse

Refine Explore Enrich

DB data

Upload

© Hortonworks Inc. 2012

Visualization Tools EDW / Datamart

Explore

Big Data Exploration & Visualization Hadoop as agile, ad-hoc data mart

Capture •  Capture multi-structured data and retain inputs

in raw form for iterative analysis Process •  Parse the data into queryable format •  Explore & analyze using Hive, Pig, Mahout and

other tools to discover value •  Label data and type information for

compatibility and later discovery •  Pre-compute stats, groupings, patterns in data

to accelerate analysis Exchange •  Use visualization tools to facilitate exploration

and find key insights •  Optionally move actionable insights into EDW

or datamart Page 17

Capture and archive

upload JDBC / ODBC

Structure and join

Categorize into tables

Unstructured Log files DB data

Refine Explore Enrich

Optional

© Hortonworks Inc. 2012

Online Applications

Enrich

Application Enrichment Deliver Hadoop analysis to online apps

Capture •  Capture data that was once

too bulky and unmanageable

Process •  Uncover aggregate characteristics across data •  Use Hive Pig and Map Reduce to identify patterns •  Filter useful data from mass streams (Pig) •  Micro or macro batch oriented schedules

Exchange •  Push results to HBase or other NoSQL alternative

for real time delivery •  Use patterns to deliver right content/offer to the

right person at the right time

Page 18

Derive/Filter

Capture

Parse

NoSQL, HBase Low Latency

Scheduled & near real time

Unstructured Log files DB data

Refine Explore Enrich

© Hortonworks Inc. 2012

Hadoop in Enterprise Data Architectures

Page 19

EDW

Existing Business Infrastructure

ODS & Datamarts

Applications & Spreadsheets

Visualization & Intelligence

Discovery Tools

IDE & Dev Tools

Low Latency/NoSQL

Web

Web Applications

Operations

Custom Existing

Templeton Sqoop WebHDFS Flume HCatalog

Pig HBase

Hive

Ambari HA Oozie ZooKeeper

MapReduce HDFS

Big Data Sources (transactions, observations, interactions)

CRM ERP Exhaust

Data logs files financials

Social Media

New Tech

Datameer Tableau

Karmasphere Splunk

© Hortonworks Inc. 2012

Where Does It Fit into Your Business?

Vertical Refine Explore Enrich

Retail & Web •  Log Analysis/Site Optimization •  Social Network Analysis

•  Dynamic Pricing •  Session & Content

Optimization

Retail •  Loyalty Program Optimization •  Brand and Sentiment Analysis •  Dynamic Pricing/Targeted

Offer

Intelligence •  Threat Identification •  Person of Interest Discovery •  Cross Jurisdiction Queries

Finance •  Risk Modeling & Fraud

Identification •  Trade Performance

Analytics

•  Surveillance and Fraud Detection

•  Customer Risk Analysis

•  Real-time upsell, cross sales marketing offers

Energy •  Smart Grid: Production Optimization

•  Grid Failure Prevention •  Smart Meters •  Individual Power Grid

Manufacturing •  Supply Chain Optimization •  Customer Churn Analysis •  Dynamic Delivery •  Replacement parts

Healthcare & Payer

•  Electronic Medical Records (EMPI)

•  Clinical Trials Analysis

•  Insurance Premium Determination

Page 20

© Hortonworks Inc. 2012

We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop.

Hortonworks Vision & Leadership

Page 21

•  100% open platform •  No POS holdback •  Open to the Hadoop

community •  Open to the Hadoop

ecosystem •  Closely aligned to

Hadoop core

•  Stewards of core Hadoop •  Original builders and

operators of Hadoop •  100+ years Hadoop

development experience •  Managed every viable,

stable Hadoop release •  HDP built on Hadoop 1.0

•  Innovating current platform with HCatalog, Ambari, HA

•  Innovating future platform with YARN, HA

•  Complete vision for Hadoop-based platform

•  Enable the Hadoop ecosystem

Trusted Open Innovative

© Hortonworks Inc. 2012

1

•  Simplify deployment to get started quickly and easily

•  Monitor, manage any size cluster with familiar console and tools

•  Only platform to include data integration services to interact with any data

•  Metadata services opens the platform for integration with existing applications

•  Dependable high availability architecture

•  Tested at scale to future proof your cluster growth

Hortonworks Data Platform

Page 22

ü  Reduce risks and cost of adoption ü  Lower the total cost to administer and provision ü  Integrate with your existing ecosystem

Twitter Tag: #briefr

The Briefing Room

© Third Nature Inc.

“In  pioneer  days  they  used  oxen  for  heavy  pulling,  and  when  one  ox  couldn't  budge  a  log,  they  didn't  try  to  grow  a  larger  ox.  We  shouldn't  be  trying  for  bigger  computers,  but  for  more  systems  of  computers.”  

 Grace  Hopper  

© Third Nature Inc.

What’s  different  today?  We’re  not  ge@ng  more  CPU  speed,  but  more  CPU  cycles.  

There  are  too  many  CPUs  relaEve  to  other  resources,  creaEng  an  imbalance  in  hardware  plaForms.  

We  therefore  use  nodes  to  aggregate  memory,  network  bandwidth  and  IOPs.  

Most  soJware  is  designed  for  a  single  worker,  not    high  degrees  of  parallelism  and  won’t  scale  well.  

© Third Nature Inc.

Data  volume  is  the  oldest,  easiest  problem  

Teradata

© Third Nature Inc.

Analy:cs  makes  the  data  volume  problem  bigger  

Many  of  the  processing  problems  are  O(n2)  or  worse,  so  moderate  data  can  be  a  problem  for  DW  architectures  

© Third Nature Inc.

.        

It would be logical to keep all the data in one place.

I need that data now.

A  common  problem  with  new  projects  or  unexpected  business  problems…  

It will take 6 months

© Third Nature Inc.

The  proposed  solu:on?  Load  Hadoop  and  analyze  

© Third Nature Inc.

Welcome  to  the  Hadoop  schema!  

Why  soJ  /  no  schema  can  be  good:  Easier  programming  Easier  modeling  since  you  don’t  have  to  be  perfect  in  advance,  and  it’s  change-­‐resilient  Join  eliminaEon  =  I/O  savings  (if  no  updates)    

© Third Nature Inc.

Whether  to  switch  from  a  DB  isn’t  the  right  discussion  

SQL...

SQL!

SQL?

SQL

Hadoop

© Third Nature Inc.

Strategy:  There’s  a  pony  in  there  somewhere  

© Third Nature Inc.

…but  you  need  a  unicorn  to  find  the  pony  

© Third Nature Inc.

Ques:ons  for  discussion  

1. Is  scale  of  data  really  that  much  of  a  problem  for  most  organizaEons?  

2. Hadoop  is  designed  for  batch  work  –  how  good  is  it  for  interacEve  use?  Real-­‐Eme  use  cases?  

3. How  do  you  define  “plaForm”?  4. ETL  modernizaEon  is  menEoned,  but  isn’t  this  a  reversion  to  manual  coding?  

5. How  do  you  design  for  long-­‐term  use  rather  than  one-­‐off  analysis  projects?  

6. Does  open  source  really  macer  for  this  part  of  the  stack?  

© Third Nature Inc.

CC  Image  AOribu:ons  Thanks  to  the  people  who  supplied  the  creaEve  commons  licensed  images  used  in  this  presentaEon:    Phone  dump  -­‐  Richard  Barnes  ponies  in  field.jpg  -­‐  hcp://www.flickr.com/photos/bulle_de/352732514/    

Twitter Tag: #briefr

The Briefing Room

Twitter Tag: #briefr

The Briefing Room

!  This Month: Database

!  November: Cloud

!  December: Innovators

!  January: Big Data

!  2013 Editorial Calendar (www.insideanalysis.com)

Twitter Tag: #briefr

The Briefing Room