the business advantage of hadoop: lessons from the field – cloudera summer webinar series: 451...

32
THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD 1 Matt Aslett, Research Manager, 451 Research Mike Olson, CEO, Cloudera Bill Theisinger, Executive Director, Platform Data Services, YP Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion

Upload: cloudera-inc

Post on 20-Aug-2015

4.308 views

Category:

Business


1 download

TRANSCRIPT

Page 1: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD

1

Matt Aslett, Research Manager, 451 Research

Mike Olson, CEO, Cloudera

Bill Theisinger, Executive Director, Platform Data Services, YP

Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion

Page 2: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Introducing our Speakers

2

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett

Page 3: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

Big Data, Total Data… Hadoop

Matt Aslett - @maslett• Research manager, data

management and analytics

Total Data• Assesses data management

approaches in an era of ‘big data’• Explores the drivers behind new

approaches to data management and analytics

• Explains the new and existing technologies used to store and process and deliver value from data

Page 4: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

“Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety.

‘Big Data’

VelocityVelocityThe data is being The data is being produced at a rate produced at a rate that is beyond the that is beyond the performance limits performance limits of traditional of traditional systems systems

VolumeVolumeThe volume of data The volume of data is too large for is too large for traditional database traditional database software tools to software tools to cope with cope with

VarietyVarietyThe data lacks the The data lacks the structure to make it structure to make it suitable for storage suitable for storage and analysis in and analysis in traditional traditional databases and data databases and data warehouses warehouses

Page 5: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.

‘Total Data’

ExplorationExplorationThe interest in The interest in exploratory analytic exploratory analytic approaches, in approaches, in which schema is which schema is defined in response defined in response to the nature of the to the nature of the query.query.

TotalityTotalityThe desire to The desire to process and analyze process and analyze data in its entirety, data in its entirety, rather than rather than analyzing a sample analyzing a sample of data and of data and extrapolating the extrapolating the results.results.

DependencyDependencyThe reliance on The reliance on existing existing technologies and technologies and skills, and the need skills, and the need to balance to balance investment in those investment in those existing existing technologies and technologies and skills with the skills with the adoption of new adoption of new techniques. techniques.

FrequencyFrequencyThe desire to The desire to increase the rate of increase the rate of analysis in order to analysis in order to generate more generate more accurate and timely accurate and timely business business intelligence. intelligence.

Page 6: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

A virtuous circle?

Increased use of interactive applications and data-generating machines

New commercial opportunities for analyzing previously ignored data

Increased desire to store and process all available data

More economically feasible to store and process previously ignored data

New infrastructure investments to support new data processing software

Page 7: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects

• Open source• Vendor-supported• Clusters of commodity servers• Storage of large data volumes• Structured, unstructured and

semi-structured data• Flexible, schema-on-read

processing• Complex data sets• Connectors to existing

databases, data integration and business intelligence tools

What is Apache Hadoop?

HBaseHBase

ZooKeeperZooKeeper PigPig

FlumeFlumeMahoutMahoutAvroAvro

ChukwaChukwa SqoopSqoop

WhirrWhirr

HDFSHDFS

MapReduceMapReduce

HiveHive

Hadoop CommonHadoop Common

HamaHama

Page 8: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 by The 451 Group. All rights reserved

Hadoop as a platform for storing data that could not previously be efficiently stored.

Hadoop as a large scale data ingestion/ETL layer that complements existing databases.

Hadoop as a platform for new exploratory analytic applications.

What is Apache Hadoop for?

Big-data analytics

Big-data integration

Big-data storage

Page 9: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

THE EVOLUTION OF HADOOPTHE EVOLUTION OF HADOOPAnd how it’s used in the real world today

9

Mike Olson

CEO & Co-Founder, Cloudera

Page 10: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Fastest sort of a TB, 62secs over 1,460 nodes

Sorted a PB in 16.25hours over 3,658 nodes

Page 11: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

©2011 Cloudera, Inc. All Rights Reserved.11

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

MapReduce

Distributed Computing Across Physical Servers

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Apache Hadoop is a platform for data storage and processing that is…

ScalableFault tolerantOpen source

CORE HADOOP COMPONENTS

Page 12: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

12

2008CLOUDERA FOUNDED BY MIKE OLSON,AMR AWADALLAH & JEFF HAMMERBACHER

2009HADOOP

CREATOR DOUG CUTTING JOINS

CLOUDERA

2009CDH:FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION

2010CLOUDERA MANAGER:

FIRST MANAGEMENT

APPLICATION FOR HADOOP

2011CLOUDERA REACHES 100 PRODUCTION CUSTOMERS

2011CLOUDERA

UNIVERSITY EXPANDS TO 140

COUNTRIES

2012CLOUDERA ENTERPRISE 4:THE STANDARD FOR HADOOP IN THE ENTERPRISE

2012CLOUDERA

CONNECT REACHES 300

PARTNERS

BEYOND…TRANSFORMING

HOW COMPANIES THINK ABOUT

DATA

CLOUDERA ENTERPRISE

4

CHANGING THE WORLDONE PETABYTE AT A TIME

Page 13: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

13

CLOUDERA ENTERPRISE

CDH:BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED ON APACHE HADOOP – 100% OPEN SOURCE

CLOUDERA MANAGER:END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH

CLOUDERA SUPPORT:OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE LEVEL AGREEMENTS (SLAS)

PROFESSIONAL SERVICES

USE CASE DISCOVERY

NEW HADOOP DEPLOYMENT

PROOF OF CONCEPT

PRODUCTION PILOTS

PROCESS & TEAM DEVELOPMENT

DEPLOYMENT CERTIFICATION

EDUCATION

DEVELOPERS

ADMINISTRATORS

CERTIFICATION PROGRAMS

DATA SCIENTISTS

Page 14: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Cloudera’s software is never installed all by itself

It’s always deployed alongside mission-critical systems that represent enormous investment

Extracting value from data requires sharing it across boundaries and among systems

Goal: The right storage and the right processing in the right place at the right time

©2012 Cloudera, Inc. All Rights Reserved.14

Page 15: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

✛ Disparate data sources✛ Disparate systems for transforming, processing

and analyzing data✛ Disparate systems for capturing and reporting

data, and for enforcing business and legislative governance requirements

All need to be connected for usability and to unlock the unique value of each

©2012 Cloudera, Inc. All Rights Reserved.15

Page 16: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

©2011 Cloudera, Inc. All Rights Reserved.16

LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases

IDE’sIDE’s BI / AnalyticsBI / Analytics Enterprise ReportingEnterprise Reporting

Enterprise Data Warehouse

Operational Rules Engines

Management Tools

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Cloudera Enterprise•CDH•Cloudera Manager•Technical Support

Consulting ServicesCloudera University

Web Application

Web Application

CUSTOMERS

Page 17: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

INDUSTRYDATA PROCESSING

ADVANCED ANALYTICS

Web Clickstream Sessionization Social Network Analysis

Media Engagement Content Optimization

Telecom Mediation Network Analytics

Retail Data Factory Loyalty & Promotions

Financial Trade Reconciliation Fraud Analysis

Government Signal Intelligence (SIGINT) Entity Analysis

Biotech / Pharma Genome Mapping Sequencing Analysis

Page 18: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

18

Page 19: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

© 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE

ONLY)

Hadoop@YP

Sept 26, 2012William Theisinger

Executive Director, Platform Computing

Page 20: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Challenges

Page 20

Page 21: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

• Increasing volume of traffic data through our distribution network

• Need for a system to support changing data complexity and detail

• Adhere to tighter SLAs

• Provide intra-day reporting

• Benefit from the intelligence trapped in our data

What we were facing

21

Page 22: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Legacy processing flow

Page 22

Application Log Data

ETL processing

Data Load

Data Load

Data Load

Data Warehouse

• Drop reportable events on the floor

• Loading multiple DBs

• Processing time was significant

• Reporting lag was in days, not hours

• High maintainability required

Data Layer

Page 23: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Hadoop Platform

Page 23

Page 24: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Hadoop processing flow

Page 24

Applications LWES

Data Laye

r

Data Warehouse

• All ETL processing in Hadoop

• Several systems integrate to Hadoop platform

• All Java MapReduce with some Hive for end user and dependent systems

• Reporting lag in hours, not days

• Actual reduction in maintainability needs

Data Collection

Hadoop Platform

Page 25: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Next Generation

Page 25

Page 26: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Hadoop processing flow

Page 26

Applications LWES

Data Laye

r

Data Warehouse

• Migrating some reporting to HBase

• Exposing core business KPIs via APIs

• Replacing various data marts with HBase tables/schemas

• Reducing TCO

• Alignment of core skill sets

Data Collection

Hadoop Platform

HBase Platform

Page 27: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect

Page 28: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Internal Use Only

The Problem

1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly).

2. Traditional systems unable to cope with both growth and access requests.

3. Total global dataset of ~100PB.

Confidential and Proprietary28 Confidential and Proprietary28

Page 29: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Internal Use Only

The Old Way

1. - Focus on reducing data to required data set

2. - Pipeline data flows to avoid hitting disk

3. - Scalability issues at most stages

4. - Going back to the Archive was really time consuming

Confidential and Proprietary29 Confidential and Proprietary29

ServicesFilter andSplit Streaming ETL

Streaming ETL

Event Monitoring

Data Warehouse

Complex Correlation

Alerting

Archive Storage

Page 30: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Internal Use Only

The Hadoop Way

1. - Archive storage moved to HDFS

2. - ETL processes converted to Hadoop (Pig+Hive)

3. - Some data warehouse functions migrating to Hadoop

Confidential and Proprietary30 Confidential and Proprietary30

ServicesFilter andSplit

Event Monitoring

Alerting

HadoopArchive Storage

ETLCorrelation

Stage 1 DWH

Data Warehouse

Page 31: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Internal Use Only

Real Results

1. - 90% code base reduction for ETL Tools

2. - Example Performance:

3. - Previous Ad-Hoc query would take around 4 days

- Now takes 53 minutes

- Significant capital cost reductions over previous system

Confidential and Proprietary31 Confidential and Proprietary31

Page 32: The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

Introducing our Speakers

32

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett