boston hug - cloudera presentation

15
PROFIT FROM ALL OF YOUR DATA February 2012 Hadoop in the Enterprise Adam Smieszny | Systems Engineer

Upload: reedshea

Post on 26-Jan-2015

113 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Boston HUG - Cloudera presentation

PROFITFROM ALL OFYOURDATA

February 2012

Hadoop in the EnterpriseAdam Smieszny | Systems Engineer

Page 2: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.2

Agenda

• Hadoop Overview• History of Hadoop• What is Hadoop• Hadoop in the Enterprise

Page 3: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.3

Existing Data Management

10,000

2005 20152010

5,000

0

Current Database Solutions are designed for structured data.

Optimized to answer known questions quickly

Schemas dictate form/context

Difficult to adapt to new data types and new questions

Expensive at Petabyte scale

STRUCTURED DATA UNSTRUCTURED DATA

GIG

AB

YT

ES

OF

DA

TA C

RE

AT

ED

(IN

BIL

LIO

NS

)

10%

Page 4: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.4

Why the Need for Hadoop?

10,000

2005 20152010

5,000

0

1.8 trillion gigabytes of data wascreated in 2011…

More than 90% is unstructured data

Approx. 500 quadrillion files

Quantity doubles every 2 years

STRUCTURED DATA UNSTRUCTURED DATA

GIG

AB

YT

ES

OF

DA

TA C

RE

AT

ED

(IN

BIL

LIO

NS

)

Source: IDC 2011

More Devices

New Sources

More Content

New & Better Info

Page 5: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.5

The Origins of Hadoop

Open source web crawler project created

by Doug Cutting

Publishes MapReduce and GFS Paper

Open Source MapReduce and HDFS

project created by Doug Cutting

Runs 4,000-node Hadoop cluster

Hadoop wins Terabyte sort benchmark

Launches SQL support for Hadoop

Releases CDH and Cloudera Enterprise

2002 2007 2012

Page 6: Boston HUG - Cloudera presentation

6

What is Apache Hadoop?

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

MapReduce

Distributed Computing Across Physical Servers

Flexibility

A single repository for storing processing & analyzing any type of data

Not bound by a single schema

Scalability

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

Low Cost

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop is a platform for data storage and processing that is…

Scalable Fault tolerant Open source

CORE HADOOP COMPONENTS

©2011 Cloudera, Inc. All Rights Reserved.

Page 7: Boston HUG - Cloudera presentation

7

What is CDH?

Fastest Path to Success

No need to write your own scripts or do integration testing on different components

Works with a wide range of operating systems, hardware, databases and data warehouses

Stable and Reliable

Extensive Cloudera QA systems, software & processes

Tested & run in production at scale

Proven at scale in dozens of enterprise environments

Community Driven

Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings

FREE

Cloudera’s Distribution IncludingApache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…

100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule

©2011 Cloudera, Inc. All Rights Reserved.

Page 8: Boston HUG - Cloudera presentation

More coming…

Packaging, testing

Sqoop frame-work,

adapters

Drivers, language enhancements, testing

Coordination

Data Integration

Fast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

8

CDH & Enterprise Ecosystem

Page 9: Boston HUG - Cloudera presentation

unstructured data

semi-structured data

structured data

Create context (classification, text mining)

Analyze

Parse, aggregate Analyze, report

Analyze, reportActive archival

Long running queries

9Copyright 2011 Cloudera Inc. All rights reserved

Slide borrowed from Krishnan Parasuraman presentation at Enzee’11

Hadoop / RDBMS Use Cases

EDW

EDW

EDW

Page 10: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.10

Hadoop in Production

How Apache Hadoop fitsinto your existing infrastructure.

Logs Files Web DataRelational

Data

IDE’s BI / AnalyticsEnterprise Reporting

Enterprise Data Warehouse

Low-Latency Serving Systems

Web Application

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS

Page 11: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.11

Hadoop Use CasesA

DV

AN

CE

D A

NA

LYT

ICS

DA

TA P

RO

CE

SS

ING

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

Page 12: Boston HUG - Cloudera presentation

Use Case: Customer Risk

Build comprehensive data picture of customer side risk

Publish a consolidated set of attributes for analysis

Map ratings across products

Parse and aggregate data from difference sources

Credit and debit cards, product payments, deposits and savings

Banking activity, browsing behavior, call logs, e-mails and chats

Merge data into a single view

A “fuzzy join” among data sources

Structure and normalize attributes

Sentiment analysis, pattern recognition

Copyright 2010 Cloudera Inc. All rights reserved12

Page 13: Boston HUG - Cloudera presentation

Use Case: Sentiment Analysis

Copyright 2010 Cloudera Inc. All rights reserved13

Internet generates a lot of chatter about brandsUnderstanding what’s being said is crucial to protecting brand value

Facebook, Twitter generate a lot of data for a global top brand

Capturing and Processing direct feedbackBetter engagement and alerting via Sentiment Analysis

Not yet ready for fully automated customer service

Hadoop handles the diverse data types and processingSources of data changing and semantics continuously evolving

Sophistication of algorithms is improving daily

Page 14: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.14

Journey of CDH Users

Discover the Benefits of Apache Hadoop

DeployCDH

Subscribe to Cloudera Enterprise

Gain the flexibility to store and mine all types of data

• • •

Leverage the scale-out architecture for complex data analysis

• • •

Easily scale to meet growing data requirements

• • •

Avoid vendor lock-in with an open source technology

The fastest, surest path to success with Apache Hadoop

• • •

Stable, reliable version of Apache Hadoop without the vendor lock-in

imposed by proprietary vendors

• • •

Integrates with your other technology platforms ensuring

investment protection

Simplify and accelerate Apache Hadoop deployment

• • •

Reduce adoption costs and risks

• • •

More effectively manage cluster resources

• • •Leverage the experience of our

experts

Page 15: Boston HUG - Cloudera presentation

©2011 Cloudera, Inc. All Rights Reserved.15

http://www.cloudera.com/hadoop/

cloudera.com twitter.com/cloudera

facebook.com/cloudera

Get Hadoop