cloudera & hadoop use cases rob lancaster | omer trajman "big data"... applications...

14
Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data" ... Applications From Enterprises to Individuals

Upload: letitia-parker

Post on 18-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Cloudera & Hadoop Use CasesRob Lancaster | Omer Trajman

"Big Data" ... Applications From Enterprises to Individuals

©2011 Cloudera, Inc. All Rights Reserved.2

The ‘Big Data’ Phenomenon

Big Data Drivers:

The proliferation of data capture and creation technologies

Increased “interconnectedness” drives consumption (creating more data)

Inexpensive storage makes it possible to keep more, longer

Innovative software and analysis tools turn data into information

Big Data encompasses not only the content itself, but how it’s consumed.

More Devices

More Consumption

More Content

New & Better Information

Every gigabyte of stored content can generate a petabyte or more of transient data*

The information about you is much greater than the information you create

*Source: IDC 2011

©2011 Cloudera, Inc. All Rights Reserved.3

Big Data ChallengesIt’s not just about “big”

Cost-effectively managing the volume, velocity and variety of data

Deriving value acrossstructured and unstructured data

Adapting to context changes and integratingnew data sources and types

©2011 Cloudera, Inc. All Rights Reserved.4

Common Challenges

1 Network Analysis and Sessionization

2 Content Optimization and Engagement Modeling

3 Usage Analysis and Mediation

4 Entity Surveillance and Signal Monitoring

5 Recommendations and Modeling

6 Loyalty, Promotion Analysis and Targeting

7 Fraud Analysis, Reconciliation and Risk

8 Time series Analysis, Mapping and Modeling

5

What is Apache Hadoop?

Hadoop Distributed File System (HDFS)

MapReduce

Consolidates Mixed DataComplex and relational data

into a single repository

Stores InexpensivelyKeep raw data always

available

Processes at the SourceEliminate ETL bottlenecks

Mine data first, govern later

Apache Hadoop is a platform for data storage and processing that is…

Scalable Fault tolerant Open source

CORE HADOOP COMPONENTS

©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.

©2011 Cloudera, Inc. All Rights Reserved.6

Cloudera in Production

Logs Files Web DataRelational Databases

IDE’s BI / AnalyticsEnterprise Reporting

Enterprise Data Warehouse

Operational Rules Engines

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Cloudera’s Distribution Including Apache Hadoop (CDH)

&SCM Express

Cloudera Enterprise Cloudera Management Suite Cloudera Support

UNIVERSITY Consulting Services Cloudera University

Web Application

CUSTOMERS

©2011 Cloudera, Inc. All Rights Reserved.7

What Can Hadoop Do For You?A

DV

AN

CE

D A

NA

LYT

ICS

1 2Two Core Use Cases

Applied Across Industries

DA

TA P

RO

CE

SS

ING

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Engagement

Mediation

Data Factory

Trade Reconciliation

SIGINT

INDUSTRY TERM INDUSTRY TERMINDUSTRY

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Genomics

Cost of DNA Sequencing Falling Very FastRaw data needs to be aligned and matched

Scientists want to collect and analyze these sequences

Hadoop Can Read Native Formathadoop-bam Java library for manipulation of Binary Alignment/Map

Alignment, SNP discovery, genotyping Genomic Tools Based On Hadoop

SEAL – distributed short read alignment

BlastReduce – parallel read mapping

Crossbow – whole genome re-sequencing analysis

Cloudburst - sensitive MapReduce alignment

Copyright 2010 Cloudera Inc. All rights reserved8

©2011 Cloudera, Inc. All Rights Reserved.9

Biodiversity Indexing

Consolidation and serving of Biological dataProvide free and open access to biodiversity data

Collection, search, discovery and access to a variety of data

Data matching and cleansingGeography, Water/land mapping

Dictionaries and taxonomic services

Data is harvested into multiple RDBMSSqoop to Hadoop for processing workflows and index generation

Sqoop back to MySQL for Web app serving

Future development is to crawl into and serve from HBase

Copyright 2011 Cloudera Inc. All rights reserved

Processing Seismic Data

Optimize the IO-intensive phases of seismic processingIncorporate additional parallelism where it makes sense

Simplify gather/transpose operations with MapReduce

Seismic Unix for Core AlgorithmsWell-known, used at many grad programs in geophysics

SU file format can be easily transformed for processing on HDFS

Hadoop StreamingSeismic Unix, SEPlib, Javaseis - non-Java code in MR

Framework is aware of parameter files needed by SU commands

©2011 Cloudera, Inc. All Rights Reserved.11

Targeted Offers

The checkout lane is everywhereCookies track users through ad impressions

Purchasing behavior is time sensitive

Logs collected from on-site and off-site browsingData is ingested incrementally

Process happens at a variety of time scales

Data logged to HBase as primary storeSome events naturally associate, others require deeper analysis

Random access useful for debugging algorithms

Recommendations and Forecasting

Copyright 2010 Cloudera Inc. All rights reserved12

Collect and serve personalization informationWide variety of constantly changing data sources

Data guaranteed to be messy

Data ingestion includes collection of raw dataFiltering and fixing of poorly formatted data

Normalization and matching across data sources

Analysis looks for reliable attributes and groupingsInterpretation (e.g. gender by name)

Aggregation across likely matching identifiers

Identify possible predicted attributes or preferences

13

Who is Cloudera?

The #1 commercial and non-commercialApache Hadoop distribution.

Complete, Integrated Hadoop StackWho is Cloudera?

Helps organizations profit from all their data

Largest contributor to Hadoop ecosystem

Provides the most widely used open source

distribution

Develops the most sophisticated Hadoop

operations software

Supports mission critical Hadoop clusters

Trained the largest number of Hadoop

Developers and Administrators Coordination

Data Integration

Fast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.

©2011 Cloudera, Inc. All Rights Reserved.14

Cloudera helps you profit from all your data.

cloudera.com+1 (888) [email protected]

twitter.com/cloudera

facebook.com/cloudera

Get Hadoop