cloudera & hadoop use cases rob lancaster | omer trajman "big data"... applications...
TRANSCRIPT
Cloudera & Hadoop Use CasesRob Lancaster | Omer Trajman
"Big Data" ... Applications From Enterprises to Individuals
©2011 Cloudera, Inc. All Rights Reserved.2
The ‘Big Data’ Phenomenon
Big Data Drivers:
The proliferation of data capture and creation technologies
Increased “interconnectedness” drives consumption (creating more data)
Inexpensive storage makes it possible to keep more, longer
Innovative software and analysis tools turn data into information
Big Data encompasses not only the content itself, but how it’s consumed.
More Devices
More Consumption
More Content
New & Better Information
Every gigabyte of stored content can generate a petabyte or more of transient data*
The information about you is much greater than the information you create
*Source: IDC 2011
©2011 Cloudera, Inc. All Rights Reserved.3
Big Data ChallengesIt’s not just about “big”
Cost-effectively managing the volume, velocity and variety of data
Deriving value acrossstructured and unstructured data
Adapting to context changes and integratingnew data sources and types
©2011 Cloudera, Inc. All Rights Reserved.4
Common Challenges
1 Network Analysis and Sessionization
2 Content Optimization and Engagement Modeling
3 Usage Analysis and Mediation
4 Entity Surveillance and Signal Monitoring
5 Recommendations and Modeling
6 Loyalty, Promotion Analysis and Targeting
7 Fraud Analysis, Reconciliation and Risk
8 Time series Analysis, Mapping and Modeling
5
What is Apache Hadoop?
Hadoop Distributed File System (HDFS)
MapReduce
Consolidates Mixed DataComplex and relational data
into a single repository
Stores InexpensivelyKeep raw data always
available
Processes at the SourceEliminate ETL bottlenecks
Mine data first, govern later
Apache Hadoop is a platform for data storage and processing that is…
Scalable Fault tolerant Open source
CORE HADOOP COMPONENTS
©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
©2011 Cloudera, Inc. All Rights Reserved.6
Cloudera in Production
Logs Files Web DataRelational Databases
IDE’s BI / AnalyticsEnterprise Reporting
Enterprise Data Warehouse
Operational Rules Engines
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
Cloudera’s Distribution Including Apache Hadoop (CDH)
&SCM Express
Cloudera Enterprise Cloudera Management Suite Cloudera Support
UNIVERSITY Consulting Services Cloudera University
Web Application
CUSTOMERS
©2011 Cloudera, Inc. All Rights Reserved.7
What Can Hadoop Do For You?A
DV
AN
CE
D A
NA
LYT
ICS
1 2Two Core Use Cases
Applied Across Industries
DA
TA P
RO
CE
SS
ING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Engagement
Mediation
Data Factory
Trade Reconciliation
SIGINT
INDUSTRY TERM INDUSTRY TERMINDUSTRY
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Genomics
Cost of DNA Sequencing Falling Very FastRaw data needs to be aligned and matched
Scientists want to collect and analyze these sequences
Hadoop Can Read Native Formathadoop-bam Java library for manipulation of Binary Alignment/Map
Alignment, SNP discovery, genotyping Genomic Tools Based On Hadoop
SEAL – distributed short read alignment
BlastReduce – parallel read mapping
Crossbow – whole genome re-sequencing analysis
Cloudburst - sensitive MapReduce alignment
Copyright 2010 Cloudera Inc. All rights reserved8
©2011 Cloudera, Inc. All Rights Reserved.9
Biodiversity Indexing
Consolidation and serving of Biological dataProvide free and open access to biodiversity data
Collection, search, discovery and access to a variety of data
Data matching and cleansingGeography, Water/land mapping
Dictionaries and taxonomic services
Data is harvested into multiple RDBMSSqoop to Hadoop for processing workflows and index generation
Sqoop back to MySQL for Web app serving
Future development is to crawl into and serve from HBase
Copyright 2011 Cloudera Inc. All rights reserved
Processing Seismic Data
Optimize the IO-intensive phases of seismic processingIncorporate additional parallelism where it makes sense
Simplify gather/transpose operations with MapReduce
Seismic Unix for Core AlgorithmsWell-known, used at many grad programs in geophysics
SU file format can be easily transformed for processing on HDFS
Hadoop StreamingSeismic Unix, SEPlib, Javaseis - non-Java code in MR
Framework is aware of parameter files needed by SU commands
©2011 Cloudera, Inc. All Rights Reserved.11
Targeted Offers
The checkout lane is everywhereCookies track users through ad impressions
Purchasing behavior is time sensitive
Logs collected from on-site and off-site browsingData is ingested incrementally
Process happens at a variety of time scales
Data logged to HBase as primary storeSome events naturally associate, others require deeper analysis
Random access useful for debugging algorithms
Recommendations and Forecasting
Copyright 2010 Cloudera Inc. All rights reserved12
Collect and serve personalization informationWide variety of constantly changing data sources
Data guaranteed to be messy
Data ingestion includes collection of raw dataFiltering and fixing of poorly formatted data
Normalization and matching across data sources
Analysis looks for reliable attributes and groupingsInterpretation (e.g. gender by name)
Aggregation across likely matching identifiers
Identify possible predicted attributes or preferences
13
Who is Cloudera?
The #1 commercial and non-commercialApache Hadoop distribution.
Complete, Integrated Hadoop StackWho is Cloudera?
Helps organizations profit from all their data
Largest contributor to Hadoop ecosystem
Provides the most widely used open source
distribution
Develops the most sophisticated Hadoop
operations software
Supports mission critical Hadoop clusters
Trained the largest number of Hadoop
Developers and Administrators Coordination
Data Integration
Fast Read/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE APACHE OOZIE APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
©2011 Cloudera, Inc. All Rights Reserved.14
Cloudera helps you profit from all your data.
cloudera.com+1 (888) [email protected]
twitter.com/cloudera
facebook.com/cloudera
Get Hadoop