vbacd july 2012 - apache hadoop, now and beyond

38
© Hortonworks Inc. 2012 Apache Hadoop & the Cloud Jim Walker Dir. Product Marketing, Hortonworks Twitter @jaymce July 10, 2012

Upload: cloudstack-open-source-cloud-computing-project

Post on 10-May-2015

1.475 views

Category:

Technology


2 download

DESCRIPTION

“Apache Hadoop, Now and Beyond”, Jim Walker, Director of Product Marketing, Hortonworks Hadoop is an open source project that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. It is shifting the way many traditional organizations think of analytics and business models. While it is deigned to take advantage of cheap commodity hardware, it is also perfect for the cloud as it is built to scale up or down without system interruption. In this presentation, Jim Walker will provide an overview of Apache Hadoop and its current state of adoption in and out of the cloud.

TRANSCRIPT

Page 1: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop & the Cloud

Jim Walker Dir. Product Marketing, Hortonworks Twitter @jaymce July 10, 2012

Page 2: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012 Page 2

1941

2012

Page 3: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Next Generation Data Warehouse

•  MPP columnar data warehouse appliances •  In-memory analytics engines •  Fast data loading

Hardware Software Distributions ETL & Mgmnt Analytics Applications Services

•  Storage •  Servers •  Networking

•  OSS Apache Hadoop

•  Enterprise Distributions

•  Non-Hadoop big data frameworks

•  Distributed file stores

•  NoSQL databases

•  Data integration

•  Data quality & governance

•  Analytic application development platforms

•  Advanced analytics applications

•  Data visualization tools

•  Business intelligence applications

•  Consulting •  Training •  Tech support •  Software

maintenance •  Hardware

maintenance •  hosting

Big data market segments

Page 4: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Next Generation Data Warehouse

•  MPP columnar data warehouse appliances •  In-memory analytics engines •  Fast data loading

Hardware Software Distributions ETL & Mgmnt Analytics Applications Services

•  Storage •  Servers •  Networking

•  OSS Apache Hadoop

•  Enterprise Distributions

•  Non-Hadoop big data frameworks

•  Distributed file stores

•  NoSQL databases

•  Data integration

•  Data quality & governance

•  Analytic application development platforms

•  Advanced analytics applications

•  Data visualization tools

•  Business intelligence applications

•  Consulting •  Training •  Tech support •  Software

maintenance •  Hardware

maintenance •  hosting

Big data market segments

cloud cloud cloud cloud

Page 5: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Analytics started with basic purchase history…

Megabytes Purchase detail Purchase record Payment record

ERP

Increasing Data Variety and Complexity

Source: Crated in conjunction with Teradata, Inc.

Page 6: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

then we added customer information…

Megabytes

Gigabytes

Purchase detail Purchase record Payment record

ERP

CRM

Offer details

Support Contacts

Customer Touches

Segmentation

Increasing Data Variety and Complexity

Source: Crated in conjunction with Teradata, Inc.

Page 7: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

and the web started to impact…

Megabytes

Gigabytes

Terabytes

Purchase detail Purchase record Payment record

ERP

CRM

WEB

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

Increasing Data Variety and Complexity

Source: Crated in conjunction with Teradata, Inc.

Page 8: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Big data changes the game

Source: Crated in conjunction with Teradata, Inc.

Megabytes

Gigabytes

Terabytes

Petabytes

Purchase detail Purchase record Payment record

ERP

CRM

WEB

BIG DATA

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

User Generated Content

Mobile Web

SMS/MMS Sentiment

External Demographics

HD Video, Audio, Images

Speech to Text

Product/Service Logs

Social Interactions & Feeds

Business Data Feeds

User Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

Increasing Data Variety and Complexity

Transactions + Interactions + Observations = BIG DATA

Page 9: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Next-gen data architecture drivers

Business Drivers

Technical Drivers

Financial Drivers

•  Enable new business models & drive faster growth (20%+)

•  Find insights for competitive advantage & optimal returns

•  Cost of data systems, as % of IT spend, continues to grow

•  Cost advantages of commodity hardware & open source

•  Data continues to grow exponentially •  Data is increasingly everywhere and in many formats •  Legacy solutions unfit for new requirements growth cloud

Page 10: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

One of the best examples of open source driving innovation and creating a market

•  Foundation for big data solutions

•  Enables a rational economics model

•  Powers data-driven business

•  Commodity hardware

•  Loosely coupled, ship early/ship often

•  Consists of many specialized sub-projects

Apache Hadoop Open Source Data Management Software

Page 11: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop & Cloud Makes Sense

Page 11

cloud

•  Broader access of Hadoop to end users, IT professionals, and developers

•  Easy installation and configuration and simplified programming

•  Enterprise-ready distribution with greater security, performance, ease of management and options for Hybrid IT usage.

•  Integrate with everything via RESTful API

•  Spin up a cluster on demand

•  Ease management

Page 12: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

5 Reasons for Hadoop in the Cloud

Page 12

People say "should you run Hadoop in the cloud?”

I say "it depends".

http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

Page 13: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

5 Reasons for Hadoop in the Cloud

Page 13

If your data is stored in a cloud, local analysis may make more sense… "work near the data"

For periodic processing (nightly, etc…) it might make sense to just rent.

No upfront capital expense, fund from success

Easier to expand a cluster; no need to buy just find

Eliminate networking concerns

1

2

3

4

5

http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

Page 14: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

PROCESSING – Map/Reduce

STORAGE – Hadoop Distributed File System

•  Distributed across “nodes” •  Natively redundant •  Name node tracks locations

What is Apache Hadoop?

2

1

•  Splits a task across processors “near” the data & assembles results

•  2004 white paper MapReduce: Simplified Data Processing on Large Clusters

•  Base of much new tech

Page 15: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

Apache Hive is a data warehouse infrastructure built on top of Hadoop (originally by Facebook) for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL (HQL).

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 16: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3 HBase is a non-relational database. It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 17: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3 HCatalog HCatalog is a metadata management service for Apache Hadoop. It opens up the platform and allows interoperability across data processing tools such as Pig, Map Reduce and Hive. It also provides a table abstraction so that users need not be concerned with where or how their data is stored. Aster SQL-H interfaces with HCatalog

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 18: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Apache Pig allows you to write complex map reduce transformations using a simple scripting language. Pig latin (the language) defines a set of transformations on a data set such as aggregate, join and sort among others. Pig Latin is sometimes extended using UDF (User Defined Functions), which the user can write in Java and then call directly from the language.

Page 19: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

Oozie coordinates jobs written in multiple languages such as Map Reduce, Pig and Hive. It is a workflow system that links these jobs and allows specification of order and dependencies between them.

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 20: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

Apache Ambari operationalizes Hadoop. It provides a mechanism to monitor and manage a cluster. It also provisions nodes. Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 21: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

Sqoop is a set of tools that allow non-Hadoop data stores to interact with traditional relational databases and data warehouses.

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 22: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Apache Hadoop related projects

Hive 3

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

HBase 4

HCatalog 5

Pig 6

Oozie 7

Ambari 8

Sqoop 9

Zookeeper 10

Page 23: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Big Data Refinery

Hadoop in Action

Web Logs

Website Interactions

Web Log files via WebHDFS APIs 1

DB Order Data

DB Customer Data

Customer & Order data via Talend & HCatalog for schema 2 3 Pre-processes, refines, and

joins data via Talend, Pig, & HCatalog

4 Interfaces with HCatalog to analyze website visits by the type of end results

Page 24: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop.

Hortonworks Vision & Role

Be diligent stewards of the open source core 1

Be tireless innovators beyond the core 2

Provide robust data platform services & open APIs 3

Enable the ecosystem at each layer of the stack 4

Make the platform enterprise-ready & easy to use 5

Page 25: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Balancing Innovation & Stability

Page 25

time

rela

tive

%

cus

tom

ers

The

CH

ASM

Customers want solutions & convenience

Customers want technology & performance

Innovators, technology enthusiasts

Early adopters, visionaries

Early majority,

pragmatists

Late majority, conservatives

Laggards, Skeptics

Source: Geoffrey Moore - Crossing the Chasm

Page 26: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Enabling Hadoop as Enterprise Big Data Platform

DEVELOPER Data Platform Services & Open APIs

Hortonworks Data Platform

Applications, Business Tools, Development Tools, Open APIs and access Data Movement & Integration, Data Management Systems, Systems Management

Installation & Configuration, Administration, Monitoring, High Availability, Replication, Multi-tenancy, ..

Metadata, Indexing, Search, Security, Management, Data Extract & Load, APIs

Page 27: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

•  Tightly aligned with core Apache code line

•  All code committed back to open source

•  Most complete Apache Hadoop platform

•  Comprehensive management and monitoring

•  Intuitive graphical data integration tools

•  Centralized metadata services for easy data sharing

The ONLY 100% open source data platform for Hadoop

Hortonworks Data Platform

Page 27

Page 28: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

1

•  Simplify deployment to get started quickly and easily

•  Monitor, manage any size cluster with familiar console and tools

•  Only platform to include data integration services to interact with any data source

•  Metadata services opens the platform for integration with existing applications

•  Dependable high availability architecture

Hortonworks Data Platform

Hortonworks Data Platform

Delivers enterprise grade functionality on a proven Apache Hadoop distribution to ease management,

simplify use and ease integration into the enterprise

The only 100% open source data platform for Apache Hadoop

Page 29: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Hortonworks Distribution

Built on Hadoop 1.0 (a.k.a. 0.20.205)

•  Proven at large scale enterprise implementations

•  Most stable and reliable version of Hadoop to date

•  First Apache line supporting security, HBase, WebHDFS

•  Driven by core committers and architects at Hortonworks

Includes necessary components already integrated and tested together Most stable versions of all components are chosen

Apache Distribution Stack

Page 29

Cor

e

HC

atal

og

Pig

Hiv

e

HB

ase

Sqo

op

Ooz

ie

Zoo

keep

er

Am

bari

Tal

end

1.0.3

0.4.0

0.9.2

0.9.0+

0.92.1+

0.9.0+

3.1.3

3.3.4

beta

5.1.1

1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 3.3.4 beta 5.1.1

Tested, Hardened & Proven Distribution Reduces Risk

Page 30: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Management & Monitoring Svcs

Hortonworks Management Center – View the health of cluster operations,

server utilization and performance levels – Customizable dashboards – APIs for integration into 3rd party

monitoring tools – 100% open source management &

monitoring, powered by Apache Ambari, Puppet, Nagios and Gaglia – Simple wizard-based installation,

configuration & provisioning of any size Hadoop cluster

Page 30

Optimize performance for your Hadoop cluster

Simplify Installation and provisioning

Page 31: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Data Integration Services

•  Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig

•  Oozie scheduling allows you to manage and stage jobs

•  Connectors for any database, business application or system

•  Integrated HCatalog storage

Page 31

Bridge the gap between legacy data & Hadoop

Simplify and speed development

Page 32: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Which is best for the cloud?

Page 32

vs.

Page 33: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

HCatalog

Table access Aligned metadata REST API

•  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access

Apache HCatalog provides flexible metadata services across tools and external access

Metadata Services

•  Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive)

•  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API

Shared table and schema management opens the platform

Page 34: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

HDFS HBase External Store

Existing & New Applications

MapReduce Pig Hive

HCatalog

HCatalog RESTful Web Services

Services Integration

Provides RESTful API as “front door” for Hadoop

•  Opens the door to languages other than Java

•  Thin clients via web services vs. fat-clients in gateway

•  Insulation from interface changes release to release

Opens Hadoop to integration with existing and new applications

WebHDFS

Page 35: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

opt imize

Use cases: optimize outcomes at scale

Media Content

Intelligence Detection

Investment Algorithms

Advertising Performance

Fraud Prevention

Regulation Compliance

Retail / Wholesale Inventory turns

Manufacturing Supply chains

Healthcare Patient outcomes

Education Learning outcomes

Government Citizen services

Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.

Page 36: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

Business Transactions & Interactions

Web, Mobile, CRM, ERP, SCM, …

Business Intelligence & Analytics

Dashboards, Reports, Visualization, …

Classic ETL

processing

1

Connecting Transactions + Interactions + Observations

Retain historical data to unlock additional value 6

Retain runtime models and historical data for ongoing

refinement & analysis 5

Audio, Video, Images

Docs, Text, XML

Web Logs, Clicks

Social, Graph, Feeds

Sensors, Devices,

RFID

Spatial, GPS

Events, Other

Big Data Refinery

Store, aggregate, and transform multi-structured data to unlock value

3 Share refined data & runtime models

4 Data Discovery & Investigative

Analytics Interactive data exploration

2

Page 37: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

5 Reasons for Hadoop in the Cloud

Page 37

If your data is stored in a cloud, local analysis may make more sense… "work near the data"

For periodic processing (nightly, etc…) it might make sense to just rent.

No upfront capital expense, fund from success

Easier to expand a cluster; no need to buy just find

Eliminate networking concerns

1

2

3

4

5

http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html

Page 38: vBACD July 2012 - Apache Hadoop, Now and Beyond

© Hortonworks Inc. 2012

THANK YOU

Page 38

Get Hortonworks Data Platform hortonworks.com/download

1

2 Use the getting started guide hortonworks.com/get-started

3 Learn more… get support hortonworks.com/training hortonworks.com/support

Jim Walker [email protected] @jaymce