© Hortonworks Inc. 2012
Apache Hadoop & the Cloud
Jim Walker Dir. Product Marketing, Hortonworks Twitter @jaymce July 10, 2012
© Hortonworks Inc. 2012 Page 2
1941
2012
© Hortonworks Inc. 2012
Next Generation Data Warehouse
• MPP columnar data warehouse appliances • In-memory analytics engines • Fast data loading
Hardware Software Distributions ETL & Mgmnt Analytics Applications Services
• Storage • Servers • Networking
• OSS Apache Hadoop
• Enterprise Distributions
• Non-Hadoop big data frameworks
• Distributed file stores
• NoSQL databases
• Data integration
• Data quality & governance
• Analytic application development platforms
• Advanced analytics applications
• Data visualization tools
• Business intelligence applications
• Consulting • Training • Tech support • Software
maintenance • Hardware
maintenance • hosting
Big data market segments
© Hortonworks Inc. 2012
Next Generation Data Warehouse
• MPP columnar data warehouse appliances • In-memory analytics engines • Fast data loading
Hardware Software Distributions ETL & Mgmnt Analytics Applications Services
• Storage • Servers • Networking
• OSS Apache Hadoop
• Enterprise Distributions
• Non-Hadoop big data frameworks
• Distributed file stores
• NoSQL databases
• Data integration
• Data quality & governance
• Analytic application development platforms
• Advanced analytics applications
• Data visualization tools
• Business intelligence applications
• Consulting • Training • Tech support • Software
maintenance • Hardware
maintenance • hosting
Big data market segments
cloud cloud cloud cloud
© Hortonworks Inc. 2012
Analytics started with basic purchase history…
Megabytes Purchase detail Purchase record Payment record
ERP
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
then we added customer information…
Megabytes
Gigabytes
Purchase detail Purchase record Payment record
ERP
CRM
Offer details
Support Contacts
Customer Touches
Segmentation
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
and the web started to impact…
Megabytes
Gigabytes
Terabytes
Purchase detail Purchase record Payment record
ERP
CRM
WEB
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
Big data changes the game
Source: Crated in conjunction with Teradata, Inc.
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail Purchase record Payment record
ERP
CRM
WEB
BIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMS Sentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions + Observations = BIG DATA
© Hortonworks Inc. 2012
Next-gen data architecture drivers
Business Drivers
Technical Drivers
Financial Drivers
• Enable new business models & drive faster growth (20%+)
• Find insights for competitive advantage & optimal returns
• Cost of data systems, as % of IT spend, continues to grow
• Cost advantages of commodity hardware & open source
• Data continues to grow exponentially • Data is increasingly everywhere and in many formats • Legacy solutions unfit for new requirements growth cloud
© Hortonworks Inc. 2012
One of the best examples of open source driving innovation and creating a market
• Foundation for big data solutions
• Enables a rational economics model
• Powers data-driven business
• Commodity hardware
• Loosely coupled, ship early/ship often
• Consists of many specialized sub-projects
Apache Hadoop Open Source Data Management Software
© Hortonworks Inc. 2012
Apache Hadoop & Cloud Makes Sense
Page 11
cloud
• Broader access of Hadoop to end users, IT professionals, and developers
• Easy installation and configuration and simplified programming
• Enterprise-ready distribution with greater security, performance, ease of management and options for Hybrid IT usage.
• Integrate with everything via RESTful API
• Spin up a cluster on demand
• Ease management
© Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud
Page 12
People say "should you run Hadoop in the cloud?”
I say "it depends".
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
© Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud
Page 13
If your data is stored in a cloud, local analysis may make more sense… "work near the data"
For periodic processing (nightly, etc…) it might make sense to just rent.
No upfront capital expense, fund from success
Easier to expand a cluster; no need to buy just find
Eliminate networking concerns
1
2
3
4
5
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
© Hortonworks Inc. 2012
PROCESSING – Map/Reduce
STORAGE – Hadoop Distributed File System
• Distributed across “nodes” • Natively redundant • Name node tracks locations
What is Apache Hadoop?
2
1
• Splits a task across processors “near” the data & assembles results
• 2004 white paper MapReduce: Simplified Data Processing on Large Clusters
• Base of much new tech
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
Apache Hive is a data warehouse infrastructure built on top of Hadoop (originally by Facebook) for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL (HQL).
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3 HBase is a non-relational database. It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3 HCatalog HCatalog is a metadata management service for Apache Hadoop. It opens up the platform and allows interoperability across data processing tools such as Pig, Map Reduce and Hive. It also provides a table abstraction so that users need not be concerned with where or how their data is stored. Aster SQL-H interfaces with HCatalog
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
Apache Pig allows you to write complex map reduce transformations using a simple scripting language. Pig latin (the language) defines a set of transformations on a data set such as aggregate, join and sort among others. Pig Latin is sometimes extended using UDF (User Defined Functions), which the user can write in Java and then call directly from the language.
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
Oozie coordinates jobs written in multiple languages such as Map Reduce, Pig and Hive. It is a workflow system that links these jobs and allows specification of order and dependencies between them.
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
Apache Ambari operationalizes Hadoop. It provides a mechanism to monitor and manage a cluster. It also provisions nodes. Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
Sqoop is a set of tools that allow non-Hadoop data stores to interact with traditional relational databases and data warehouses.
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Apache Hadoop related projects
Hive 3
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HBase 4
HCatalog 5
Pig 6
Oozie 7
Ambari 8
Sqoop 9
Zookeeper 10
© Hortonworks Inc. 2012
Big Data Refinery
Hadoop in Action
Web Logs
Website Interactions
Web Log files via WebHDFS APIs 1
DB Order Data
DB Customer Data
Customer & Order data via Talend & HCatalog for schema 2 3 Pre-processes, refines, and
joins data via Talend, Pig, & HCatalog
4 Interfaces with HCatalog to analyze website visits by the type of end results
© Hortonworks Inc. 2012
We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop.
Hortonworks Vision & Role
Be diligent stewards of the open source core 1
Be tireless innovators beyond the core 2
Provide robust data platform services & open APIs 3
Enable the ecosystem at each layer of the stack 4
Make the platform enterprise-ready & easy to use 5
© Hortonworks Inc. 2012
Balancing Innovation & Stability
Page 25
time
rela
tive
%
cus
tom
ers
The
CH
ASM
Customers want solutions & convenience
Customers want technology & performance
Innovators, technology enthusiasts
Early adopters, visionaries
Early majority,
pragmatists
Late majority, conservatives
Laggards, Skeptics
Source: Geoffrey Moore - Crossing the Chasm
© Hortonworks Inc. 2012
Enabling Hadoop as Enterprise Big Data Platform
DEVELOPER Data Platform Services & Open APIs
Hortonworks Data Platform
Applications, Business Tools, Development Tools, Open APIs and access Data Movement & Integration, Data Management Systems, Systems Management
Installation & Configuration, Administration, Monitoring, High Availability, Replication, Multi-tenancy, ..
Metadata, Indexing, Search, Security, Management, Data Extract & Load, APIs
© Hortonworks Inc. 2012
• Tightly aligned with core Apache code line
• All code committed back to open source
• Most complete Apache Hadoop platform
• Comprehensive management and monitoring
• Intuitive graphical data integration tools
• Centralized metadata services for easy data sharing
The ONLY 100% open source data platform for Hadoop
Hortonworks Data Platform
Page 27
© Hortonworks Inc. 2012
1
• Simplify deployment to get started quickly and easily
• Monitor, manage any size cluster with familiar console and tools
• Only platform to include data integration services to interact with any data source
• Metadata services opens the platform for integration with existing applications
• Dependable high availability architecture
Hortonworks Data Platform
Hortonworks Data Platform
Delivers enterprise grade functionality on a proven Apache Hadoop distribution to ease management,
simplify use and ease integration into the enterprise
The only 100% open source data platform for Apache Hadoop
© Hortonworks Inc. 2012
Hortonworks Distribution
Built on Hadoop 1.0 (a.k.a. 0.20.205)
• Proven at large scale enterprise implementations
• Most stable and reliable version of Hadoop to date
• First Apache line supporting security, HBase, WebHDFS
• Driven by core committers and architects at Hortonworks
Includes necessary components already integrated and tested together Most stable versions of all components are chosen
Apache Distribution Stack
Page 29
Cor
e
HC
atal
og
Pig
Hiv
e
HB
ase
Sqo
op
Ooz
ie
Zoo
keep
er
Am
bari
Tal
end
1.0.3
0.4.0
0.9.2
0.9.0+
0.92.1+
0.9.0+
3.1.3
3.3.4
beta
5.1.1
1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 3.3.4 beta 5.1.1
Tested, Hardened & Proven Distribution Reduces Risk
© Hortonworks Inc. 2012
Management & Monitoring Svcs
Hortonworks Management Center – View the health of cluster operations,
server utilization and performance levels – Customizable dashboards – APIs for integration into 3rd party
monitoring tools – 100% open source management &
monitoring, powered by Apache Ambari, Puppet, Nagios and Gaglia – Simple wizard-based installation,
configuration & provisioning of any size Hadoop cluster
Page 30
Optimize performance for your Hadoop cluster
Simplify Installation and provisioning
© Hortonworks Inc. 2012
Data Integration Services
• Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig
• Oozie scheduling allows you to manage and stage jobs
• Connectors for any database, business application or system
• Integrated HCatalog storage
Page 31
Bridge the gap between legacy data & Hadoop
Simplify and speed development
© Hortonworks Inc. 2012
Which is best for the cloud?
Page 32
vs.
© Hortonworks Inc. 2012
HCatalog
Table access Aligned metadata REST API
• Raw Hadoop data • Inconsistent, unknown • Tool specific access
Apache HCatalog provides flexible metadata services across tools and external access
Metadata Services
• Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive)
• Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API
Shared table and schema management opens the platform
© Hortonworks Inc. 2012
HDFS HBase External Store
Existing & New Applications
MapReduce Pig Hive
HCatalog
HCatalog RESTful Web Services
Services Integration
Provides RESTful API as “front door” for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release
Opens Hadoop to integration with existing and new applications
WebHDFS
© Hortonworks Inc. 2012
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
Use cases: optimize outcomes at scale
Media Content
Intelligence Detection
Investment Algorithms
Advertising Performance
Fraud Prevention
Regulation Compliance
Retail / Wholesale Inventory turns
Manufacturing Supply chains
Healthcare Patient outcomes
Education Learning outcomes
Government Citizen services
Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.
© Hortonworks Inc. 2012
Business Transactions & Interactions
Web, Mobile, CRM, ERP, SCM, …
Business Intelligence & Analytics
Dashboards, Reports, Visualization, …
Classic ETL
processing
1
Connecting Transactions + Interactions + Observations
Retain historical data to unlock additional value 6
Retain runtime models and historical data for ongoing
refinement & analysis 5
Audio, Video, Images
Docs, Text, XML
Web Logs, Clicks
Social, Graph, Feeds
Sensors, Devices,
RFID
Spatial, GPS
Events, Other
Big Data Refinery
Store, aggregate, and transform multi-structured data to unlock value
3 Share refined data & runtime models
4 Data Discovery & Investigative
Analytics Interactive data exploration
2
© Hortonworks Inc. 2012
5 Reasons for Hadoop in the Cloud
Page 37
If your data is stored in a cloud, local analysis may make more sense… "work near the data"
For periodic processing (nightly, etc…) it might make sense to just rent.
No upfront capital expense, fund from success
Easier to expand a cluster; no need to buy just find
Eliminate networking concerns
1
2
3
4
5
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
© Hortonworks Inc. 2012
THANK YOU
Page 38
Get Hortonworks Data Platform hortonworks.com/download
1
2 Use the getting started guide hortonworks.com/get-started
3 Learn more… get support hortonworks.com/training hortonworks.com/support
Jim Walker [email protected] @jaymce