what is hadoop and why is it important?

16
Copyright © 2013, SAS Institute Inc. All rights reserved. HADOOP PRIMER STEVE HOLDER

Upload: sas-canada

Post on 19-Aug-2015

49 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP PRIMERSTEVE HOLDER

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP AGENDA

• Why Big Data and why’s it different• What is Hadoop?• The players

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

BIG DATA

Lots of data

AN ERA OF ABUNDANCE

WHERE WE ARE NOW

2005 2007 2009 2011 2013

Processing Power

HADOOP ANALYTICS

Intelligence

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

BIG DATA

Large Volumes of Unstructured Data Mine data Detect nuggets of relevant

data while disregarding unimportant data

Smaller Structured Data Sets• Run queries to for insight• Know what you’re looking for

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

BIG DATA DEALING WITH NOISY DATA

[Skip 2 seconds (12,600 record entries) for next meaningful action from this user]

Visitor A views 1st product

Visitor A - irrelevant data

Visitor A views 2nd product

User Product Product etc

43.251.164.128 B003ZX8B3W B00365F6EG etcEDW

Example: What’s Important Data in a Web Log?

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

WHAT IS APACHE HADOOP?

Apache Hadoop is open source software that enables reliable, scalable,

distributed computing on clusters of inexpensive servers• Reliable - Software is fault tolerant, it expects and handles hardware and

software failures• Scalable - Designed for massive scale of processors, memory, and local

attached storage• Distributed - Handles replication. Offers massively parallel programming

model, MapReduce

Hadoop framework handles: partitioning, scheduling, dispatch, execution,

communication, failure handling, monitoring, reporting and more

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP LOGICAL VIEW

Hadoop Distributed File System (HDFS) • Reliable and cheap data storage • Uses commodity hardware

YARN• Resource manager• Key to Enterprise scalability• Provides hooks into HDFS

MapReduce • Programing model • Create queries • Manages execution

HIVE • Solution on top of Hadoop • Direct access to HDFS and Hbase • Provides access to Hadoop

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

APACHE HADOOP BOTTOM LINE

Strengths Weaknesses

+ Huge data volumes

+ Unstructured data

+ Reliable

+ Scalable

+ Lowest cost

+ Open source

+ No hardware lock in

+ Batch processing

- Limited to no built in analytics

- Not efficient at small scale

- Requires skilled engineering, operation and analyst resources

- Hiring qualified talent

- Less mature than SQL

- Governance

- Lack of user role support in access model

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP ECOSYSTEM

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

THE HADOOP ARMS RACE

WHO’S WHO IN THE ZOO

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP VENDORS

• First in the market• Proprietary software to enhance

ecosystem• Single place to store, process and analyze

all your data• In many large accounts as the incumbent• Partner approach has been more

conservative

CLOUDERA

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP VENDORS

• 100% open source• Driving the most Apache Projects• Created and leader in YARN• Seeing good deal of traction due to 100%

Open Source• Partner approach has been much more

open and beneficial

HORTONWORKS

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

SAS POSITION

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

SAS & HADOOP HOW?

SAS & Hadoop intersect in many ways:

SAS can treat Hadoop just as any other data source, pulling data

FROM Hadoop, when it is most convenient;

SAS can work WITH Hadoop, lifting data in a purpose-built

advanced analytics in-memory environment;

SAS can work directly IN Hadoop, leveraging the distributed

processing capabilities of Hadoop.

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved .

HADOOP QUICK OVERVIEW

Why it’s important?• Hadoop has moved to the enterprise – it’s the Go To

for Big Data• Adoption is faster than most other technologies• Hadoop cost per TB is cheaper than traditional DBs

Why Hadoop is it important to SAS?• We make them look good - #1 reason for adoption of

Hadoop = Analytics• They are selling around us – Why not partner?• We have an amazing story – From With and In• Joint knowledge will mitigate Open Source threat

Copy r igh t © 2013 , SAS Ins t i t u t e I nc . A l l r i gh t s res erved . sas.com

QUESTIONS