copyright © 2015, sas institute inc. all rights reserved. the elephant in the room sas & hadoop

25
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Upload: kelly-hancock

Post on 18-Jan-2016

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

THE ELEPHANT IN THE ROOMSAS & HADOOP

Page 2: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP ABOUT THE PRESENTER

• Jim Watson• SAS Education, Canberra• Background in SAS Programming, SQL programming, Database Processing,

Grid Processing, et al• With SAS since 1999

Page 3: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP LIST OF TOPICS

• What is Hadoop?• How SAS integrates with Hadoop

• HDFS• LIBNAME Engine• Explicit Pass-through• High Performance Analytics

Page 4: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP WHAT IS HADOOP?

• Apache Hadoop is an Open Source Software Framework• Written in Java• For Distributed Storage and processing of very large datasets on computer

clusters• Built from Commodity Hardware

Page 5: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP ADVANTAGES OF HADOOP

• Some characteristics of Hadoop include:• Open-source• Simple to use distributed file system• Supports highly parallel processing• It’s scalable, so it’s suitable for massive amounts

of data• It is designed to work on low-cost hardware• It’s fault tolerant (redundant) at the data level

• automatic replication of data• automatic fail-over

Page 6: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP HADOOP FUNDAMENTALS

• HDFS – “Hadoop Distributed File System”• Files are distributed across the Hadoop cluster

• Hadoop YARNa framework for job scheduling and cluster resource management

• MapReduce• Files are processed locally and in parallel• Based on YARN

These modules handle the process of reading/writing & processing large files in

a distributed environment. This allows the data to be exploited as if it were a

single massively powerful server.

Page 7: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM

• HDFS is hierarchical with LINUX style paths and file ownership and

permissions.• HADOOP FS commands are similar to LINUX commands. • HDFS in not built into the operating system. • Files are append-only after they are written.

$ hadoop fs –ls /user/studentFound 4 itemsdrwxr-xr-x - student1 sasapp 0 2014-05-30 20:00 /user/student1/.Trashdrwx------ - student1 sasapp 0 2014-05-30 10:05 /user/student1/.stagedrwxr-xr-x - student1 sasapp 0 2014-05-28 15:25 /user/student1/datadrwxr-xr-x - student1 sasapp 0 2014-05-28 13:59 /user/student1/users$ hadoop fs –mkdir /user/student1/newdir$

Page 8: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP MAPREDUCE

• MapReduce is a framework written in Java that is built into Hadoop. It

automates the distributed processing of data files.

map processing of individual rows (filtering, row calculations)

shuffle and sort

grouping rows for summarisation

reduce summary calculations within groups

The MapReduce framework coordinates multiple mapping, sorting, and reducing tasks that execute in parallel across the computer cluster.

Page 9: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP WHAT’S INSIDE

SASClient

SAS metadataserver

SAS workspaceserver

Hadoop NameNode

HiveHadoop

DataNode 1Hadoop

DataNode 2Hadoop

DataNode 3

Page 10: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP PARALLEL PROCESSING EXAMPLE

• A MapReduce Example: Summarise a detailed order table to derive total

revenue by state. The table is already distributed in HDFS.

id st rev1 NC 102 GA 123 VA 84 NC 95 VA 226 NC 187 NC 28 GA 53......

st totrevGA 65NC 39VA 30......

Page 11: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP PARALLEL PROCESSING EXAMPLE

.

.

.

id st rev1 NSW 102 QLD 123 VIC 84 NSW 9

st rev ctNSW 10 1QLD 12 1VIC 8 1NSW 9 1

st rev ctNSW 10 1NSW 9 1NSW 18 1NSW 2 1

id st rev5 VIC 226 NSW 187 NSW 28 QLD 53

Block n

map

st rev ctVIC 22 1NSW 18 1NSW 2 1QLD 53 1

st rev ctVIC 8 1VIC 22 1

.

.

.

.......

.

.

.

.......

shuffle

st totrevNSW 39

st totrevVIC 30

.

.

.

.......

reduce

output

output

File blocks

output

Page 12: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP “PIG” & “HIVE”

Pig and Hive provide less complex higher-level programming methods for parallel

processing of Hadoop data files.

Pig A platform for data analysis that includes stepwise procedural programming that converts to MapReduce.

Hive A data warehousing framework to query and manage large data sets stored in Hadoop. Provides a mechanism to structure the data and query the data using an SQL-like language called HiveQL. Most HiveQL queries are compiled into MapReduce programs.

Page 13: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP THE HADOOP ECOSYSTEM

The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along

with additional projects including Pig, Hive, and others are collectively called the

Hadoop ecosystem.

Page 14: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP EXPLOITING THE HDFS

• The Hadoop FILENAME engine• Upload local data to Hadoop• Read data from Hadoop• Use normal SAS PROC & DATA Steps

• PROC HADOOP• Submit HDFS Commands• Submit MapReduce & PIG programs

Page 15: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP THE FILENAME STATEMENT & HDFS

filename hadconfg "/workshop/hadoop_config.xml';

filename mapres hadoop"/user/&std/data/mapoutput" concat

cfg=hadconfg user="&std";

data work.commonwords; infile mapres dlm='09'x; input word $ count; …run;

Page 16: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP PROC HADOOP

• PROC HADOOP submits• Hadoop file system (HDFS) commands• MapReduce programs• PIG language code.

PROC HADOOP <Hadoop-server-option(s)>; HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>; MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>; PIG <Hadoop-server-option(s)> <pig-code-option(s)>; PROPERTIES <configuration-properties>; RUN;

Page 17: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP PROC HADOOP – HDFS STATEMENTS

HDFS COPYFROMLOCAL='local-file' OUT='output-location'<DELETESOURCE> < OVERWRITE>;

HDFS COPYTOLOCAL='HDFS-file' OUT='output-location'<DELETESOURCE> < OVERWRITE> < KEEPCRC>;

HDFS DELETE='HDFS-file' <NOWARN>;

HDFS MKDIR='HDFS-path';

HDFS RENAME='HDFS-file' OUT='new-name';

Page 18: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP ACCESS HIVE TABLES VIA SAS

Two main methods to exploit Hadoop Hive tables in SAS:• The LIBNAME Engine (aka “Implicit Pass Through”)

• Assign a LIBREF to Hive and use SAS code upon the LIBREF• SAS Code is automatically converted to Hive

• Explicit Pass Through• Hive code is embedded in SAS code and is submitted verbatim to Hadoop

Page 19: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP THE HADOOP LIBNAME ENGINE

libname hivedb hadoop server=namenode subprotocol=hive2 port=10000 schema=diacchad user=studentX pw=StudentX;

LIBNAME libref engine-name <connection options> <LIBNAME-options>;

23 libname hivedb hadoop server=namenode24 subprotocol=hive225 port=10000 schema=diacchad26 user="&std" pw="&stdpw";NOTE: Libref HIVEDB was successfully assigned as follows: Engine: HADOOP Physical Name: jdbc:hive2://namenode:10000/diacchad

Page 20: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP LIBNAME ENGINE EXAMPLE

options sastrace=',,,d' sastraceloc=saslog nostsuffix;

proc means data=hivedb.order_fact sum mean; var total_retail_price;run;

proc freq data=hivedb.order_fact; tables order_type;run;

options sastrace=off;

Page 21: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP LIBNAME ENGINE EXAMPLE

NOTE: SQL generation will be used to perform the initial summarization.

HADOOP_41: Executed: on connection 7 select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from

( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2, COUNT(TXT_1.`total_retail_price`) as ZSQL3, SUM(TXT_1.`total_retail_price`) as ZSQL4 from `ORDER_FACT` TXT_1 ) T1

where T1.ZSQL1 > 0 ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.

Page 22: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP EXPLICIT PASS THROUGH

proc sql; connect to hadoop

(server=namenode subprotocol=hive2 schema=diacchad user="&std");

select * from connection to hadoop (select employee_name,salary from salesstaff

where emp_hire_date between '2011-01-01' and '2011-12-31'

); disconnect from hadoop;quit;

Page 23: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP HIGH PERFORMANCE ANALYTICS

Interface Purpose Product

High-Performance Analytics Procedures

Perform complex analytical computations on Hadoop tables within the data nodes of the Hadoop distribution via SAS procedure language. HPDS2 allows for manipulation of data structure (column derivation).

SAS High-Performance Analytics Solutions

SAS Visual Analytics

A web interface to generate graphical visualizations of data distributions and relationships on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution.

SAS Visual Analytics

PROC IMSTAT A programming interface to perform complex analytical calculations on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution.

SAS In-Memory Statistics

DS2 A SAS proprietary language for table manipulation that translates to database language and executes in parallel in the data nodes of a distributed database.

SAS In-Database Code AcceleratorsData loader for hadoop

Page 24: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP HIGH PERFORMANCE ANALYTICS

SAS metadataserver

SAS workspaceserver

Hadoop NameNode

Hive

Hadoop DataNode 1

Hadoop DataNode 2

Hadoop DataNode 3

SAS processes in each HDFS data node execute in parallel.

SAS High Performance

AnalyticsRoot Node

SAS High Performance

Analytics Worker Node

SAS High Performance

Analytics Worker Node

SAS High Performance

Analytics Worker Node

SAS Client

Page 25: Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.

SAS & HADOOP LEARNING MORE

• SAS Website• SAS Education

• Introduction to SAS & Hadoop• 2 Day course requiring some SAS Programming & SQL knowledge

• DS2 Programming: Essentials• 2 Day course, requires intermediate SAS Programming knowledge

• DS2 Programming Essentials with Hadoop• 1 ½ day course, requires intermediate SAS Programming knowledge