copyright © 2015, sas institute inc. all rights reserved. the elephant in the room sas & hadoop
TRANSCRIPT
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
THE ELEPHANT IN THE ROOMSAS & HADOOP
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP ABOUT THE PRESENTER
• Jim Watson• SAS Education, Canberra• Background in SAS Programming, SQL programming, Database Processing,
Grid Processing, et al• With SAS since 1999
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP LIST OF TOPICS
• What is Hadoop?• How SAS integrates with Hadoop
• HDFS• LIBNAME Engine• Explicit Pass-through• High Performance Analytics
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP WHAT IS HADOOP?
• Apache Hadoop is an Open Source Software Framework• Written in Java• For Distributed Storage and processing of very large datasets on computer
clusters• Built from Commodity Hardware
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP ADVANTAGES OF HADOOP
• Some characteristics of Hadoop include:• Open-source• Simple to use distributed file system• Supports highly parallel processing• It’s scalable, so it’s suitable for massive amounts
of data• It is designed to work on low-cost hardware• It’s fault tolerant (redundant) at the data level
• automatic replication of data• automatic fail-over
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP HADOOP FUNDAMENTALS
• HDFS – “Hadoop Distributed File System”• Files are distributed across the Hadoop cluster
• Hadoop YARNa framework for job scheduling and cluster resource management
• MapReduce• Files are processed locally and in parallel• Based on YARN
These modules handle the process of reading/writing & processing large files in
a distributed environment. This allows the data to be exploited as if it were a
single massively powerful server.
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM
• HDFS is hierarchical with LINUX style paths and file ownership and
permissions.• HADOOP FS commands are similar to LINUX commands. • HDFS in not built into the operating system. • Files are append-only after they are written.
$ hadoop fs –ls /user/studentFound 4 itemsdrwxr-xr-x - student1 sasapp 0 2014-05-30 20:00 /user/student1/.Trashdrwx------ - student1 sasapp 0 2014-05-30 10:05 /user/student1/.stagedrwxr-xr-x - student1 sasapp 0 2014-05-28 15:25 /user/student1/datadrwxr-xr-x - student1 sasapp 0 2014-05-28 13:59 /user/student1/users$ hadoop fs –mkdir /user/student1/newdir$
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP MAPREDUCE
• MapReduce is a framework written in Java that is built into Hadoop. It
automates the distributed processing of data files.
map processing of individual rows (filtering, row calculations)
shuffle and sort
grouping rows for summarisation
reduce summary calculations within groups
The MapReduce framework coordinates multiple mapping, sorting, and reducing tasks that execute in parallel across the computer cluster.
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP WHAT’S INSIDE
SASClient
SAS metadataserver
SAS workspaceserver
Hadoop NameNode
HiveHadoop
DataNode 1Hadoop
DataNode 2Hadoop
DataNode 3
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP PARALLEL PROCESSING EXAMPLE
• A MapReduce Example: Summarise a detailed order table to derive total
revenue by state. The table is already distributed in HDFS.
id st rev1 NC 102 GA 123 VA 84 NC 95 VA 226 NC 187 NC 28 GA 53......
st totrevGA 65NC 39VA 30......
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP PARALLEL PROCESSING EXAMPLE
.
.
.
id st rev1 NSW 102 QLD 123 VIC 84 NSW 9
st rev ctNSW 10 1QLD 12 1VIC 8 1NSW 9 1
st rev ctNSW 10 1NSW 9 1NSW 18 1NSW 2 1
id st rev5 VIC 226 NSW 187 NSW 28 QLD 53
Block n
map
st rev ctVIC 22 1NSW 18 1NSW 2 1QLD 53 1
st rev ctVIC 8 1VIC 22 1
.
.
.
.......
.
.
.
.......
shuffle
st totrevNSW 39
st totrevVIC 30
.
.
.
.......
reduce
output
output
File blocks
output
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP “PIG” & “HIVE”
Pig and Hive provide less complex higher-level programming methods for parallel
processing of Hadoop data files.
Pig A platform for data analysis that includes stepwise procedural programming that converts to MapReduce.
Hive A data warehousing framework to query and manage large data sets stored in Hadoop. Provides a mechanism to structure the data and query the data using an SQL-like language called HiveQL. Most HiveQL queries are compiled into MapReduce programs.
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP THE HADOOP ECOSYSTEM
The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along
with additional projects including Pig, Hive, and others are collectively called the
Hadoop ecosystem.
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP EXPLOITING THE HDFS
• The Hadoop FILENAME engine• Upload local data to Hadoop• Read data from Hadoop• Use normal SAS PROC & DATA Steps
• PROC HADOOP• Submit HDFS Commands• Submit MapReduce & PIG programs
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP THE FILENAME STATEMENT & HDFS
filename hadconfg "/workshop/hadoop_config.xml';
filename mapres hadoop"/user/&std/data/mapoutput" concat
cfg=hadconfg user="&std";
data work.commonwords; infile mapres dlm='09'x; input word $ count; …run;
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP PROC HADOOP
• PROC HADOOP submits• Hadoop file system (HDFS) commands• MapReduce programs• PIG language code.
PROC HADOOP <Hadoop-server-option(s)>; HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>; MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>; PIG <Hadoop-server-option(s)> <pig-code-option(s)>; PROPERTIES <configuration-properties>; RUN;
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP PROC HADOOP – HDFS STATEMENTS
HDFS COPYFROMLOCAL='local-file' OUT='output-location'<DELETESOURCE> < OVERWRITE>;
HDFS COPYTOLOCAL='HDFS-file' OUT='output-location'<DELETESOURCE> < OVERWRITE> < KEEPCRC>;
HDFS DELETE='HDFS-file' <NOWARN>;
HDFS MKDIR='HDFS-path';
HDFS RENAME='HDFS-file' OUT='new-name';
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP ACCESS HIVE TABLES VIA SAS
Two main methods to exploit Hadoop Hive tables in SAS:• The LIBNAME Engine (aka “Implicit Pass Through”)
• Assign a LIBREF to Hive and use SAS code upon the LIBREF• SAS Code is automatically converted to Hive
• Explicit Pass Through• Hive code is embedded in SAS code and is submitted verbatim to Hadoop
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP THE HADOOP LIBNAME ENGINE
libname hivedb hadoop server=namenode subprotocol=hive2 port=10000 schema=diacchad user=studentX pw=StudentX;
LIBNAME libref engine-name <connection options> <LIBNAME-options>;
23 libname hivedb hadoop server=namenode24 subprotocol=hive225 port=10000 schema=diacchad26 user="&std" pw="&stdpw";NOTE: Libref HIVEDB was successfully assigned as follows: Engine: HADOOP Physical Name: jdbc:hive2://namenode:10000/diacchad
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP LIBNAME ENGINE EXAMPLE
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
proc means data=hivedb.order_fact sum mean; var total_retail_price;run;
proc freq data=hivedb.order_fact; tables order_type;run;
options sastrace=off;
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP LIBNAME ENGINE EXAMPLE
NOTE: SQL generation will be used to perform the initial summarization.
HADOOP_41: Executed: on connection 7 select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from
( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2, COUNT(TXT_1.`total_retail_price`) as ZSQL3, SUM(TXT_1.`total_retail_price`) as ZSQL4 from `ORDER_FACT` TXT_1 ) T1
where T1.ZSQL1 > 0 ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP EXPLICIT PASS THROUGH
proc sql; connect to hadoop
(server=namenode subprotocol=hive2 schema=diacchad user="&std");
select * from connection to hadoop (select employee_name,salary from salesstaff
where emp_hire_date between '2011-01-01' and '2011-12-31'
); disconnect from hadoop;quit;
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP HIGH PERFORMANCE ANALYTICS
Interface Purpose Product
High-Performance Analytics Procedures
Perform complex analytical computations on Hadoop tables within the data nodes of the Hadoop distribution via SAS procedure language. HPDS2 allows for manipulation of data structure (column derivation).
SAS High-Performance Analytics Solutions
SAS Visual Analytics
A web interface to generate graphical visualizations of data distributions and relationships on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution.
SAS Visual Analytics
PROC IMSTAT A programming interface to perform complex analytical calculations on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution.
SAS In-Memory Statistics
DS2 A SAS proprietary language for table manipulation that translates to database language and executes in parallel in the data nodes of a distributed database.
SAS In-Database Code AcceleratorsData loader for hadoop
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP HIGH PERFORMANCE ANALYTICS
SAS metadataserver
SAS workspaceserver
Hadoop NameNode
Hive
Hadoop DataNode 1
Hadoop DataNode 2
Hadoop DataNode 3
SAS processes in each HDFS data node execute in parallel.
SAS High Performance
AnalyticsRoot Node
SAS High Performance
Analytics Worker Node
SAS High Performance
Analytics Worker Node
SAS High Performance
Analytics Worker Node
SAS Client
Copy r ight © 2015, SAS Ins t i tu te Inc . A l l r ights reserved.
SAS & HADOOP LEARNING MORE
• SAS Website• SAS Education
• Introduction to SAS & Hadoop• 2 Day course requiring some SAS Programming & SQL knowledge
• DS2 Programming: Essentials• 2 Day course, requires intermediate SAS Programming knowledge
• DS2 Programming Essentials with Hadoop• 1 ½ day course, requires intermediate SAS Programming knowledge