hadoop workflows using sas® data integration … workflows using sas® data integration studio lal...

23
#AnalyticsX Copyright © 2016, SAS Institute Inc. All rights reserved. Hadoop Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai Solution Architect United HealthCare

Upload: leque

Post on 22-Mar-2018

255 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

#AnalyticsXC o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Hadoop Workflows Using SAS® Data Integration Studio

Lal Puthenveedu RajanpillaiSolution ArchitectUnited HealthCare

Page 2: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform highlights

Hadoop cluster – Architecture changes

ETL Process changes

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 3: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform highlights

Hadoop cluster – Architecture changes

ETL Process changes

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 4: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Claims and Financial data

750+ Users across Enterprise

Connected to multiple data sources

Metadata driven SAS Grid Environment

Access from SAS Clients and SAS Add-ons

Platform highlights

Page 5: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform Highlights

Hadoop cluster – Architecture Changes

New ETL Process

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 6: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Pre Hadoop

Claims

Revenue

Membership

Financial

Clinical

Operational

Call Center

ENTERPRISE

W AREHOUSE

SAS

ANALYTICS

P L ATFORM

OTHER

ANALYTIC

TOOLS

SAS

CL IENT

TOOLS

Accounting

Regulatory

Actuarial

Data Science

Marketing

Planning

Leadership

ANALYTICS PLATFORMLEGACY

Page 7: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Unified Data-lake in Hadoop

Co-location of data from multiple sources

Hadoop cluster for storage and Processing

New users with diverse client tools

What changed

Page 8: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

With Hadoop

Claims

Revenue

Membership

Financial

Clinical

Operational

Call Center

ENTERPRISE

W AREHOUSE

SAS

ANALYTICS

P L ATFORM

OTHER

ANALYTIC

TOOLS

SAS

CL IENT

TOOLS

Accounting

Regulatory

Actuarial

Data Science

Marketing

Planning

Leadership

ANALYTICS PLATFORMWITH HADOOP

HAD OOP

CL USTER

SAS A

ccelerators

SAS A

ccess

SAS In

mem

ory

Publish

DIRECT ACCESS TO HADOOP

Page 9: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Minimize SAN storage utilization

Replace legacy ETL process leveraging Hadoop

Co-location of analytics data with enterprise data-lake

Better access from non-SAS clients

Efficient scheduling process

Benefits of new Architecture

Page 10: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform highlights

Hadoop cluster – Architecture changes

ETL Process changes

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 11: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Old Process

· Multiple data sources· Data loaded to staging by legacy ETL· Triggers custom SAS jobs· Staging tables to SAS warehouse· Reconciles to source systems· Updates analytics and reporting datasetsLEGACY ETL CUS TOM SAS ETL

S T AGING T A BLES

DA

TASO

URC

ES

RAW

ENRICH

RECONCILE

ANALYTICS

SAS DATA

Page 12: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

New Process

S AS DI STUDIO

HIV E S T AGING T A BLES

DA

TASO

URC

ES

RAW

ENRICH

RECONCILE

ANALYTICS

SAS DATA

· HDFS for data landing and archive· Hive staging tables· Job processing with SAS DI studio· Enriched and reconciled data in SAS and hive· Current data (Recent 3 years in SAS)· History data in Hive

RAW

ENRICH

RECONCILE

ANALYTICS

HIVE TABLES

Page 13: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

SAS DI studio

UI based development

Hadoop containers for HDFS, Pig and Hive

Automatic status handling

Better readability and maintainability of code

Hadoop cluster for processing and storage

Advantages of new process

Page 14: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform highlights

Hadoop cluster – Architecture changes

ETL Process changes

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 15: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

SAS DI Studio and Hadoop

Containers for Hadoop jobs

PIG container with PROC HADOOP

Hive container with PROC SQL

Hadoop file reader/writer for direct access to HDFS files

Page 16: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

SAS DI Studio job

2:SAS STAGE

Load f ile to SAS Staging.

Data step w ith LIBNAME to HDFS

3:SAS LOAD

Stage to SAS Data

Data Append

4:HIVE STAGE

Load files to HIVE Staging.

Hive Containers

5:HIVE LOAD

Stage to HIVE Data

Hive Container

1 : INITIALIZE Create new run,

Copy file, File DQPIG Container

Page 17: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform highlights

Hadoop cluster – Architecture changes

ETL Process changes

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions

AGENDA

Page 18: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Pig Container

HDFS file moves

Quality checks

Data filtering

Hive Container

Hive table load, read , updates

Hadoop file reader/writer

HDFS file read/write from/to SAS data

Hadoop Containers

Page 19: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Conditional process flow

All status handing at job level

Functionality based sub-jobs

Return codes and error

messages from containers

* For SAS 9.4M1 Pig Container needs a work around for Status Handling

Status Handling

Page 20: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

SAS and Hadoop compatibility

SAS 9.4M1 version

Hive 0.13 - table formats to avoid space issues & ensure proper data conversion

Error handing of pig container

Hadoop error code not populated. Used error text to set RC

%if "&SYSERRORTEXT" ne "" %then %do ;

%let trans_rc = 9999;

%let job_rc = 9999;

%end;

Compatibility

Page 21: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Optimized the job stream

Two job streams

First updated SAS datasets minimal dependency to the cluster

Second updated the hive tables

The loop functionality provided by SAS was effectively used

Hive with ORC SerDe and Snappy compression

and more..

Page 22: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#analyticsx

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

Analytics Platform Highlights

Hadoop cluster – Architecture Changes

New ETL Process

Leveraging SAS DI Studio

Best practices & Lessons learned

Questions ?

AGENDA

Page 23: Hadoop Workflows Using SAS® Data Integration … Workflows Using SAS® Data Integration Studio Lal Puthenveedu Rajanpillai ... Hadoop Workflows Using SAS® Data Integration Studio

C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.

#AnalyticsX