luncheon webinar series december 18th, 2015 -...

51
0 Luncheon Webinar Series December 18th, 2015 How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop presented by Beate Porstonsored By:

Upload: vuongdien

Post on 05-Feb-2018

244 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

0

Luncheon Webinar SeriesDecember 18th, 2015

How to get started with DataStage (aka IBM InfoSphere

Information Server) running natively on Hadoop

presented by Beate Porstonsored By:

Page 2: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop

Questions and suggestions regarding presentation topics? - send to

[email protected]

Downloading the presentation

• http://www.dsxchange.net/20151218dsx.html

• Replay will be available within one day with email with details

Pricing and configuration - send to [email protected] Subject line : Pricing

For those that stay through the entire presentation, we have a extra give away!

Bonus Offer – Free premium membership for your DataStage Management! Submit

your management’s email address and we will offer him access on your behalf.

• Email [email protected] subject line “Managers special”.

• Join us all at Linkedin http://tinyurl.com/DSXmembers

1

Page 3: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

© 2015 IBM Corporation2

How to get started with DataStage

v11.5 running natively on Hadoop

December, 2015

Beate Porst ([email protected])Product ManagerIBM InfoSphere Information Server

Page 4: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Agenda

• Quick Introduction into InfoSphere Information Server v11.5

• Architecture and System topologies for Information Server on Hadoop

• Installation & Setup

• Performance Observations

• Q&A

Page 5: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

.. powered by Information ServerIntegrating and transforming data and content to deliver

accurate, consistent, timely and complete information on a

single platform unified by a common metadata layer

Information Empowerment for your Data Ecosystem

Information Governance

Catalog

Understand & Collaborate• Catalog technical metadata &

align w/ business language

• Mange (big) data lineage

• New compliance reporting

DataQuality

Cleanse & Monitor• Analyze & validate

w/ enhanced classification

• Cleanse & standardize

• Define, manage & monitor data

rules + exceptions

DataIntegration

Transform & Deliver

• Massive scalability

• Power for any complexity

• Deliver in batch and/or real-

time with change capture

• common connectivity • shared metadata • security (new data privacy functions included)• common execution engine with flexible deployments (new native MPP runtime on Hadoop)

Page 6: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Information Server Release History

5

9.1 9.1.2 11.3 11.58.78.5

GA: 9/25:

New

EOS:

9/2016

EOS:

4/2017

8.18.0

Page 7: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

2012 2013 2014 2015

Information Server Recent Activity

9.1 9.1.2 11.3 11.3.1+FP1 11.3.1

Business Driven Governance- Policy and rules support for

information governance- Web-based blueprints- Integrated metadata mgmt

enhancements

Sustainable Quality- Data Quality Console- Standardization Rules

Designer- Data Rules Advancements

Agile integration- InfoSphere Data Click- Enhanced Workload Mgmt- ODM Integration- Hadoop Balanced

Optimization- HDFS Extensions

Business Driven Governance- IDA 8.5 - Additional Workflow Roles- Data Rules Metadata- Bulk metadata import

Sustainable Quality- Profiling Big Data- Exception Stage- New QS standardization

rulesets

Agile Integration- Big Data Features

* JSON support* JDBC connector

- DB2 on z/OS load optimization- Data Click new data

sources/targets

Business Driven Governance- Info Governance Catalog- Shop for Data- Smart Hover- Collect & Share- Lineage@Scale

Sustainable Quality- Governance Dashboard

integration- Performance Optimizations- Productivity Enhancements- Global Geocoding

Agile Integration- Self-service Data integration- Cloud Connectors- MDM Integration- Sort compress- Hadoop currency- Greenplum Connector

Business Driven Governance- Subscription Manager - Stewardship Center (w/BPM)- Term Custom Attributes- Customizable attribute display- Lineage Admin Console- Prebuilt Governance Content- IGC Data ClassificationSustainable Quality- Data Quality Exception Management

Updates- Exception SQL Views- Stewardship Center Data Remediation

Workflow- Data Classification- Global Geocoding support

Agile Integration- Cognos TM1 Connector and Metadata

Import- HDFS Secure Connector- IDAA pushdown support- Hypervisor support for v11.3.1- BigInsights v4 support

Page 8: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

2015 2016

Summary Information Server v11.5

11.5 FP2

Platform Extensions- Native execution on Hadoop- In-place upgrade v11.3.1 v11.5

Business Driven Governance- Governance Catalog Extensible Framework- Column-level lineage for Hadoop files- Multi-language support- XML Schema Definition support- Data class definitions- Asset interchange for extended lineage content

Sustainable Quality- Enhanced Data Classification- Address Verification and Enrichment Advancements

Agile Integration- Data Integration running natively on Hadoop- Automatic HDFS metadata import- Comprehensive and fast HDFS Connectivity- Out of the Box Database Pushdown- Out of the Box ERP Pack support- Embedded sensitive data protection

FP1

Page 9: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

V11.5 Detailed Capability ComparisonInfoSphere

Information

Governance

Catalog

InfoSphere

Information

Server

For

Data Integration

InfoSphere

Information

Server

For

Data Quality

InfoSphere

Information

Server

Enterprise Edition

BigInsights

BigIntegrate

BigInsights

BigQuality

Business Glossary 1

1

1

1

Metadata Management and Lineage 1

1

1

1

Logical and Physical Data Modeling

Data Cleansing and Enrichment

Data Quality Validation & Monitoring

Data Stewardship

SOA Deployment

Data Specification Mapping

Extract, transform, load (ETL)

Change Data Delivery 2

2

Self Serve Data Access

Data Masking 5

5

5

5

View reports in Cognos 3

3

3

3

3

3

IBM BigInsights included (see notes) 4

4

Runs natively in Hadoop

1 Limited to 250 assets (any combination of glossary terms, categories, information governance policies and information governance rules)2 One database Source or Capture Agent excluding z/OS and must be used with DataStage as target3 View only access for any pre-defined report provided for Information Server4 Maximum of 5-node cluster of IBM BigInsights Data Scientist v4.1 install in support of Information Server5 Requires additional entitlement for Optim ODPP

Separate add-on purchases: data replication, ERP connectors (SAP, SAS), Postal address verification / geo-coding

New offering

Page 10: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Key Use Cases for Data Integration on Hadoop

HDFS

MDM

HDFS

warehouse

Enhanced 360º view

Data Reservoir & Logical Warehouse

Exploratory Analysis

Warehouse Offloading

HDFS

warehouse

HDFS

Modernize

warehouse

architecture

through the

Data Reservoir

improving

efficiency

(TCO) and

extending

analytics

Enhance insight

of key business

entities (e.g.

customer) by

integrating and

correlating new

data sources

and building an

integrated view

Improve

efficiency of

existing

warehouse

investments by

offloading

“dark data” or

augmenting it

with sandboxes

Discover &

explore new

insights more

rapidly and in a

more agile &

iterative manner

Integrate | Transform

Cleanse | Govern

Integrate | Transform

Cleanse | Govern

Integrate | Transform

Cleanse | Govern

Integrate | Transform

Cleanse | Govern

Page 11: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Information Server – BigIntegrateIngest, transform, process and deliver any data into & within Hadoop

Satisfy the most complex transformation requirements with the most scalable runtime available in batch or real-time

Connect• Connect to wide range of traditional enterprise data sources as

well as Hadoop data sources

• Native connectors with highest level of performance and scalability for key data sources

Design & Transform• Transform and aggregate

any data volume

• Benefit from hundreds of built-in transformation functions

• Leverage metadata-driven productivity and enable collaboration

Manage & Monitor• Use a simple, web-based dashboard to manage

your runtime environment

Page 12: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Information Server – BigQualityAnalyze, cleanse and monitor your big data

Most comprehensive data quality capabilities that run natively on HadoopAnalyze

• Discovers data of interest to the org based on business defined data classes

• Analyzes data structure, content and quality

• Automates your data analysis processCleanse

• Investigate, standardize, match and survive data at scale and with the full power of common data integration processes

Monitor

• Assess and monitor the quality of your data in any place and across systems

• Align quality indicators to business policies

• Engage data steward team when issues exceed thresholds of the business

Page 13: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

12

Page 14: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Information Server on Hadoop Offering

• The most scalable Transformation and Data Integration and Quality engine now runs natively on Hadoop

• Runs 10x-20x faster than MapReduce

• Get enterprise-class transformation and cleansing for your Hadoop data

• Use the power of your Hadoop cluster to integrate, transform & cleanse data without writing a single line of code

• Hadoop distribution currency:

– BigInsights 4.0 & 4.1

– HortonWorks 2.2 & 2.3

– Cloudera 5.3 & 5.4

13

Page 15: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Optimize your Integration/Transformation and Data Quality workload based on

data locality and resources availability

Design your integration, data preparation or cleansing once and run it on your

Hadoop Cluster, on your traditional engine or optimize to run on your database

Native Hadoop Runtime

Page 16: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Information Server on Hadoop Features• Full support for Information Analyzer, QualityStage, DataStage and DataClick jobs

• Support for Kerberos enabled cluster

• Full Edge/Client node support for Engine Tier install

• Automatic binary distribution (if not detected) to data nodes or NFS mount

• Data locality support for HDFS file reads (e.g. BDFS, DataSet etc.)

• Container size estimation

• Visibility in DS Job log (Hadoop tracking URL) & YARN Job browser

• Support for Hadoop Node Labels

• Support for YARN scheduler queues

• Support for ODP distributions (BigInsights, HortonWorks, Pivotal etc.) and Cloudera

Page 17: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

RUNTIME ARCHITECTURE & DEPLOYMENT OPTIONS

16

Page 18: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

System Topology

DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

IS Engine Tier Installed on Hadoop Edge Node

All other IS Tiers can be on the Edge Node or outside the cluster

Information Server binaries live on all DataNodes that will run DataStage jobs

Information Server binaries are copied to DataNodes at job run time using HDFS if binaries don’t already exist

Page 19: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Grid Deployments on and off Hadoop

18

Stand-alone

Information Server Grid

Information Server

Grid on Hadoop

Page 20: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Deployment ModelsInformation Server on Hadoop:

19

Typical Hadoop Environment

3 different deployment models for Information Server

within a typical Hadoop Environment

Page 21: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

One Information Server Instance – Multiple EnginesOn and off Hadoop

20

PX Engine “On Hadoop”

DS Project B

PX Engine “Stand-alone”

DS Project A

Services & Repository

Requirement:

• needs to be v11.5 (no

version mix between

components)

Page 22: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataStage Job Runtime Architecture on Hadoop

DataNode

Section Leader

Player 1 Player 2 Player N

DataNodeDataNode

Section Leader

Player 1 Player 2 Player N

YARN Containers

/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer

IS

Application

Master

IS Service Tier

IS Engine Tier Hadoop Edge Node

ConductorIS YARN

Client

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Submit Job

Jobs are submitted from an IS Client (1)

Conductor asks IS YARN Client for an Application Master(AM) to run the job (2)

IS YARN Client manages IS AM pool, starts new ones when necessary (3)

Conductor passes IS AM resource requirements and commands to start Section Leaders (4)

IS AM gets containers from YARN Resource Manager(not pictured)

YARN Node Managers(NM) on DataNodes start YARN containers with Section Leaders (5)

Section Leaders connect back to Conductor and start players (6)

1

2

3

4

55

6 6

Page 23: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

INSTALLATION & SETUP

22

Page 24: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

Installation – Edge Node Provisioning

DataNode DataNode

Hadoop Edge Node

Provisioned through Ambari(pictured), Cloudera Manager, or manually.

Required Clients to install are HDFS and YARN

Validate by running yarn and hdfs commands

Page 25: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

Installation – Information Server on Hadoop

DataNode DataNode

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Information Server Tiers are installed in the typical fashion through the IBM Information Server install.

Page 26: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Validate Engine Tier Install

Make sure a simple job with Transform can compile and run locally

Run with default config file on local node

Don’t run on run on Hadoop yet!APT_YARN_CONFIG

Page 27: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

Creating local Information Server Binary Paths

DataNode DataNode

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Currently a Manual step since jobs don’t run as root

Be careful to create with correct permissions

Cluster settings affect who the owner should be

/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer

Page 28: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Setting up Users on Hadoop

• Gather the User & Group names that will run Jobs

• Create HDFS permissions for those users

– sudo -u hdfs hadoop fs -mkdir /user/InfoSphere_Information_Server_user_name

– sudo -u hdfs hadoop fs -chown InfoSphere_Information_Server_user_name

:InfoSphere_Information_Server_user_group

/user/InfoSphere_Information_Server_user_name

– E.g., to create a user folder for the user dsadm, issue:

• sudo -u hdfs hadoop fs -mkdir /user/dsadm

• sudo -u hdfs hadoop fs -chown dsadm:dstage /user/dsadm

• Additional settings might be required if not running on an Edge node27

Page 29: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

Starting the Information Server YARN Client

DataNode DataNode

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Can be started manually using PXEngine/etc/yarn_conf/start-pxyarn.sh

Will be started automatically with first job run on Hadoop

Will start 2 ApplicationMasters by default

Tuneable with APT_YARN_AM_POOL_SIZE

Troubleshoot with PXEngine/logs/yarn_logs/yarn_client_out.0

/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer

IS YARN

Client

IS

Application

Master

IS

Application

Master

Page 30: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Create Static Configuration File with All Cluster Nodes

• This will localize binaries on all nodes with first job runnode "conductor_node"{fastname "myconductor.mycompany.com"pools "conductor" "export"resource disk "/data" {pool "" "export" "conductor_node"}resource scratchdisk "/scratch" {}

}node "node0"{fastname “compute1.mycompany.com"pools ""resource disk "/data" {pool "" "export" "node0"}resource scratchdisk "/scratch" {}

}node "node1"{fastname “compute2.mycompany.com"pools ""resource disk "/data" {pool "" "export" "node1"}resource scratchdisk "/scratch" {}

}

Page 31: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Validate Running on Hadoop

Make sure a simple job with Transform can run on Hadoop

Run with static config file on all nodes APT_YARN_CONFIG = /opt/IBM/InformationServer/Server/PXEngine/etc/yarn_conf/yarnconfig.cfg

APT_YARN_MODE=trueIn yarnconfig.cfg

Page 32: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

How Binary Localization Works?

DataNode DataNode

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Cached in HDFS by IS YARN Client on startup

Localized by jobs from HDFS cache if they don’t exist at job run time

Requires ~4GB of space in /tmp

Tuneable with APT_YARN_BINARY_COPY_MODE

/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer

IS YARN

Client

IS

Application

Master

IS

Application

Master

Page 33: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Dynamic Configuration Files• Dynamic configuration files take advantage of resource management and HDFS for DataSets

– Predefined dynamic config file: /opt/IBM/InformationServer/Server/dynamic_config

node "conductor_node"{

fastname "myconductor.mycompany.com"pools "conductor" "export"resource disk "/data" {pool "" "export" "conductor_node"}resource scratchdisk "/scratch" {}

}node "node0"{

fastname "$host"pools ""resource disk "/data" {pool "" "export" "node0"}resource scratchdisk "/scratch" {}

}node "node1"{

fastname "$host"pools ""resource disk "/data" {pool "" "export" "node1"}resource scratchdisk "/scratch" {}

}

HDFS

Local Disk

Page 34: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

The Information Server Yarn Config Fileyarnconfig.cfg Located in: /opt/IBM/InformationServer/Server/PXEngine/etc/yarn_conf/yarnconfig.cfg

APT_YARN_MODE=trueIf defined and set to 1 or true runs the given PX job on the local Hadoop install in YARN mode.

APT_YARN_CONTAINER_SIZE=64Defines the size in MBs of the containers that will be requested to run PX Section Leader and Player processes in. The default is 64MB if not set.

APT_YARN_CONTAINER_VCORES=0Defines the number of virtual cores that the containers will request to run PX Section Leader and Player processes in. The default is 0 which means "Don't set it".

APT_YARN_AM_CONTAINER_SIZE=256Defines the size in MBs of the container that will be requested to run PX Application Master process. The default is 256MB if not set.

APT_YARN_AM_POOL_SIZE=2The number of pre-started Application Masters, default is 2.

APT_YARN_NODE_LABEL_EXPR=Define the node label that Information Server jobs should use when being submitted tothe YARN scheduler.

APT_YARN_SCHEDULER_QUEUE=Define the default queue that Information Server jobs should use when being submitted to the YARN scheduler. The default is empty which will use the default scheduler queue.

Page 35: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

DataStage Job Run time logs

YARN

Client

Connection

Hadoop

tracking

URL

Application

Master

Connection

YARN

Container

Allocation

Job

Processes

Running

Page 36: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

DataStage Job Runtime Hadoop Console

DataStage

Application

Master

Information

Application

Run Time

Container

Allocated

Resources

Page 37: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

Using Hadoop Node Labels

DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier

Separate application workloads

Supported by Apache Hadoop 2.6, HDP 2.2, CDH 5.4, IOP 4.0

IIS node label can be controlled by Hadoop scheduler queue or passed with jobs

Unlabelled nodes available to any application dependent on queue configuration

Not supported for Fair Scheduler yet (YARN-2497)

Apache Hadoop 2.8 allows borrowing nodes to increase cluster utilization

DataNodeDataNode DataNode

IISNode IISNode

GPUNode GPUNode

Page 38: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Hadoop Cluster

DataNode

HDFS Data Replication

DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer

IS Service Tier

IS Engine Tier Hadoop Edge Node

/opt/IBM/InformationServer

IS Metadata

Repository TierIS Client Tier IIS Job writes two partition

data files P1 and P2

One block will always reside local to the writing node

Other blocks replicated based on HDFS rack awareness algorithm

Number of replicas depends on HDFS configuration, Default=3

IIS Job that reads P1 and P2 requests to run local to the blocks

Job will read block from another node if locality isn’t possible

DataNodeDataNode DataNode

IISNode IISNode

GPUNode GPUNode

P1 P21 2 2

21

1

Page 39: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

HADOOP / YARN Environment Settings

38

Parameter Description Default value

Recommended

value

yarn.log-aggregation-enable Manages YARN log files. Set this parameter to false if you want the log files stored

in the local file system.

true false

yarn.nodemanager.log.retain-

seconds

Specifies the duration in seconds that Hadoop retains container logs 10800

yarn.nodemanager.pmem-check-

enabled

Determines if physical memory limits exist for containers. If set to true, job is

stopped if a container uses more than the physical memory limit that you specify.

Set this parameter to false if you do not want jobs to fail when the containers

consume more memory than they are allocated.

true

yarn.nodemanager.resource.memo

ry-mb

Sets the amount of physical memory that can be allocated for containers. 8192 MB

yarn.nodemanager.vmem-check-

enabled

Determines if virtual memory limits exist for containers. If this parameter is set to

true, the job is stopped if a container is using more than the virtual limit that you

specify. Set this parameter to false if you do not want jobs to fail when the

containers consume more memory than they are allocated.

true

yarn.nodemanager.vmem-pmem-

ratio

Sets the ratio of virtual memory to physical memory limits for containers. If

yarn.nodemanager.vmem-check-enabled is set to true, jobs might be stopped by

YARN if the ratio of the virtual memory that a container consumes compared to

the physical memory is greater than the ratio that you specify.

2.1

yarn.resourcemanager.nodemanag

ers.heartbeat-interval-ms

Controls the start time for parallel jobs. For clusters that have fewer than 50

nodes, 1000 ms is often too long and leads to a longer start time for parallel jobs.

You can set this value to 50 milliseconds to ensure parallel jobs start in a timely

manner.

1000 ms 50

milliseconds.

Page 40: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

39

Parameter Description Default value

Recommended

value

yarn.scheduler.capacity.

maximum-am-resource-

percent

Specifies the maximum percentage of resources for all queues in the cluster that

can be used to run application masters, and controls the number of concurrent

active applications.

Defaults vary

between

distrubutions of

Hadoop.

yarn.scheduler.capacity.q

ueue-path.maximum-

am-resource-percent

Specifies the maximum percentage of resources for a single queue in the cluster

that can be used to run application masters, and controls the number of

concurrent active applications.

Defaults vary

between

distrubutions of

Hadoop.

yarn.scheduler.incremen

t-allocation-mb

This value indicates how much the container size can be incremented. If you

submit tasks with resource requests lower than the minimum-allocation value, the

requests are set to the minimum-allocation value.

512 MB on

Cloudera

yarn.scheduler.minimum

-allocation-mb

This parameter helps conserve resources on the cluster by setting the minimum

amount of memory that can be requested for a container. The default container

size for parallel processes is 64 MB.

Note: If changing the yarn.scheduler.minimum-allocation-mb value with Ambari-

2.1, you must specify whether the changes should be applied to the MapReduce

specific resource settings. If you are significantly reducing the value of

yarn.scheduler.minimum-allocation-mb, do not change the MapReduce values

based on the new value, because it could cause MapReduce jobs to fail.

1024 MB for most

Hadoop

distributions

256 MB or l

Page 41: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

PERFORMANCE OBSERVATIONS

40

Page 42: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Performance ObservationsRunning Information Server jobs natively on Hadoop / Yarn

• Running Information Server jobs natively under YARN scales out linearly!

– Throughput doubles if number of Hadoop data double

• YARN introduces some overhead for Job startup time

– Job startup time is slightly slower then a non-YARN start up

• Storing data on HDFS is up to 13% slower then native OS storage

• Observations when running a realistic DataStage workload on a YARN managed Hadoop cluster:

– Using Static configuration files

• performance running on/off Hadoop would be similar (for similar resources)• This is mostly because it doesn’t need to store DataStage specific files on HDFS as jobs will run on

statically defined nodes

– Using dynamic configuration files:

• We observed a performance penalty on Hadoop of up to 13% due to the HDFS usage• Storing data on HDFS is significantly slower than native OS storage due to things such as the

replication factor 41

Page 43: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Test System Topology

42

. . .

BigInsights Cluster

DB2 Server Data Node 1 Data Node N

Information Server

Services, Repository

Engine

• Number of Systems: 11

•The specs for each box are identical (IBM xSeries High Volume Racks x3630 M4)

‾ CPU: 32 cores ( 4 Sandy-Bridge EP each with 8 cores)

‾ Memory: 64 GB

‾ Disk: 14 X 1TB

‾ Network: interconnected with 10GbE

Data Warehouse

For the TPC-DI

Workload

Master Node

Page 44: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

43

Scale Out Test

• DataStage throughput doubled when doubling the number of hadoop

data nodes.

Page 45: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

44

TPC-DI Workload Performance in Different Modes

Page 46: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Q&A

45

Page 47: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Where to get more Information?

• Product Documentation: IBM Information Server Knowledge Center:

– http://www-

01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ishado

op.nav.doc/containers/cont_iisinfsrv_hadoop.html?lang=en

– Remember: BigIntegrate / BigQuality are only offerings – the actual product is

Information Server

• Tutorial on How to setup Information Server on Hadoop on a Cloudera CDH

5.4

– https://app.box.com/s/b0wonh8vv5bn8g8eaaj76cy7deui27cx

• Contact: Beate Porst ([email protected]) -- Product Manager Data

Integration

46

Page 48: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Q&A• What are IBM BigInsights BigIntegrate & IBM BigInsights BigQuality

– These are offerings (specific bundles/licenses/prices)for your Hadoop Data Integration & Data Quality

needs. These offerings are powered by InfoSphere Information Server now running natively on Hadoop /

Yarn.

• Which Hadoop Distributions are supported?

– ODP distributions (e.g. IBM BigInsights, HortonWorks, Pivotal), Cloudera running on Linux OS (X86).

• Can I connect (read/write) to data sources outside of Hadoop?

– Yes, you can connect to pretty much any data source accessible by Information Server. (from

mainframe to cloud)

• Where will data transformation / quality processes run?

– Processes will run on any /all of the Data Nodes in the Hadoop distribution on which the product is

installed. The number of data nodes utilized to run a particular job depends on the partioning level

associated with a job during Job start up (configuration file)

• Do I need to know how to write Java, HiveQL, Pig or any other programming language to create Data

Integration or quality processes

– No, data integration and quality processes are designed using an intuitive graphical design interface. You

compose your transformation logic out of pre-build operators (think of it as LEGO bricks) that you hook

together to form a final flow of data47

Page 49: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

Q&A

• Will I be able to get Data Lineage or Impact Analysis for jobs running on Hadoop?

– Yes, Information Server on Hadoop utilize Information Server’s shared metadata feature which allows to

automatically capture design & operation metadata and deduce data lineage and dependency analysis

no matter where the job runs.

• Is Information Server on Hadoop using Map/Reduce?

– No, jobs are processed by the Information Server Parallel Execution Engine which is a highly scalable

MPP (cluster) engine. Each data node has a copy of the PX engine libraries and therefore a job can run

in parallel on multiple data nodes.

• Are BigIntegrate & BigQuality offerings the only option to license Information Server on Hadoop?

– No, any of the Information Server v11.5 offerings can be deployed on Hadoop.

• Is the Information Server Parallel Execution Engine (PX) faster than Spark?

– The IBM PX engine and Spark are both high-performant cluster computing MPP engines. Based on

internal tests, we have seen many use cases, specifically when processing large volumes of data where

IBM PX engine was more performant than Spark.

48

Page 50: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

THANK YOU

Page 51: Luncheon Webinar Series December 18th, 2015 - DataStagedsxchange.net/uploads/12182015_DS_on_Hadoop.pdf · Luncheon Webinar Series December 18th, 2015 ... Container size estimation

How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop

Questions and suggestions regarding presentation topics? - send to

[email protected]

Downloading the presentation

• http://www.dsxchange.net/20151218dsx.html

• Replay will be available within one day with email with details

Pricing and configuration - send to [email protected] Subject line : Pricing

For those that stay through the entire presentation, we have a extra give away!

Bonus Offer – Free premium membership for your DataStage Management! Submit

your management’s email address and we will offer him access on your behalf.

• Email [email protected] subject line “Managers special”.

• Join us all at Linkedin http://tinyurl.com/DSXmembers

50