big data for dummies using datastage by peter bjelvert infosphere architect middlecon ab
TRANSCRIPT
Big Data for Dummies using DataStageBig Data for Dummies using DataStage
By Peter BjelvertInfoSphere Architect
Middlecon AB
ETL – Relational DB
Extract Transform in DataStage
Load
Your powerful DataStage server will handle all complex transformation and the database is only used for reading and writing.
ELT – Relationel DB
ExtractLoad with Transform
If you have powerful Database servers you can push down much of the work to the database, then DataStage will mostly control the flow
Balanced Optimization
Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big SQL statement.
Bal. Opt. creates a new copy of the jobb that push the load into Source and Target
Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both
The DataStage job is re-written into SQL code.
ETL Balanced Optimization feature of Datastage
ELT – PushDown
DB DataStage is doing the main work
Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,
DB server is doing the main job
Hadoop Distributed File System - HDFS
Application Layer
Workload mgmt Layer
Data Layer
One file3 copies
ETL – HDFS
Extract Transform in DataStage Load
HDFS
Node
Node
Node
Node
Node
Node
HDFS
Node
Node
Node
Node
Node
Node
Your powerful DataStage server can read and write to the distributed file system
ELT – Hadoop system
Extract
Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both
The DataStage job is re-written into JACL code.
Load with Transform
Hadoop
Node
Node
Node
Node
Node
Node
Hadoop
Node
Node
Node
Node
Node
Node
DataStage JACL example
Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big JACL statement.
ETL Balanced Optimization feature of Datastage
ELT – PushDown
DB DataStage is doing the main work
Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,
DB server is doing the main job
HDFS DataStage is doing the main work Bal. Opt. creates a new copy of the job with JACL code: SetOptions({conf:{"mapred.job.name":"DataStage BalOp job BIGDATA:dstage1 ff_read_write_to_hadoop_jaql_balopt_join CustomerTarget 16_#DSJobInvocationId#"}}); setOptions({conf:{"mapred.reduce.tasks":1}}));
Hadoop application server execute the JACL code onall nodes.
BDFS
Node
Node
Node
Node
Node
NodeHadoop
Node
Node
Node
Node
Node
Node
Extract, Transform and filter in DataStage Load good data into HDFS
BDFS
Node
Node
Node
Node
Node
Node
DataStage can read from many different sources. Convert common data (like time/date) to failitate following queries. Send unwanted data to garbage
A good scenario for DS customer
Analytic functionsAQL …
o Borrowed images from google Slide 6- https://yoyoclouds.wordpress.com/tag/hadoop/ Slide 7- http://
kickstarthadoop.blogspot.se/2011/04/word-count-hadoop-map-reduce-example.html
Slide 8 - http://www.rosebt.com/1/post/2012/07/hadoop-internal-software-architecture.html
Slide 9- http://www.ndm.net/datawarehouse/IBM/ibm-infosphere-biginsights