get deeper insights from more data sources using ibm fluid query · 2017-04-14 · get deeper...

Get deeper insights from more data sources using IBM Fluid QueryDoug Dailey, Offering Manager, PureData System for Analytics/Netezza Products, IBM

Welcome to Virtual Enzee

Call LogisticsAudio is broadcast via your computer speakers (no dial-in)

Ask questions via the Q&A button

[in lower left hand of your screen]

Experiencing technical difficulties? Let us know via the

Q&A button

Download a copy of the presentation via the Content button

Virtual Enzee Replays are available On-Demand @ http://ibm.biz/dwwebinars

http://ibm.biz/dwwebinars

Agenda

Market Perspective

Fluid Query Introduction

Data Virtualization

Data Integration

Fluid Query 1.7.1

Uniformity across IBM

3

Data is the new currency and at the core of every business, but…

1. Only 15% of organizations fully leverage data and analytics Unlock the potential of all your data available in all data types. Combine it with public or 3rd party data sets

2. Many users don’t have direct or timely access to information Short-cut / avoid dependencies and democratize access with integrated governance enabling Self-Service

3. 90% of the worlds data cannot be googled Leverage data where it resides and bring analytic capabilities & cloud benefits to your data

4. The cloud journey is a marathon, not a sprint Hybrid cloud solutions offer faster, incremental value at lower risk

Self Service Platform

Inside Information Supply Chain

IntegrationEngine

Agenda

Market Perspective


Data Virtualization

Data Integration

Fluid Query 1.7.1


7

IBM Fluid QueryHelps Unify Todays Advanced Analytic Ecosystems

Excellent for provisioning your

business

Easier, faster consumption

of data

Better, more transparent access to required data

sources

Connect ̶̶ Query ̶̶ Bulk Data Copy

Transparency Across Your Enterprise

dashDBLocal

Big SQL

Extending to Hadoop

dashDBLocal

Big SQL

Bulk Data Copy

DataCustom Filters (tables, DBs)•

Single or multiple tables with •“where” clause support

Determine format and •compression

Copy• Simple to use CLI or SQL-

based function

• Support for Hive partitions and clusters

• Export Hive registered data to PDA

Bulk• Import or export data to &

from Hadoop

• Parallel data transfer (n-stream)

• Parallel via map/reduce jobs across Hadoop cluster

Native transfer between PDA and Hadoop

Read all about itEndpoint validation checking•Checksum capability to ensure data integrity•Supports Netezza external table options•Retailer imported • 75M rows under 5 minutes¼ rack tested at • 2-3 TB/hr on a 6-node cluster

• 2-3X faster than Sqoop using nzbak formatPlanned for Big SQL, dashDB and LUW•

ORC

AVRO

ParquetRcFile

Text

Parquet AVRO ORC RCfile• snappy• gzip• Uncompressed

• snappy• deflate

• snappy• zlib• none

• snappy• zlib• none

Hadoop storage considerations

A Simple test with an uncompressed, 1.8 GB CSV file into various formats, achieves much smaller disk footprints.• Avro - 1.5 GB • Avro w/ Snappy Compression - 750 MB • Parquet w/ Snappy Compression - 300 MB

Query performance across Hive, BigSQL and Impala improved as the files become smaller.

HadoopDistro?

Schema evolution

CPUI/O

(etc.)Reads&

Writes

Export from

HDFS?

Storage needs?

Supported compression modes:

Bulk Data Copy Sample times by IBM between PDA N2001 model and BigInsights

File formats GB/h Time to import 1 TB TB/h

Text 2444 25m 2.444

Mixed mode 3081 19m 3.15

Netezza binary 3561 17m 3.561

Hadoop formats with NO compression

Avro 800 1h 16m 0.800

Parquet 862 1h 10m 0.862

RC file 1216 50m 1.20

ORC file 896 1h 8m 0.896

Hadoop formats WITH compression (use > 12 splits)

Parquet gzip 344 2h 58m 0.344

Avro deflate 421 2h 25m 0.421

RC file gzip 459 2h 13m 0.459

ORC file Snappy 997 1h 2m 0.997

Disclaimer: Performance test results are measured using specific computer systems and / or components. Any difference in system hardware or software design or configuration may affect actual performance. Readers should consult other sources of information to evaluate the performance of systems or components they are using.

N2001-005− 10 Gbe NIC− Local Network− NPS 7.2.1− INZA 3.2.1− Fluid Query 1.7

BigInsights v4.1− 5-data nodes− Apache Hive

PDA, dashDB Local and Hadoop

Agenda

Market Perspective


Data Virtualization

Data Integration

Fluid Query 1.7.1


21

Data Virtualization using Fluid Query

Fluid Query

Cost-based Optimizer

• JAE instantiates JDBC for data access• UDTF via Java Analytic Executable (AE) • UDTFs encapsulated in VIEWs• ad-hoc via nzsql or SQL client• Dynamic SQL pushdown not supported

Netezza architecture

JDBC to data

source

• Decomposes, rewrites and distributes queries

• Cost-based optimizer chooses query with SQL pushdown

• Query execution engine combines results

Federation Comparison

Federated data from heterogeneous data sources •throughout the day

Complex data transformations are performed using •ETL at the virtual database

Business Benefits:• Call center representatives can now access

real-time operational data complete with detail from the data warehouse

• On demand scoring of data profiling and mining processes from a larger number of sources

Challenge

Solution

Need to have a complete view of data from various •sources including DB2, Oracle, Sybase, and SQL Server in order to perform data mining and profilingNeed to provide up• -to-date, detailed information to call center staff to make tactical business decisionsTo explore data sources without moving the data •to focus on mining the data

Technology Benefits:• Save time to build ETL processes in bringing

data to mining database

• Extending the value of the data warehouse to users

Merrill Lynch

• Combined data from various data sources into a single view

• Used Cognos as reporting tool to provide real-time insight into making strategic and tactical business decisions

Business Benefits:99• % reduction in process time for ad hoc queries

Reduced costs and higher productivity•

Greater cross• -sell and up-sell opportunities

Technology Benefits:

• 59% reduction in batch window• Reduced time to develop new applications

Challenge

Solution

Dealt with different types of data that needed to •be shared across branches to provide a complete view of the customerProducing business reports took more than • 10 employees to compile and took too long

Taikang Life Insurance Co.

Ability to query data in hadoop and other data sources like Oracle and Netezza

Currently in development, but want to leverage Dimension tables stored on Oracle to supplement local data on PDA through a View definition. Their top needs are Oracle, Hadoop, and Spark.

Fluid Query is currently installed primarily for queryable archive of data on hadoop

Enabled for Test and Production and used for discovery. Business analysts use Fluid Query to JOIN additional data from DB2 and other RDBMS for existing reports, in order to better enrich overall content. Once Business Analysts identify new data sources, IT tests and creates a process to support the new data source.

Healthcare

Primary work is data movement between Netezza and Big Insights. They are interested in federation to use as queryable archive.

Mortgage Banking - The Team supports the EDW space in Mortgages. Their inflow includes some BigInsights Hadoop being used inside but has not expanded. Currently exploring the integration between Netezza appliances and existing Oracle databases.

This customer has an Oracle Data Warehouse and are working to migrate data to two new Netezza boxes. They like the ease of use that Fluid Query offers, which allows them to copy data from Oracle by using a simple select statement. They very much like the fact that they can use CTAS to create tables and avoid dealing with DDL conversion.

Fluid Query is used to create connections to all of their Oracle schemas, so that they can synchronize data with Netezza while they work to migrate their batch framework. No coding effort and only CTAS and INSERT/SELECT statements to move the data.

Banking and Retail

Agenda

Market Perspective


Data Virtualization

Data Integration

Fluid Query 1.7.1


29

Integrating Data Sources using Fluid Query

Open Source

Hadoop

CommercialNetezza

nzload

nz_migrate

external table

Fluid Query

Large Canadian BankBusiness challengeCreate an aggregated Risk data warehouse from each Risk engine in their environment in adherence of regulation BCBS 239.

The consolidated reports go to senior executives, board members and regulators. Some reports are used in the publishing of financial statements. These have strict timeline requirements and if missed there will be reputational and/or fiscal consequence. Regulators can impose fines on banks for not submitting the reports on time.

www.markit.com/bcbs239, 2014

https://en.wikipedia.org/wiki/BCBS_239

http://www.markit.com/bcbs239

− Consolidate all Risk engine(s) data into a single data warehouse− Meet SLAs required for time to load each respective Risk area

(largest daily load is 10 million rows in under 15 minutes)− Provide Daily and weekly reporting for critical business functions− Provide the ability to filter by custom dimension(s)

Success Criteria

on Solaris

Full rack

5.5.1N2002-010• NPS 7.2.0.5• INZA 3.20• Fluid Query 1.6

Smaller systems @ 50 GB Larger systems @ 9-10 TB

~ 200 tables ranging from 300 to 10 million records

Netezza was chosen as the platform for consolidation based on its capacity, performance and ability to leverage Fluid Query through a straight-forward setup that met the estimated processing window.

Netezza also had a cost savings variant in its favor (no-charge component).

Evaluating Options

ETLFluid Query

EnvironmentFluid Query populates EDW with data from criticalRISK engines, then combines with Local data on PDAfor comprehensive reporting for BCBS 239.

Cognos Analytics

Fluid Query

World of

35

1. Custom utility to extract the DDL from source and convert to Netezza DDL. Provides data type mapping

Allows for the ability to adjust the distribution key in PDA

2. SP is called as Risk engines complete scoring and capital calculations

3. SP kicks off 10 connectors that are manually group against data in each DB

Implementation

Risk engine10 • SP invoked in parallel via nzsqlEach SP pull different set of tables•Logical partition by table•Execution varies daily, weekly, monthly•Simple select *•No JOINs•INSERT INTO•Runs off• -hours

t1

t2

t3

t4

t5

…

…

− Fluid Query met project deadline (fast TTM)

− Met regulatory compliance BCBS 239

− No additional capital requirements− No measurable operating costs− Consistently meet all SLAs for their business window− Straight forward implementation− No need to transfer files or additional storage space− Fully native and lightweight package

Summary

Architecture Review

Largest Risk engineSource: Oracle ExadataTotal tables: 99Total rows: 21.5 millionData volume: 6.75 GBLoad time: < 12 minutes

Big Fish Games

Etc.

Installs

On-premise

Sessions

Custom

Fluid Query ships SQL request to Hadoop and MySQL to populate the EDW.

streaming data

Fluid Query

IBM BigInsightsBig SQL v4.1

Fluid Query 1.7.1 now supports export of AVRO file format.

Export

SQL Insert

ETL Takeout

Fluid Query

Fluid Query replaced an entire ETL infrastructure with just two lines of code – making it quicker and easier to collect and process data for fast, insightful reporting.

− Talend ETL software (Open Source)

− Separate staging server− Oracle 11g− Netezza 3001-020

− NPS 7.2.0.5 − INZA 3.20− Fluid Query 1.7

− BigInsights Big SQL v4.1

Fluid Query

Staging server

Benefits of Fluid Query for Data Integration

• No-charge NPS software component

• Simple SQL front-end

• Ability to migrate RDBMS via SQL

• Automatic data type conversion

• No ETL staging server required

• No formal ETL required

• Allows for custom predicates

• Does not require additional skills

• Allows the ability to access HOT data

• Flexible and easy to setup

Agenda

Market Perspective


Data Virtualization

Data Integration

Fluid Query 1.7.1


41

Fluid Query 1.7.1 New Features Hadoop Integration and Fast Data Movement

– Support for Hive partitioning and Cluster By functionality– Ability to export Hadoop files to PDA (including select DBs and tables)– Automatic VIEW creation on PDA for newly imported data to Hadoop– Bulk data copy support for Netezza external table options– Checksum utility ensures data integrity when copying data between PDA

and Hadoop Access and Authentication

– Automatic password encryption and storage control– Automated kinit authentication for kerberized Hadoop services

Updated support for data sources

• BigInsights 4.2 / Big SQL 4.2• Hortonworks 2.5• Cloudera 5.9

• Spark 1.6.1• MapR• DB2 v11.1

Supported Database and Hadoop Providers

2013 2014 2015 2016 2017

February March April

1 2 3 4

Table

10B| 10B| 10B| |10B |10B

50M| | 50M

100K| 100K| |100K

Import to Hadoop Using Hive Partitions The fq.hive.partitioned.by property for FDM determines by which column(s) the imported table will be

partitioned

Hive partitioned tables allow queries with predicates to run much faster on Hadoop and encourage more efficient MapReduce job processing

./fdm.sh -conf conf.xml -D fq.hive.partitioned.by=’col1’

Note: some restrictions apply when using partitions and clusters

Import to Hadoop Using Hive Clustering New parameters for enabling Hive Clustering

– fq.hive.clustered.by– fq.hive.clustered.buckets

There are two ways to set clustering– Default clustering

Using the same value that was set for the table on NPS with the DISTRIBUTE ON parameter

setting empty fq.hive.clustered.by./fdm.sh -conf conf.xml -D fq.hive.clustered.by =’’ -D fq.hive.clustered.buckets=’252’

– Cluster by specific columns./fdm.sh -conf conf.xml -D fq.hive.clustered.by =’month’ -D fq.hive.clustered.buckets=’12’

Hive Table ExportExport Hive registered tables from Hadoop to PDA

Specify table and WHERE clause

Export from Hadoop or SQL in NPS

New FDM parametersfq.hive.tablename–fq.hive.where–

Execute from Hadoop via CLI./fdm.sh -conf conf.xml -D fq.command=export -D fq.hive.tablename=sales -D fq.hive.schema=xmas_campaign -D fq.hive.where=’order_month=December and order_item=2345pt_aws'

Execute on PDA using SQL -based fromHadoop()call fromHadoop ('',‚testtab','','fq.input.path=','fq.hive.tablename=sales’, ’fq.hive.schema=xmas_campaign’,’fq.hive.where=order_month=December and order_item=2345pt_aws’ );

log4j:WARN No appenders could be found for logger (com.ibm.nz.fq.FqConfiguration).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Database set to: kdtestFluid Query version: 1.7.1.0-B3 [Build 170214-354]Checking access to HDFS for user: hdfs. Using the following URL: hdfs://cloudera560master.kraklab.pl.ibm.com:8020...[OK]Checking connection to warehouse jdbc:netezza://9.167.40.45:5480/KDTEST.............................................[OK]Checking map-reduce jobs............................................................................................[OK]Init destination database engine....................................................................................[OK]Checking Hive connection. jdbc:hive2://9.167.43.115:10000/default;..................................................[OK]DoneConnection configuration success.

$ ./fqRegister.sh --config Cl115 --db kdtest --udtf toCl115,fromCl115

Functions and credentials are successfully registered in database "kdtest".KDTEST.ADMIN(ADMIN)=> call toCl115('kdtest','ttab11','','','fq.append.mode=overwrite'); TOCL115Fluid Query Data Movement finished successfully.

Auto connector enabled. View(s) with suffix toCl115_federated_view were created. (112,234 rows)

Automated View Creation on Import to Hadoop Enable using SQL-based toHadoop() on NPS

$./fqConfigure.sh --service fqdm --provider ibm --fqdm-conf conf/fq-remote-conf_cloudera_115.xml --auto-connector federated_view --database kdtest --driver-path /nzscratch/BVT/fqdm_libs/BVT_Cloudera_115/ --config Cl115

BoolStyleCompressCRinStringCtrlChars

DataObjectDataDelimDateStyleDecimalDelimDelimiterEncodingEscapeCharFillRecordFormat

IgnoreZeroSkipRowsMaxRows...

FDM External Table Options

FDM now supports the use of Netezza external table options for configuring import/export between PDA and Hadoop.

New FDM parameters– fq.exxtab.options – allows for multiple options and supercede any other settings

<property><name>fq.custom.exttab.options</name>

<value>MaxErrors 1 SocketBufSize 8000000</value></property>

External Table options– Available options can be referenced on IBM Knowledge Center:

https://www.ibm.com/support/knowledgecenter/SSULQD_7.2.0/com.ibm.nz.load.doc/c_load_options.html

mailto:https://www.ibm.com/support/knowledgecenter/SSULQD_7.2.0/com.ibm.nz.load.doc/c_load_options.html

FDM Checksum CapabilityChecksum functionality allows you to check data consistency after moving data between PDA and Hadoop using Fluid Query

New FDM parameters:fq.checksum– - Indicates whether the checksum has to be calculated after the import/export.

NONE: nochecksum (default)FULL: calculate checksum using all columns in each table (ROWCOUNT: only row countCOLUMNS: columns specified by fq.checksum.columns

fq.checksum.columns– - The list of columns for checksum calculation. When used with multiple tables for import/export, checksum is calculated only on the listed columns within the tables, otherwise only row count check is performed.

Commandfdm.sh – conf file.xml – D fq.checksum=rowcount

OUTPUT 2017-02-21 13:36:38,078 132941 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 0)2017-02-21 13:36:38,081 132944 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 0 0

FDM Checksum by ROWCOUNT

ROWCOUNT will only check the count of rows between the source and target<property>

<name>fq.checksum</name><value>ROWCOUNT</value>

</property>

Commandfdm.sh – conf file.xml – D fq.checksum=full

OUTPUT2017-02-21 13:22:45,861 123232 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 153.7944052845138)2017-02-21 13:22:45,865 123236 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 153.7944052845138 153.7944052845138

FDM Checksum by FULL FULL will check all columns for every row between the source and target<property>

<name>fq.checksum</name><value>FULL</value>

</property>

Commandfdm.sh – conf file.xml – D fq.checksum=column –D fq.checksum.columns=“CALL_COUNT”

OUTPUT2017-02-21 13:51:34,075 130493 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 154.0898104558758)2017-02-21 13:51:34,078 130496 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 154.0898104558758 154.0898104558758

FDM Checksum by Column COLUMNS will check every row for selected columns across the

source/target<property>

<name>fq.checksum</name><value>COLUMNS</value>

</property><property>

<name>fq.checksum.columns</name><value>CALL_COUNT</value>

</property>

Generate 128-bit key and store that in file

dd if=/dev/urandom of=KeyFile_Name bs=16 count=1

Provide this keyfile to fqConfigure.sh script

./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/userprovidedkeyconfig --keyFile keyfile

Generate and store key at user provided location

./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/autogenkeyfileconfig --autoGenerateKey --keyFileOut /tmp/keyfile

Generate and store key in file at default location /nz/export/ae/products/fluidquery/AutoGenKeys/autoGeneratedkey

./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/autogenkeyfileconfig --autoGenerateKey

Using the autoGenerateKey featureAuto generated key files will have read-only permission for its owner.

Manually generate encryption keyUser is responsible for the security of the encryption key.

Password Encryption and Key Storage

Agenda

Market Perspective


Data Virtualization

Data Integration

Fluid Query 1.7.1


54

One Vision Across our Portfolio

1. Enable query federation across all IBM Analytics repositories2. Deliver an integrated out of the box experience (installation and configuration)3. Distribute a robust set of ODBC/JDBC drivers 4. Easily define, manage and monitor federation via the DSM or console user interface5. Provide native bulk data copy with Hadoop across all IBM database sources

Federationtechnology

Big SQL

Federation technology

dashDB Local


Ground/Cloud


PDOA

Hive or Spark SQL

BigInsights/IOP

Netezza Fluid Query

PDA

Federationtechnology

dashDB Analytics

PDOAIBM PDA

IBM

IBM Fluid Query – Unifying Agent for Hybrid Analytics

Connect Query Monitor Move Access Hybrid environments ODBC/JDBC connectivity On-premises and Cloud

data sources Cloudera, BigInsights,

Hortonworks, Pivotal and MapR

Intelligent Query Routing Cost-based optimizer SQL pushdown Local data caching ANSI-compliant SQL

Easily manage federation thru a single pane

Simple point & click todiscover and query

Monitor and visualize active queries

Bulk data copy to and from Hadoop

Parallel transfer Filtered subsets of data Support for AVRO, ORC,

Parquet and RCFile Checksum validation

Provision Hybrid Cloud Environments.On -premises, Cloud and Hadoop dataIntelligent Query Routing

Bulk Data Transfer

1. Address data source client install prerequisites2. Install data source clients3. Grant file system permissions to fenced user ID4. Register data source server5. enable federated6. configure DB2 reg var for data source driver7. configure env var for data source driver8. configure db2dj.ini file for data source driver9. restart DB2 instance10.create wrapper

11.Create server12.Create label13.Query remote data

Simplified User ExperiencePoint and Click

Run SQL against remote objects seamlessly and export excel format.Review visual explain and pushdown of the remote .

Monitor Queries that Access Remote Data

Questions?Type your question in the

Q&A panel on your screen.

PRODUCT LINKS:

IBM Fix Central:https://www.ibm.com/support/fixcentral/

PureData System for Analytics Support Page:http://ibm.biz/pda_support

IBM Knowledge Center – PureData: http://ibm.biz/pda_knowledgecenter

For more information

Web Site: IBM PureData System for Analytics http://www.ibm.com/software/data/puredata/analytics/system/

Blogs/Articles: IBM Big Data & Analytics Hub –http://www.ibmbigdatahub.com

Community: Upcoming & On-Demand Webinarshttp://ibm.biz/dwwebinars

Enzee Communityhttp://ibm.biz/enzeecommunity Make sure to JOIN the community to get the latest updates and join in on the conversation! [select “Log In” in the top right hand of the screen to register and JOIN]

Take dashDB Local for a Spin! http://ibm.biz/dashDBLocal

TRY!!

https://www.ibm.com/support/fixcentral/

http://ibm.biz/pda_support

http://ibm.biz/pda_knowledgecenter

Next WebinarTechTalk: Client Self-Service Series

Maintenance Tasks Preparation (for PureData System for Analytic clients)

May 4@ 11 AM EThttp://ibm.biz/enzee_0504

Virtual Enzee Schedule and On-Demand replays are available at:

http://ibm.biz/dw_webinars

Thank you for Joining Virtual Enzee

Previous topics include: TechTalk: Replication Troubleshooting TechTalk: PureData System for Analytics Complete Preventive

Health Check TechTalk: Life Saving Checkup and Upgrade of your Netezza

Platform Software (NPS) Unifying Data Access Across the Logical Data Warehouse with

IBM Fluid Query And more….

Follow Us on Social: #Enzee @IBMNetezza @IBMdataWH @IBMdashDB

http://ibm.biz/enzee_0504

http://ibm.biz/dw_webinars

get deeper insights from more data sources using ibm fluid query · 2017-04-14 · get deeper...

Documents