get deeper insights from more data sources using ibm fluid query · 2017-04-14 · get deeper...
TRANSCRIPT
Get deeper insights from more data sources using IBM Fluid QueryDoug Dailey, Offering Manager, PureData System for Analytics/Netezza Products, IBM
Welcome to Virtual Enzee
Call LogisticsAudio is broadcast via your computer speakers (no dial-in)
Ask questions via the Q&A button
[in lower left hand of your screen]
Experiencing technical difficulties? Let us know via the
Q&A button
Download a copy of the presentation via the Content button
Virtual Enzee Replays are available On-Demand @ http://ibm.biz/dwwebinars
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
3
Data is the new currency and at the core of every business, but…
1. Only 15% of organizations fully leverage data and analytics Unlock the potential of all your data available in all data types. Combine it with public or 3rd party data sets
2. Many users don’t have direct or timely access to information Short-cut / avoid dependencies and democratize access with integrated governance enabling Self-Service
3. 90% of the worlds data cannot be googled Leverage data where it resides and bring analytic capabilities & cloud benefits to your data
4. The cloud journey is a marathon, not a sprint Hybrid cloud solutions offer faster, incremental value at lower risk
Self Service Platform
Inside Information Supply Chain
IntegrationEngine
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
7
IBM Fluid QueryHelps Unify Todays Advanced Analytic Ecosystems
Excellent for provisioning your
business
Easier, faster consumption
of data
Better, more transparent access to required data
sources
Connect ̶̶ Query ̶̶ Bulk Data Copy
Transparency Across Your Enterprise
dashDBLocal
Big SQL
Extending to Hadoop
dashDBLocal
Big SQL
Bulk Data Copy
DataCustom Filters (tables, DBs)•
Single or multiple tables with •“where” clause support
Determine format and •compression
Copy• Simple to use CLI or SQL-
based function
• Support for Hive partitions and clusters
• Export Hive registered data to PDA
Bulk• Import or export data to &
from Hadoop
• Parallel data transfer (n-stream)
• Parallel via map/reduce jobs across Hadoop cluster
Native transfer between PDA and Hadoop
Read all about itEndpoint validation checking•Checksum capability to ensure data integrity•Supports Netezza external table options•Retailer imported • 75M rows under 5 minutes¼ rack tested at • 2-3 TB/hr on a 6-node cluster
• 2-3X faster than Sqoop using nzbak formatPlanned for Big SQL, dashDB and LUW•
ORC
AVRO
ParquetRcFile
Text
Parquet AVRO ORC RCfile• snappy• gzip• Uncompressed
• snappy• deflate
• snappy• zlib• none
• snappy• zlib• none
Hadoop storage considerations
A Simple test with an uncompressed, 1.8 GB CSV file into various formats, achieves much smaller disk footprints.• Avro - 1.5 GB • Avro w/ Snappy Compression - 750 MB • Parquet w/ Snappy Compression - 300 MB
Query performance across Hive, BigSQL and Impala improved as the files become smaller.
HadoopDistro?
Schema evolution
CPUI/O
(etc.)Reads&
Writes
Export from
HDFS?
Storage needs?
Supported compression modes:
Bulk Data Copy Sample times by IBM between PDA N2001 model and BigInsights
File formats GB/h Time to import 1 TB TB/h
Text 2444 25m 2.444
Mixed mode 3081 19m 3.15
Netezza binary 3561 17m 3.561
Hadoop formats with NO compression
Avro 800 1h 16m 0.800
Parquet 862 1h 10m 0.862
RC file 1216 50m 1.20
ORC file 896 1h 8m 0.896
Hadoop formats WITH compression (use > 12 splits)
Parquet gzip 344 2h 58m 0.344
Avro deflate 421 2h 25m 0.421
RC file gzip 459 2h 13m 0.459
ORC file Snappy 997 1h 2m 0.997
Disclaimer: Performance test results are measured using specific computer systems and / or components. Any difference in system hardware or software design or configuration may affect actual performance. Readers should consult other sources of information to evaluate the performance of systems or components they are using.
N2001-005− 10 Gbe NIC− Local Network− NPS 7.2.1− INZA 3.2.1− Fluid Query 1.7
BigInsights v4.1− 5-data nodes− Apache Hive
PDA, dashDB Local and Hadoop
PDA, dashDB Local and Hadoop
PDA, dashDB Local and Hadoop
PDA, dashDB Local and Hadoop
PDA, dashDB Local and Hadoop
PDA, dashDB Local and Hadoop
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
21
Data Virtualization using Fluid Query
Fluid Query
Cost-based Optimizer
• JAE instantiates JDBC for data access• UDTF via Java Analytic Executable (AE) • UDTFs encapsulated in VIEWs• ad-hoc via nzsql or SQL client• Dynamic SQL pushdown not supported
Netezza architecture
JDBC to data
source
• Decomposes, rewrites and distributes queries
• Cost-based optimizer chooses query with SQL pushdown
• Query execution engine combines results
Federation Comparison
Federated data from heterogeneous data sources •throughout the day
Complex data transformations are performed using •ETL at the virtual database
Business Benefits:• Call center representatives can now access
real-time operational data complete with detail from the data warehouse
• On demand scoring of data profiling and mining processes from a larger number of sources
Challenge
Solution
Need to have a complete view of data from various •sources including DB2, Oracle, Sybase, and SQL Server in order to perform data mining and profilingNeed to provide up• -to-date, detailed information to call center staff to make tactical business decisionsTo explore data sources without moving the data •to focus on mining the data
Technology Benefits:• Save time to build ETL processes in bringing
data to mining database
• Extending the value of the data warehouse to users
Merrill Lynch
• Combined data from various data sources into a single view
• Used Cognos as reporting tool to provide real-time insight into making strategic and tactical business decisions
Business Benefits:99• % reduction in process time for ad hoc queries
Reduced costs and higher productivity•
Greater cross• -sell and up-sell opportunities
Technology Benefits:
• 59% reduction in batch window• Reduced time to develop new applications
Challenge
Solution
Dealt with different types of data that needed to •be shared across branches to provide a complete view of the customerProducing business reports took more than • 10 employees to compile and took too long
Taikang Life Insurance Co.
Ability to query data in hadoop and other data sources like Oracle and Netezza
Currently in development, but want to leverage Dimension tables stored on Oracle to supplement local data on PDA through a View definition. Their top needs are Oracle, Hadoop, and Spark.
Fluid Query is currently installed primarily for queryable archive of data on hadoop
Enabled for Test and Production and used for discovery. Business analysts use Fluid Query to JOIN additional data from DB2 and other RDBMS for existing reports, in order to better enrich overall content. Once Business Analysts identify new data sources, IT tests and creates a process to support the new data source.
Healthcare
Primary work is data movement between Netezza and Big Insights. They are interested in federation to use as queryable archive.
Mortgage Banking - The Team supports the EDW space in Mortgages. Their inflow includes some BigInsights Hadoop being used inside but has not expanded. Currently exploring the integration between Netezza appliances and existing Oracle databases.
This customer has an Oracle Data Warehouse and are working to migrate data to two new Netezza boxes. They like the ease of use that Fluid Query offers, which allows them to copy data from Oracle by using a simple select statement. They very much like the fact that they can use CTAS to create tables and avoid dealing with DDL conversion.
Fluid Query is used to create connections to all of their Oracle schemas, so that they can synchronize data with Netezza while they work to migrate their batch framework. No coding effort and only CTAS and INSERT/SELECT statements to move the data.
Banking and Retail
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
29
Integrating Data Sources using Fluid Query
Open Source
Hadoop
CommercialNetezza
nzload
nz_migrate
external table
Fluid Query
Large Canadian BankBusiness challengeCreate an aggregated Risk data warehouse from each Risk engine in their environment in adherence of regulation BCBS 239.
The consolidated reports go to senior executives, board members and regulators. Some reports are used in the publishing of financial statements. These have strict timeline requirements and if missed there will be reputational and/or fiscal consequence. Regulators can impose fines on banks for not submitting the reports on time.
www.markit.com/bcbs239, 2014
− Consolidate all Risk engine(s) data into a single data warehouse− Meet SLAs required for time to load each respective Risk area
(largest daily load is 10 million rows in under 15 minutes)− Provide Daily and weekly reporting for critical business functions− Provide the ability to filter by custom dimension(s)
Success Criteria
on Solaris
Full rack
5.5.1N2002-010• NPS 7.2.0.5• INZA 3.20• Fluid Query 1.6
Smaller systems @ 50 GB Larger systems @ 9-10 TB
~ 200 tables ranging from 300 to 10 million records
Netezza was chosen as the platform for consolidation based on its capacity, performance and ability to leverage Fluid Query through a straight-forward setup that met the estimated processing window.
Netezza also had a cost savings variant in its favor (no-charge component).
Evaluating Options
ETLFluid Query
EnvironmentFluid Query populates EDW with data from criticalRISK engines, then combines with Local data on PDAfor comprehensive reporting for BCBS 239.
Cognos Analytics
Fluid Query
World of
35
1. Custom utility to extract the DDL from source and convert to Netezza DDL. Provides data type mapping
Allows for the ability to adjust the distribution key in PDA
2. SP is called as Risk engines complete scoring and capital calculations
3. SP kicks off 10 connectors that are manually group against data in each DB
Implementation
Risk engine10 • SP invoked in parallel via nzsqlEach SP pull different set of tables•Logical partition by table•Execution varies daily, weekly, monthly•Simple select *•No JOINs•INSERT INTO•Runs off• -hours
t1
t2
t3
t4
t5
…
…
− Fluid Query met project deadline (fast TTM)
− Met regulatory compliance BCBS 239
− No additional capital requirements− No measurable operating costs− Consistently meet all SLAs for their business window− Straight forward implementation− No need to transfer files or additional storage space− Fully native and lightweight package
Summary
Architecture Review
Largest Risk engineSource: Oracle ExadataTotal tables: 99Total rows: 21.5 millionData volume: 6.75 GBLoad time: < 12 minutes
Big Fish Games
Etc.
Installs
On-premise
Sessions
Custom
Fluid Query ships SQL request to Hadoop and MySQL to populate the EDW.
streaming data
Fluid Query
IBM BigInsightsBig SQL v4.1
Fluid Query 1.7.1 now supports export of AVRO file format.
Export
SQL Insert
ETL Takeout
Fluid Query
Fluid Query replaced an entire ETL infrastructure with just two lines of code – making it quicker and easier to collect and process data for fast, insightful reporting.
− Talend ETL software (Open Source)
− Separate staging server− Oracle 11g− Netezza 3001-020
− NPS 7.2.0.5 − INZA 3.20− Fluid Query 1.7
− BigInsights Big SQL v4.1
Fluid Query
Staging server
Benefits of Fluid Query for Data Integration
• No-charge NPS software component
• Simple SQL front-end
• Ability to migrate RDBMS via SQL
• Automatic data type conversion
• No ETL staging server required
• No formal ETL required
• Allows for custom predicates
• Does not require additional skills
• Allows the ability to access HOT data
• Flexible and easy to setup
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
41
Fluid Query 1.7.1 New Features Hadoop Integration and Fast Data Movement
– Support for Hive partitioning and Cluster By functionality– Ability to export Hadoop files to PDA (including select DBs and tables)– Automatic VIEW creation on PDA for newly imported data to Hadoop– Bulk data copy support for Netezza external table options– Checksum utility ensures data integrity when copying data between PDA
and Hadoop Access and Authentication
– Automatic password encryption and storage control– Automated kinit authentication for kerberized Hadoop services
Updated support for data sources
• BigInsights 4.2 / Big SQL 4.2• Hortonworks 2.5• Cloudera 5.9
• Spark 1.6.1• MapR• DB2 v11.1
Supported Database and Hadoop Providers
2013 2014 2015 2016 2017
February March April
1 2 3 4
Table
10B| 10B| 10B| |10B |10B
50M| | 50M
100K| 100K| |100K
Import to Hadoop Using Hive Partitions The fq.hive.partitioned.by property for FDM determines by which column(s) the imported table will be
partitioned
Hive partitioned tables allow queries with predicates to run much faster on Hadoop and encourage more efficient MapReduce job processing
./fdm.sh -conf conf.xml -D fq.hive.partitioned.by=’col1’
Note: some restrictions apply when using partitions and clusters
Import to Hadoop Using Hive Clustering New parameters for enabling Hive Clustering
– fq.hive.clustered.by– fq.hive.clustered.buckets
There are two ways to set clustering– Default clustering
Using the same value that was set for the table on NPS with the DISTRIBUTE ON parameter
setting empty fq.hive.clustered.by./fdm.sh -conf conf.xml -D fq.hive.clustered.by =’’ -D fq.hive.clustered.buckets=’252’
– Cluster by specific columns./fdm.sh -conf conf.xml -D fq.hive.clustered.by =’month’ -D fq.hive.clustered.buckets=’12’
Hive Table ExportExport Hive registered tables from Hadoop to PDA
Specify table and WHERE clause
Export from Hadoop or SQL in NPS
New FDM parametersfq.hive.tablename–fq.hive.where–
Execute from Hadoop via CLI./fdm.sh -conf conf.xml -D fq.command=export -D fq.hive.tablename=sales -D fq.hive.schema=xmas_campaign -D fq.hive.where=’order_month=December and order_item=2345pt_aws'
Execute on PDA using SQL -based fromHadoop()call fromHadoop ('',‚testtab','','fq.input.path=','fq.hive.tablename=sales’, ’fq.hive.schema=xmas_campaign’,’fq.hive.where=order_month=December and order_item=2345pt_aws’ );
log4j:WARN No appenders could be found for logger (com.ibm.nz.fq.FqConfiguration).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Database set to: kdtestFluid Query version: 1.7.1.0-B3 [Build 170214-354]Checking access to HDFS for user: hdfs. Using the following URL: hdfs://cloudera560master.kraklab.pl.ibm.com:8020...[OK]Checking connection to warehouse jdbc:netezza://9.167.40.45:5480/KDTEST.............................................[OK]Checking map-reduce jobs............................................................................................[OK]Init destination database engine....................................................................................[OK]Checking Hive connection. jdbc:hive2://9.167.43.115:10000/default;..................................................[OK]DoneConnection configuration success.
$ ./fqRegister.sh --config Cl115 --db kdtest --udtf toCl115,fromCl115
Functions and credentials are successfully registered in database "kdtest".KDTEST.ADMIN(ADMIN)=> call toCl115('kdtest','ttab11','','','fq.append.mode=overwrite'); TOCL115Fluid Query Data Movement finished successfully.
Auto connector enabled. View(s) with suffix toCl115_federated_view were created. (112,234 rows)
Automated View Creation on Import to Hadoop Enable using SQL-based toHadoop() on NPS
$./fqConfigure.sh --service fqdm --provider ibm --fqdm-conf conf/fq-remote-conf_cloudera_115.xml --auto-connector federated_view --database kdtest --driver-path /nzscratch/BVT/fqdm_libs/BVT_Cloudera_115/ --config Cl115
BoolStyleCompressCRinStringCtrlChars
DataObjectDataDelimDateStyleDecimalDelimDelimiterEncodingEscapeCharFillRecordFormat
IgnoreZeroSkipRowsMaxRows...
FDM External Table Options
FDM now supports the use of Netezza external table options for configuring import/export between PDA and Hadoop.
New FDM parameters– fq.exxtab.options – allows for multiple options and supercede any other settings
<property><name>fq.custom.exttab.options</name>
<value>MaxErrors 1 SocketBufSize 8000000</value></property>
External Table options– Available options can be referenced on IBM Knowledge Center:
https://www.ibm.com/support/knowledgecenter/SSULQD_7.2.0/com.ibm.nz.load.doc/c_load_options.html
FDM Checksum CapabilityChecksum functionality allows you to check data consistency after moving data between PDA and Hadoop using Fluid Query
New FDM parameters:fq.checksum– - Indicates whether the checksum has to be calculated after the import/export.
NONE: nochecksum (default)FULL: calculate checksum using all columns in each table (ROWCOUNT: only row countCOLUMNS: columns specified by fq.checksum.columns
fq.checksum.columns– - The list of columns for checksum calculation. When used with multiple tables for import/export, checksum is calculated only on the listed columns within the tables, otherwise only row count check is performed.
Commandfdm.sh – conf file.xml – D fq.checksum=rowcount
OUTPUT 2017-02-21 13:36:38,078 132941 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 0)2017-02-21 13:36:38,081 132944 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 0 0
FDM Checksum by ROWCOUNT
ROWCOUNT will only check the count of rows between the source and target<property>
<name>fq.checksum</name><value>ROWCOUNT</value>
</property>
Commandfdm.sh – conf file.xml – D fq.checksum=full
OUTPUT2017-02-21 13:22:45,861 123232 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 153.7944052845138)2017-02-21 13:22:45,865 123236 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 153.7944052845138 153.7944052845138
FDM Checksum by FULL FULL will check all columns for every row between the source and target<property>
<name>fq.checksum</name><value>FULL</value>
</property>
Commandfdm.sh – conf file.xml – D fq.checksum=column –D fq.checksum.columns=“CALL_COUNT”
OUTPUT2017-02-21 13:51:34,075 130493 [main] INFO com.ibm.nz.fq.cksum.TableInfo - The checksum calculated for the transferred data for table ADMIN.TAB1 returned identical values on both systems (rows: 100 , sum 154.0898104558758)2017-02-21 13:51:34,078 130496 [main] INFO com.ibm.nz.fq.NzTransfer - Import summary:Used filter: ADMIN.tab1Found 1 tables matching the filter: TAB1Imported 1 out of 1 table(s): TAB1==================================== CHECKSUM REPORT =====================================TABLE STATUS PDA CNT HD CNT PDA SUM HD SUMADMIN.TAB1 EQUAL 100 100 154.0898104558758 154.0898104558758
FDM Checksum by Column COLUMNS will check every row for selected columns across the
source/target<property>
<name>fq.checksum</name><value>COLUMNS</value>
</property><property>
<name>fq.checksum.columns</name><value>CALL_COUNT</value>
</property>
Generate 128-bit key and store that in file
dd if=/dev/urandom of=KeyFile_Name bs=16 count=1
Provide this keyfile to fqConfigure.sh script
./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/userprovidedkeyconfig --keyFile keyfile
Generate and store key at user provided location
./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/autogenkeyfileconfig --autoGenerateKey --keyFileOut /tmp/keyfile
Generate and store key in file at default location /nz/export/ae/products/fluidquery/AutoGenKeys/autoGeneratedkey
./fqConfigure.sh --host 9.167.40.23 --provider horton --service hive --port 10000 --username root --config CUST_TEST_properties/autogenkeyfileconfig --autoGenerateKey
Using the autoGenerateKey featureAuto generated key files will have read-only permission for its owner.
Manually generate encryption keyUser is responsible for the security of the encryption key.
Password Encryption and Key Storage
Agenda
Market Perspective
Fluid Query Introduction
Data Virtualization
Data Integration
Fluid Query 1.7.1
Uniformity across IBM
54
One Vision Across our Portfolio
1. Enable query federation across all IBM Analytics repositories2. Deliver an integrated out of the box experience (installation and configuration)3. Distribute a robust set of ODBC/JDBC drivers 4. Easily define, manage and monitor federation via the DSM or console user interface5. Provide native bulk data copy with Hadoop across all IBM database sources
Federationtechnology
Big SQL
Federation technology
dashDB Local
Federation technology
Ground/Cloud
Federation technology
PDOA
Hive or Spark SQL
BigInsights/IOP
Netezza Fluid Query
PDA
Federationtechnology
dashDB Analytics
PDOAIBM PDA
IBM
IBM Fluid Query – Unifying Agent for Hybrid Analytics
Connect Query Monitor Move Access Hybrid environments ODBC/JDBC connectivity On-premises and Cloud
data sources Cloudera, BigInsights,
Hortonworks, Pivotal and MapR
Intelligent Query Routing Cost-based optimizer SQL pushdown Local data caching ANSI-compliant SQL
Easily manage federation thru a single pane
Simple point & click todiscover and query
Monitor and visualize active queries
Bulk data copy to and from Hadoop
Parallel transfer Filtered subsets of data Support for AVRO, ORC,
Parquet and RCFile Checksum validation
Provision Hybrid Cloud Environments.On -premises, Cloud and Hadoop dataIntelligent Query Routing
Bulk Data Transfer
1. Address data source client install prerequisites2. Install data source clients3. Grant file system permissions to fenced user ID4. Register data source server5. enable federated6. configure DB2 reg var for data source driver7. configure env var for data source driver8. configure db2dj.ini file for data source driver9. restart DB2 instance10.create wrapper
11.Create server12.Create label13.Query remote data
Simplified User ExperiencePoint and Click
Run SQL against remote objects seamlessly and export excel format.Review visual explain and pushdown of the remote .
Monitor Queries that Access Remote Data
Questions?Type your question in the
Q&A panel on your screen.
PRODUCT LINKS:
IBM Fix Central:https://www.ibm.com/support/fixcentral/
PureData System for Analytics Support Page:http://ibm.biz/pda_support
IBM Knowledge Center – PureData: http://ibm.biz/pda_knowledgecenter
For more information
Web Site: IBM PureData System for Analytics http://www.ibm.com/software/data/puredata/analytics/system/
Blogs/Articles: IBM Big Data & Analytics Hub –http://www.ibmbigdatahub.com
Community: Upcoming & On-Demand Webinarshttp://ibm.biz/dwwebinars
Enzee Communityhttp://ibm.biz/enzeecommunity Make sure to JOIN the community to get the latest updates and join in on the conversation! [select “Log In” in the top right hand of the screen to register and JOIN]
Take dashDB Local for a Spin! http://ibm.biz/dashDBLocal
TRY!!
Next WebinarTechTalk: Client Self-Service Series
Maintenance Tasks Preparation (for PureData System for Analytic clients)
May 4@ 11 AM EThttp://ibm.biz/enzee_0504
Virtual Enzee Schedule and On-Demand replays are available at:
http://ibm.biz/dw_webinars
Thank you for Joining Virtual Enzee
Previous topics include: TechTalk: Replication Troubleshooting TechTalk: PureData System for Analytics Complete Preventive
Health Check TechTalk: Life Saving Checkup and Upgrade of your Netezza
Platform Software (NPS) Unifying Data Access Across the Logical Data Warehouse with
IBM Fluid Query And more….
Follow Us on Social: #Enzee @IBMNetezza @IBMdataWH @IBMdashDB