highlighted. eventually, we compare idaa to oracle’s in
TRANSCRIPT
When queries are rerouted from DB2 to IDAA, they typically perform better
by an order of one or two magnitudes. This presentation shows how data
distribution and clustering within IDAA can further improve response times.
We also discuss approaches to automatically detect and implement efficient
distribution and organizing keys by analyzing IDAA data access pattern and
data statistics. Reporting on IDAA benefits to management will also be
highlighted. Eventually, we compare IDAA to Oracle’s In-Memory Column
Store option from a technical perspective including a comparison of query
response times for different OLAP queries.
1
2
Disclaimer:
The Information contained in this presentation has not been submitted to any formal Swiss Mobiliar or other review and is distributed on an ‘as is‘ basis without any warranty either expressed or implied. The use of this information is the user‘s responsibility. The procedures, results and measurements presented in this paper were run in either the test and development environment or in the production environment at Swiss Mobiliar in Berne, Switzerland. There is no guarantee that the same or similar results will be obtained elsewhere. Users attempting to adapt these procedures and data to their own environments do so at their own risk. All procedures presented have been designed and developed for educational purposes only.
3
4
5
6
7
8
9
10
11
In contrast to many other IDAA installations, the scope of IDAA at Swiss
Mobiliar is focused on operational data, which means data of the core
information systems, and not on replicated data typically used for decision
support systems.
Additionally, data of other platforms is also replicated into IDAA in order to
benefit from Netezza‘s MPP architecture to produce much better response
times, and to allow to join this non-DB2 data together with DB2 data in an
efficient way
12
13
So far, existing workload has been re-routed to IDAA and optimized for
speed. Due to the reduced response times, a lot more of such reports were
produced.
In order to create even more opportunities, the focus switches now to making
the business aware of new types of queries, new business functions, etc.:
Increase the business awareness of IDAA.
14
After installation and first tests, ad-hoc analytical queries were re-routed to
IDAA, followed by scheduled workload, COBOL programs, Excel macro‘s
SQL code, reporting tools etc..
From a business perspective, log analysis from DB2 tables to derive insights
of application usage access patterns was the first IDAA based application,
followed by improving end-of-month processing, still more ad-hoc reports
from a larger user crowd, and improving ETL flow.
Eventually, a couple of DB2 secondary indexes could be removed, and
phyiscal database design options such as „append on insert“ or member
clustering were more often applied – leading to even better response times
and reduced CPU consumption.
15
We don‘t store facebook or twitter data on DB2, nor even the session information of internet
users of our web pages. As IDAA is DB2 based, such information are not within the scope of
IDAA. But to make business units aware of the possibilities of IDAA based reports, we
analyzed the behaviour of our agency‘s employees when it comes to using our CRM system.
Instructions exist on using it as a primary hub before switching to applications containing
more detailled information. A large bubble in the slide highlights an agency where this policy
is strictly followed; the smaller the bubbles are, the less this policy is applied, and the
employees are directly assessing the detail-applications without starting at the CRM system.
This information on internal system usage showed some surprising results, and as the
business units are directly concerned, their awareness to these kind of reports and to IDAA in
general is raised.
But what about Real-Time-Analytics? The slide above represents an observation interval of
14 months. How did they do this morning?
Most agencies don‘t seem to be much
impressed by those statistics.
However, at least some of them
changed their behaviour, see
Hochdorf in the Lucerne region
(marked in green) which made a
progress from average to top.
16
17
18
19
20
21
Ad-hoc queries and other analytical workload running directly on DB2 often
require indexes. Which means that these queries must be known in advance
and optimized by your DBA. In other words: This is an expensive and time-
consuming process: If you come up with a new idea not yet supported by
indexes or MQTs, they have to be built first. If these queries run without
proper support, they can severely impact the online transaction response
times. Another downside of indexes is their need of space, and the resources
necessary to update these indexes. Eventually, the whole system gets either
over-indexed or new queries run without index support – anyway, the effect is
that end-user won’t query the database anymore because of response times
becoming inacceptable, and therefore the information residing in the database
is not any longer analyzed.
22
23
Data residing in the IDAA is stored and accessed on a column-based
paradigm rather than the row-oriented paradigm used by DB2. This makes the
amount of data to be scanned much smaller (a typical analytical query scans a
few attributes of very many rows and has to access the referenced columns
only, not the whole table as it is the case for row-based access). Thus, indexes
become obsolete for these kind of accesses. Furthermore, data compression is
much more effective on IDAA compared to DB2, and inherent parallelism for
query processing is widely used without much administration effort due to
Netezza’s broad usage of its MPP (massiv parallel processing) architecture.
The downside of this technology is that directly accessing a single table row is
as complex as a scan through the whole table, and updating a single row
becomes very expensive.
24
25
26
Rules of thumb regarding selection of distribution keys:
A random distribution key provides good access paths.
For tables with > 100 Mio rows, an explicit distribution key
should be selected
The choice of the distribution key is triggered by
Data Skew
Processing Skew
Avoidance of Data Re-distribution
A distribution key should consist of only one column
Use random distribution key for small reference tables
27
28
29
30
Rules of thumb for selection of organizing keys:
Small tables don‘t benefit from the definition of organizing
keys, due to the small amount of data to be scanned
Large tables (> 1 mio records) benefit most assuming that
queries restrict on column values which are physically
scattered across the table
There is no preference for any of the organizing key columns
Not all organizing keys need to be referenced in a query
for organizing keys to improve query performance.
All data types ok
first 8 Bytes of CHAR data types considered
for numerics, up to 18 digits considered
31
32
33
34
The following query calculates the column skew value displayed on the slide:
Example for CARDF=54321:
Select (100.0 * sum(no) / (select count(*) from T1) - 10.0)
from
(select C1, dec(count(*), 15,3) as no
from T1
group by C1
order by 2 desc fetch first 5432 rows only) t
This query calculates the number of rows in a table having a column value
equal to the 10% most frequent columns values . In general., replace the fix
10% value (n=0.1) with any individually selected value for n:
select (100.0 * sum(no) / (select count(*) from T1) – 100 * n)
from
(select C1, dec(count(*), 15,3) as no
from T1
group by C1
order by 2 desc fetch first CARDF * n rows only) t
35
36
37
38
//AQTSCS03 EXEC PGM=IKJEFT01
//* parameter #1 for accelerator name
//AQTP1 DD *
IDAADB2P
/*
//* parameter #2 for alter containing tables specification
//AQTP2 DD *
<?xml version="1.0" encoding="UTF-8"?>
<aqttables:tableSpecifications
xmlns:aqttables="http://www.ibm.com/xmlns/prod/dwa/2011" version="1.0">
<table name="TEO_TDSTAT2" schema="DB2PROD">
<distributionKey>
<column name="C63654"/>
<column name="C32006"/>
</distributionKey>
<organizingKey name="C63654"/>
<organizingKey name="C63655"/>
</table>
</aqttables:tableSpecifications>
/*
//* parameter #3 for message input to control trace
//AQTMSGIN DD *
<?xml version="1.0" encoding="UTF-8" ?>
<spctrl:messageControl
xmlns:spctrl="http://www.ibm.com/xmlns/prod/dwa/2011"
version="1.0" versionOnly="false" >
</spctrl:messageControl>
/*
//SYSTSPRT DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
//SYSTSIN DD *
DSN SYSTEM(DB2P)
RUN PROGRAM(AQTSCALL) PLAN(AQTSCALL) -
LIB('SYS1.DAA310B.U.LOAD') PARMS('ALTERTABLES')
END
/*
39
40
41
42
43
44
45
By avoiding the unnecessary movement of data off-platform, it has become
possible to perform real-time analytics because decisions are made based on
the most accurate data available, not some stale copy. Integrating analytics
technologies with transactional systems enables insights to be injected
directly into operational decision processes. Analytics will share the same
business-critical support that operational systems enjoy today.
46
47