experiences with real-time data warehousing using oracle database 10g mike schmitz high performance...
TRANSCRIPT
Experiences with Real-Time Data Warehousing Using Oracle
Database 10G
Mike SchmitzHigh Performance Data Warehousing
[email protected] Brey
Principal Member Technical StaffST/NEDC Oracle Engineering
Oracle Corporation
Mike Schmitz High Performance Data Warehousing2
Agenda
The meaning of Real-Time in Data Warehousing Customer Business Scenario
Customer Environment “Real-Time” Requirement
Our Real-Time Solution Real-Time data architecture Incremental Operational Source Change Capture Transformation and Population into DW Target
Simplified Functional Demonstration Asynchronous Change Data Capture (Oracle)
Performance Characteristics and Considerations
Mike Schmitz
High Performance Data Warehousing3
My BackgroundAn independent data warehousing consultant specializing in the dimensional approach to data warehouse / data mart design and implementation with in-depth experience utilizing efficient, scalable techniques whether dealing with large-scale data warehouses or small-scale, platform constrained data mart implementations. I deliver dimensional design and implementation as well as ETL workshops in the U.S. and Europe.
I have helped implement data warehouses using Redbrick, Oracle, Teradata, DB2, Informix, and SQL Server on mainframe, UNIX, and NT platforms, working with small and large businesses across a variety of industries including such customers as Hewlett Packard, American Express, General Mills, AT&T, Bell South, MCI, Oracle Slovakia, J.D. Power and Associates, Mobil Oil, The Health Alliance of Greater Cincinnati, and the French Railroad SNCF.
Mike Schmitz High Performance Data Warehousing4
Real-Time in Data Warehousing
Data Warehousing Systems are complex environments Business rules Various data process flows and dependencies
Almost never pure Real-Time Some latency is a given
What do you need? Real Time Near Real-Time Just in Time for the business
Mike Schmitz High Performance Data Warehousing5
Customer Business Scenario
Client provides software solutions for utility companies Utility companies have plants generating energy supply
Recommended maximum output capacity Reserve Capacity Buy supplemental energy as needed
Peak demand periods are somewhat predictable Each day is pre-planned on historical behavior
Cheaper to buy energy ahead Expensive to have unused capacity
Existing data warehouse supports the planning function Reduced option expenses Cut down of supplemental energy costs
Mike Schmitz High Performance Data Warehousing6
Customer “Real-Time” Requirement
Getting more in-time accuracy enhances operational business Compare today's plant output volumes to yesterdays
or last week’s average Know when to purchase additional options or supplies
Customer Target Actual data within a 5 minute lag Use a single query Use a single tool
Mike Schmitz High Performance Data Warehousing7
Sample Analysis Graph
Plant A
0
20,000
40,000
60,000
80,000
100,000
8am 9am 10am
Today
Yesterday
Last WeekAvg
Max
Mike Schmitz High Performance Data Warehousing8
Our Real-Time SolutionOverview
Three-Step Approach:1. Implement a real-time DW data
architecture
2. Near real-time incremental change capture from operational system
3. Transformation and Propagation (population) of change data to DW
Mike Schmitz High Performance Data Warehousing9
Our Real-Time SolutionReal-Time DW Data Architecture
Add a Real-Time “Partition” to our Plant Output Fact Table for current day activity Separate physical table No indexes or RFI constraints (data coming in
will have RFI enforced) during daily activity UNION ALL viewed to the Plant Output Fact
Table
Mike Schmitz High Performance Data Warehousing10
Our Real-Time SolutionChange Capture and Population
1. Incremental change capture from operational site Synchronous or Asynchronous
2. Transformation and Propagation (population) of change data to the DW
Continuous trickle feed or periodic batch
Operations Staging DWAsynch CDC
Trigger
Batch
Synch CDC
Mike Schmitz High Performance Data Warehousing11
Our Real-Time SolutionIncremental Change Capture
Done with Oracle’s Change Data Capture (CDC) functionality Synchronous CDC available with Oracle9i Asynchronous CDC with Oracle10g
Asynchronous CDC is the preferred mechanism Decoupling of change capture from the
operational transaction
Mike Schmitz
High Performance Data Warehousing
Asynchronous CDC
SQL interface to change data Publish/subscribe paradigm Parallel access to log files, leveraging
Oracle Streams Parallel transformation of data
OLTPDB
Redologfiles
Logical Change DataBased on
Log Miner
Oracle10g DWTables
SQL, PL/SQL,Java
Transform
Mike Schmitz High Performance Data Warehousing13
Our Real-Time SolutionPopulation of Change Data into DW
Continuous Change table owner creates trigger to populate
warehouse real-time partition Periodic Batch
Utilize the Subscribe Interface Subscribe to specific table and column changes
through view Sets a window and extracts the changes at required
period Purges view and moves window
Mike Schmitz High Performance Data Warehousing14
Integrate daily changes into historical fact table At the end of the day
index the current day table and apply constraints (no validate)
Create new fact table partition Exchange current day table with new partition Create next days “Real-Time Partition” table
Our Real-Time SolutionThe Daily Process
Mike Schmitz High Performance Data Warehousing15
Simplified Functional DemoSchema Owners
AO_CDC_OP Owns the operational schema
AO_CDC Owns the CDC change sets and change tables
(needs special cdc privileges) ? CDC Publish Role
AO_CDC_DW Owns the data warehouse schema (also needs
special cdc privileges) ? CDC Subscribe Role
Simplified Functional DemoOperational Schema
Simplified Functional DemoData Warehouse Schema
D_GENERATING_PLANT
GENERATING_PLANT_KEY: NUMBER(4)
PLANT_ID: VARCHAR2(24)PLANT_NAME: VARCHAR2(32)PLANT_STATUS: VARCHAR2(15)PLANT_TARGET_MAX_CAPACITY_KWH: NUMBER(15)PLANT_ABSOL_MAX_CAPACITY_KWH: NUMBER(15)UPDATE_TS: TIMESTAMP(6)
D_OUTPUT_MINUTE
OUTPUT_MINUTE_KEY: NUMBER
D_OUTPUT_DAY
OUTPUT_DAY_KEY: NUMBER
F_CURRENT_DAY_PLANT_OUTPUT
OUTPUT_DAY_KEY: NUMBER(7)OUTPUT_MINUTE_KEY: NUMBER(4)GENERATING_PLANT_KEY: NUMBER(4)
OUTPUT_ACTUAL_QTY_IN_KWH: NUMBER(15)
F_PLANT_OUTPUT
OUTPUT_DAY_KEY: NUMBEROUTPUT_MINUTE_KEY: NUMBERGENERATING_PLANT_KEY: NUMBER(4)
OUTPUT_ACTUAL_QTY_IN_KWH: NUMBER(15)
Mike Schmitz High Performance Data Warehousing18
What do we have?
Operational transaction table AO_CDC_OP.PLANT_OUTPUT
DW historical partitioned fact table AO_CDC_DW.F_PLANT_OUTPUT
DW current day table (“Real-Time Partition”) AO_CDC_DW.F_CURRENT_DAY_PLANT_OUTPUT
Data Warehouse UNION ALL view AO_CDC_DW.V_PLANT_OUTPUT
Mike Schmitz High Performance Data Warehousing19
First
The CDC user publishes Create a Change Set (CDC_DW) Add supplemental logging for the operational
table Create a change table for the operational
table (CT_PLANT_OUTPUT) Force database logging on the tablespace to
catch any bulk insert /*+ APPEND */ (non-logged) activity
Mike Schmitz High Performance Data Warehousing20
Next – Transform and Populate
One of two ways Continuous Feed
Logged Insert activity Permits nearer real-time Constant system load
Periodic Batch Feed Permits non-logged bulk operations You set the lag time – how often do you run the batch
process? Hourly Every five minutes
Less system load overall
Mike Schmitz High Performance Data Warehousing21
The Continuous Feed
Put an insert trigger on the change table which joins to the dimension tables picking up the dimension keys and does any necessary transformations
Mike Schmitz High Performance Data Warehousing22
The Batch Feed
The CDC schema owner Authorizes AO_CDC_DW to select from the change table (the
select will be accomplished via a generated view) The DW schema owner
Subscribes to the change table and the columns he needs (with a centralized EDW approach this would usually be the whole change table) with a subscription and view name
Activates the subscription Extract
Extend the window Extracts changed data via the view (same code as trigger) Purges the window (logical Delete – physical deletion is handled by
the CDC schema owner)
Mike Schmitz High Performance Data Warehousing23
Extraction from Change Table View
insert /*+ APPEND*/ into ao_cdc_dw.F_CURRENT_DAY_PLANT_OUTPUT (generating_plant_key, output_day_key, output_minute_key, output_actual_qty_in_kwh) select p.generating_plant_key ,d.output_day_key ,m.output_minute_key ,new.output_in_kwh from ao_cdc_dw.PO_ACTIVITY_VIEW new inner join ao_cdc_dw.d_generating_plant p on new.plant_id = p.plant_id inner join ao_cdc_dw.d_output_day d on trunc(new.output_ts) = d.output_day inner join ao_cdc_dw.d_output_minute m on to_number(substr(to_char(new.output_ts,'YYYYMMDD HH:II:SS'),10,2)||substr(to_char(new.output_ts,'YYYYMMDD HH:II:SS'),13,2)) = m.output_time_24hr_nbr;
Mike Schmitz High Performance Data Warehousing24
Next Step
Add the current days activity (the contents of the current day fact table) to the historical fact table as a new partition Index and apply constraints to the current day
fact table Add a new empty partition to the fact table Exchange the current day fact table with the
partition Create the new current day fact table
Mike Schmitz High Performance Data Warehousing25
Let’s step thru this live
Mike Schmitz High Performance Data Warehousing26
Summary
We created a real-time partition for current day activity We put CDC on the operational table and created a
change table populated by an asynchronous process (reads redo log)
We demonstrated continuous feed to the DW by using a trigger based approach
We demonstrated a batch DW feed by using the CDC subscribe process
We showed how to add the current day table to the fact table and set up the next days table
An electronic copy of the SQL used to build this prototype is available by emailing [email protected]
Michael BreyPrincipal Member Technical StaffST/NEDC Oracle EngineeringOracle Corporation
Overview
Benchmark Description System Description Database Parameters Performance Data
The Benchmark
Customer OLTP benchmark run internally at Oracle Insurance application handling customer inquires and
quotes over the phone N users perform M quotes Quote = actual work performed during a call with a
customer Mixture of Inserts, Updates, Deletes, Singleton Selects,
Cursor Fetches, Rollbacks/commits, savepoints Compute average time for all quotes across users
System Info
SunFire 4800 A standard Shared Memory Processor (SMP) 8 900-Mhz CPUs 16 GB physical memory Solaris 5.8 Database storage: striped across 8 Sun
StorEdge T3 arrays (9X36.4MB each)
Database Parameters
Parallel_max_servers 20 Streams_pool_size 400M (default 10% shared
pool) Shared_pool_size 600M Buffer cache 128M Redo buffers 4M Processes 600
Change Data Capture (CDC)
Sync Async HotLog
Async AutoLog
Available Oracle 9i Oracle 10g Oracle 10g
source system cost
System resources
System resources
Minimal
Part of txn YES NO NO
Changes seen
Real time Near real time
Variable
Systems 1 1 2
Tests
Conducted tests with Asynchronous Hotlog CDC enabled and disabled and with Sync CDC.
Asynchronous Hotlog CDC tests conducted at different log usage levels
Appr. 10, 50, and 100% of all OLTP tables with DML operations were included in CDC
Tests run with: 250 concurrent users Continuous peak workload after ramp-up 175 transactions per second
Impact on Transaction Time
0.9
1
1.1
1.2
1.3
1.4
1.5
noCDC
noCDC
suppl
Async10%
Async50%
Async100%
Sync100%
CPU ConsumptionSupplemental Logging
USR + SYS Time
0
1
2
3
4
5
65 75 145
215
285
355
425
495
565
635
705
775
845
915
985
Time (s)
Usa
ge (#
CP
US
)no CDC
no CDC w/ suppl
CPU Consumption10% DML Change tracking
USR + SYS Time
0
1
2
3
4
5
65
80
15
5
23
0
30
5
38
0
45
5
53
0
60
5
68
0
75
5
83
0
90
5
98
0
Time (s)
Us
ag
e (
#C
PU
S)
no CDC w/suppl
CDC 10%
USR + SYS Time
0
1
2
3
4
5
65
75
145
215
285
355
425
495
565
635
705
775
845
915
985
Time (s)
Usa
ge
(#C
PU
S)
no CDC w/suppl
CDC 50%
CPU Consumption50% DML Change tracking
USR + SYS Time
0
1
2
3
4
5
6
7
85
75
145
215
285
355
425
495
565
635
705
775
845
915
985
Time (s)
Usa
ge
(#C
PU
S)
no CDC w/suppl
CDC 10%
CDC 100%
CPU Consumption10%,100% DML Change tracking
Latency of Change Tracking
Latency is defined as the time between the actual change and its reflection in the Change Capture Table
Latency = time[change record insert] – time[redo log insert] Latency measurement were made for the 100%
Asynchronous Hotlog CDC run 99.7% of records arrived in less than 2 secs
53.5% of records arrived in less than 1 sec Remaining records arrived in less than 3 sec Asynchronous CDC kept up with the constant high OLTP
workload all the time
Summary
Change Data Capture enables enterprise-ready near real-time capturing of change data
No fallback for constant high-load OLTP environments
Minimal impact on origin OLTP transactions Predictable additional resource requirements,
solely driven by the amount of change tracking Oracle provides the flexibility to meet your “on-
time” business needs
AQ&
Next Steps….Data Warehousing DB Sessions
11:00 AM
#40153, Room 304
Oracle Warehouse Builder:
New Oracle Database 10g Release
3:30 PM
#40176, Room 303
Security and the Data Warehouse
4:00 PM
#40166, Room 130
Oracle Database 10g
SQL Model Clause
Monday
8:30 AM#40125, Room 130
Oracle Database 10g: A Spatial VLDB Case Study
3:30 PM#40177, Room 303
Building a Terabyte Data Warehouse,Using Linux and RAC
5:00 PM
#40043, Room 104
Data Pump in Oracle Database 10g:Foundation for Ultrahigh-Speed Data
Movement
Tuesday
For More Info On Oracle BI/DW Go To http://otn.oracle.com/products/bi/db/dbbi.html
Next Steps….Data Warehousing DB Sessions
8:30 AM #40179, Room 304
Oracle Database 10g Data Warehouse Backup and Recovery
11:00 AM#36782, Room 304
Experiences with Real-Time Data Warehousing using Oracle 10g
1:00PM#40150, Room 102
Turbocharge your Database, Using the Oracle Database 10g
SQLAccess Advisor
Thursday
Oracle Database 10g
Oracle OLAP
Oracle Data Mining
Oracle Warehouse Builder
Oracle Application Server 10
Business Intelligence and Data Warehousing Demos All Four DaysIn The Oracle Demo Campground
For More Info On Oracle BI/DW Go To http://otn.oracle.com/products/bi/db/dbbi.html
Reminder – please complete the OracleWorld online session survey
Thank you.