the oracle9i multi-terabyte data warehouse
Post on 05-Jan-2016
63 Views
Preview:
DESCRIPTION
TRANSCRIPT
The Oracle9i Multi-Terabyte Data Warehouse
Jeff ParkerManager Data Warehouse Development
Amazon.com
Session id:
The Challenges• Rapidly evolving business
• Growing data volumes
• Do more with less
The Challenges• Rapidly evolving business
– New international markets
– Continual innovation of features on Amazon• Buy it used
• Magazine subscriptions
– Marketplace Partnerships – Toys R Us, Target
• Growing data volumes
• Do more with less
The Challenges• Rapidly evolving
business
• Growing data volumes– 2X growth yearly over the
past 5 years
– Currently 10 Terabytes of raw data
• Do more with less
Data Growth
0
5
10
15
20
25
30
35
1999 2000 2001 2002 2003
Ter
abyt
es
The Challenges• Rapidly evolving business
• Growing data volumes
• Do more with less– Innovative use of technology and resources
– Throwing money and people at the problem is not an option
– Leverage existing investment in Oracle
Addressing the issues• Rapidly evolving business
–Denormalize only for performance reasons
–Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs
• Growing data volumes
• Do more with less
Addressing the issues• Rapidly evolving business
• Growing data volumes– Dual database approach to ETL
• Staging database for efficient transformation of large datasets. SQL and hash-joins allow transforms to scale in a non-linear fashion
• Second database optimized for analytics
– Oracle as an API • Simplifies ETL architecture
• Better scalability than traditional ETL tools
• Do more with less
Addressing the issues• Rapidly evolving business
• Growing data volumes
• Do more with less–One DW schema supports all countries
–Cut costs by eliminating unneeded software
–Data driven Load functionality
The ETL Process
• Extract data from source
• The Load process
• Dimensional Transforms
DATAWAREHOUSE
RT
OLTP
RT
STAGINGDATABASE
DTRTFLATFILESEXTRACT
FLATFILES
RT
DT
= Row level data Transform
= Dimensional Transform
The ETL Process
• Extract data from source–Can create one or more files to be loaded
–Must produce Metadata upon which the Load process can depend
• The Load Process
• Dimensional Transforms
Extract produced Metadata
• Describes each field in database type terms
• Changes as the dataset changes
• Can reference multiple files
• Very reliable
• No additional overhead
XML Based Metadata<DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2"
DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE"
DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2"
DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES></DATA>
The ETL Process
• Extract data from source
• The Load Process –Makes extensive use of External Tables
–MERGE and Bulk Insert
–Contains integrated DBA tasks
–Every load is tracked in an operational database
• Dimensional Transforms
The Load Process
OPSDW Operational Data
Load Tasks
DBA Tasks
Log loadstats
Performcleanup
DataWarehouse
SQLInsert/Merge
Row levelTransforms
XT
ExternalTable
DataFiles
METADATA
DATAFILES
The Load Process
• External Tables–access to files on the operating system
–Is a building block in a broader ETL process
• MERGE & Bulk Insert
• Integrated DBA tasks
The External Table
• Created by using Metadata from the Extract process
• Data is read-only
• No indexes
• Use DBMS_STATS to set number of rows
DATA_SETSdataset_id NUMBERdataset_name VARCHAR(80)creation_date DATEcreated_by VARCHAR(8)
External Table
Data Files
Example External Table
1. Copy the data to the database server Data must reside in a file system location
specified by the DBA’s.
- create directory DAT_DIR as ‘/stage/flat’
Example External Table
2. Create the external table using the DML from the extract.
CREATE TABLE XT_datasets_77909( dataset_id NUMBER , dataset_name VARCHAR2(80) ,
creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER
DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline
characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' )
The External Table
• No pre-staging of data
• Ability to describe a flat file to Oracle
• Handles horizontally partitioned files
• Good error messaging
The Load Process
• External Tables
• MERGE–Can be run in parallel
–Combined with external table provides a powerful set of ETL tools
• Integrated DBA tasks
MERGE
• Allows for update or insert in a single statement–If key value already exists
• Yes, update row
• No, insert row
• MERGE statement is auto-generated
• Row level column transforms are supported
MERGE
External tableMetadata Permanent table
Dataset_id
Dataset_name
Created_by
last_updated
Dataset_id
Dataset_name
Created_by
sysdate
Dataset_id
Dataset_name
Created_by
MERGE exampleMERGE into DATASETS dsUSING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) srcOn ( xt.dataset_id = ds.dataset_id )When matched then UPDATE SET ds.dataset_name =
src.dataset_name ,ds.creation_date = src.dataset_name
,ds.created_by = src.created_by ,ds.last_updated = sysdate
when not matched then INSERT( dataset_name, creation_date, created_by, last_updated )
VALUES( dataset_name, creation_date, created_by, sysdate )
MERGE
• Issues we faced–Duplicate records in the dataset
–NESTED-LOOPS from external table
–Parallelism is not enabled by default
–Bulk Load partition determination
The Load Process
• External Tables
• MERGE
• Integrated DBA tasks–Reduces workload required by the DBA team
–Streamlines the load process
–Eliminates human error
Integrated DBA Tasks
• Provided by the DBA team–Managed by the DBA team
–ETL team does not need special knowledge of table layout
Integrated DBA Tasks
• Truncate Partitiondeveloper makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )
DBA utility translates this and executesalter table TABLE-NAME drop partition dbi20020930_101;
Integrated DBA Tasks
• Analyze Partitiondeveloper makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )
DBA utility translates this and executes dbms_stats.gather_table_stats(ownname , tabname , partname , cascade , estimate_percent, granularity);
Integrated DBA Tasks
• Return Partition Namedeveloper makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )
DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables.
Integrated DBA Tasks
• Partitioning utilities–Helps to streamline the process
–Reduces workload of DBA team
–Helps to eliminate the problem of double loads for Snapshot tables and partitions
The Load Process
• External Tables– Provides access to flat files outside the database
• MERGE– Parallel “upsert” simplifies ETL
– Row level transforms can be performed in SQL
• Integrated DBA tasks– Reduces workload required by the DBA team
– Streamlines the load process
– Eliminates human error
• Loads are repeatable processes
Summary
• Reduction in time to integrate new subject areas
• Oracle parallelism scales well
• Eliminated unneeded software
Summary
• Oracle has delivered on the DW promise–Oracle External table combined with MERGE
is a viable alternative to other ETL tools
–ETL tools are ready today
&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S
Reminder – please complete the OracleWorld session survey
Thank you.
top related