stack it and unpack it

Download Stack It And Unpack It

If you can't read please download the document

Upload: jeff-moss

Post on 17-Jun-2015

670 views

Category:

Technology


0 download

DESCRIPTION

Partitioning and Compression for Datawarehouses.

TRANSCRIPT

  • 1. Stack It & Pack It Partitioning And Compression For Warehouses / VLDB Jeff Moss

2. Who Dunnit ? 3. Agenda

  • My background
  • Squeeze your data with data segment compression
  • Partition for success
  • Questions

4. My Background

  • Independent Consultant
  • 13 years Oracle experience
  • Blog:http://oramossoracle.blogspot.com/
  • Focused on warehousing / VLDB since 1998
  • First project
    • UK Music Sales Data Mart
    • Produces BBC Radio 1 Top 40 chart and many more
    • 2 billion row sales fact table
    • 1 Tb total database size
  • Currently working with Eon UK (Powergen)
    • 4Tb Production Warehouse, 8Tb total storage
    • Oracle Product Stack

5. What Is Data Segment Compression ?

  • Compresses data by eliminating intra block repeated column values
  • Reduces the space required for a segment
    • but only if there are appropriate repeats!
  • Self contained
  • Lossless algorithm

6. Where Can Data Segment Compression Be Used ?

  • Can be used with a number of segment types
    • Heap & Nested Tables
    • Range or List Partitions
    • Materialized Views
  • Cant be used with
    • Subpartitions
    • Hash Partitions
    • Indexes but they have row level compression
    • IOT
    • External Tables
    • Tables that are part of a Cluster
    • LOBs

7. How Does Segment Compression Work ? Database Block Symbol Table Row Data Area Block Common Header (20 bytes) Transaction Header (24 bytes fixed + 24 bytes per ITL) Data Header (14 bytes) Compressed Data Header (16 bytes -variable ) Tail (4 bytes) 100 Call to discuss bill amount TEL NO YES 3 TEL 4 NO 5 YES 2 Call to discuss bill amount 1 100 1 2 3 4 5 101 Call to discuss new product MAIL NO N/A 8 MAIL 9 N/A 7 Call to discuss new product 6 101 6 7 8 4 9 102 Call to discuss new product TEL YES N/A 10 7 3 5 9 10 102 ID DESCRIPTION CONTACT TYPE OUTCOME FOLLOWUP Table Directory (8 bytes) Row Directory (2 bytesper row ) 8. What Affects Compression ?

  • Undisclosed Algorithm
    • I asked but support wouldnt play ball!
  • Many Factors
    • Block size
    • Anything which affectsblock overhead
      • Interested Transaction Lists ( INITRANS )
      • Number of columns
      • Number of rows
      • PCTFREE
    • Number of repeats ( in the block )
    • Length of column value(s)

9. Compression v Block Size

  • 200K rows, Non ASSM Uniform Local extents
  • More chance of repeats in any given block

10. Compression v ITL

  • 10K rows, Non ASSM Uniform Local extents
  • More ITL = more overhead = less repeats

11. Compression v Number Of Columns

  • 500K rows, Non ASSM Uniform Local extents
  • Same amount of data to store
  • More columns = more overhead = less repeats

12. Compression v PCTFREE

  • 200K rows, Non ASSM Uniform Local extents
  • Higher PCTFREE = less space = less repeats

13. Compression v NDV

  • 200K rows, Non ASSM Uniform Local extents
  • Higher NDV = less repeats

14. Compression v Column Length

  • 80K rows, Non ASSM Uniform Local extents
  • Minimum 6 characters for compression
  • Longer Length = more compression savings

15. Compression v Ordering

  • Colocate data to maximise compression benefits
  • For maximum compression
    • Minimise the total space required by the segment
    • Identify most compressable column(s)
  • For optimal access
    • We know how the data is to be queried
    • Order the data by
      • Access path columns
      • Then the next most compressable column(s)

Uniformly distributed Colocated 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 16. Get Max Compression Order Package

    • PROCEDURE mgmt_p_get_max_compress_order
    • Argument NameTypeIn/Out Default?
    • ------------------------------ ----------------------- ------ --------
    • P_TABLE_OWNERVARCHAR2INDEFAULT
    • P_TABLE_NAMEVARCHAR2IN
    • P_PARTITION_NAMEVARCHAR2INDEFAULT
    • P_SAMPLE_SIZENUMBERINDEFAULT
    • P_PREFIX_COLUMN1VARCHAR2INDEFAULT
    • P_PREFIX_COLUMN2VARCHAR2INDEFAULT
    • P_PREFIX_COLUMN3VARCHAR2INDEFAULT
    • BEGIN
    • mgmt_p_get_max_compress_order(p_table_owner => AE_MGMT
    • ,p_table_name =>BIG_TABLE
    • ,p_sample_size =>10000);
    • END:
    • /

Running mgmt_p_get_max_compress_order... ---------------------------------------------------------------------------------------------------- Table: BIG_TABLE Sample Size: 10000 Unique Run ID: 25012006232119 ORDER BY Prefix: ---------------------------------------------------------------------------------------------------- Creating MASTER Table: TEMP_MASTER_25012006232119 Creating COLUMN Table 1: COL1 Creating COLUMN Table 2: COL2 Creating COLUMN Table 3: COL3 ---------------------------------------------------------------------------------------------------- The output below lists each column in the table and the number of blocks/rows and space used when the table data is ordered by only that column, or in the case where a prefix has been specified, where the table data is ordered by the prefix and then that column. From this one can determine if there is a specific ORDER BY which can be applied to to the data in order to maximise compression within the table whilst, in the case of a a prefix being present, ordering data as efficiently as possible for the most common access path(s). ---------------------------------------------------------------------------------------------------- NAMECOLUMNBLOCKSROWS SPACE_GB ============================== ============================== ============ ============ ======== TEMP_COL_001_25012006232119COL129010000 .0022 TEMP_COL_002_25012006232119COL234510000 .0026 TEMP_COL_003_25012006232119COL355510000 .0042 17. Pros & Cons

  • Pros
    • Saves space
      • Reduces LIO / PIO
      • Speeds up backup/recovery
      • Improves query response time
    • Transparent
      • To readers
      • and writers
    • Decreases time to perform some DML
      • Deletesshould bequicker
      • Bulk insertsmaybe quicker

18. Pros & Cons

  • Cons
    • Increases CPU load
    • Can only be used on Direct Path operations
      • CTAS
      • Serial Inserts using INSERT /*+ APPEND */
      • Parallel Inserts (PDML)
      • ALTER TABLEMOVE
      • Direct Path SQL*Loader
    • Increases time to perform some DML
      • Bulk insertsmaybe slower
      • Updates are slower

19. Data Warehousing Specifics

  • Star Schema compresses better than Normalized
    • More redundant data
  • Focus on
    • Fact Tables and Summaries in Star Schema
    • Transaction tables in Normalized Schema
  • Performance Impact 1
    • Space Savings
      • Star schema: 67%
      • Normalized: 24%
    • Query Elapsed Times
      • Star schema: 16.5%
      • Normalized: 10%

1 -Table Compression in Oracle 9iR2: A Performance Analysis 20. Things To Watch Out For

  • DROP COLUMN is awkward
    • ORA-39726: Unsupported add/drop column operation on compressed tables
    • Uncompress the table and try again - still gives ORA-39726!
  • After UPDATEs data is uncompressed
    • Performance impact
    • Row migration
  • Use appropriate physical design settings
    • PCTFREE 0- pack each block
    • Large blocksize -reduce overhead / increase repeats per block
    • Minimise INITRANS -reduce overhead
  • Order data for best compression / access path

21. A Funny Thing

  • Block dump trace files still show 9iR2 even in 10g releases
  • ALTER SYSTEM DUMP DATAFILE x BLOCK y;

Thanks to Julian Dyke for the block dumping information http://www.juliandyke.com 22. What Is Partitioning ?

  • Partitioningaddresses key issues in supporting very large tables and indexes by letting you decompose them intosmallerand moremanageablepieces calledpartitions . Oracle Database Concepts Manual, 10gR2
  • Introduced in Oracle 8.0
  • Numerous improvements since
  • Subpartitioning adds another level of decomposition
  • Partitions and Subpartitions are logical containers

23. Partition To Tablespace Mapping

  • Partitions map to tablespaces
    • Partition can only be in One tablespace
    • Tablespace can hold many partitions
    • Highest granularity is One tablespace per partition
    • Lowest granularity is One tablespace for all the partitions
  • Tablespace volatility
    • Read / Write
    • Read Only

P_JAN_2005 P_FEB_2005 P_MAR_2005 P_APR_2005 P_MAY_2005 P_JUN_2005 P_JUL_2005 P_AUG_2005 P_SEP_2005 P_OCT_2005 P_NOV_2005 P_DEC_2005 T_Q1_2005 T_Q2_2005 T_Q3_2005 T_Q4_2005 T_Q1_2006 P_JAN_2006 P_FEB_2006 P_MAR_2006 T_Q3_2005 Read / Write Read Only 24. Read Only Tablespaces

  • Quicker checkpointing
  • Quicker backup
  • Quicker recovery
  • Reduced space use via compression
  • But
  • depends on granularity

Partition Tablespace 25. Why Partition ? - Performance

  • Improved query performance
    • Pruning or elimination
    • Partition wise joins
      • Full
      • Partial
  • Selective Compression
    • By Partition
  • Selective Reorganisation
    • Index Partition REBUILD
    • Table Partition MOVE

SELECT SUM(sales)FROM part_tab WHERE sales_date BETWEEN 01-JAN-2005AND 30-JUN-2005 Sales Fact Table * Oracle 10gR2 Data Warehousing Manual JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 26. Why Partition ? - Manageability

  • Archiving
    • Use a rolling window approach
    • ALTER TABLE ADD/SPLIT/DROP PARTITION
  • Easier ETL Processing
    • Build a new dataset in a staging table
    • Add indexes and constraints
    • Collect statistics
    • Then swap the staging table for a partition on the target
      • ALTER TABLEEXCHANGE PARTITION
  • Easier Maintenance
    • Table partition move, e.g. to compress data
    • Local Index partition rebuild

27. Why Partition ? - Scalability

  • Partition is generally consistent and predictable
    • Assuming an appropriate partitioning key is used
    • and data has an even distribution across the key
  • Read only approach
    • Scalable backups - read only tablespaces are ignored
    • so partitions in those tablespaces are ignored
  • Pruning allows consistent query performance

28. Why Partition ? - Availability

  • Offline data impact minimised
    • depending on granularity
    • Quicker recovery
    • Pruned data not missed
    • EXCHANGE PARTITION
      • Allows offline build
      • Quick swap over

P_JAN_2005 P_FEB_2005 P_MAR_2005 P_APR_2005 P_MAY_2005 P_JUN_2005 P_JUL_2005 P_AUG_2005 P_SEP_2005 P_OCT_2005 P_NOV_2005 P_DEC_2005 T_Q1_2005 T_Q2_2005 T_Q3_2005 T_Q4_2005 T_Q1_2006 P_JAN_2006 P_FEB_2006 P_MAR_2006 T_Q3_2005 Read / Write Read Only 29. Fact Table Partitioning Transaction Date Load Date

  • Easier ETL Processing
    • Each load deals with only 1 partition
  • No use to end user queries!
    • Cant prune Full scans!
  • Harder ETL Processing
    • But still uses EXCHANGE PARTITION
  • Useful to end user queries
    • Allows full pruning capability

07-JAN-2005 Customer 1 09-JAN-2005 15-JAN-2005 Customer 2 17-JAN-2005 January Partition February Partition 22-JAN-2005 Customer 3 01-FEB-2005 02-FEB-2005 Customer 4 05-FEB-2005 26-FEB-2005 Customer 5 28-FEB-2005 March Partition 06-MAR-2005 Customer 2 07-MAR-2005 12-MAR-2005 Customer 3 15-MAR-2005 Tran Date Customer Load Date April Partition 21-JAN-2005 Customer 7 04-APR-2005 09-APR-2005 Customer 9 10-APR-2005 07-JAN-2005 Customer 1 09-JAN-2005 15-JAN-2005 Customer 2 17-JAN-2005 21-JAN-2005 Customer 7 04-APR-2005 22-JAN-2005 Customer 3 01-FEB-2005 January Partition February Partition 02-FEB-2005 Customer 4 05-FEB-2005 26-FEB-2005 Customer 5 28-FEB-2005 March Partition 06-MAR-2005 Customer 2 07-MAR-2005 12-MAR-2005 Customer 3 15-MAR-2005 Tran Date Customer Load Date April Partition 09-APR-2005 Customer 9 10-APR-2005 30. Watch out for

  • Partition exchange and table statistics 1
    • Partition stats updated
    • but Global stats are NOT!
    • Affects queries accessing multiple partitions
    • Solution
      • Gather stats on staging table prior to EXCHANGE
      • Partition exchange
      • Gather stats on partitioned table using GLOBAL

Jonathan Lewis: Cost-Based Oracle Fundamentals, Chapter 2 31. Partitioning Feature: Characteristic Reason Matrix Partition Truncation Exchange Partition Archiving Pruning (Partition Elimination) Partition wise joins Parallel DML Local Indexes Read Only Partitions Availability Scalability Manageability Performance Characteristic: Feature: 32. Questions ? 33. References: Papers

  • Table Compression in Oracle 9iR2: A Performance Analysis
  • Table Compression in Oracle 9iR2: An Oracle White Paper
  • Scaling To Infinity, Partitioning In Oracle Data Warehouses, Tim Gorman
  • Decision Speed: Table Compression In Action

34. References: Online Presentation / Code

  • http://www.oramoss.demon.co.uk/presentations/stackitandpackit.ppt
  • http://www.oramoss.demon.co.uk/Code/mgmt_p_get_max_compression_order.prc
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_delete.sql
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_insert.sql
  • http://www.oramoss.demon.co.uk/Code/test_dml_performance_update.sql
  • http://www.oramoss.demon.co.uk/Code/test_block_size_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_column_length_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_itl_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_ndv_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_num_cols_compression.sql
  • http://www.oramoss.demon.co.uk/Code/test_pctfree_compression.sql