parallel multi-dimensional rolap indexing andrew rau-chaplin faculty of computer science dalhousie...
TRANSCRIPT
Parallel Multi-Dimensional ROLAP Indexing
Andrew Rau-ChaplinFaculty of Computer Science
Dalhousie University
Joint work with
Frank Dehne, Carleton Univ.
Todd Eavis, Dalhousie Univ.
Data Warehousing for Decision Support
Operational data collected into DW
DW used to support multi-dimensional views
Views form the basis of OLAP processing
Our focus: the OLAP server
Data MiningAnalysisQuery Reports
Olap ServerOlap Server
Meta Data Repository
MonitoringAdministration
Operational Databases
Data Warehouse
Data Marts
External Sources
ExtractClean
TransformLoad
Refresh
Output
Front-End Tools
Olap Engines
Data Storage
Data Cleaningand
Integration
Multi-dimensional views
Collection of feature attributes
Aggregate along one or more measure attributes
Reduce the granularity by “collapsing” dimensions
Points generated by: distributive functions(e.g.,
sum) algebraic functions (e.g.,
average) holistic functions(e.g.,
median)
Red
White
Blue
By Make & Colour
By Colour
By Make
1993
19901991
1992
ChevyFord
By Year
By Colour & Year
By Make & Year
Data Cube Generation
Proposed by Gray et al in 1995
Can be generated “manually” from a relational DB but this is very inefficient
Exploit the relationship between cuboids to compute all 2d cuboids
In OLAP environments, we typically pre-compute these views to improve query response time
ABC
AB AC BC
A C B
ALL
Existing Parallel Results
Goil & ChoudharyMOLAP solution
in-memory structures global partition + d
communication rounds
distributed viewsLimitations
Memory for multi-dimensional arrays
expensive communication for larger d
J. Of Data Mining & Knowledge Discovery 1(4), 1997
Our Approach
ROLAP solution Construct and cost the
data cube lattice Find a “least cost”
spanning tree Partition the spanning tree
over the processors equally, construct views and distribute
Can handle partial cubes
Limitations What about indexing?????
ABCD
ABC ABD ACD BCD
AB AC AD BC BD CD
AA BB CC DD
All
CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001
Parallel Multi-dimensional Indexing
Query specifies a range on multiple dimensions
Forms a hypercube in the point space
General Approach
No multidimensional index is universally successful
Exploit domain specific information and the features of a particular index
OLAP Data is provided up front Updates are batch oriented
Design Goals
A framework for distributed high-performance indexing of ROLAP cubes Practical to implement Low communication volume Fully adapted to external memory (disks) No shared disk required Incrementally maintainable Efficient for high D spatial searches Scalable in terms of data size,
dimensions, processors
Challenge
How to order and partition data such that Number of records retrieved per node is
as balanced as possible Minimize the number of disk seeks
required in answering a queryABC
P1 P2 P3 P4
Indexing the Data Cube
Combine the strengths of a space filling and an r-tree index
Use Hilbert curve to load buckets
Index buckets with r-tree
Update indexes with merge/sort
Space Filling Curves & Striping
Query Retrieval
P1 P2 P3 P4
ABC ABC ABC ABC
Example
Original Space Processor 1 Processor 2
8 points to be reported
Reports:2 consecutive blocks & 4 points
Reports:2 consecutive blocks & 4 points
The Parallel Framework
A single view is partitioned across p processors
Partial Hilbert/r-tree indexes are computed locally
Queries are answered concurrently
Queries answered individually or “piggy-backed”
The Virtual Data Cube
Problem: Full cube often to large to materialize
Solution: Use surrogate views
Surrogate Processing
Other issues…
Dimension orderingQuery piggybacking Batch updatingManaging Hierarchies of views
Experimental Results
Machine 17 node cluster Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40
GB IDE drives, running Linux Interconnect = Intel Fast Ethernet
switchTest Data
10 dimensions and 1,000,000 records
RCUBE index Construction
Output: ~640 million rows, 16 Gigabytes
Distributed Query Resolution
Test: Random queries returning ~15% of points (10 experiments per point)
Disk blocks retrieved vs. Disk Seeks
Test: Random queries returning 5-15% of points (15 experiments per point)
Distributed Query Resolution in Surrogate Group-bys
Thank You
Questions?