parallel star join + dataindexes : efficient query processing in data warehousing and olap anindya...
TRANSCRIPT
Parallel Star Join + DataIndexes : Efficient
Query Processing in Data Warehousing and OLAP
Anindya DattaDebra VanderMeer
Krithi RamamrithamPresented by –
Ashutosh Joshi
Motivation OLAP involves efficient retrieval of data from
data warehouses for decision-support purposes
Data Warehouses are extremely large and queries are highly computationally expensive
DataIndex is a storage structure serving as both index and data
Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel
The Road Map A physical design principle for
exploiting parallelism Parallel Star Join algorithm Experiment results
The Star SchemaPART
PartKey 4Name 55Mfgr 25Brand 10Type 25Size 4Others... 41 164
200,000
CUSTOMERCustKey 4Name 25Address 40Nation 25Region 25Phone 15AcctBal 8MktSegment 10Comment 117 269
150,000SUPPLIER
SuppKey 4Name 25Address 40Nation 25Region 25Phone 15AcctBal 8Comment 101 243
10,000
TIMETimeKey 2Alpha 10Year 4Month 4Week 4Day 4 28
2,557
SALES
PartKey 4SuppKey 4CustKey 4Quantity 8ExtPrice 8Discount 8Tax 8RetFlag 1Status 1ShipDate 2CommitDate 2ReceiptDate 2ShipInstruct 25ShipMode 10Comment 44 137
6,000,000
Fact Table
Dimension Table
A Physical Design Principle DataIndexes
Serve as both index as well as data Based on vertical partitioning of
tables Two types
Projection Index (PI) Join Index (JI)
Projection Index
CustKey Qty
DiscountExtPrice
CK1 Q1
D1E1
CK2 Q2
D2E2
CK3 Q3
D3E3
CK4 Q4
D4E4
Base Table
CustKeyCK1CK2CK3CK4
QtyQ1Q2Q3Q4
DiscountExtPriceD1E1D2E2D3E3D4E4
PIPIPI
Join Index
TaxT1T2T3T4
Base Fact Table
RIDsRID1RID2RID3RID3
TaxT1T2T3T4
JI
Name AddressN1 A1N2 A2N3 A3
Base Dimension Table
Name AddressN1 A1N2 A2N3 A3
DiscountExtPriceD1E1D2E2D3E3D4E4
CustKeyCK1CK2CK3CK3
CustKeyCK1CK2CK3
DiscountExtPriceD1E1D2E2D3E3D4E4
PIPI
CustKeyCK1CK2CK3
PIPIPI
The Principle Each foreign key column in the fact
table is stored as Join Index (JI) Rest of the columns (for both
dimension as well as fact table) are stored as Projection Index (PI)
Parallel Star Join Data placement strategy
Based on shared nothing architecture with N processors
Assume a d dimensional data warehouse
Partition N processors into d+1 groups Assign to each group j, dimension table
Dj and Jj , the fact table join index Assign metric PIs to the group d+1
Processor Group Partitioning
Number of processors is governed by the size of dimension table Dj
Size of jth processor group
Size of metric group
mNs
N j
j,min
Warehousesize
MetricPIsizeNN d
1
Physical Data Placement
Horizontally partition JI’s across all processors
Replicate PI’s on all processors Use round-robin strategy for
partitioning JI’s
The Parallel Star Join Algorithm A general k- dimensional star join
query Select Ad
P, AmP
from F, D1, … , Dk
where Pjoin and Pselect
The algorithm has three phases Local rowset generation Global rowset synthesis Output preparation
Local Rowset generation
Load PI fragment
P1 P2Pc
PI fragment
255715
PI fragment
Qty > 101001
Rowset fragment
PI fragment
PI fragment
Local Rowset Generation (contd) Merge dimension rowset fragments
Distribute dimension rowset
P1 P2 P3 P4
OR
Rdim,
i
Rowsetfragment
Local Rowset Generation (contd)
Load JI fragment
Merge partial fact rowsets
1001
RIDsRID1RID2RID3RID3
1000
Rdim,
i
Rfact,iJIi
Global Rowset Synthesis Merge local fact rowsets
Distribute global rowset to groups participating in the output phase
G1 G2 G3 G4
AND
Rglobal
Rfact,1Rfact,2
Output Preparation
Distribute global rowset to individual processors
Load PI columns necessary for output
Merge output1100
PIiJIi Rglobal
RIDsRID1RID2RID3RID3
CustKeyCK1CK2CK3CK4
OutputCK1CK2
Performance Comparison The PSJ algorithm was compared
with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm
Two performance metrics used Response time in block access (RTBA) Aggregate Data Transmission (ADT)
Scalability Experiments The curves rise as
the scale factor and number of processors increase
PSJ cost is much lower than BJI and HASH costs
At large memory sizes, PSJ approaches “near-perfect” scalability
Scalability Experiments(contd) Transmission costs
for PSJ and BJI are the same
Both curves exhibit imperfect scalability
HASH has substantially higher transmission costs than PSJ
Conclusion DataIndex is a physical design
strategy which provides efficient partitioning of the schema
Parallel Star Join algorithm provides a means to perform star join in parallel
PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs