parallel star join + dataindexes : efficient query processing in data warehousing and olap anindya...

Parallel Star Join + DataIndexes : Efficient

Query Processing in Data Warehousing and OLAP

Anindya DattaDebra VanderMeer

Krithi RamamrithamPresented by –

Ashutosh Joshi

Motivation OLAP involves efficient retrieval of data from

data warehouses for decision-support purposes

Data Warehouses are extremely large and queries are highly computationally expensive

DataIndex is a storage structure serving as both index and data

Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel

The Road Map A physical design principle for

exploiting parallelism Parallel Star Join algorithm Experiment results

The Star SchemaPART

PartKey 4Name 55Mfgr 25Brand 10Type 25Size 4Others... 41 164

200,000

CUSTOMERCustKey 4Name 25Address 40Nation 25Region 25Phone 15AcctBal 8MktSegment 10Comment 117 269

150,000SUPPLIER

SuppKey 4Name 25Address 40Nation 25Region 25Phone 15AcctBal 8Comment 101 243

10,000

TIMETimeKey 2Alpha 10Year 4Month 4Week 4Day 4 28

2,557

SALES

PartKey 4SuppKey 4CustKey 4Quantity 8ExtPrice 8Discount 8Tax 8RetFlag 1Status 1ShipDate 2CommitDate 2ReceiptDate 2ShipInstruct 25ShipMode 10Comment 44 137

6,000,000

Fact Table

Dimension Table

A Physical Design Principle DataIndexes

Serve as both index as well as data Based on vertical partitioning of

tables Two types

Projection Index (PI) Join Index (JI)

Projection Index

CustKey Qty

DiscountExtPrice

CK1 Q1

D1E1

CK2 Q2

D2E2

CK3 Q3

D3E3

CK4 Q4

D4E4

Base Table

CustKeyCK1CK2CK3CK4

QtyQ1Q2Q3Q4

DiscountExtPriceD1E1D2E2D3E3D4E4

PIPIPI

Join Index

TaxT1T2T3T4

Base Fact Table

RIDsRID1RID2RID3RID3

TaxT1T2T3T4

JI

Name AddressN1 A1N2 A2N3 A3

Base Dimension Table

Name AddressN1 A1N2 A2N3 A3


CustKeyCK1CK2CK3CK3

CustKeyCK1CK2CK3


PIPI

CustKeyCK1CK2CK3

PIPIPI

The Principle Each foreign key column in the fact

table is stored as Join Index (JI) Rest of the columns (for both

dimension as well as fact table) are stored as Projection Index (PI)

Parallel Star Join Data placement strategy

Based on shared nothing architecture with N processors

Assume a d dimensional data warehouse

Partition N processors into d+1 groups Assign to each group j, dimension table

Dj and Jj , the fact table join index Assign metric PIs to the group d+1

Processor Group Partitioning

Number of processors is governed by the size of dimension table Dj

Size of jth processor group

Size of metric group

mNs

N j

j,min

Warehousesize

MetricPIsizeNN d

1

Physical Data Placement

Horizontally partition JI’s across all processors

Replicate PI’s on all processors Use round-robin strategy for

partitioning JI’s

The Parallel Star Join Algorithm A general k- dimensional star join

query Select Ad

P, AmP

from F, D1, … , Dk

where Pjoin and Pselect

The algorithm has three phases Local rowset generation Global rowset synthesis Output preparation

Local Rowset generation

Load PI fragment

P1 P2Pc

PI fragment

255715

PI fragment

Qty > 101001

Rowset fragment

PI fragment

PI fragment

Local Rowset Generation (contd) Merge dimension rowset fragments

Distribute dimension rowset

P1 P2 P3 P4

OR

Rdim,

i

Rowsetfragment

Local Rowset Generation (contd)

Load JI fragment

Merge partial fact rowsets

1001


1000

Rdim,

i

Rfact,iJIi

Global Rowset Synthesis Merge local fact rowsets

Distribute global rowset to groups participating in the output phase

G1 G2 G3 G4

AND

Rglobal

Rfact,1Rfact,2

Output Preparation

Distribute global rowset to individual processors

Load PI columns necessary for output

Merge output1100

PIiJIi Rglobal


CustKeyCK1CK2CK3CK4

OutputCK1CK2

Performance Comparison The PSJ algorithm was compared

with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm

Two performance metrics used Response time in block access (RTBA) Aggregate Data Transmission (ADT)

Scalability Experiments The curves rise as

the scale factor and number of processors increase

PSJ cost is much lower than BJI and HASH costs

At large memory sizes, PSJ approaches “near-perfect” scalability

Scalability Experiments(contd) Transmission costs

for PSJ and BJI are the same

Both curves exhibit imperfect scalability

HASH has substantially higher transmission costs than PSJ

Conclusion DataIndex is a physical design

strategy which provides efficient partitioning of the schema

Parallel Star Join algorithm provides a means to perform star join in parallel

PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs

parallel star join + dataindexes : efficient query processing in data warehousing and olap anindya...

Documents