chapt. 7 multidimensional hierarchical clustering
DESCRIPTION
PRODUCT. CUSTOMER. DISTRIBUTION. TIME. All Products. All Customer. All Distributions. All Time. Type (5). Sales. Region (8). Year (3). Organization (5). Month (12). Brand (8). Distribution. Nation (7). Channel (3). Category (19). Trade. Type (2). Container (10). Business. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/1.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 1
Chapt. 7 Multidimensional Hierarchical Clustering
Fig. 3.1 Hierarchies in the `Juice and More´ schema
Year (3)
Month (12)
TIME
Region (8)
Nation (7)
Trade Type (2)
Business Type (7)
CUSTOMER
Type (5)
Brand (8)
Category (19)
Container (10)
PRODUCT
Sales Organization (5)
DistributionChannel (3)
DISTRIBUTION
All Products All DistributionsAll Customer All Time
![Page 2: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/2.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 2
(b)
PRODKEY
CUSTKEY
DISTKEY
TIMEKEY
SALES
DISTCOST
PRODKEY
PRODUCT2180 rows
TYPE
BRAND
CATEGORY
CONTAINER
...
CUSTKEY
CUSTOMER7064 rows
REGION
NATION
TRADE-TYPE
BUSINESS-TYPE
...
DISTKEY
DISTRIBUTION12 rows
SALESORG
CHANNEL
...
TIMEKEY
TIME36 rows
YEAR
MONTH
FACT26M rows
...
![Page 3: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/3.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 3
Size of completely aggregated Cube
(6*9*20*11)*(9*8*3*8)*(6*4)*(4*13)------------------------------------------------ = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12)
4*6*6*9*11*13 185.328-------------------- = ----------- = 7.96 larger than base cube 5*5*7*7*19 23.275
Base Cube has 2.245.024.000 cells * 4 B ~ 9 GB
Number of available facts: 26 million
![Page 4: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/4.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 4
Sparsity:
26*106
-------------- = 0,01162,245* 109
100 - 1.16 = 98.84 % sparsity
![Page 5: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/5.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 5
Hierarchically aggregated Cube
(1+5+40+760+7600) = 8406
(1+8+56+112+784) = 961
(1+5+15) = 21
(1+3+24) = 28
= 4.749.961.608
Size of base cube 2.145.024.000
Number of aggregate cells 2.504.937.608
==> Juice and More database has 96 times more hierarchically aggregated cells than occupied base cells!
![Page 6: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/6.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 6
Star-Joins
Restrictions on several dimension tables, which are then joined with fact table
In addition: grouping, computation of aggregates, sorting of results.
Example:
Select <MEASURE AGGREGATION>From Fact F, Customer C, DISTRIBUTION D,
Product P, Time TWhere F. ProdKey = P. ProdKey AND
F. CustKey = C. CustKey ANDF.TIMEKEY = T.TIMEKEY ANDF.DISTKEY = D.DISTKEY AND<CUSTOMER RESTRICTION> AND<DISTRIBUTION RESTRICTION> AND<PRODUCT RESTRICTION> AND<TIME RESTRICTION>
![Page 7: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/7.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 7
Select <MEASURE AGGREGATION>
From Fact F
Where F. ProdKey BETWEEN Pkey1 AND Pkey2 AND
F. DistKey BETWEEN Dkey1 AND Dkey2 AND
F. CustKey BETWEEN Ckey1 AND Ckey2 AND
F. TimeKey BETWEEN Tkey1 AND Tkey2
![Page 8: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/8.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 8
Key Question:
How to compute star-joins efficiently?
• Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details
- intersect result lists
-retrieve tuples from fact table randomly
• Bitmaps
![Page 9: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/9.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 9
Bitmap Index Intersection
bitmap for organization = „TM“
bitmap for region = „
Asia “
1.....1.11 1.1...1.1. 1.1...1.1. ...1.1.... ..1.1...1.
11.1...... 1.11.....1 .1.1..1... 1.1.1..... .1..1.1...
1......... 1.1....... ......1... .......... ....1.....
Page 1 Page 2 Page 3 Page 4 Page 5
result of bitmap intersection
accessed disk pages (shaded)
34 % of tuples
32 % of tuples
10 % of tuples
80 % of
pages
![Page 10: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/10.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 10
Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk.
Ex: with 8 KB pages 20 to 400 tuples per page, i.e. at 0.25% to 5% hits in the result almost all pages must be fetched.
At least tuple clustering, preferably page clustering, are desirable, but how??
Goal: Code hierarchies in such a way, that for star-joins with the Fact table we have to join only with a query box on the Fact table
![Page 11: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/11.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 11
Basic Idea for Multidimensional Clustering
1L} 0.5L; Juice Apple 1L; OJ 0.7L; OJ 0.33L; {OJ
0
1 m
1L} OJ 0.7L; OJ 0.33L; {OJ1
1 m
0.5L} {A-Juice2 4 m
1L} Juice Apple 0.5L Juice Apple {1
2 m
0.33L} {OJ2 1 m 0.7L} {OJ
22 m 1L} OJ {
23 m 1L} {A-Juice
2 5m
Orange Juice Apple Juice
0,33L 0,7L 1L0,5L
Product Category
All Products
All
0 1
10 0 2
Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L)Legend:
Example Hierarchy in Member Set Representation
AppleJuice
1 1L
![Page 12: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/12.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 12
Dimension D consists of
Value Set V = [[ v1, v2, ... vn ]]
Hierarchy H of height h consisting of h+1 hierarchy levels H = [[L0 , L1
,..., Lh ]]
Level Li is a set of sets = [[m1i, ..., mj
i ]] with mki elof V
mki get names, e.g. „Orange Juice“ as label(m1
1), in general label(mk
i)
Constraint: every mli+1 must be a subset of some mk
i
![Page 13: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/13.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 13
Hierarchic Relationships
The children of mki are all those sets ml
i+1 of the lower level i+1 with the property:
mli+1 elof??? mk
i , formally:
children(mki ) := [[ml
i+1 subsetof??? Li+1 : mli+1 subof???
mki ]]
parent(mki ) := [[ml
i-1 subsetof??? Li-1 : mli-1 superof???
mki ]]
Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0
![Page 14: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/14.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 14
Enumeration and Surrogate Functions
Let A be an enumeration type
A = [[ a0, a1, ... ak ]]
f : A --> (0, 1 ,..., k ) defined as
f (ai ) = i
then i is called the surrogate of ai
![Page 15: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/15.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 15
Hierarchies and composite Surrogates
Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs)
Note: the root ALL of the hierarchy is not encoded
Def: compound surrogate cs for hierarchy H
ordm : children (m) --> [[0, 1, ..., |children(m)| -1]]
cs (H, mi) := ord father (mi) (mi) if i=1
:=cs (H, father ( mi)) comp??? ord father (mi) (mi) otherwise
![Page 16: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/16.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 16
Example:
REGION f(REGION)
South Europe 0
Middle Europe 1
Northern Europe 2
Western Europe 3
North America 4
Latin America 5
Asia 6
Australia 7
(a)
![Page 17: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/17.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 17
0
CUSTOMER
South Europe North America Asia
RetailWholesale Kana´s Sushi Bar
Joe‘s Sports Bar
... ...
Bar
4 6
2
1
10
RetailUSACanada 10
... ...
... ...
... ...
Australia7
Wholesale0
Surrogates for Region and the entire Costumer Hierarchy
![Page 18: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/18.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 18
Example: the path
North America --> USA --> Retail --> Bar
has the compound surrogate 4?1?1?2
Next Idea: for every hierarchy level determine the higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits.
surrogates (H,i) := max [[ cardinality (children (H,m)) : m in??? level (H, i-1) ]]
![Page 19: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/19.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 19
handgeschriebene Seite 6.6 ???
Problem mit doppelten Indices?
![Page 20: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/20.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 20
Properties of MHC Encoding
• very compact coding of fixed length
• lexicographic order of composite keys remains, i.e. isomorphic to integer ordering
• point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates
![Page 21: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/21.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 21
Example: path to USA is:
North America --> USA
4 = 1002 1 = 0012
leads to range on cs:100 001 0 0002 to 100 001 1 1112
and to the decimal range:528 to 543 or [528 : 543]
==> star join with restriction North America.USA leads to an interval restriction on the fact table
==> point restrictions on arbitrary hierarchy levels of several dimensions lead to Query Boxes on the fact table.
![Page 22: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/22.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 22
Complex Hierarchies
• time with months and weeks, both restrictions lead to intervals on the level of days
• Example of Fig. 4-4
• proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent hierarchies. Caution, this increases the number of dimensions !!!
• Time variant hierarchies: extend by time interval of validity , see Example Fig. 4-5,
![Page 23: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/23.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 23
(a) (b)
YEAR
MONTH WEEK
DAY
REGION
NATION
TRADE TYPE
CUSTOMER TYPE
CUSTOMER SIZE
CUSTOMER
Complex Hierarchy Graphs
![Page 24: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/24.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 24
CUSTOMER
South Europe North America ...
USACanada
Retail Wholesale
Bar Restaurant
Joe ‘s Sports BarYear <= 1997 Year > 1997
Change of a hierarchy over the time
![Page 25: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/25.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 25
Orange
Juice
Asia
![Page 26: Chapt. 7 Multidimensional Hierarchical Clustering](https://reader035.vdocuments.mx/reader035/viewer/2022062315/5681536f550346895dc172f9/html5/thumbnails/26.jpg)
Prof. Bayer, DWH, Ch.7, SS2000 26
Apple
Juice
AsiaProcessing a query box in sort order with the Tetris algorithm