hongwei zhao, xiaojun ye molap on cloud interactive, cluster data warehouse tsinghua university...
TRANSCRIPT
![Page 1: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/1.jpg)
Hongwei Zhao, Xiaojun Ye
MOLAP on Cloud
Interactive, Cluster Data Warehouse
Tsinghua University [email protected], [email protected]
![Page 2: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/2.jpg)
MotivationExtend the cube model to support OLAP operations on Big Data:»OLAP operations»Interactive queries
![Page 3: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/3.jpg)
OutlineCube modelling
Building and querying
Experimenting
![Page 4: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/4.jpg)
Data Transform for CubeTPC-DS tables Star views Cube data
User queries
![Page 5: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/5.jpg)
A Simplified Cube Model
Cube Instance
Cuboid InstanceDimension
Instance
DimensionInstance
CubeMetadata
DimensionInstance
Cuboid Instance
KeyMemb
erKeyMemb
erKeyDimensi
on Member
KeyMeasure NodeKeyMeasure NodeKeyMeasure Cell
ABC
AB
A
AC
B
BC
C
*
Base Cuboids
Result
![Page 6: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/6.jpg)
Example: TPC-DS Query7select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotionwhere ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id
![Page 7: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/7.jpg)
Relation Schema
Store Sales
Date Dim
Item Promotion
Customer Demographic
s
![Page 8: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/8.jpg)
Converting to BitKeyDimensio
n ADimension B
Dimension C
Measure
A1 B1 C1 M1
A2 B1 C2 M2
A3 B2 C2 M3
Dimension Member
BitKey Dimension Mask
A1 000001 000001
B1 000010 000010
C1 000100 000100A2 001000 001001
B1 000010 000010
C2 010000 010100
BitKeys
Value
000111 M1011010 M2Result2
Result1
Intermediate
Result1
Fact1
Fact2
Intermediate
Result1
![Page 9: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/9.jpg)
Cube StorageTable
Region
ColumnFamily
Row
Column
Version
Value
Cell
One table for dimension instances storage:
Row Key Dimension Name
Column Family
Default
Column Member BitKey
Value Member ValueMultiple tables for cuboids instances
Table Name Cuboid Name
Row Key Cell BitKey
Column Family
Default
Column Measure Name
Value Measure Value
![Page 10: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/10.jpg)
MDX for query 7select { i_item_id } on rows,
{ avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt),
avg(ss_sales_price) } on columns
from store_sales_cubewhere (cd_gender .[Male], cd_marital_status .[Single], cd_education_status .[College],
d_year.[2000])
![Page 11: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/11.jpg)
Cube Implementation
Base cuboid building with 4 stages:Dimension constructingHive queryAggregationSaving
Query execution with 4 stages:Loading dimensionOther cuboid constructingMappingReducing
![Page 12: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/12.jpg)
OLAP System
Engin
eC
olu
mnar
Data
base
Master Node
Region NodeRegion
Node
Dispatcher Node
Worker Node
Region Node
Worker NodeWorker Node
cachedat
a
Cube data
Cluster FrameworkDispatcher Node
Worker Nodes
• Distribute dynamically cubes data onto worker nodes
• Parallelize OLAP operations into a concurrent model
![Page 13: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/13.jpg)
Actor of AkkaStateBehaviorMailbox
Lifecycle
Fault tolerance
![Page 14: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/14.jpg)
Execute Query
Query Dispatch
er
Cuboid Manager
Dimension
Manager
Mapper Reducer
1 2
34
require
Cuboid ready
Dimension load
data ready
Extract Query
Hit Cell
Hit Cell
Actors for Query
• Load dimension members
• Build other cuboids• Mapping• Reducing
![Page 15: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/15.jpg)
Compiling & MappingQuery 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000
GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single :000001100ES Mask: 001100000College: 001000000YEAR Mask: 110000000 2000:010000000
Mask: 111111111FilterKey: 011001110
Query Dispatch
er
Mapper1
Mapper2
Mapper3
For each cell in mapper{ If (key & mask
== Filter Key) Send to Reducer}
![Page 16: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/16.jpg)
Region 1
Region 2
Region 3
Worker
Worker
Worker
Master
messages
results
Cache 1
Cache 2
Cache 3
Query Execution
• Master sends task messages to workers
• Each worker caches each region data
• Sequential tasks reuse the cache data
First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs
![Page 17: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/17.jpg)
Experiments On TPC-DS
1g 10g 100g0
50000000
100000000
150000000
200000000
250000000
300000000
fact recordscells
1G 10G 100G
records number
2,653,108
26,532,571
265,325,821
cube cell number
1,836,162
10,190,922
41,892,286
4 nodes:• 2*Intel Xeon CPU E5-2630• 4*600G 15000r/s SAS • 256G RAM• 10Gb Network
Dimensions:1. "i_item_id", 2. "cd_gender", 3. "cd_marital_status", 4. "cd_education_status", 5. "p_channel_email", 6. "p_channel_event", 7. "d_year“Measures: 8. ss_quantity_avg,9. ss_list_price_avg, 10. ss_coupon_amt_avg, 11. ss_sales_price_avg
![Page 18: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/18.jpg)
Build Cube for Query 7
1G
10G
100G
0 1000 2000 3000 4000 5000 6000
queryingaggregatingSaving
running time (seconds)
TP
C-D
S d
ata
siz
e
• Partition by the largest Dimension(i_item_id)
• In-Memory aggregation• Saving stage can be
ignore(cache)
![Page 19: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/19.jpg)
1 2 3 4 50
50
100
150
200
250
300
350
400
4 workers8 workers16 workers
iteration number
run
nin
g t
ime
(seco
nd
s)
Execute Query 7First execution on the cube includes • Dimension loading• other cuboids construction • Caching• Mapping• Reducing
Sequential execution includes:• Mapping• Reducing
![Page 20: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/20.jpg)
Hive Query for Fact Data select p_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price from store_sales
join date_dim on (store_sales.ss_sold_date_sk
= date_dim.d_date_sk) join item on (store_sales.ss_item_sk =
item.i_item_sk) join customer_demographics on
(store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
![Page 21: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/21.jpg)
Compare with Hive
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
First query time compare:2-3X
Sequential execution time:30-50X
![Page 22: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/22.jpg)
Future work• Cube Model:
• Demand-driven & Data-driven
• Cube Data: • Model-driven & Requirement-driven
• More experiments on TPC-DS queries• Report, ad hoc, iterative, data mining,
• MDX/XMLA compliance
![Page 23: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/23.jpg)
Thanks.
![Page 24: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn](https://reader035.vdocuments.mx/reader035/viewer/2022062719/56649ecb5503460f94bd97bd/html5/thumbnails/24.jpg)
Storage for Example
Row Key
Column Family: default
Dimension A
Mask 000001 001000 001001
001001 A1 A2 A3
Dimension B
Mask 000010 100000
100010 B1 B2
Row Key Column Family: default
000111 Mea_count Mea_sum
1 M1
011010 Mea_count Mea_sum
1 M2
Table: Dimension
Table: Cuboid_ABC