scaling up analytical queries with column -stores
DESCRIPTION
Scaling up analytical queries with column -stores. Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki. École Polytechnique Fédérale de Lausanne. Drinking from a data firehose. Fast and high quality data analysis for smart business decisions Data warehouses - PowerPoint PPT PresentationTRANSCRIPT
Scaling up analytical queries with column-stores
Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki
École Polytechnique Fédérale de Lausanne
Drinking from a data firehose Fast and high quality data analysis for
smart business decisions Data warehouses
1/3 of the database market ($$$) Column-stores are here to stay!
Need for multiple concurrent users 100s to 1000s queries*
2
Many concurrent queries + column-stores = ???*"High-performance data warehousing", TDWI best practices report
Multiple concurrent queries
3
DBMS
CORE 4
CORE 1
CORE 3
CORE 2
CORE 8CORE 7
CORE 6CORE 5
MEM
CORE 4
CORE 1
CORE 3
CORE 2
CORE 8CORE 7
CORE 6CORE 5
HDD
Find all restaurants with rating over 3.5 and close to East Village
steak?
pasta?
indian?
vegan?
High contention for resources
4
throughputresponse time
Throughput (memory-resident workload)
5
Ideal Real
# clients
Thro
ughp
ut (k
Q/h
)
total #HW contexts
saturation point
Concurrency can hurt performance
TPCH (sf:30)
Experimental setup Column stores
System-A and System-B (Commercial) System-C (Open-source)
Hardware Dual socket Intel(R) Xeon(R) CPU E5-2660
• 2 sockets x 8 cores x 2 threads (32 HW contexts) 128 GB RAM, 1600 MHz DIMMs L1: 64KB and L2: 256KB (per core), L3: 20MB (shared)
6
Workloads TPC-H
Scale factor: 30 (32GB on disk) Qtpch = {10 query templates}
SSB (Star Schema Benchmark) Scale factor: 30 (18GB on disk) Qssb = {all of 13 query templates}
Throughput exp. with 25 query instances
7
Memory-resident
Hot-runs
8
Experiment 1:
How does increased concurrency affect response time?
Scaling up TPCH Q1
9
0 50 100 150 200 2500
50100150200250300350400450500 System-A
System-CSystem-B
# concurrent queries
Avg.
resp
. tim
e (s
ec)
Linear increase in response time
Scaling up SSB Q3.1
10
0 50 100 150 200 2500
50
100
150
200
250 System-ASystem-CSystem-B
# concurrent queries
Avg.
resp
. tim
e (s
ec)
Similar behavior in SSB
11
Experiment 2:
What is the variability of query response time?
Variability of System-A
12Groups of short, medium and long running queries
TPCH (64 clients)
Variability of System-B
13Balanced resource allocation lower variation
TPCH (64 clients)
Variability of System-C
14System-C uses an admission control mechanism
TPCH (64 clients)
15
Experiment 3:
How does increasing concurrency affect throughput?
Throughput - TPCH
16
0 50 100 150 200 2500
2000
4000
6000
8000
10000
12000
14000
16000System-BSystem-CSystem-A
# concurrent clients
Thro
ughp
ut (
kQue
ries/
h)
Throughput decreases after the saturation point
48%
32% drop
35% drop
0 50 100 150 200 2500
2000400060008000
100001200014000160001800020000
System-BSystem-CSystem-A
# concurrent clients
Thro
ughp
ut (
kQue
ries/
h)
29% drop
39% drop
Throughput - SSB
17Exploiting sharing sustain peak performance
throughput plateaus
When concurrency in column-stores is increased:
Response time increases linearly
… with high variability
After saturation peak performance is not sustained
18
Except from System-B for SSB
Where do we go from here? QPipe, Datapath, CJoin, ShareDB, Blink Recycler (MonetDB), cooperative scans, CCM (cracking)
19
Ideal Real
# clients
Thro
ughp
ut
saturation point
Adaptive resource (re)allocation Work sharing techniques Contention-aware scheduling
Thank you!