scaling up analytical queries with column -stores

Scaling up analytical queries with column-stores

Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki

École Polytechnique Fédérale de Lausanne

Drinking from a data firehose Fast and high quality data analysis for

smart business decisions Data warehouses

1/3 of the database market ($$$) Column-stores are here to stay!

Need for multiple concurrent users 100s to 1000s queries*

2

Many concurrent queries + column-stores = ???*"High-performance data warehousing", TDWI best practices report

Multiple concurrent queries

3

DBMS

CORE 4

CORE 1

CORE 3

CORE 2

CORE 8CORE 7

CORE 6CORE 5

MEM

CORE 4

CORE 1

CORE 3

CORE 2

CORE 8CORE 7

CORE 6CORE 5

HDD

Find all restaurants with rating over 3.5 and close to East Village

steak?

pasta?

indian?

vegan?

High contention for resources

4

throughputresponse time

Throughput (memory-resident workload)

5

Ideal Real

# clients

Thro

ughp

ut (k

Q/h

)

total #HW contexts

saturation point

Concurrency can hurt performance

TPCH (sf:30)

Experimental setup Column stores

System-A and System-B (Commercial) System-C (Open-source)

Hardware Dual socket Intel(R) Xeon(R) CPU E5-2660

• 2 sockets x 8 cores x 2 threads (32 HW contexts) 128 GB RAM, 1600 MHz DIMMs L1: 64KB and L2: 256KB (per core), L3: 20MB (shared)

6

Workloads TPC-H

Scale factor: 30 (32GB on disk) Qtpch = {10 query templates}

SSB (Star Schema Benchmark) Scale factor: 30 (18GB on disk) Qssb = {all of 13 query templates}

Throughput exp. with 25 query instances

7

Memory-resident

Hot-runs

8

Experiment 1:

How does increased concurrency affect response time?

Scaling up TPCH Q1

9

0 50 100 150 200 2500

50100150200250300350400450500 System-A

System-CSystem-B

# concurrent queries

Avg.

resp

. tim

e (s

ec)

Linear increase in response time

Scaling up SSB Q3.1

10

0 50 100 150 200 2500

50

100

150

200

250 System-ASystem-CSystem-B

# concurrent queries

Avg.

resp

. tim

e (s

ec)

Similar behavior in SSB

11

Experiment 2:

What is the variability of query response time?

Variability of System-A

12Groups of short, medium and long running queries

TPCH (64 clients)

Variability of System-B

13Balanced resource allocation lower variation

TPCH (64 clients)

Variability of System-C

14System-C uses an admission control mechanism

TPCH (64 clients)

15

Experiment 3:

How does increasing concurrency affect throughput?

Throughput - TPCH

16

0 50 100 150 200 2500

2000

4000

6000

8000

10000

12000

14000

16000System-BSystem-CSystem-A

# concurrent clients

Thro

ughp

ut (

kQue

ries/

h)

Throughput decreases after the saturation point

48%

32% drop

35% drop

0 50 100 150 200 2500

2000400060008000

100001200014000160001800020000

System-BSystem-CSystem-A

# concurrent clients

Thro

ughp

ut (

kQue

ries/

h)

29% drop

39% drop

Throughput - SSB

17Exploiting sharing sustain peak performance

throughput plateaus

When concurrency in column-stores is increased:

Response time increases linearly

… with high variability

After saturation peak performance is not sustained

18

Except from System-B for SSB

Where do we go from here? QPipe, Datapath, CJoin, ShareDB, Blink Recycler (MonetDB), cooperative scans, CCM (cracking)

19

Ideal Real

# clients

Thro

ughp

ut

saturation point

Adaptive resource (re)allocation Work sharing techniques Contention-aware scheduling

Thank you!

scaling up analytical queries with column -stores

Documents

high variability

query templates

data firehosefast

response time9scaling

query instances

high quality data analysis

high contention

saturation peak performance