tractor pulling on data warehouse

Tractor Pulling on Datawarehouses

Martin Kersten, Volker MarklMeikel Poess, Kai-Uwe Settler

Alfons Kemper, Ani Nica,

DBTest 2011

The good old days• The early eighties when – Oracle appeared on the scene– Ingres was a respected innovator on

RDBMS– System R fought the Codasyl battle– IMS was still dominating the market

• There was a need for a metric to evaluate the solutions

The good old days• Turned into an organised battle– TPC-C, TPC-H, TPC-D, TPC-W… – hundreds of benchmarks to proof one’s

muscles

• We need tools to assess a solution space

• We don’t need weapons to win a ‘war’

Dagstuhl 2010 Robust Query Processing

• With each step in the pull the tension of the Tractor increases (exponentially)

• The Tractor driver is throttling and changing gears to keep it going

Ingredients of the DBMS Tractor Pull

• A tractor pull is a series of workload steps for which we measure the performance

• Each step is defined by – Catalog changes– Database load, delete+load+create

index– Query processing, BI grouped statistics– Concurrency– Act of God operations

A database soil

Generate a small database < RAMUse a single data type

A database soil

Cop

COPY the smaller relation into the larger one

A database soil

Query template

SELECT R0.B0, ...,Ri.Bi, count(*), avg(R0.B0),avg(R1.B0), avg(R1.B1),. . ., avg(Ri.B0), . . .FROM R0, . . . , RiWHERE selectpattern(R0, . . . , Ri) AND joinpattern(R0, . . . , Ri)GROUP BY R0.B0, . . . , Ri.BiORDER BY R0.B0, . . . , Ri.Bi

Linear, Cyclic, Star-based, Clique query patterns

The n-th query load includes the n-1 th query load

Scenarios• Tractor pull workload

• W(N) = < S, L, Pre, Qry, Post, qry, db>– Schema adjustments– Loading the database – Pre-optimization– Query execution– Post optimization– query characteristics– db growth function

Hill scenario• The Hills scenario models a data

warehouse that grows with a modest growth rate of g ∈ (0, 1) (e.g., g = 0.2).

• It starts out from a main-memory focus until it overflows into a few disks.

• It will highlight a system’s robustness to deal with the memory-disk performance chasm.

Hill scenarioA modest growing warehouse with a

single user.The database fits in memory and spills

over to disk

D ∈ (0%, 100%), G∈ (0, 1)Number of connections at track I : 1db(0) = (D x RAM) x ( 1 / (2 x dom) )db(i) = g x i x db(0)qry(0) = 1, qry(i) = 4|qry(i)| = 1 + 4 x i

A stable warehouse with a multiple users.Query templates stress complexity

d∈(0%,100%), g=0, C>1Number of connections at track i : Cdb(0) = (d × RAM) × (1) 2×domdb(i) = 0 (no growth)qry(0) = 0, qry(i) = C |Q(i)| = 1 + C × i

Meadow scenario

A growing warehouse with a multiple users.

Query templates stress complexity

d∈(0%,100%), g∈ (0,10)Number of connections at track i : idb(0) = (d × RAM) × (1) 2×domdb(i) = g × i × db(0)qry(0) = 0, qry(i) = i × 4|Q(i)| = 1 + 4 × i (i+1)/2

Rockies scenario

Robustness metrics• It is a multi-dimensional metric

aimed at measuring the deviation from the expected norm

• Robust(N)=<L, S, QO, QOk, QE, QEk, H>– Standard deviation of the loading time L– ,, Storage requirements– ,, Query optimization (per track– ,, Query execution (per track)– ,, Holistic

A hill scenario

A meadow Scenario

A Rockies scenario

Take aways

• Robustness is all about comparisons. We need methods to quickly determine difference in behavior.

• If the system reaches the end of the field we are happy. If it blows up or if the queries are behaving worse along the way it is not robust.

Conclusions• Tractorpulling is an effective new

toolkit for robustness testing a DBMS in various dimensions

• Refinements for ease of analysis is needed (GUIs)

• http://sourceforge.net/projects/tractorpulling

tractor pulling on data warehouse

Technology

track i

query optimization

g i db0qry0

query templateselect

d x ram x

tractor pull

createindex query processing

database fits