staged database systems

@Carnegie MellonDatabase

s

Staged Database Systems

Thesis Oral

Stavros Harizopoulos

2

Database world: a 30,000 ft view

DBMSDBMS

Sarah: “Buy this book”Jeff: “Which store needs more advertising?”

internetinternetoffloaddata

OLTP: Online Transaction ProcessingOLTP: Online Transaction Processingmany short-lived requestsmany short-lived requests

DSS: Decision Support SystemsDSS: Decision Support Systemsfew long-running queriesfew long-running queries

DB systems fuel most e-applications

Improved performance Impact on everyday life

3

New HW/SW requirements• More capacity, throughput efficiency• CPUs run much faster than they can access data

CPU

mem

ory

the ‘80s

1 cycle 10 300

DSS stressI/O subsystem

today

Need to optimize all levels of memory hierarchy

4

The further, the slower

• Keep data close to CPU

• Locality and predictability is key

DBMS core design contradicts above goals

Overlap mem. accessesOverlap mem. accesseswith computationwith computation

Modify algorithms and structuresModify algorithms and structuresto exhibit more localityto exhibit more locality

5

Thread-based execution in DBMS

• Queries are handled by a pool of threads

• Threads execute independently

• No means to exploit common operations

DBMS

thread pool

xno

coordination

D

CD

C

StagedDB

New design to expose locality across threads

6

Staged Database Systems

• Organize system components into stages• No need to change algorithms / structures

Stage 3

Stage 2

Stage 1DBMS

queries

StagedDB

queries

High concurrency locality across requests

7

Thesis

“By organizing and assigning system components into self-contained stages,

database systems can exploit instruction and data commonality across concurrent

requests

thereby improving performance.”

8

Summary of main results

• 56% - 96% fewer I-misses• STEPS: full-system

evaluation on Shore

• 1.2x - 2x throughput• QPipe: full-system

evaluation on BerkeleyDB

memory hierarchy

I

I

D

D L1

L2-L3

RAMRAM

DisksDisks

9

Contributions and dissemination

• Introduced StagedDB design • Scheduling algorithms for staged systems

• Built novel query engine design• QPipe engine maximizes data and work sharing

• Addressed instruction cache in OLTP• STEPS applies to any DBMS with few changes

CIDR’03

VLDB’04

SIGMOD’05

CMU-TR’02IEEE Data Eng. ’05

ICDE’06 demo sub.

CMU-TR’05HDMS’05 VLDB J. subm.

TODS subm.

10

Outline• IntroductionIntroduction

• QPipe

• STEPS

• Conclusions

I D

DSS

11

Query-centric design of DB engines

• Queries are evaluated independently

• No means to share across queries

• Need new design to exploit common data instructions work across operators

12

QPipe: operator-centric engine• Conventional: “one-query, many-operators”

• QPipe: “one operator, many-queries”

• Relational operators become Engines

• Queries break up in tasks and queue up

conventional QPipe

queue

runtime

13

QPipe designpacketdispatcher

S S

A

threadpool

storage engine

queryplans

conventional design

J

Engine-S

Q Q

Engine-J

Engine-AQ

Q

readwrite

read

14

Reusing data & work in QPipe

• Detect overlap at run time

• Shared pages and intermediate results are

simultaneously pipelined to parent nodes

Q1

overlap inred operator

simultaneouspipelining

Q2 Q1 Q2

15

Mechanisms for sharing

• Multi-query optimization

• Materialized views

• Buffer pool management

• Shared scans• RedBrick, Teradata, SQL Server

requiresworkload knowledge

opportunistic

limited use

not used in practice

QPipe complements above approaches

16

Experimental setup

• QPipe prototype• Built on top of BerkeleyDB, 7,000 C++ lines• Shared-memory buffers, native OS threads

• Platform• 2GHz Pentium 4, 2GB RAM, 4 SCSI disks

• Benchmarks• TPC-H (4GB)

17

Sharing order-sensitive scans

I I

M-J

S

A

ORDERS LINEITEM

TPC-HQuery 4

Q1

M-J

S

AQ2

I IORDERS LINEITEM

• Two clients send query at different intervals• QPipe performs 2 separate joins

order-sensitive

order-insensitive

M-J

I I

M-J

I I+

18

Sharing order-sensitive scans

• Two clients send query at different intervals• QPipe performs 2 separate joins

0

50

100

150

200

250

300

0 20 40 60 80 100 120 140

Baseline Qpipe w/SP

time difference between arrivals

tota

l res

pons

e tim

e (s

ec)

19

TPC-H workload

• Clients use pool of 8 TPC-H queries

• QPipe reuses large scans, runs up to 2x faster

• ..while maintaining low response times

0

20

40

60

80

0 2 4 6 8 10 12

Qpipe w/SPDBMS XBaseline

thro

ughp

ut (

quer

ies/

hr)

number of clients

20

QPipe: conclusions• DB engines evaluate queries independently

• Limited existing mechanisms for sharing

• QPipe requires few code changes

• SP is simple yet powerful technique

• Allows dynamic sharing of data and work

• Other benefits (not described here)• I-cache, D-cache performance• Efficiently execute MQO plans

21

Outline• IntroductionIntroduction

• QPipeQPipe

• STEPS

• Conclusions

I DOLTP

22

Online Transaction Processing

Need solution for instruction cache-residency

L1-I sizes for various CPUs

Max on-chipL2/L3 cache

‘96 ‘00 ‘04‘98 ‘02Year Introduced

10KB

100KB

1MB

10MB

Ca

ch

e s

ize

• High-end servers, non I/O bound

• L1-I stalls are 20-40% of execution time• Instruction caches cannot grow

23

Related work

• Hardware and compiler approaches• Increased block size, stream buffer [Ranganathan98]

• Code layout optimizations [Ramirez01]

• Database software approaches• Instruction cache for DSS [Padmanabhan01][Zhou04]• Instruction cache for OLTP: Challenging!

24

STEPS for cache-resident code

STEPS: Synchronized Transactions through Explicit Processor Scheduling

• Microbenchmark: eliminate 96% of L1-I misses

• TPC-C: eliminate 2/3 of misses, 1.4 speedup

Begin

Select

Update

Insert

Delete

Commit

Transaction

keep thread model,insert sync points

S still largerthan I-cache

multiplex execution,reuse instructions

25

I-cache aware context-switching

code fits inI-cache

context-switch(CTX)point

select( ) s1 s2 s3 s4 s5 s6 s7

thread 1 thread 2

instructioncache

thread 1 thread 2

select( ) s1 s2 s3 s4 s5 s6 s7

select( ) s1 s2 s3

s4 s5 s6 s7

select( ) s1 s2 s3

s4 s5 s6 s7

MissMMMMMMM

MMMM

MMMM

HHHH

HitHHH

MMMMMMMM

no STEPS with STEPS

26

Placing CTX calls in source

AutoSTEPS tool

Evaluation

DBMSbinary

valgrind 0x010x040x050x04…

…instructionmem. refs STEPS

simulation 0x01

0x05…

…mem. address

for CTX

gdb file1.c:30

file2.c:40…

…lines to

insert CTX

• Comparable performance to manual• ..while being more conservative

27

Experimental setup (1st part)

• Implemented on top of Shore

• AMD AthlonXP• 64KB L1-I + 64KB L1-D, 256KB L2

• Microbenchmark• Index fetch, in-memory index

• Fast CTX for both systems, warm cache

28

Microbenchmark: L1-I misses

STEPS eliminates 92-96% of misses for add’l threads

Shore Shore w/Steps

1 2 4 6Concurrent threads

L1-

I cac

he

mis

ses

8 10

1K

2K

3K

4K

AthlonXPAthlonXP

29

L1-I misses & speedup

L1-I Miss reduction Upper LimitL1-I Miss reduction %

Sp

eed

up

1.1

1.2

1.3

1.4

Mis

s re

duct

ion 100%

80%

60%

Speedup

40%


50 60 70 80


50 60 70 80

Steps achieves max performance for 6-10 threads• No need for larger thread groups

AthlonXPAthlonXP

30

Challenges in full-system operationSo far:

• Threads are interested in same Op• Uninterrupted flow• No thread scheduler

Full-system requirements• High concurrency on similar Ops• Handle exceptions

• Disk I/O, locks, latches, abort

• Co-exist with system threads• Deadlock detection, buffer pool housekeeping

31

System design

• Fast CTX through fixed scheduling• Repair thread structures at exceptions• Modify only thread package

STEPS wrapper

Op X

STEPS wrapper

Op Y

STEPS wrapper

Op Z

stray thread

executionteam

to otherOp

32

Experimental setup (2nd part)

• AMD AthlonXP• 64KB L1-I + 64KB L1-D, 256KB L2

• TPC-C (wholesale parts supplier)• 2GB RAM, 2 disks

• 10-30 Warehouses (1-3GB), 100-300 users

• Zero think time, in-memory, lazy commits

33

One transaction: payment

100 200 300

• STEPS outperforms baseline system

• 1.4 speedup, 65% fewer L1-I misses

Number of users

20%

40%

60%

80%

100%

CyclesL1-I misses

Nor

mal

ized

cou

nt

34

Mix of four transactions

100 200Number of users

Nor

mal

ized

cou

nt

20%

40%

60%

80%

100%

CyclesL1-I misses

• Xaction mix reduces team size

• Still, 56% fewer L1-I misses

35

STEPS: conclusions

• STEPS can handle full OLTP workloads

• Significant improvements in TPC-C• 65% fewer L1-I misses• 1.2 – 1.4 speedup

STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity

36

StagedDB: future work

• Promising platform for Chip-Multiprocessors• DBMS suffer from CPU-to-CPU cache misses• StagedDB allows work to follow data

-- not the other way around!

• Resource scheduling• Stages cluster requests for DB locks, I/O• Potential for deeper, more effective scheduling

37

Conclusions

• New hardware, new requirements

• Server core design remains the same

• Need new design to fit modern hardware

StagedDB:Optimizes all memory hierarchy levels

Promising design for future installations

38

The speaker would like to thank:

his academic advisorAnastassia Ailamaki

his thesis committee membersPanos K. Chrysanthis,

Christos Faloutsos,Todd C. Mowry,

and Michael Stonebraker

and his coauthorsKun Gao,

Vladislav Shkapenyuk,and Ryan Williams

Thank you

39

QPipe backup

40

A Engine in detail

• tuple batching I-cache

• query grouping I&D-cache

relational operator code

simultaneouspipelining

schedulingthread free threads busy threads

main routineEngine

parametersEngine

queue

Harizopoulos04 (VLDB)Zhou03 (VLDB)

Padmanabhan01 (ICDE)Zhou04 (SIGMOD)

41

Simultaneous Pipelining in QPipe

join

without SP with SP

Q1write

Q2 Q2

COMPLETE

2 Q2copy

3

Q1 Q1Q1

4pipeline

Q1

join

Q2 Q2Q2

Q2 attach1 Q2

joincoordinator

SP

Q1

Q1 Q1

read

42

Sharing data & work across queries

S S

M-J

A

TABLE A TABLE B

Query 1 : “Find average age of studentsenrolled in both class A and class B”

S

TABLE A

maxQuery 2

S S

M-J

TABLE A TABLE B

Query 3min

datasharingopportunity

worksharingopportunity

43

Sharing opportunities at run time• Q1 executes operator R• Q2 arrives with R in its plan

sharing potential

result production for R in Q1

Q2

result production for R in Q2

Rwithout SP

Q1 Q2R

writeread

with SPR coordinator

SP

Q1Q2

readpipeline

44

TPC-H workload

• Clients use pool of 8 TPC-H queries

• QPipe reuses large scans, runs up to 2x faster

• ..while maintaining low response times

0

20

40

60

80

0 2 4 6 8 10 12

Qpipe w/SPDBMS XBaseline

thro

ughp

ut (

quer

ies/

hr)

number of clients

0

200

400

600

800

1000

1200

0 20 40 60 240

BaselineQpipe w/SP

aver

age

resp

onse

tim

e

think time (sec)

45

STEPS backup

46

Smaller L1-I cache

AthlonXPPentium III

209%AthlonXP, Pentium IIIAthlonXP, Pentium III

10 threads10 threads

No

rma

lize

d co

un

t

Cycles

L1-I miss

es

Br. Misp

red.

L1-D m

isses

Branch

es

20%

40%

60%

80%

100%

120%

Br. miss

ed BTB Instr. stalls

(cycles)

Steps outperforms Shore even on smaller caches (PIII)• 62-64% fewer mispredicted branches on both CPUs

47

SimFlex: L1-I misses

Shore-16KBSteps-16KBMIN Shore-32KB

Steps-32KBMIN

Shore-64KBSteps-64KBMIN

higherassociativity

L1-

I cac

he

mis

ses

2K

4K

6K

8K

10K

direct2-way

4-way8-way

full higherassociativity

AthlonXP

64b cache block64b cache block10 threads10 threads

Steps eliminates all capacity misses (16, 32KB caches)• Up to 89% overall miss reduction (upper limit is 90%)

48

One Xaction: payment10 20 30

Steps outperforms Shore• 1.4 speedup, 65% fewer L1-I misses

• 48% fewer mispredicted branches

Number of Warehouses

No

rma

lize

d co

un

t

20%

40%

60%

80%

100%

Cycles L1-Imisses

L1-Dmisses

L2-Imisses

L2-Dmisses

Branchesmispred.

49

Mix of four Xactions10 20

No

rma

lize

d co

un

t

20%

40%

60%

80%

100%

Cycles L1-Imisses

L1-Dmisses

L2-Imisses

L2-Dmisses

Branchesmispred.

• Xaction mix reduces average team size (4.3 in 10W)

• Still, Steps has 56% fewer L1-I misses (out of 77% max)

121% 125% Number of Warehouses

staged database systems

Documents

data commonality

common data instructions

system evaluation

slowerkeep data close

keydbms core design

queriesneed new design

different intervalsqpipe

algorithms structuresstage