parallel database systems analyzing lots of data jim gray microsoft research

Parallel Database Systems

Analyzing LOTS of Data

Jim GrayMicrosoft Research

Main Message Technology trends give

– many processors and storage units– inexpensively

To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.

Moore’s Law

20008KB

1970 1980 1990

1M 16Mbits: 1K 4K 16K 64K 256K 4M 64M 256M

1 chip memory size ( 2 MB to 32 MB)

•XXX doubles every 18 months 60% increase per year

•Micro Processor speeds•chip density•Magnetic disk density•Communications bandwidthWAN bandwidth

approaching LANs.

Magnetic Storage Cheaper than Paper• File Cabinet: cabinet (4 drawer) 250$

paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$

total 700$

3 ¢/sheet• Disk: disk (4 GB =) 500$

ASCII: 2 m pages

(100x cheaper) 0.025 ¢/sheet

• Image: 200 k pages

(10x cheaper) .25 ¢/sheet• Store everything on disk

Today’s Storage Hierarchy : Speed & Capacity vs Cost

Tradeoffs

Size vs Speed

Access Time (seconds)10 -9 10-6 10-3 10 0 10 3

Secondary

Nearline Tape Offline

OnlineTape

Price vs Speed

Access Time (seconds)10 -9 10 -6 10 -3 10 0 10 3

MainSecondary

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

Storage Latency: How Far Away is the

RegistersOn Chip CacheOn Board Cache

Memory

Tape /Optical Robot

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

2,000 YearsAndromdeda

ThesisMany Little will Win over Few

1 M$100 K$ 10 K$

Mainframe MiniMicro Nano

5.25" 3.5" 2.5" 1.8"

1k SPECintCPU

500 GB Disc

2 GB RAM

Implications of Hardware Trends

Large Disc Farms will be inexpensive (10k$/TB)

Large RAM databases will be inexpensive (1K$/GB)

Processors will be inexpensive

So building block will be a processor with large RAM lots of Disc

lots of network bandwidth

CyberBrick™

Implication of Hardware Trends:

Clusters

Future Servers are CLUSTERSof processors, discs

Distributed Database techniquesmake clusters work

50 GB Disc

5 GB RAM

Summary

� Tech trends => pipeline & partition parallelism– Lots of bytes & bandwidth per dollar– Lots of latency– Lots of MIPS per dollar– Lots of processors

� Putting it together Scaleable Networks and Platforms – Build clusters of commodity processors & storage– Commodity interconnect is key (S of PMS)

• Traditional interconnects give 100k$/slice.– Commodity Cluster Operating System is key– Fault isolation and tolerance is key– Automatic Parallel Programming is key

Parallelism: Performance is the GoalGoal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or

near-linear speedup orboth.

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program

SequentialSequential

SequentialSequential Any Sequential Program

Any Sequential Program

The Perils of Parallelism

Processors & Discs

The Good Speedup Curve

Processors & Discs

A Bad Speedup Curve

Linearity

No Parallelism Benefit

Processors & Discs

A Bad Speedup Curve3-Factors

Startup: Creating processesOpening filesOptimization

Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)

Skew: If tasks get very small, variance > service time

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

Data Flow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!)

Memory that gets the data in advance ( 100MB/S)

Solution:

Pipeline from storage (tape, disc...) to cpu cache

Pipeline results to destination

Latency

Parallelism: Speedup & Scaleup100GB 100GB

Speedup: Same Job, More Hardware Less time

Scaleup: Bigger Job, More Hardware Same time

100GB 1 TB

Server Server

1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time

Benchmark Buyer's GuideThe Whole Story

(for any system)

Processors & Discs

The Benchmark Report

Things to ask

When does it stop scaling?

Throughput numbers,Not ratios.

Standard benchmarks allowComparison to others

Comparison to sequential

Ratios & non-standard benchmarks are red flags.

Why are Relational OperatorsSo Successful for

Parallelism?Relational data model uniform operatorson uniform data streamclosed under composition

Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow

partitioning some operators (e.g. aggregates, non-equi-join, sort,..)

requires innovation

AUTOMATIC PARALLELISM

Database Systems “Hide” Parallelism • Automate system management via tools

• data placement• data organization (indexing)• periodic tasks (dump / recover / reorganize)

• Automatic fault tolerance• duplex & failover• transactions

• Automatic parallelism• among transactions (locking)• within a transaction (parallel execution)

Automatic Parallel OR DBSelect image

from landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;

Temporal

Spatial

date loc image

Landsat

1/2/72.........4/8/95

33N120W.......34N120W

Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it

Answer

date, location, & image tests

Automatic Data PartitioningSplit a SQL table to subset of nodes & disks

Partition within set:Range Hash Round Robin

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equi-joins, range queriesgroup-by

Good for equi-joins Good to spread load

Index PartitioningHash indices partition by hash

B-tree indices partition as a forest of trees.One tree per range

Primary index clusters data

0...9 10..19 20..29 30..39 40..

A..C D..F G...M N...R S..Z

Secondary Index PartitioningIn shared nothing, secondary indices are Problematic

Partition by base table key rangesInsert: completely local (but what about unique?)Lookup: examines ALL trees (see figure)

Unique index involves lookup on insert.

Partition by secondary key rangesInsert: two nodes (base and index)Lookup: two nodes (index -> base)Uniqueness is easy

Teradata solutionA..C D..F G...M N...R S..

Base Table

A..Z A..Z A..Z A..Z

Data Rivers: Split + Merge Streams

Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering

does partition and merge of data records River = Split/Merge in Gamma =

Exchange operator in Volcano.

M ConsumersN producers

N X M Data Streams

Partitioned Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Spreads computation and IO among processors

Partitioned data gives NATURAL parallelism

N x M way Parallelism

A...E F...J K...N O...S T...Z

Merge Merge

N inputs, M outputs, no bottlenecks.

Partitioned DataPartitioned and Pipelined Data Flows

Picking Data RangesDisk Partitioning

For range partitioning, sample load on disks.Cool hot disks by making range smaller

For hash partitioning, Cool hot disks by mapping some buckets to others

River PartitioningUse hashing and assume uniform If range partitioning, sample data and use

histogram to level the bulk

Teradata, Tandem, Oracle use these tricks

Blocking Operators = Short PipelinesAn operator is blocking,

if it does not produce any output, until it has consumed all its input

Examples:Sort, Aggregates, Hash-Join (reads all of one operand)

Blocking operators kill pipeline parallelismMake partition parallelism all the more important.

Sort RunsScan

Sort Runs

Tape File SQL Table Process

Merge Runs

Table Insert

Index Insert

SQL Table

Index 1

Index 2

Index 3

Database LoadTemplate hasthree blocked phases

Simple Aggregates (sort or hash?)

Simple aggregates (count, min, max, ...) can use indicesMore compactSometimes have aggregate info.

GROUP BY aggregatesscan in category order if possible (use indices)Else If categories fit in RAM

use RAM category hash table Else

make temp of <category, item>sort by category,do math in merge step.

Parallel AggregatesFor aggregate function, need a decomposition strategy:

count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...

For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Sub-sortsgenerateruns

Mergeruns

Range or Hash Partition River

River is range or hash partitioned

Scan or other source

Parallel SortM inputs N outputs

Disk and mergenot needed if sort fits in memory

Scales “linearly” because

12= => 2x slowerlog(10 ) 6

log(10 ) 12

Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source

Hash Join: Combining Two TablesHash smaller table into N buckets (hope N=1)

If N=1 read larger table, hash to smallerElse, hash outer to disk then

bucket-by-bucket hash join.

Purely sequential data behavior

Always beats sort-merge and nestedunless data is clustered.

Good for equi, outer, exclusion joinLots of papers,

products just appearing (what went wrong?)

Hash reduces skew

Right Table

LeftTable

HashBuckets

Parallel Hash JoinICL implemented hash join with bitmaps in CAFS

machine (1976)!

Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully)

We ignored them! (why?) But now, Everybody's doing it.

(or promises to do it).

Hashing minimizes skew, requires little thinking for redistribution

Hashing uses massive main memory

What Systems Work This WayShared Nothing

Teradata: 400 nodes 80x12 nodes

Tandem: 110 nodesIBM / SP2 / DB2: 128 nodesInformix/SP2 100 nodesATT & Sybase 8x14 nodes

Shared DiskOracle 170 nodesRdb 24 nodes

Shared MemoryInformix 9 nodes RedBrick ? nodes

CLIENTS

MemoryProcessors

CLIENTS

Outline Why Parallelism:

– technology push– application pull

Parallel Database Techniques– partitioned data– partitioned and pipelined execution– parallel relational operators

Main Message Technology trends give

– many processors and storage units– inexpensively

To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.

parallel database systems analyzing lots of data jim gray microsoft research

diskdisc tape storage

magnetic storage cheaper

miles of bookshelf

tape silo

x fasterparallelism

magnetic disc

x markupteradata

database store

Documents

a new approach for analyzing and managing macrofinancial...

windows nt scalability jim gray microsoft research...

the scientific method …lots and lots and lots of math

a new framework for analyzing and managing macrofinancial...

a 12-step user guide for analyzing voxel-wise gray matter

cyberspace: enabling trustworthy and autonomous agency...feb...

lots and lots of worthless data

computer technology forecast jim gray microsoft research...

br300073 · 2020. 12. 10. · gray sandstone, red rock red...

lots of lots 29th auction

cartella colori · 2016. 7. 15. · 01 02 06 04 05 10 07 08...

by: gary gray, pt - gray institute

gray change and gray stain scales

jim gray microsoft research gray@microsoft...

internet hdtv terry gray university of washington...

gray change and gray stain scales - hunterlab...gray change...

analyzing locality of mobile messaging traffic using the...

database functions & filtering. excel has two special...

analyzing and visualizing gray web forum...

subdivision implots totallots pctbuilt · reagan's overlook...