parallel database systems analyzing lots of data jim gray microsoft research
Post on 30-Dec-2015
42 Views
Preview:
DESCRIPTION
TRANSCRIPT
3
Main Message Technology trends give
– many processors and storage units– inexpensively
To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.
5
Moore’s Law
128KB
128MB
20008KB
1MB
8MB
1GB
1970 1980 1990
1M 16Mbits: 1K 4K 16K 64K 256K 4M 64M 256M
1 chip memory size ( 2 MB to 32 MB)
•XXX doubles every 18 months 60% increase per year
•Micro Processor speeds•chip density•Magnetic disk density•Communications bandwidthWAN bandwidth
approaching LANs.
6
Magnetic Storage Cheaper than Paper• File Cabinet: cabinet (4 drawer) 250$
paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$
total 700$
3 ¢/sheet• Disk: disk (4 GB =) 500$
ASCII: 2 m pages
(100x cheaper) 0.025 ¢/sheet
• Image: 200 k pages
(10x cheaper) .25 ¢/sheet• Store everything on disk
9
Today’s Storage Hierarchy : Speed & Capacity vs Cost
Tradeoffs
1
1015
1012
109
106
103
Size vs Speed
Access Time (seconds)10 -9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline Tape Offline
Tape
OnlineTape
104
102
100
10-2
10-4
Price vs Speed
Access Time (seconds)10 -9 10 -6 10 -3 10 0 10 3
Cache
MainSecondary
Disc
Nearline Tape
OfflineTape
OnlineTape
Size(B) $/MB
10
Storage Latency: How Far Away is the
Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
10 9
106
Sacramento
This CampusThis Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 YearsAndromdeda
Clo
ck T
icks
11
ThesisMany Little will Win over Few
Big
1 M$100 K$ 10 K$
Mainframe MiniMicro Nano
14"9"
5.25" 3.5" 2.5" 1.8"
12
1k SPECintCPU
500 GB Disc
2 GB RAM
Implications of Hardware Trends
Large Disc Farms will be inexpensive (10k$/TB)
Large RAM databases will be inexpensive (1K$/GB)
Processors will be inexpensive
So building block will be a processor with large RAM lots of Disc
lots of network bandwidth
CyberBrick™
13
Implication of Hardware Trends:
Clusters
Future Servers are CLUSTERSof processors, discs
Distributed Database techniquesmake clusters work
CPU
50 GB Disc
5 GB RAM
15
Summary
� Tech trends => pipeline & partition parallelism– Lots of bytes & bandwidth per dollar– Lots of latency– Lots of MIPS per dollar– Lots of processors
� Putting it together Scaleable Networks and Platforms – Build clusters of commodity processors & storage– Commodity interconnect is key (S of PMS)
• Traditional interconnects give 100k$/slice.– Commodity Cluster Operating System is key– Fault isolation and tolerance is key– Automatic Parallel Programming is key
18
Parallelism: Performance is the GoalGoal is to get 'good' performance.
Law 1: parallel system should be faster than serial system
Law 2: parallel system should give near-linear scaleup or
near-linear speedup orboth.
19
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
SequentialSequential
SequentialSequential Any Sequential Program
Any Sequential Program
20
The Perils of Parallelism
Old
Tim
e N
ewTi
me
Spe
edup
=
Processors & Discs
The Good Speedup Curve
Processors & Discs
A Bad Speedup Curve
Linearity
No Parallelism Benefit
Processors & Discs
A Bad Speedup Curve3-Factors
Sta
rtu
p
Inte
rfe
ren
ce
Ske
w
Startup: Creating processesOpening filesOptimization
Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)
Skew: If tasks get very small, variance > service time
21
Why Parallel Access To Data?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel1.5 minute SCAN.
Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Bandwidth
22
Data Flow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!)
Memory that gets the data in advance ( 100MB/S)
Solution:
Pipeline from storage (tape, disc...) to cpu cache
Pipeline results to destination
Latency
23
Parallelism: Speedup & Scaleup100GB 100GB
Speedup: Same Job, More Hardware Less time
Scaleup: Bigger Job, More Hardware Same time
100GB 1 TB
100GB 1 TB
Server Server
1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time
24
Benchmark Buyer's GuideThe Whole Story
(for any system)
Th
rou
gh
pu
t
Processors & Discs
The Benchmark Report
Things to ask
When does it stop scaling?
Throughput numbers,Not ratios.
Standard benchmarks allowComparison to others
Comparison to sequential
Ratios & non-standard benchmarks are red flags.
25
Why are Relational OperatorsSo Successful for
Parallelism?Relational data model uniform operatorson uniform data streamclosed under composition
Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
26
Database Systems “Hide” Parallelism • Automate system management via tools
• data placement• data organization (indexing)• periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance• duplex & failover• transactions
• Automatic parallelism• among transactions (locking)• within a transaction (parallel execution)
27
Automatic Parallel OR DBSelect image
from landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;
Temporal
Spatial
Image
date loc image
Landsat
1/2/72.........4/8/95
33N120W.......34N120W
Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it
image
Answer
date, location, & image tests
28
Automatic Data PartitioningSplit a SQL table to subset of nodes & disks
Partition within set:Range Hash Round Robin
Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
Good for equi-joins, range queriesgroup-by
Good for equi-joins Good to spread load
29
Index PartitioningHash indices partition by hash
B-tree indices partition as a forest of trees.One tree per range
Primary index clusters data
0...9 10..19 20..29 30..39 40..
A..C D..F G...M N...R S..Z
30
Secondary Index PartitioningIn shared nothing, secondary indices are Problematic
Partition by base table key rangesInsert: completely local (but what about unique?)Lookup: examines ALL trees (see figure)
Unique index involves lookup on insert.
Partition by secondary key rangesInsert: two nodes (base and index)Lookup: two nodes (index -> base)Uniqueness is easy
Teradata solutionA..C D..F G...M N...R S..
Base Table
A..Z
Base Table
A..Z A..Z A..Z A..Z
31
Data Rivers: Split + Merge Streams
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data records River = Split/Merge in Gamma =
Exchange operator in Volcano.
River
M ConsumersN producers
N X M Data Streams
32
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
33
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
34
Picking Data RangesDisk Partitioning
For range partitioning, sample load on disks.Cool hot disks by making range smaller
For hash partitioning, Cool hot disks by mapping some buckets to others
River PartitioningUse hashing and assume uniform If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
35
Blocking Operators = Short PipelinesAn operator is blocking,
if it does not produce any output, until it has consumed all its input
Examples:Sort, Aggregates, Hash-Join (reads all of one operand)
Blocking operators kill pipeline parallelismMake partition parallelism all the more important.
Sort RunsScan
Sort Runs
Sort Runs
Sort Runs
Tape File SQL Table Process
Merge Runs
Merge Runs
Merge Runs
Merge Runs
Table Insert
Index Insert
Index Insert
Index Insert
SQL Table
Index 1
Index 2
Index 3
Database LoadTemplate hasthree blocked phases
36
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use indicesMore compactSometimes have aggregate info.
GROUP BY aggregatesscan in category order if possible (use indices)Else If categories fit in RAM
use RAM category hash table Else
make temp of <category, item>sort by category,do math in merge step.
37
Parallel AggregatesFor aggregate function, need a decomposition strategy:
count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...
For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
38
Sub-sortsgenerateruns
Mergeruns
Range or Hash Partition River
River is range or hash partitioned
Scan or other source
Parallel SortM inputs N outputs
Disk and mergenot needed if sort fits in memory
Scales “linearly” because
6
12= => 2x slowerlog(10 ) 6
log(10 ) 12
Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source
39
Hash Join: Combining Two TablesHash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smallerElse, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Always beats sort-merge and nestedunless data is clustered.
Good for equi, outer, exclusion joinLots of papers,
products just appearing (what went wrong?)
Hash reduces skew
Right Table
LeftTable
HashBuckets
40
Parallel Hash JoinICL implemented hash join with bitmaps in CAFS
machine (1976)!
Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully)
We ignored them! (why?) But now, Everybody's doing it.
(or promises to do it).
Hashing minimizes skew, requires little thinking for redistribution
Hashing uses massive main memory
42
What Systems Work This WayShared Nothing
Teradata: 400 nodes 80x12 nodes
Tandem: 110 nodesIBM / SP2 / DB2: 128 nodesInformix/SP2 100 nodesATT & Sybase 8x14 nodes
Shared DiskOracle 170 nodesRdb 24 nodes
Shared MemoryInformix 9 nodes RedBrick ? nodes
CLIENTS
MemoryProcessors
CLIENTS
CLIENTS
43
Outline Why Parallelism:
– technology push– application pull
Parallel Database Techniques– partitioned data– partitioned and pipelined execution– parallel relational operators
top related