radio: managing the performance of large, distributed...
TRANSCRIPT
RADIO: Managing the Performance of Large, Distributed Storage Systems
Scott A. Brandtand
Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey
Computer Science DepartmentUniversity of California Santa Cruz
andRichard Golding and Ted Wong, IBM Almaden Research Center
UPC—July 7, 2009
Who am I?
• Professor, Computer Science Department, UC Santa Cruz
• Director, UCSC/LANL Institute for Scalable Scientific Data Management (ISSDM)
• Director, UCSC Systems Research Laboratory (SRL)
• Background
• 1999 Ph.D. CS, Colorado
• 1987/1993 B. Math/MS CS, Minnesota
• 1982-1994 Programmer/Research Scientist/VP, CPT, B-Tree, Honeywell SRC, Theseus Research, Alliant TechSystems RTS, Secure Computing
• My Research
• High-performance petascale storage
• Real-time systems
• Performance management and virtualization
• Active object-based storage
• Other Research
• Secure operating systems
• Asynchronous circuits
• Real-time image processing
Distributed systems need performance guarantees
• Many distributed systems and applications need (or want) I/O performance guarantees• Multimedia, high-performance simulation, transaction processing,
virtual machines, service level agreements, real-time data capture, sensor networks, ...
• Systems tasks like backup and recovery
• Even so-called best-effort applications
• Providing such guarantees is difficult because it involves:• Multiple interacting resources
• Dynamic workloads
• Interference among workloads
• Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth
In a nutshell
• Big distributed systems• Serve many users/jobs
• Process petabytes of data
• Data center design• Use rules of thumb
• Over-provision
• Isolate
• Ad hoc per formance management approaches creates marginal storage systems that cost more than necessary
• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network
Outline
1. Problem: Managing the performance of large, distributed storage systems
2. Approach: End-to-end performance management
3. Model: RAD
4. Instances: • Disk
• Network
• Buffer cache
5. Application: Data Center Performance Management and Monitoring
End-to-end I/O performance guarantees
• Goal: Improve end-to-end performance management in large distributed systems• Manage performance
• Isolate traffic
• Provide high performance
• Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ...
• Approach:1. Develop a uniform model for managing performance
2. Apply it to each resource
3. Integrate the solutions
Our current target
• High-performance I/O• From client, across network, through server, to disk
• Up to hundreds of thousands of processing nodes
• Up to tens of thousands of I/O nodes
• Big, fat, network interconnect
• Up to thousands of storage nodes with cache and disk
• Challenges• Interference between I/O streams, variability of
workloads, variety of resources, variety of applications, legacy code, system management tasks, scale
Stages in the I/O path
client
cache
network
transport
diskstorage
cache
network
transport
flow control with one client
connection management between clients
IO selection and head scheduling
prefetch and writeback based on utilization, QoS
app
app
I/O
scheduler
client
cache
network
transport
app
app
integration between client and server cache
1. Disk I/O
2. Server cache
3. Flow control across network
• Within one client’s session and between clients
4. Client cache
System architecture
Client
StorageServer
QoSBroker
StorageServer
StorageServer
StorageServer
Request
Reservation
Utilizationreservations
1
2
3
4
NetworkServer Caches
Disks• Client: Task, host, distributed
application, VM, file, ...
• Reservations made via broker
• Specify workload: throughput, read/write ratio, burstiness, etc.
• Broker does admission control
• Requirements + workload are translated to utilization
• Utilizations are summed to see if they are feasible
• Once admitted, I/O streams are guaranteed (subject to workload adherence)
• Disk, caches, network controllers maintain guarantees
I/O
Achieving robust guaranteeable resources
• Goal: Unified resource management algorithms capable of providing• Good performance
• Arbitrarily hard or soft performance guarantees with• Arbitrary resource allocations
• Arbitrary timing / granularity
• Complete isolation between workloads
• All resources: CPU, disk, network, server cache, client cache
➡Virtual resources indistinguishable from “real” resources with fractional performance
Isolation is key
• CPU• 20% of a 3 Ghz CPU should be indistinguishable
from a 600 Mhz CPU
• Running: compiler, editor, audio, video
• Disk• 20% of a disk with 100 MB/second bandwidth
should be indistinguishable from a disk with 20 MB/second bandwidth
• Serving:1 stream, n streams, sequential, random
Scott’s epistemology of virtualization
• Virtual Machines and LUNs provide good HW virtualization
• Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource?
• Answer: By not getting its share of the resource when it needs it
Observation
• Resource management consists of two distinct decisions• Resource Allocation: How much resources to
allocate?• Dispatching: When to provide the allocated
resources?
• Most resource managers conflate them• Best-effort, proportional-share, real-time
Separating them is powerful!
• Separately managing resource allocation and dispatching gives direct control over the delivery of resources to tasks
• Enables direct, integrated support of all types of timeliness needs
Res
ourc
e A
lloca
tion
MissedDeadline
SRT
Dispatchingunconstrained
unco
nstra
ined
cons
train
ed
ResourceAllocation
SRTSoftReal-Time
BestEffort
CPU-Bound
I/O-Bound
HardReal-Time
Rate-Based
constrained
The resource allocation/dispatching (RAD) scheduling model
Rate
Deadlines
DispatcherSeries of jobs w/
budgets and deadlines
Share of resources
Times at which allocation must equal share
Process
Supporting different timeliness requirements with RAD
HardReal-time
Rate-based
Best-effort
Soft Real-time
Rate
Deadlines
Dispatcher
SchedulingMechanism
RuntimeSystem
RateBounds
PeriodWCET
PeriodACET
Priority
PiPiPiPi
Set of jobs w/
budgets and deadlines
SchedulingPolicy
Rate-Based Earliest Deadline (RBED) CPU scheduler
Rate
Deadlines
EDF + timers
SchedulingPolicy
SchedulingMechanism
RuntimeSystem
Set of jobs w/
budgets and deadlines
• Processes have rate & period• ∑rates ≤ 100%
• Periods based on processing characteristics, latency needs, etc.
• Jobs have budget & deadline• budget = rate * period
• Deadlines based on period or other characteristics
• Jobs dispatched via Earliest Deadline First (EDF)• Budgets enforced with timers
• Guarantees all budgets & deadlines = all rates & periods
Adapting RAD to disk, network, and buffer cache
• Fahrrad—Guaranteed disk request schedulingAnna Povzner (UCSC)
• RADoN—Guaranteeing storage network performanceAndrew Shewmaker (UCSC and LANL)
• Radium—Buffer management for I/O guaranteesRoberto Pineiro (UCSC)
Disk
Guaranteed disk request scheduling
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Good I/O performance
• Challenging because disk I/O is:• Stateful
• Non-deterministic
• Non-preemptable, and
• Best- and worst-case times vary by 3–4 orders of magnitude
Fahrrad
• Manages disk time instead of disk throughput
• Adapts RAD/RBED to disk I/O
• Reorders aggressively to provide good performance, without violating guarantees
A B C BE
Disk
I/O streams
Fahrrad
A bit more detail
• Reservations in terms of disk time utilization and period (granularity)
• All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS)
• I/Os in the DSS are issued in the most efficient way
• I/O charging model is critical
• Overhead reservation ensures exact utilization• 2 WCRTs per period for “context switches”
• 1 WCRT per period to ensure last I/Os
• 1 WCRT for the process with the shortest period due to non-preemptability
Fahrrad outperforms Linux
• Workload• Media 1: 400 sequential I/Os per second (20%)
• Media 2: 800 sequential I/Os per second, (40%)
• Transaction: short bursts of random I/Os at random times (30%)
• Background: random (10%)
• Result: Better isolation AND better throughput
Linux Linux w/Fahrrad
New work: virtual disks
• Provide workload-independent performance guarantees
• Isolate from other workloads concurrently accessing the device
• LUNs virtualize storage capacity
• Fahrrad virtualizes storage performance
Fahrrad virtual disks
• Implemented with the Fahrrad real-time I/O scheduler
• Guarantee reserved and isolated share of the time on storage device• Hard guarantees on performance isolation
• Virtual disk throughput same as equivalent standalone throughput
• Amount of data transferred:• ∀i, Di(x%, t) = Di(100%, x%·t)
Share ofdisk
Time Share ofdisk
Time
Guaranteeing performance isolation
• Virtual disk reservation: disk share (utilization) and time granularity (period) • Account for all extra (inter-stream) seeks
• Reserve overhead utilization to do them
• Charge each I/O stream for all of the time it uses, including inter- and intra-stream seeks
• Reservation = Disk Share + Overhead utilization
25%, 1 sec
30%, 250 ms
19%, 1 min
Disk time
Extra seeks
• Intra-stream seeks caused by workload
• Inter-stream seeks caused by reservations• Low time granularity causes more frequent seeks
• At most two extra seeks per stream per period• To and away from the stream to meet deadlines
➡Extra seeks caused by un-queued requests
Charging model and reservations
• Charge streams responsible for inter-stream seeking• From overhead: for seeks caused by reservations
• From reservation: for seeks caused by bursty behavior
• Overhead utilization needed for hard guarantees
• Overhead utilization = WCRT/p + 2*WCRT/p + WCRT/p
• We can trade-off hard guarantees for lower overhead by assuming less than worst-case request time
Guaranteereservedutilization
Account for inter-stream
seeks
Maintain 2outstanding
requests
• Throughput is determined by reservation and workload
0
100
200
300
400
500
0.01 0.1 1 10 100 1000
Am
ou
nt of
data
tra
nsf
ere
d [
MB
]
Run length of semi-sequential stream [MB]
Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)
Random, virtual disk (20% share, 150s)
Each virtual disk reserves 20% with 1 second
granularity
Performance: guaranteeing throughput
• Throughput is determined by reservation and workload
Each virtual disk reserves 20% with 1 second
granularity
0
100
200
300
400
500
0.01 0.1 1 10 100 1000
Am
ou
nt of
data
tra
nsf
ere
d [
MB
]
Run length of semi-sequential stream [MB]
Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)
Random, virtual disk (20% share, 150s)Sequential, standalone (100%, 30s)
Semi-sequential, standalone (100%, 30s)Random, standalone (100%, 30s)
Performance: guaranteeing throughput
Performance: Controlling throughput
• Each virtual disk is isolated from the other
• Performance is fully determined by the reservation and workload
Reserved share for sequential stream Reserved share for sequential stream
Dat
a tr
ansf
erre
d (M
B)
Dis
k tim
e re
serv
atio
n (%
)
Performance: Controlling latency
• Reservation granularity bounds latency: • period = latency/2
• Virtual device serves periodic semi-sequential stream and shares storage with random background stream. Four experiments for different period reservations.
Upper bounds
Frac
tion
of I/
Os
Latency Period of virtual disk
Util
izat
ion
Performance: Isolation guarantees
• Hard guarantees require high overhead (proportional to reservation granularity)
• Three virtual disks each serving one sequential stream with many outstanding I/Os share a storage system with a random background stream.
Dis
k tim
e re
serv
atio
n (%
)
Period of virtual disk 3Period of virtual disk 3
Dat
a tr
ansf
erre
d (M
B)
Performance: Soft guarantees w/isolation
• Overhead based on less than worst-case I/O time
• Increased short term throughput variation• Virtual disk (10%, 1 sec) runs one sequential stream with 400 IO/sec arrival rate
and shares the system with 5 virtual disks each running one random stream.
•
Dis
k tim
e re
serv
atio
n (%
)
Percentile of observed service timesPercentile of observed service times
Dat
a tr
ansf
erre
d (%
)
Performance: Soft guarantees w/isolation
• Linux fails to support Cello99 (variation up to 30% from standalone)
• Fahrrad Virtual Disks provide Cello99 and OpenMail performance close to standalone
• Cello99 and OpenMail virtual disks share the system with random background stream.•
Thr
ough
put
(I/O
s pe
r se
cond
)
TimeTime
Linux Fahrrad Virtual Disks
Fahrrad Virtual Disks
1. Guarantee throughput by accounting for overhead and guaranteeing utilization
2. Guarantee isolation between workloads by accurately accounting for all disk time
3. Provide high throughput (w/guarantees) by minimizing interference between workloads
4. Result: performance of virtual disk depends only on reservation, workload, and performance of device
Guaranteeing storage network performance
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Good I/O performance
• Challenging because network I/O is:• Distributed
• Non-deterministic (due to collisions or switch queue overflows)
• Non-preemptable
• Assumption: closed network
What we want
Client
Client
Client
Server
Server
Server
30%
50%
20%
What we have
• Switched fat tree w/full bisection bandwidth
• Issue 1: Capacity of shared links
• Issue 2: Switch queue contention
Congestion in a simple switch model
• Each transmit port on the switch is a collision domain
tx/rx ports
shared
FIFO
switch fabric
1
2
3
4
5
6
7
8
Congestion in a simple switch model
• One of the packets arriving at the same switch transmit port is delayed on the queue
switch fabric
1 and 2congest1
2
3
4
5
6
7
8
1 and 2send to 5
Congestion in a simple switch model
• Delayed packets from unrelated streams affect each other on the queue
switch fabric
1 and 2congest1
2
3
43 and 4congest
2 and 4congest
5
6
7
8
1 and 2send to 5
3 and 4send to 8
TCP
• Those who do not understand TCP are destined to reimplement it
• Jon Postel
• Ack-clocked flow control
• Packet loss based congestion control
• Sawtooth throughput
• Incast throughput collapse
Network resource usage measurements
• Round trip time RTTi = Ci - Si
• Combines queueing effects on forward and reverse path + response time
Clock 1
Si
Ci
Clock 2
Network resource usage measurements
• One-way delay OWDi = Ri - Si
• Isolates queueing affects on forward path, but
• Requires synchronized clocks
Clock 1
Si
Ri
Clock 2
Network resource usage measurements
• Relative forward delay RFDi,j = (Rj - Ri) - (Sj - Si)
• Isolates queueing affects on forward path, and
• Does not require synchronized clocks• But they must be relatively stable
Clock 1S
i
Ri
Rj
Sj
Clock 2
RADoN
• A reservation has a network share (utilization) and a time granularity (period)
• Two real-time scheduling algorithms• Earliest Deadline First (EDF) - absolute deadlines
• Least Laxity First (LLF) - relative laxities
now deadlinelaxity
release
Approximating optimal scheduling
• Flow control - throttling senders• Execution time (per period) e = utilization /
period
• Budget in packets m = e / packets_per_second
• Congestion control - avoiding switch contention (adjust wait time between packets)• Percent budget %budget = (1 - %laxity) = e/(d-t)
• Packet wait time w = wmin / %budget
• Size change w∆ = -|wi - wmin|/2
• New wait time wi+1 = min(wmax, max(wmin, w∆))
Queue modeling: single network stream
• No contention: 765 Mbps w/no lost packets
0
5
10
15
20
25
30
35
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload model
Queue modeling: punctuated stream
• Contention: 5 bursts of 250 Mbps
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload modellost packets
Median filter detects congestion before packet loss and decreasing queue size after congestion
Queue modeling: punctuated adaptive stream
• Contention: 5 bursts of 250 Mbps
Adapting to median-filter model decreases packet loss
0
50
100
150
200
250
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload modellost packets
Userspace RADoN prototype
• Detects congestion using Relative Forward Delay
• Responds to congestion using RAD real-time theory
• Decreases packet loss significantly
• Improves goodput
• Requires no global knowledge or synchronization
• Ongoing: RADoN kernel implementation
Buffer management for I/O guarantees
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Improved I/O performance
• Challenging because:• Buffer is space-shared rather than time-shared
• Space limits time guarantees
• Best- and worst-case are opposite of disk
• Buffering affects performance in non-obvious ways
Disk
Guarantees in the buffer cache
• Role
• Improve performance
• Preserve & enhance guarantees
• App-specific guarantees:
• Hard at core
• Soft when possible
• Predictable
• Hard isolation
• Device time utilization
ResourceBroker
App 1
Disk Scheduler
Buffer-Cache
Disk
App 2 App n
App 1 App 2 App n
Dsk
Disk
Buffering roles in storage servers
• Staging and de-staging data• Decouples sender and receiver
• Speed matching• Allows slower and faster devices to communicate
• Traffic shaping• Shapes traffic to optimize performance of
interfacing devices
• Assumption: reuse primarily occurs at the client
Disk
Radium
• I/O into and out of buffer have rates and time granularities (periods)
• Period transformation: period into cache may be shorter than from cache to disk
• Rate transformation: rate into cache may be higher than disk can support
• Partition cache based on I/O characteristics and performance requirements
• Cache policies enhance performance within constraints determined by I/O requirements• Use slack to prefetch reads and delay
writes
Disk
App 1
Disk Scheduler
Buffer-Cache
Disk
App 2 App 3
Enhancing guarantees in the buffer cache
• Reclaim unused resources (e.g., unused overhead)• Use slack to prefetch reads and delay writes
• Allow more unguaranteed services
• Resource redistribution (buffer swapping) accommodates burstiness
• Period transformation: period into cache may be shorter than from cache to disk
• Rate transformation: rate into cache may be higher than disk can support
Disk
Managing a sequential workload
0
3
6
9
12
15
no-cachemonolithic
Thro
ughp
ut [t
hous
and
I/O p
er s
ec]
Sequential workloadexecuted in isolation
target performancereservation
0
3
6
9
12
15
no-cache monolithic Radium
Sequential workloadcombined with random workload
(with reservations)
target performancereservation
fifonoop
deadlineanticipatory
cfq(50%/50%)quanta(50%/50%, 2 sec)
RAD(50%/50%, 2 sec)
Managing a random workload
0
3
6
9
12
15
no-cachemonolithic
Thro
ughp
ut [t
ens
of I/
O p
er s
ec]
Random workloadexecuted in isolation
target performancereservation
0
3
6
9
12
15
no-cache monolithic Radium
Random workloadcombined with sequential workload
(with reservations)
target performancereservation
fifonoop
deadlineanticipatory
cfq(50%/50%)quanta(50%/50%, 2 sec)
RAD(50%/50%, 2 sec)
Managing combined workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRADTh
roug
hput
[tho
usan
d I/O
per
sec
] no cache
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRAD
Monolithic
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRAD
Radium
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
Controlling throughput w/mixed workloads
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+RAD
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+CFQ
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+Anticipatory
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+deadline
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Actu
al th
roug
hput
[I/O
s pe
r sec
]
Target throughput [I/Os per sec]
Radium+FIFO
stream 1, 2 secstream 2, 2 secIdeal stream 1Ideal stream 2
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Actu
al th
roug
hput
[I/O
s pe
r sec
]
Target throughput [I/Os per sec]
Radium+NOOP
Consistent and predictable throughput for arbitrary reservations
Controlling latency w/mixed workloads
0
0.2
0.4
0.6
0.8
1
0 250 500 750
Cum
ulat
ive d
istrib
utio
n[%
]
Latency [ms]
Radium+CFQ
Upper boundsUpper bounds
0
0.2
0.4
0.6
0.8
1
0 250 500 750Latency [ms]
Radium+RAD
Upper boundsUpper bounds
period 1 sec 750 ms 500 ms 250 ms
Precise control over the service times of each stream
Results w/complex workloads
0
20
40
60
80
100
0 15 30 45 60 75 90
Avg.
thro
ughp
ut [I
Os
per s
ec]
Time [sec]
Radium+CFQ
0
20
40
60
80
100
0 15 30 45 60 75 90Time [sec]
Radium+RADSoft stream 1, period 500 msSoft stream 2, period 500 ms
Hard stream 3, period 500 msGreedy stream 4, period 1 sec Greedy stream 5, period 1 sec
Reasonable control with complex workloads
Data center performance management
• Big distributed systems• Serve many users/jobs
• Process petabytes of data
• Data center design• Use rules of thumb
• Over-provision
• Isolate
• Ad hoc performance management creates marginal storage systems that cost more than necessary
• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network
Data center performance mgmt. goals
1. A first-principles model for data center perf. mgmt.
2. Full-system performance metrics for client processing nodes, buffer cache, network, server buffer cache, and disk
3. Performance visualization by application, client node, reservation, or device
4. Application workload profiling and modeling
5. Full system performance provisioning and management based on all of the above
6. Online machine-learning based performance monitoring for real-time diagnostics
RADIX
• $1 million from UC Lab Fee program
• Based on schedulers and workload-independent utilization metrics from our E2E QoS research
• Plan1.Performance model and metrics
2.Tools for profiling, prediction, and planning
3.Operating systems components
4.Performance monitors and visualization tools
• Case study: LANL data centers
Conclusion
• Distributed I/O performance management requires management of many separate components
• An integrated approach is needed
• RAD provides the basis for a solution
• It has been successfully applied to several resources: CPU, disk, network, and buffer cache
• We are on our way to an integrated solution
• There are many useful applications: Data center performance management, full storage virtualization, ...