will computer systems with performance guarantees ever go ...keynote. hase 2014. miami, fl, usa. jan...
TRANSCRIPT
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 1
Will Computer Systems With Performance Guarantees
Ever Go Mainstream?Keynote
Presented at the 15th IEEE International Symposium on High Assurance Systems Engineering (HASE 2014)
January 10, 2014 / Miami, Florida, USA
Juan A. ColmenaresComputer Science Laboratory (CSL)
Samsung Research America – Silicon Valley (SRA-SV)[email protected]
Disclaimer
• No part of this presentation necessarily represents the views and opinions of my current and former employers or my research collaborators.
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 2
Introduction
• Performance guarantees are key in mission-critical, cyber-physical systems
– Considered a ultra-specialized area• Average performance
– Today’s common figure of merit for software systems• For 10+ years, sustained demands for high-quality
multimedia applications– Multi-party video conference and video/audio on demand– Typical motivation for (probabilistic) performance
guarantees• Now, Internet-based service providers start to show
interest in offering predictably responsive interactive services
– To differentiate themselves from the competition – To retain existing customers/users and attract new ones
Introduction
• Developing distributed computing systems with performance guarantees is
– Harder – More expensive – More time consuming
• We just do it when it is strictly necessary• Naturally we defer such hard problems until there is
no other choice but to face them– Notable recent example: parallel computing
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 3
Questions
• Will current trends force us to develop massively used computer systems with some type of performance guarantees?
• And if so, are we prepared?
Mainstream Applications and SystemsSome Targets
• Currently or expected to become popular (used by millions)
• With clear demands for performance guarantees• Developed and supported by multiple, large teams
http://www.thinkgig.com http://www.isisingenieria.com
Data Centers (Cloud Computing)
NetworkedSensors and Actuators
[ Phrase borrowed from Prof. Jan Rabaey, UC Berkeley Swarm Lab ]
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 4
Cloud ComputingA Target for Performance Guarantees
• Run in data centers – Serving millions of people
• Examples of what to guarantee– Their contributions to the service response
times experienced by users– Throughput for media content delivery
Web Search
www.adobe.com
thinkjudd.com
Media Content Delivery
Networked Sensors and ActuatorsA Target for Performance Guarantees
• Some apps are basically control systems, wired and/or wireless
• Deployed in the environment• Examples of what to guarantee
– Response time latencies of critical actions
• Some apps with strict requirements
Autonomous Cars
From article by Tom Vanderbilt (Feb 2012)http://www.wired.com/magazine/2012/01/ff_autonomouscars/
Amazon Prime Air Rotorcraft
Sourc
e: H
onda
Robotic Assistants
http://www.amazon.com
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 5
Swarm of Devices at the Edge of the Cloud
[ Prof. Jan Rabaey, ASPDAC’08 ]
Infrastructuralcore
The Cloud Mobile Access & Relay
The Swarm
Swarm of Devices at the Edge of the Cloud
http://www.wired.com
www.popsci.com.au
The Cloud
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 6
Smart Homes and SpacesSwarm of Devices at the Edge of the Cloud
http://www.corning.com
http://www.corning.com
http://www.samsung.com
Samsung Smart Home Corning’s A Day Made of Glass
Smart CitiesFocus of the TerraSwarm Research Center[TS12]
• Meant to handle two cases– Normal operation and disasters
• Integrate– Fixed infrastructure
• e.g., environmental monitoring, energy-usage, tracking and mapping
– Mobile assets (automatic vehicles, UAVs, robots) – Immersive humans
• Cloud as a companion, but data locality is key for latency
[TS12] Lee et al. The TerraSwarm Research Center (TSRC) (A White Paper). Tech Report No. UCB/EECS-2012-207
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 7
Smart CitiesSwarm of Devices at the Edge of the Cloud
• Some initiatives in industry and academia– IBM’s The Smarter City– Schneider Electric’s Smart Cities Solution– TerraSwarm Research Center @ UC Berkeley– Center for Urban Science + Progress (CUSP) @ NYU
http://www.schneider-electric.com
Clear Demands for Performance Guarantees May Not Be Enough
• GUIs should provide response-time guarantees to users– At least for some meaningful actions
• But I don’t expect major improvement here soon
Popularity in decline, so no major interest in improving guarantees of desktop GUIs
Popular indeed , but battery life is a much more pressing issue
Uprising, but similar battery-life issue
Desktops Mobile Devices(e.g., smart phones
and tablets)
Wearable Devices(e.g., smart watches
and glasses)
GUI hardware acceleration is key in keeping response times low
Google Glasses
Samsung Gear
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 8
What Performance Guarantees Are We Talking About?
• We often seek guarantees on:– Throughput (e.g., requests per second)– Latency to response (e.g., service time)
• Other interesting performance metrics– Energy and power consumption (e.g., energy/power
budget)– Time to recovery (e.g., guaranteed maximum
recovery time)
• What type of guarantees?– Probabilistic with high confidence (mostly)
• Often easier targets than hard guarantees• Leave more room for tradeoffs instead of largely
overprovisioning for the infrequent worst cases
Our focus today
Any Performance Guarantees Offered by Public Clouds?
• None so far– At least by 3 major cloud providers
• Service Level Agreements (SLAs) are only about availability and accessibility
– e.g., monthly availability > 99.95%; otherwise, you get service credits
• Will competition have any effect?
https://cloud.google.comhttp://aws.amazon.com http://www.windowsazure.com
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 9
I/O Bandwidth Provisioning in Amazon Elastic Block Store (EBS)
Signs of Improvement in Public Clouds
• Amazon EBS offers volumes– Durable, block-level storage devices
• For a virtual-machine instance, an EBS volume appears as a native block device similar to a hard drive
• Provisioned IOPS volumes– Offer consistent performance for I/O-intensive workloads
(e.g., databases) in Amazon EC2– Designed to deliver within 10% of the specified IOPS rate
99.9% of the time• But this is NOT part of any SLA!
– IOPS rate up to 4000 IOPS per volume– Volume sizes from 10 GB to 1 TB
• Possibly inspired by– Gulati et al. mClock: Handling throughput variability for
hypervisor IO scheduling. OSDI 2010.
Source: http://aws.amazon.com/ebs/piops/
SolidFire’s All-Flash Storage Infrastructure with QoS
Signs of Improvement in Public Clouds
• At full scale (100 nodes) able to deliver– 3.4PB of effective capacity – 7.5 million IOPS
• Reduced cost– Below $3/GB and below $1/IOPS (60TB to 3.4PB)
• Below the cost of traditional performance disk solutions
• Able to guarantee performance to thousands of volumes within a shared storage
– Pending patent on QoS capabilities
http://www.solidfire.com
July 2013
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 10
Allocated Bandwidth for Streaming Servers in Windows Azure
Signs of Improvement in Public Clouds
• Media Services enable creation, management, and distribution of media
– e.g., encoding and on-demand streaming
• Reserved Units (RUs) – Dedicated set of resources for media processing tasks– Highly recommended for on-demand streaming
• Actually, availability SLA only valid with RUs
• Each RU provides bandwidth up to 200 Mbps for streaming origin servers
– Bandwidth allocation NOT part of any SLA– Availability SLA only applies when using <= 80% of available
bandwidth
Source: http://www.windowsazure.com/en-us/support/legal/sla/
Research Efforts Clear Interest in Improving
• Barker and Shenoy (UMass). Empirical evaluation of latency-
sensitive application performance in the cloud. MMSys 2010.– Focus on interference of dynamically varying background load on
latency-sensitive tasks
– Careful configurations mitigate, but do not eliminate interference
• Dean and Barroso (Google). The tail at scale. CACM 2013– Latency tail-tolerant software techniques to build predictable systems
out of less predictable parts
• Ferguson et al. (MSR) Jockey: Guaranteed job latency in data
parallel clusters. EuroSys 2012– Latencies guarantees for parallel data processing jobs using a resource
allocation control loop
• Terry et al (MSR). Consistency-based service level agreements
for cloud storage. SOSP 2013– A replicated key-value store that allows applications to declare their
consistency and latency priorities via consistency-based SLAs
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 11
So Far …
• In cloud computing– No public offerings with high-confidence
performance guarantees available• As far as we can tell
– How about private offerings?
• In some apps with networked sensors and actuators
– Clear requirements of performance guarantees due to safety
• Future swarm applications– A number will require performance guarantees– Also need to interact with the Cloud
Design Principles and Techniques to Build Software Systems with High-Confidence
Performance Guarantees
• Already available for system developers to adopt them
• Some challenges– Performance guarantees considered less important than
other requirements• Not perceived as a differentiating factor
– Additional complexity– End-to-end properties
• Multi-layered factors • A piece-by-piece game with distributed responsibility
– Input dependent– Cost effectiveness
• Especially considering the investment in existing systems
– Trained workforce
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 12
Design Principles and Techniques to Build Software Systems with High-Confidence
Performance Guarantees
• Next we will discuss– Divide-and-conquer design principle– Limiting system load– Mitigating performance variability
Divide and Conquer
• Systems should be built to enable systematic evaluation (via analysis and/or measurements) of:
– The individual contributions of factors that make up the system’s performance
– The effects of the combination of those factors on the system’s performance
• This design principle is key– Systems include multiple components or sub-systems– Performance guarantees of interest are usually end-to-end– Multiple factors influence system’s performance
• e.g., architectural features, algorithmic efficiency, task scheduling, memory management, I/O behavior, thread affinity, cache locality, etc.
• But too many factors can influence performance– To make it practical, we should consider the most
important ones for the system in hand
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 13
Performance Decoupling of System ComponentsEnabling Divide and Conquer
• Extension of software componentization to performance aspects
– Software components are used to divide the system’s logic in parts of manageable complexity
• The idea is to evaluate the contributions of individual components to the system’s performance
• KV-Cache[UCC13]
–Hash table coupled with a replacement logic
–Exploits a software absolute zero-copy approach and aggressive customization to offer high performance
An In-Memory Key-Value CacheExample of Performance Decoupling and Customization
Comm & Mem Mgmt Layer (10G NIC Driver + UDP + Mem Pools)
Hash Table
(with Fine Grain Locks)
Non-Blocking Queue-based CLOCK
(Replacement Logic)
Non-Blocking Queue-based CLOCK
(Replacement Logic)
Application Layer
No
n-B
lock
ing
C
ha
nn
els
> Decoupling <
[UCC13] Waddington, Colmenares, Kuang and Song. KV-Cache: A scalable high-performance web-object caching for manycore. 6th IEEE/ACM Int’l Conference on Utility and Cloud Computing. Implemented on Genode/Fiasco.OC µkernel
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 14
In-Memory Web-Object CachingOverview
• Widely used by Internet-based service providers to reduce latency and increase system throughput
– Memcached: a popular example
• www.memcached.org
Typical Side-Cache Deployment
A De Facto Figure of Merit: Capacity [IGCC11, BagLRU]
Maximum throughput (in RPS) the system can sustain with an average round-trip time (RTT) below 1 ms
[BagLRU] Wiggins and Langston. Enhancing the scalability of memcached. Intel Tech. Rep. 2012
(http://software.intel.com/en-us/articles/enhancing-the-scalability-of-memcached-0)
[IGCC11] Berezecki et al. Manycore key-value store. Proc. of the 2011 Int’l Green Computing Conference. 2011.
Experimental ResultsKV-Cache vs. Intel’s Bag-LRU Memcached[BagLRU]
Latency comparison for one million GET requests at 600K RPS (a slow rate)
Throughput comparison with average round-trip time < 1ms
2x
[BagLRU] Wiggins and Langston. Enhancing the scalability of memcached. Intel Tech. Rep. 2012
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 15
An In-Memory Key-Value CacheExample of Performance Decoupling and Customization
• We could have a stricter figure of merit (stronger guarantees)
– Maximum throughput (in RPS) the system can sustain with
• A target RTT of 1 msobserved on average, and
• No more than 0.1% of late responses, arriving after the target RTT
Round-trip time distribution at 3 million RPS for a single NIC
(3 million GET requests)
KV-Cache[UCC13] never exceeded the target round-trip time of 1 ms!
[UCC13] Waddington, Colmenares, Kuang and Song. KV-Cache: A scalable high-performance web-object caching for manycore. 6th IEEE/ACM Int’l Conference on Utility and Cloud Computing.
Space-Time PartitioningEnabling Divide and Conquer
Time
Spa
ce
Yellow partition grows due to adaptation
Spatial Partition: Key for performance isolation•Hard boundaries and
controlled communication between partitions
Spatial partitioning is not static and may vary over time•Partitions can be time multiplexed;
resources are gang-scheduled•Partitioning adapts to system’s needs
• Each partition receives a vector of basic resources– A number of hardware threads, memory pages, a portion of
cache segments, memory bandwidth, and energy budget• A partition may also receive
– Exclusive access to other resources (e.g., a device)– Guaranteed fractional services from other partitions
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 16
Space-Time PartitioningEnabling Divide and Conquer
Time
Spa
ce
Yellow partition grows due to adaptation
Spatial Partition: Key for performance isolation•Hard boundaries and
controlled communication between partitions
Spatial partitioning is not static and may vary over time•Partitions can be time multiplexed;
resources are gang-scheduled•Partitioning adapts to system’s needs
• Each partition receives a vector of basic resources– A number of hardware threads, memory pages, a portion of
cache segments, memory bandwidth, and energy budget• A partition may also receive
– Exclusive access to other resources (e.g., a device)– Guaranteed fractional services from other partitions
Controlled multiplexing is key
The Cell: Our Partitioning AbstractionUser-level Software Container
with Guaranteed Access to Resources
2nd-level Scheduling
2nd-level Mem Mgmt
Address Space A
Address Space B
Cell A
Task
Time
Spa
ce
Cell B
• Basic properties of cells– Full control over resources it
owns when mapped to hardware
– One or more address spaces (protection domains)
– Efficient inter-cell communication channels
Yellow partition grows due to adaptation
2nd-level runtime must be adaptive, too
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 17
Basis of a Component-based Modelwith Composable Performance
• Applications = Set of interacting components deployed on different cells
– Applications split into performance-incompatible and mutually distrusting cells with controlled communication
– OS Services are independent servers that provide QoS• Requires fast inter-cell communication
– Could use hardware acceleration for fast messaging
Application Component
DeviceDrivers
FileService
Real-time Cell
Core Application
Parallel Library
Channel
Channel
Storage Device
• Available preemptive schedulers– Round-robin (and pthreads) – EDF and Fixed Priority– Multiprocessor Constant Bandwidth
Server (M-CBS) [ECRTS’04]
– Juggle: A load balancer for SPMD applications [CLUSTER’12]
• Able to handle cell resizing Tessellation KernelTessellation Kernel
(Partition Support)
Application
Cell
[ECRTS’04] S. Baruah et al. Executing aperiodic jobs in a multiprocessor
constant-bandwidth server implementation. ECRTS'04.
[CLUSTER’12] S. Hofmeyr, J. Colmenares et al. Juggle: Addressing extrinsic
load imbalances in SPMD applications on multicore computers. Cluster
Computing Journal.
PULSE Framework
Scheduler X
Hardware cores
Timer interrupts
Customizable User-Level RuntimesPULSE: A framework for
Preemptive User-Level SchEdulers
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 18
• Supports reservations (i.e., differentiated service classes) and proportional share of bandwidth
– Using mClock scheduling algorithm [OSDI’10] (on top of PULSE)• NIC driver is entirely contained in user-space
– No system calls when transmitting and receiving buffers
[DAC’13] Colmenares, et al. Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation.
[JAES’13] Colmenares, et al. A multi-core operating system with QoS-guarantees for network audio applications.
[OSDI’10] A. Gulati et al. mClock: handling throughput variability for hypervisor IO scheduling.
Network ServiceAn OS Service with QoS Guarantees[DAC’13, JAES’13]
(Avg. throughput = 125.2 KB/s)
A Divide and Conquer Approach to Deriving Time Bounds
Analytically derived execution-time
bounds of functions
H+Execution-time
measurements of functions
Tight execution-time bounds of functions
Analytically derived service-time
bounds of functions
H+ Service-time measurements of
functions
Tight service-time bounds of functions
Combine via a hybrid approach
0.2
0.4
0.6
0.8
1.0
time
0.2
0.4
0.6
0.8
1.0
time
Analytically
Derived Bound
Max. Observed
Value
Adopted Bound
Pessimistic Optimistic
Individual
functions
running in
isolation
Concurrent function-
executions, resource
sharing, and communication
activities
[IESS09] Colmenares et al. Experimental evaluation of a hybrid approach for deriving service-time bounds of methods in
real-time distributed computing objects. Proc. Int'l Embedded Systems Symposium 2009.
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 19
Basic Approaches for Deriving Time Bounds
Static Analysis Approaches
Measurement-based Approaches
Hard bound with a practically zero probability of being violated at run
time
Tend to produce excessively loose bounds when applied to modern fully-featured processors
Maximum measured execution-time value
Safety Margin
Soft bound with a non-negligible
probability of being exceeded at run time
May not cover the worst-case
Basic Approaches for Deriving Time Bounds
Static Analysis Approaches
Measurement-based Approaches
Hard bound with a practically zero probability of being violated at run
time
Tend to produce excessively loose bounds when applied to modern fully-featured processors
Maximum measured execution-time value
Safety Margin
Soft bound with a non-negligible
probability of being exceeded at run time
May not cover the worst-case
We want a tight time bound in between.
But how to determine the safety margin?
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 20
Curve Fitting TechniqueCentral to the Hybrid Approach
• Combines (1) measurements and (2) loose but analytically-derived hard bounds to produce reasonably safe and tight time bounds
α
Margin value
Probability of the soft bound being exceeded
at run-time
A Televideo Application
Display Windows
Video Streams
Performance Metric Reports(feedback)
Remote
User
Local
User
Remote
User
Local
User
Node 1 Node 2
TMOSM
OS/HW Platform
Network
TVTMO TVTMO
TMOSM
OS/HW Platform
Network Performance Metrics
• Throughput (at the application level)
• Message loss rate
• End-to-end delay
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 21
A Televideo Application
Frame size: 320 x 240 Frame rate: 10 fpsColor depth: 24 bitsCODEC: MPEG-4 (implementation FFMpeg)
Obtaining a Tight Service Bound for a Function via the Hybrid Approach
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
Time (ms)
Est
imat
ed P
roba
bilit
y
CDF Richards Model
Analytically Derived Bound
54 ms30 ms
Adopted Bound
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 22
Limiting System Load
• We can only guarantee performance under certain load limits and conditions (i.e., input)
Example
Avg
.
100% GET
requests
On-Line Admission ControlLimiting System Load
• Hey system! Can you guarantee performance X for this job?• Some possible answers
– Sure, no problema! [The rare happy case]– Yes, but let me put order in the house
• Possible downgrading and revocation
– Nope. I am sorry. Bye.– Nope, but let’s negotiate a little bit
• No with performance X, but with performance Y. Is this OK with you?
• Typical issues– Computational cost– Reduction in effective system utilization due to pessimism in the
analysis
• Some efforts to deal with those issues– Nie et al. Capacity-based admission control for mixed periodic
and aperiodic real time service processes. SOCA 2011.
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 23
Load Regulation and ShapingLimiting System Load
• Limit request rate or progress rate– Maximum number of requests in a given interval
period, or maximum inter-arrival rate (MIR)
• Leaky bucket– Classic textbook example of traffic shaping
• Handling excess of work– Queue requests, and drop if too many– Tradeoff content quality
• Good-enough in-time content can be better than late content
Mitigating Performance Variability
• Computer systems (architectures, networking, and software) are often built favoring average performance over performance predictability
– e.g., multi-level caches and deep pipelines with dynamic dispatch and speculative execution
• Often in practice, building the system from scratch to remove/reduce unpredictability is not economically feasible
– So, to learn to live with it, we need. [Yoda!]
• Common technique: Overprovisioning
However, some are trying to reintroduce timing predictability and repeatability from the ground up for safety-critical systems• Precision Timed (PRET) Machines @ UC Berkeley
[http://chess.eecs.berkeley.edu/pret/]• Time-Predictable Multi-Core Architecture for Embedded Systems
(T-CREST) -- An EU Research Project [http://www.t-crest.org/]
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 24
Mitigating Latency VariabilityIn Data Centers [CACM13]
• Issue the same request to multiple replicas and use the first response you get (hedged requests)
– Copies of the same request are sent with a short delay among them
– The client cancels outstanding requests once it gets the response
• Requests sent to multiple servers and the servers do cross-server status updates (tied requests)
– e.g., a server sends cancelations to others once is starts servicing the request
• Can reduce latency with modest load increase– If causes of variability do not simultaneously affect the
replicas
[CACM13] Dean and Barroso (Google). The tail at scale. Communications of the ACM. 2013.
Mitigating Latency VariabilityIn Data Centers [CACM13]
• Latency-induced probation
– In some situations the system performs better by
excluding a particularly slow machine and putting it
on probation
• Slowness is often caused by temporary phenomena
• Interesting point
– Removal of serving capacity from a live system
during periods of high load actually improves
latency
[CACM13] Dean and Barroso (Google). The tail at scale. Communications of the ACM. 2013.
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 25
Adaptive Resource AllocationA Complementary Technique
• Systems need to adapt to changes in the workloads (application and request mixes) and resource availability
• Number of efforts in this area:– Yang et al. Redline: First class support for interactivity in
commodity operating systems. OSDI 2008.– Padala et al. Automated control of multiple virtualized
resources. EuroSys 2009.– Hoffmann et al. SEEC: a general and extensible framework
for self-aware computing. Technical Report MIT-CSAIL-TR-2011-046.
– Sharifi at al. METE: meeting end-to-end qos in multicores through system-wide resource management. SIGMETRICS Perform. Eval. Rev., 39(1):13–24, June 2011.
Example Adaptive Control Loop
Application1
QoS-aware
Scheduler
BlockService
QoS-aware
Scheduler
NetworkService
QoS-aware
Scheduler
GUIService
Channel
Running System(Data Plane)
Application2
Channel
PerformanceReports
ResourceAssignments
Resource Allocation(Control Plane)
Partitioningand
Distribution
Observationand
Modeling
Cell Cell
Cell
[DAC13] Colmenares et al. Tessellation: refactoring the OS around explicit resource containers with continuous adaptation. DAC 2013.
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 26
Other Complementary Techniques
• Workload characterization• Load balancing• Differentiating service classes• Managing background activities and synchronized
disruption• Software customization• High-precision global time
– e.g., Precision Time Protocol (PTP) -- IEEE 1588
Conclusions
• Current trends indicate that distributed software systems with performance guarantees are likely to
– Become very popular– Demand large number of software developers
• When!? • Obstacles
– Other requirements perceived as more urgent• Power and energy efficiency• Security and privacy• High availability
– Legal hurdles for motivating apps (e.g., autonomous cars)• Design principles and techniques are available
– But need to be adapted to the system in hand• Major challenges
– Cost effectiveness– Trained workforce
Will Computer Systems With Performance
Guarantees Ever Go Mainstream?
Keynote. HASE 2014.
Miami, FL, USA. Jan 10, 2014
Juan A. Colmenares, Ph.D. 27
THANKS
Questions?