managing the performance impact of administrative utilities
DESCRIPTION
Managing the Performance Impact of Administrative Utilities. Paper by S. Parekh ,K. Rose , J.Hellerstein , S. Lightstone, M.Huras, and V. Chang Presentation and Discussion Led by N. Tchervenski CS 848, University of Waterloo November 1, 2006. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Managing the Performance Impact of Administrative Utilities
Paper by S. Parekh ,K. Rose , J.Hellerstein , S. Lightstone, M.Huras, and V. Chang
Presentation and Discussion Led by N. TchervenskiCS 848, University of Waterloo
November 1, 2006
Outline
Introduction – performance impact of administrative utilities
Proposed solutionArchitecture and Control TheoryTests performedConclusionDiscussion
Performance Impact of Administrative Utilities
Administrative utilities Essential to the system Have performance impact
With 24/7 operation, it is never a good time to suffer performance degradation
Solution: find a way to slow down
Example of DB Running a Backup
* Throughput and response time averaged over 60s intervals
How to Slowdown a Utility
Performance impact is dynamic – both for utilities and regular workloads (WLs)
Low level approach per-resource quotas / priorities difficult to manage
Admin Utility Performance Policy - at most x% degradation of production work How to throttle utilities SIS – self-imposed sleep How to translate policy requirement vs. throttling units?
SIS – Self-imposed Sleep
Action Interval and Sleep Fraction
Action interval = workTime + sleepTimeWith action interval being constant, we
need just sleep fraction: Sleep fraction = sleepTime / action interval Sleep fraction = 0 unthrottled, 1 stopped
Suggested value for action interval is at least a few iterations of the “main-loop” of the utility
Throttle Manager Architecture
X%sleepTime
Action interval = const
Linear model based on <sleepTime , performance>
PI controller
Degradation Estimator
Baseline estimator – system performance w/o utilities Degradation = 1 – performance / baseline
How to determine baseline? Stop all utilities WL surges, short-term
performance, underutilize resources Linear fitting of <sleepTime, performance>
Performance = f(sleepTime) = Q1*sleepTime+Q0 Recursive least squares and exponential forgetting
Linear Fit Example of Sleep/Throughput
Steady workload, backup throttling kept constant for 20 minute intervals
Estimated baseline
Actual baseline
Controller
Goal: current degradation = degradation limit Error = degradation limit – current degradation
PI controller used Throttling(k+1) = Kp * error(k) + Ki * Sum(error(i), i=0..k) Kp – proportional gain – used to increase speed of response Ki – integral gain – eliminate steady state error Kp, Ki and control interval can be hard-coded or determined at
runtime Kp and Ki can be estimated by utilizing pole placement from control
theory, but experimental results are necessary to confirm results [2] Experiments in this paper:
control interval = 20 seconds Kp and Ki same across all experiments
Tests Performed
Testbed description DB2 v8.1, 4-CPU RS/6000, 2GB ram, AIX 4.3,
8 physical disks Workload similar to TPC-C Initial “warm-up” period of 10 minutes, to
stabilize system / bufferpools /etc. Utility used – parallelized BACKUP – multiple
processes reading from multiple tablespaces, and multiple other processes writing to separate disks
OS Priorities vs SIS (Sleep fraction)No performance gain by changing OS priority of backup process
OS priority works for CPU intensive WLs, here we have I/O intensive WL. CPU is idle 80% of the time.
Linear effect when throttling using sleep
WL alone
100% throttling
Dynamic Effect of SIS. Does “Turning the Knob” Actually Do Something?
As in previous slide, we don’t get back to 100% throughput when fully throttled, but we’re close.
Backup started
15tps avg
Feedback Control
X=30% degradation policy
Feedback Control Effectiveness
Without BACKUP – 15tps With x=30%, steady workload – 25 users
9.4tps 38% degradation Why the throttling slump?
Throttling system compensates for decreasing resource demands of the backup?
With x=30%, Workload surge at 1500s – from 10 users to 25 users. Pre-surge degradation of 36% Post-surge degradation of 19% Still good results, close to the 30% policy
Causes for Deviation
Baseline estimator – actual throughput is 15.1 tps vs projected value of 13.2tps..
System stochastics not always estimate degradation correctly For example, the drop of throttle at t=1800s Quick to self-correct correct results in the
long term Short-term violations could be avoided by
trading adaptation speed by adjusting the forgetting factor in online estimator.
Conclusion
Administrative utilities must be run, but there is no timeslot for them
Proposed an application-based throttling mechanism – need to change applications code only, but OS/system independent
Easy for administrators to just specify degradation policy
Applicable to various systems Main requirements
Utility work be identifiable – put sleep there Performance can be measured and w/o much overhead
Limitations and Future Work
Test on multiple utilities Throttle each utility separately?
Propose and analyze different approaches for the controller PI algorithm, recursive least squares estimator,
etc. How to specify parameters for them? Automate determination of controller
parameters as they are system dependent.
Discussion
Why the throttling slump on the feedback control? Even when backup is fully throttled, system may not reach peak
performance as before, since it needs more time to stabilize (i.e. bufferpools again). This may be a better explanation for the difference between projected baseline and the actual baseline.
Even if tasks were CPU intensive, assigning them priority by OS is not guaranteed to work, since they may interact with other parts of the engine – issue queries, etc.. Can’t slow the engine for that.
Obviously this works since it’s been implemented in DB2 v8 and v9 – backup / rebalance / auto-runstats – all I/O intensive tasks.
Other ways to limit/control the impact of backup to DB system. Controlling bufferpools / memory. Automatic tuning of memory is introduced in DB2 v9.
How to handle peak loads? How do we guarantee QoS? Can we monitor not only TPS output, but try to “expect” what the WL
performance would be, based on # of clients , # of queries compiled/executed, bufferpool activity/misses?
References
[1] Sujay Parekh, Kevin Rose, Joseph L. Hellerstein, Sam Lightstone, Matthew Huras, and Victor Chang. Managing the performance impact of administrative utilities. In Self-Managing Distributed Systems - 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2003), number 2867 in Lecture Notes in Computer Science. Springer-Verlag, 2003.
[2] Diao,Y,Gandhi,N.,Hellerstein,J.L.,Parekh,S.,Tilbury,D.M.:Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache web server. In: Proceedings of Network Operations and Management. (2002)