managing the performance impact of administrative utilities

Managing the Performance Impact of Administrative Utilities

Paper by S. Parekh ,K. Rose , J.Hellerstein , S. Lightstone, M.Huras, and V. Chang

Presentation and Discussion Led by N. TchervenskiCS 848, University of Waterloo

November 1, 2006

Outline

Introduction – performance impact of administrative utilities

Proposed solutionArchitecture and Control TheoryTests performedConclusionDiscussion

Performance Impact of Administrative Utilities

Administrative utilities Essential to the system Have performance impact

With 24/7 operation, it is never a good time to suffer performance degradation

Solution: find a way to slow down

Example of DB Running a Backup

* Throughput and response time averaged over 60s intervals

How to Slowdown a Utility

Performance impact is dynamic – both for utilities and regular workloads (WLs)

Low level approach per-resource quotas / priorities difficult to manage

Admin Utility Performance Policy - at most x% degradation of production work How to throttle utilities SIS – self-imposed sleep How to translate policy requirement vs. throttling units?

SIS – Self-imposed Sleep

Action Interval and Sleep Fraction

Action interval = workTime + sleepTimeWith action interval being constant, we

need just sleep fraction: Sleep fraction = sleepTime / action interval Sleep fraction = 0 unthrottled, 1 stopped

Suggested value for action interval is at least a few iterations of the “main-loop” of the utility

Throttle Manager Architecture

X%sleepTime

Action interval = const

Linear model based on <sleepTime , performance>

PI controller

Degradation Estimator

Baseline estimator – system performance w/o utilities Degradation = 1 – performance / baseline

How to determine baseline? Stop all utilities WL surges, short-term

performance, underutilize resources Linear fitting of <sleepTime, performance>

Performance = f(sleepTime) = Q1*sleepTime+Q0 Recursive least squares and exponential forgetting

Linear Fit Example of Sleep/Throughput

Steady workload, backup throttling kept constant for 20 minute intervals

Estimated baseline

Actual baseline

Controller

Goal: current degradation = degradation limit Error = degradation limit – current degradation

PI controller used Throttling(k+1) = Kp * error(k) + Ki * Sum(error(i), i=0..k) Kp – proportional gain – used to increase speed of response Ki – integral gain – eliminate steady state error Kp, Ki and control interval can be hard-coded or determined at

runtime Kp and Ki can be estimated by utilizing pole placement from control

theory, but experimental results are necessary to confirm results [2] Experiments in this paper:

control interval = 20 seconds Kp and Ki same across all experiments

Tests Performed

Testbed description DB2 v8.1, 4-CPU RS/6000, 2GB ram, AIX 4.3,

8 physical disks Workload similar to TPC-C Initial “warm-up” period of 10 minutes, to

stabilize system / bufferpools /etc. Utility used – parallelized BACKUP – multiple

processes reading from multiple tablespaces, and multiple other processes writing to separate disks

OS Priorities vs SIS (Sleep fraction)No performance gain by changing OS priority of backup process

OS priority works for CPU intensive WLs, here we have I/O intensive WL. CPU is idle 80% of the time.

Linear effect when throttling using sleep

WL alone

100% throttling

Dynamic Effect of SIS. Does “Turning the Knob” Actually Do Something?

As in previous slide, we don’t get back to 100% throughput when fully throttled, but we’re close.

Backup started

15tps avg

Feedback Control

X=30% degradation policy

Feedback Control Effectiveness

Without BACKUP – 15tps With x=30%, steady workload – 25 users

9.4tps 38% degradation Why the throttling slump?

Throttling system compensates for decreasing resource demands of the backup?

With x=30%, Workload surge at 1500s – from 10 users to 25 users. Pre-surge degradation of 36% Post-surge degradation of 19% Still good results, close to the 30% policy

Causes for Deviation

Baseline estimator – actual throughput is 15.1 tps vs projected value of 13.2tps..

System stochastics not always estimate degradation correctly For example, the drop of throttle at t=1800s Quick to self-correct correct results in the

long term Short-term violations could be avoided by

trading adaptation speed by adjusting the forgetting factor in online estimator.

Conclusion

Administrative utilities must be run, but there is no timeslot for them

Proposed an application-based throttling mechanism – need to change applications code only, but OS/system independent

Easy for administrators to just specify degradation policy

Applicable to various systems Main requirements

Utility work be identifiable – put sleep there Performance can be measured and w/o much overhead

Limitations and Future Work

Test on multiple utilities Throttle each utility separately?

Propose and analyze different approaches for the controller PI algorithm, recursive least squares estimator,

etc. How to specify parameters for them? Automate determination of controller

parameters as they are system dependent.

Discussion

Why the throttling slump on the feedback control? Even when backup is fully throttled, system may not reach peak

performance as before, since it needs more time to stabilize (i.e. bufferpools again). This may be a better explanation for the difference between projected baseline and the actual baseline.

Even if tasks were CPU intensive, assigning them priority by OS is not guaranteed to work, since they may interact with other parts of the engine – issue queries, etc.. Can’t slow the engine for that.

Obviously this works since it’s been implemented in DB2 v8 and v9 – backup / rebalance / auto-runstats – all I/O intensive tasks.

Other ways to limit/control the impact of backup to DB system. Controlling bufferpools / memory. Automatic tuning of memory is introduced in DB2 v9.

How to handle peak loads? How do we guarantee QoS? Can we monitor not only TPS output, but try to “expect” what the WL

performance would be, based on # of clients , # of queries compiled/executed, bufferpool activity/misses?

References

[1] Sujay Parekh, Kevin Rose, Joseph L. Hellerstein, Sam Lightstone, Matthew Huras, and Victor Chang. Managing the performance impact of administrative utilities. In Self-Managing Distributed Systems - 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2003), number 2867 in Lecture Notes in Computer Science. Springer-Verlag, 2003.

[2] Diao,Y,Gandhi,N.,Hellerstein,J.L.,Parekh,S.,Tilbury,D.M.:Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache web server. In: Proceedings of Network Operations and Management. (2002)

managing the performance impact of administrative utilities

Documents

control interval

shortterm performance

performance baselinehow

sleep fractionaction

sleeptimeaction interval

backup throttling

degradation limiterror

utilityperformance impact