performance comparison of locking caches under static and dynamic schedulers

PERFORMANCE COMPARISON OF LOCKING CACHESUNDER STATIC AND DYNAMIC SCHEDULERS

A. Martí Campoy, S. Sáez, A. Perles, J.V. Busquets-Mataix

Departamento de Informática de Sistemas y ComputadoresE-46022, Universidad Politécnica de Valencia, SPAIN

{amarti,ssaez,aperles,vbusque}@disca.upv.es

Abstract: Static use of locking caches is a useful solution to take advantage ofcache memories in real-time systems. Locking cache operates preloading andlocking a set of instructions, thus cache contents are a-priori known and remainunchanged during system operation. This solution eliminates the unpredictablebehavior of conventional caches, making easy to accomplish the schedulabilitytest through simple and well-known tools. Once attained predictability, in thispaper we analyze the performance of this schema compared to conventionalcache, as function of system size and cache size. We also study the influence ofthe scheduler (either fixed or dynamic priority). Copyright © 2003 IFAC

Keywords: real-time systems, cache memories, scheduling algorithms,performance analysis, genetic algorithms.

1. INTRODUCTION

Microprocessors include cache memories in theirmemory hierarchy to increase system performance.General-purpose systems benefit directly from thisarchitectural improvement, but hard real-timesystems need additional hardware resources and/orsystem analysis to guarantee the time correctness ofthe system behavior when cache memories arepresent. In multitask, preemptive real-time systems,the use of cache memories presents two problems.The first problem is to calculate the Worst CaseExecution Time (WCET) due to intra-task or intrinsicinterference. Intra-task interference occurs when atask removes its own instructions from the cache dueto conflict and capacity misses. This way, executiontime of instructions is not constant. The secondproblem is to calculate the task response time due tointer-task or extrinsic interference. Inter-taskinterference occurs in preemptive multitasking

systems when a task displaces the working set of anyother task from the cache. When the preempted taskresumes execution, a burst of cache misses increasesits execution time. This effect, called cache-refillpenalty or cache-related preemption delay must beconsidered in the schedulability analysis, since itsituates task execution time over the precalculatedWCET.

Several solutions have been proposed for the use ofcache memories in real-time systems. In (Healy, etal., 1999; Lim, et al., 1994; Li, et al., 1996) cachebehavior is analyzed to estimate task execution timeconsidering the intra-task interference. In (Lee, et al.,1996; Busquets, et al., 1996) cache behavior isanalyzed to estimate task response time consideringthe inter-task interference, using a precalculatedcached WCET. Alternative architectures toconventional cache memories have been proposed, inorder to eliminate or reduce cache unpredictability,

a a

27th IFAC/IFIP/IEEE Workshop on Real-Time Porgramming. Lagow, Poland May 2003

making easy the sechedulability analysis. In (Kirk,1989; Liedtke, et al., 1997; Wolfe, 1993) hardwareand software techniques are used to divide the cachememory into partitions, dedicating one or morepartitions to each task, avoiding the inter-taskinterference.

The main drawback of these solutions is thecomplexity of the algorithms and necessary methodsin order to accomplish the schedulability analysis orpartitioning the cache. Also, each method considersonly one face of the problem, intra-task interferenceor inter-task interference, but not together.

The use of locking caches has been proposed in(Martí, et al., 2001b; Martí, et al., 2003) as analternative to conventional caches solving both intra-task and inter-task interference problem. The staticuse of locking caches fully eliminates the intra-taskinterference, allowing the use of simple algorithms inorder to estimate the WCET of tasks. Regarding theinter-task interference, the static use of lockingcaches reduces the cache-refill penalty to a very low,constant time for all preemptions suffered by anytask, so inter-task interference can be incorporated toschedulability analysis in a simple way.

This work presents a statistical analysis of worst-caseperformance offered by the static use of lockingcaches versus actual worst-case performance offeredby conventional, dynamic and non-deterministiccaches (hereinafter, we will abbreviate "worst-caseperformance" by "performance"). This analysis ispresented for fixed-priority scheduler and EarliestDeadline First (EDF) scheduler. The behavior of theperformance of locking cache under both schedulersis compared

2. OVERVIEW OF THE STATIC USE OFLOCKING CACHES.

The locking cache is a direct mapped cache with noreplacement of contents when locked, joined with atemporal buffer of one-line size. The temporal bufferreduces access time to the memory blocks that arenot loaded in the cache, since only references to thefirst instruction in the block produce cache miss.During system start-up, a small routine is executed topreload and lock the cache. Preloaded instructionscan belong to any task of the system, and may belarge consecutive instruction sequences or small,individual separate blocks. When the system beginsits full-operational execution, the instruction cache isloaded with a well-known set of instructions, and itscontents will never change, eliminating both intra-task and inter-task interference.

In locking caches, an instruction will always or neverbe in the cache. In this way and memory related, theexecution time of the instruction is always constantand a-priori known. Thus, the WCET of a task

running on a locking cache can be estimated usingthe worst-path analysis (Shaw, 1989) applied tomachine code, taking into account the state, locked ornot, of each instruction.

For fixed-priority schedulers when locking cache isused, the schedulability analysis is accomplishedusing CRTA (Busquets, et al., 1996). Equation 1shows the expression of CRTA, where cache-refillpenalty is added for each preemption a task suffers.For EDF scheduler when locking cache is used,schedulability analysis is accomplised using theInitial Critical Instant (ICI) test (Ripoll, et al., 1996).In EDF is not easy to know the number ofpreemptions a task suffers, but it is known themaximum number of preemptions a task produces. Atask produces a preemption in its arrival or never willproduce it. So, cache-refill penalty can be added topreempting task instead to preempted task. Equations2 and 3 show how to include cache-refill penalty inthe ICI test.

In equations 1 to 3, Ci is the WCET of τi withoutpreemptions but considering locking-cache effects,and γ is the value of cache-refill penalty. In static useof locking cache, inter-task interference doesn’t existexcept for a small extrinsic interference introducedby the temporal buffer. After a preemption, a taskmust reload, in the worst case, only this temporalbuffer. This way, the value of γ is Tmiss, for allpreemptions, all tasks, where Tmiss is the time totransfer a block from main memory to temporalbuffer.

Instructions to be loaded and locked in cache areselected by a genetic algorithm executed duringsystem design. The developed algorithm provides theset of main memory blocks, and the result of theschedulability test. Also, this algorithm providessystem utilization. Further details can be found in(Martí, et al., 2001a).

)()(

1 γ+

+= ∑

∈

+j

ihpj j

ni

ini Cx

Tw

Cw (1)

∑=

+=

n

i ii P

tCtG

1

)()( γ (2)

∑=

−++=

n

i i

iii P

DPtCtH

1

)()( γ (3)

3. EXPERIMENTS

Experiments developed are composed of a set oftasks and a cache with definite size. The tasks used inthe experiments are synthetic created in order tostress the locking cache and exercise the geneticalgorithm. The main parameters of the experiments

https://www.researchgate.net/publication/3638099_Adding_instruction_cache_effect_to_schedulability_analysis_of_preemptive_real-time_systems?el=1_x_8&enrichId=rgreq-556d2ffb-4668-4773-bc40-f564c45f27b2&enrichSource=Y292ZXJQYWdlOzIzMDU2MzAxNTtBUzoxMDM2OTcyNzA5NjgzMzJAMTQwMTczNDc2NTEyOQ==

https://www.researchgate.net/publication/220414278_Improvement_in_Feasibility_Testing_for_Real-Time_Tasks?el=1_x_8&enrichId=rgreq-556d2ffb-4668-4773-bc40-f564c45f27b2&enrichSource=Y292ZXJQYWdlOzIzMDU2MzAxNTtBUzoxMDM2OTcyNzA5NjgzMzJAMTQwMTczNDc2NTEyOQ==

https://www.researchgate.net/publication/2687742_Reasoning_about_Time_in_Higher-Level_Language_Software?el=1_x_8&enrichId=rgreq-556d2ffb-4668-4773-bc40-f564c45f27b2&enrichSource=Y292ZXJQYWdlOzIzMDU2MzAxNTtBUzoxMDM2OTcyNzA5NjgzMzJAMTQwMTczNDc2NTEyOQ==

are described in Table 1. More than 360 experimentshave been defined. Four (two for each scheduler)kinds of runs have been accomplished for eachexperiment:

• Executing the genetic algorithm for eachexperiment.

• Simulating the conventional caches, using direct-mapped, two-way, four-way and full associativecache with LRU replacement algorithm. Thebest map-function is selected as performancevalue for each experiment. Simulations areaccomplished using the SPIM tool (Pattersonand Hennessy,1994), a MIPS R2000 simulator.

4. PERFORMANCE ANALYSIS

Regarding performance, it is not easy to compare theperformance of a real-time system running overdifferent hardware. If the same tasks-set isschedulable in both architectures, there are manycharacteristics and metrics useful to compareperformance. The approach proposed in this work tocompare the performance obtained from both lockingcache and conventional cache is the processorutilization. Utilization of conventional cache comefrom the simulation of the hyperperiod (Ucs for fixed-

priority scheduled experiments and Uce for EDFscheduled experiments). This would be an upperbound of the performance obtained by any analysistool. Utilization of locking cache is obtained by theexecution of the genetic algorithm (Uls for fixed-priority scheduled experiments and Ule for EDFscheduled experiments).

To compare the behaviors of the locking andconventional caches, Performance (πs and πe) isdefined as the utilization of the conventional cachedivided by the utilization of the locking cache (πs=Ucs/Uls and πe= Uce/Ule). π values greater than 1indicate that locking caches provide a lowerutilization than conventional caches, thus providingbetter performance.

Fig. 1 presents the accumulated frequency of πs andπe. Y-axis value is the percentage of experimentswith a π value higher than x-axis value. Table 2shows a statistical summary of πs and πe. Both figureand table show that in a great number of experiments,locking caches provide the same or betterperformance than conventional caches, reachingabout 60% of experiments with no loss or very slightloss of performance. Besides, locking cache seems tobehave the same way for both static and dynamicschedulers. Further analysis corroborates thishypothesis.

Intuitively, the most important factor in the behaviorof π is the relationship between the cache size andthe size of experiment code, since all tasks in thesystem compete for preloading their instructions inthe cache. System Size Ratio (SSR) is defined ascache size divided by the sum of all system tasks size(equation 4).

SizeCodeSizeCache

SSR__

=(4)

Fig. 2 shows the scatterplot of πs versus SSR. Eachpoint is the performance (πs=Ucs/Uls) of eachexperiment. The x-axis is shown in logarithmic scale.In the same way, Fig. 3 shows the scatterplot of πe

versus SSR. Again, the behavior of locking cache

Table 1 Main characteristics of experiments

Item Minimum MaximumNumber of tasks 3 8Size of task 2 KB 32 KbSystem size (sum oftasks’ size)

8 Kb 60 Kb

Instructions executed bytask

50,000 8,000,000

Instructions executed bysystem

200,000 10,000,000

Cache size 1Kb 64KbCache line size 16 bytes 16 bytesExecution time fromcache or temporalbuffer (Thit)

1 cycle 1 cycle

Time to transfer frommain memory (Tmiss)

10 cycles 10 cycles

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5

Performance

Acu

mu

l. F

req

uen

cy

EDF (πe )

FIXED(πs)

Fig. 1. Frequency histogram for πs and πe.

Table 2 Statistical summary for Performance

Statistic Fixed sched. EDF sched.Average 0.90275396 0,900469293Median 1.01112334 1,000110695Minim 0.35296379 0,358470076Maxim 1.35798524 1,357281619Std. deviation 0.23985561 0,232068Low quartile 0.715842316 0,71499322High quartile 1.058288365 1,046493603# of experiments 182 182Exps with π > 1 101 (55.49%) 93 (51.1%)Exps. with π > 0.9 140 (62.64%) 115 (63,19%)

under both schedulers seems to be the same, and avery strong interaction may be noticed betweenperformance of locking cache and the System SizeRatio.

In order to study this interaction, Fig. 4 and Fig. 5presents the performance grouped by SSR values.Seven groups, representing the seven cache sizesused in experiments, have been defined, using thefollowing rules. Table 3 shows the limits of eachgroup:

• Upper limit of group n is defined as z/64, wherez=2n and n = 1..6

• An experiment xi, with ratio_size r, will belongto group n if: Upper limit of group n-1 < r <Upper limit of group n

Once again, behavior of locking cache is the sameindependently of the scheduler used. However, thereare some slight differences between πs and πe . Tostudy these differences with more detail, the datapresented in Fig. 4 and Fig. 5 is divided into fourspaces regarding the value of SSR. These fourspaces, called from A to D are described below.Table 4 and Table 5 shows a statistical summary ofspaces A to D for both schedulers.

Space A: This space comprises groups 1, 2, and partof 3 (SSR below 0.08). In this space, around 80%(84% for fixed-priority scheduler and 82% for EDFscheduler) of the experiments present a π value equalor greater than 1 with low variability. Also, π rises asthe SSR increases. The behavior of this space is dueto the large difference between cache size and code

size. For small cache size, the intra-task interferenceis very high, and only a few number of instructionsremain in the cache for two or more executions.However, the locking cache avoids replacement, thusthe instructions locked in the cache always producehit. For instructions not locked, their behavior is thesame as in conventional small caches, replacing fromtemporal buffer after each execution. Space A showsa small increase of performance for the locking cacheon the right side. While the conventional cacheexperiences the same intra-task interference whencache size increases in few bytes, locking cachesprofit from each byte added to cache size. Inaddition, the inter-task interference is high for theconventional cache, but not for the locking cache.

Space B: This space comprises the part of group 3not included in space A, group 4 and part of group 5(SSR below 0.37). Only 21% of the experiments (forboth schedulers) have a performance equal or greaterthan 1. The variability is very high, as the distancebetween upper and low quartile reflects. Performanceof the locking cache falls down quickly as SSR rises.In this space, the dynamic behavior of theconventional cache profits from cache size, becausemany pieces of code with high degree of locality fitinto cache. However, the locking cache must assignthe cache lines to only a small set of instructions. Theeffect of inter-task interference is favorable to thelocking cache, but its effect on performance ofconventional cache is relatively low since cache sizeis still low.

Table 3 Upper limits and cache size for grouping

Group Cache size Upper limit1 1 Kb 0,031252 2 Kb 0,06253 4 Kb 0,1254 8 Kb 0,255 16 Kb 0,56 32Kb 17 64 Kb ∞

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

0,01 0,1 1 10Log (SSR)

Perf

orm

an

ce (ð

s)

Fig. 2. Scatterplot of πs versus SSR

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

0,01 0,1 1 10Log (SSR)

Perf

orm

an

ce (ð

e)

Fig. 3. Scatterplot of πe versus SSR

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1 2 3 4 5 6 7

Groups

Pe

rfo

rma

nc

e

dataaveragemedianLow quartilHigh quartil

Fig. 4. πs versus groups of SSR.

Space C: This space comprises part of group 5 andall experiments of group 6 (SSR lower than 1). In thisspace, only 17% of experiments presents aperformance equal or greater than 1 for fixed-priorityscheduler, and this percentage lowers to 6% for theEDF scheduler. The behavior of the locking cache inthis space is the inverse than in space B. As SSRrises, the performance of the locking cache improves,because only instructions with very low degree oflocality are forced to remain in the main memory.When the value of SSR is close to 1, the value of π isclose to 1. Regarding inter-task interference, sincecache size is now large, the impact of cache-refillpenalty is very high and preemptions penalizeperformance in conventional caches.

Space D: This space comprises experiments fitting ingroup 7, all of them with SSR equal or greater than 1.In all cases, all experiments have performance equalor greater than 1, because all instructions arepreloaded and locked in the cache. Neither intra-tasknor inter-task interference exists, both inconventional and locking caches. But the lockingcache may present a slight improvement inperformance because when the system begins fulloperation, the cache is fully loaded, and nomandatory misses happen.

Conclusions from the four-spaces analysis are clear.The relationship between locking cache size andsystem size (as the sum of all tasks’ sizes) allows the

designer of real-time systems to estimate theperformance provided by static use of locking cachein front of conventional caches.

However, as opposed to the previously shown results,the analysis of the locking cache performance bymeans of four spaces show that there are differencesbetween some characteristics of πs and πe.Apparently, locking caches provides worseperformance under EDF scheduler than under fixed-priority scheduler, since the number of experimentswith values of π above 1 is greater for fixed-priorityscheduler, and the average value of πs is greater thanaverage value of πe.

A paired-sample analysis (Jain, 1991) for each spaceprovides information about the differences betweenaverage values of πs and πe. This paired-sampleanalysis is possible because each set of tasks hasbeen evaluated under both schedulers. Table 6 showsthe analysis summary. Avg. of Diff. is the average ofdifferences of each pair of experiments, that is, πs

value of experiment i minus πe value of the sameexperiment i. The result is the response given by theNull Hypothesis Test, using a t-test with alpha = 0.01(99%) and hypothesis = 0.00. Reject means that theaverage significantly differs from 0. P-Value is ameasure of significance. Value equal or greater than0,01 indicates that the Null hypothesis (that is,average = 0,00) cannot be rejected at the 99% ofconfidence level.

The result of the test is the same for the four spaces:there is no significant difference between averagevalues of πs and πe. That is, in a general sense,performance of locking cache is the same for bothfixed-priority and EDF schedulers.

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

0 1 2 3 4 5 6 7 8Groups

Pe

rfo

rma

nc

e

dataaveragemedianLow quartilHigh quartil

Fig. 5. πe versus groups of SSR.

Table 4 Statistical summary for spaces A to D of πs

Space A B C DTotal experiments 56 56 34 36Exps. with π> 1 83.9% 21.9% 17.6% 100%Exps. with π> 0.9 91.1% 26.8% 35.3% 100%Average 1.03 0.77 0.79 1.05Median 1.05 0.67 0.79 1.03Minimum 0.47 0.35 0.43 1.00Maximum 1.36 1.29 1.07 1.16Low quartile 1.01 0.53 0.65 1.02High quartile 1.10 0.93 0.96 1.09Std. deviation 0.17 0.27 0.19 0.05

Table 5 Statistical summary for spaces A to D of πe

Space A B C DTotal experiments 56 56 34 36Exps. with π > 1 82,5% 21,4% 6,1% 100%Exps. with π > 0.9 89,5% 30,4% 33,3% 100%Average 1,02 0,76 0,79 1,03Median 1,05 0,70 0,83 1,00Minimum 0,47 0,36 0,45 1,00Maximum 1,36 1,29 1,02 1,13Low quartile 1,02 0,55 0,66 1,00High quartile 1,10 0,93 0,94 1,04Std. deviation 0,19 0,26 0,17 0,04

Table 6 Paired-sample analysis for πs vs. πe

Space Avg. of Diff P-Value ResultA 0,00178593 0,518664 No RejectB -0,00471085 0,249892 No RejectC -0,00294133 0,737337 No RejectD 0,00246365 0,0251184 No Reject

5. CONCLUSIONS

This work presents an analysis of performance of thestatic use of locking cache, under fixed-priorityscheduler and Earliest Deadline First scheduler, withregard to two system main characteristics. Also,comparison of performance behavior under bothschedulers has been presented. Performance has beendefined as the relationship between the systemutilization when conventional cache is used, versussystem utilization when static use of locking cache isused.

The behavior of locking-cache performance versusSystem Size Ratio has been divided into fourscenarios. A strong relationship between lockingcache performance and this ratio has been noticedand identified, as well as the variability in the valuesof performance. Therefore, the real-time designermay estimate, without any experiment, the cost ofusing a locking cache, that is, the probability to loseperformance and get a worse utilization than usingconventional caches. For extreme values of therelationship between cache size and system size, thelocking cache offers in most cases better performancethan the conventional cache. For central values ofthis relationship, the locking cache offers worseperformance than the conventional, unpredictablecache.

The paired-sample analysis shows that there is nosignificant difference in average values ofperformance when using static or dynamic scheduler.Results from this test, joined with the several figuresand statistical summaries show that the behavior oflocking cache, from the performance point of view, isthe same independently of the algorithm used toschedule the tasks. This way, the kind of scheduler isnot a major parameter in deciding the use or not oflocking cache.

ACKNOWLEDGEMENTS

This work was supported in part by ComisiónInterministerial de Ciencia y Tecnología underproject CICYT-TAP 990443-C05-02 and GeneralitatValenciana under project CTIDIA/2002/27.

REFERENCES

Busquets, J. V., J. J. Serrano, R. Ors, P. Gil, A.Wellings (1996). Adding Instruction CacheEffect to Schedulability Analysis of PreemptiveReal-Time Systems. IEEE Real-TimeTechnology and Applications Symposium

Healy, C. A., R. D. Arnold, F. Mueller, D. Whalleyand M. G. Harmon (1999). Bounding Pipelineand Instruction Cache Performance. IEEE

Transaction on Computers. Volume 48, pages53-70

Jain, Raj. (1991) The Art of Computer SystemsPerformance Analysis. Jhon Wiley & sons Ed.New York

Kirk, D. B. (1989) SMART (Strategic MemoryAllocation for Real-Time) Cache Design. Proc.of the 10th IEEE Real-Time Systems Symposium.

Lee, C. G., J. Hahn, Y. M. Seo, S. L. Min, R. Ha, S.Hong, C. Y. Park, M. Lee, C. S. Kim (1997).Enhanced Analysis of Cache-RelatedPreemption Delay in Fixed-Priority PreemptiveScheduling. Proc. of the 18th IEEE Real-TimeSystem Symposium.

Li, Y. S., S. Malik, and A. Wolfe (1996). CacheModeling for Real-Time Software: BeyondDirect Mapped Instruction Caches. Proc. of the17th IEEE Real-Time Systems Symposium.

Liedtke, J., H. Härtig, M. Hohmuth (1997). OS-Controlled Cache Predictability for Real-timeSystems. Proc of the IEEE Real-TimeTechnology and Applications Symposium.

Lim, S. S., Y. H. Bae, G. T. Jang, B. D. Rhee, S. L.Min, C. Y. Park, H. Shin, K. Park, and C. S. Kim(1994). An Accurate Worst Case TimingAnalysis Technique for RISC Processors. Proc.of the 15th IEEE Real-Time Systems Symposium.

Martí Campoy, A., A. Pérez, A. Perles, J.V.Busquets (2001a). Using Genetics Algorithms inContent Selection for Locking- Caches.IAESTED International Conference on AppliedInformatics.

Martí Campoy, A., A. Perles, J.V. Busquets (2001b)Static Use of Locking Caches in MultitaskPreemptive Real-Time Systems. IEEE Real-TimeEmbedded Systems Workshop. London, UK

Martí Campoy, A. ,S. Sáez, A. Perles andJ.V.Busquets (2003). Schedulability Analysis inthe EDF Scheduler with Cache Memories . The9th International Conference on Real-Time andEmbedded Computing Systems and Applications.Tainan, TW

Patterson, D. and J. L. Hennessy (1994). ComputerOrganization and Design. The Hardware/Software Interface. Morgan Kaufmann. SanMateo.

Ripoll, I., A. Crespo and A. Mok (1996).Improvements in feasibility testing for real-timetasks. The Journal of Real-Time Systems , Vol. 11pp. 19-39.

Shaw, A. (1989).Reasoning About Time in Higher-Level Language Software. IEEE Transaction onSoftware Engineering. Vol. 15, Num. 7.

Wolfe, A. (1993). Software-Based Cache Partitioningfor real-time Applications. Proc of the 3thInternational Workshop on ResponsiveComputer Systems.

performance comparison of locking caches under static and dynamic schedulers

Documents