the susceptibility of programs to context...

994 IEEE TRANSACTIONS ON COMPUTERS, VOL. 43. NO. 9, SEPTEMBER 1994

The Susceptibility of Programs to Context Switching Wen-mei W. Hwu, Member, IEEE, and Thomas M. Conte

Abstruct- Modern memory systems are composed of several levels of caching. Design of these levels is largely an empirical practice. One highly-effective empirical method is the single-pass method wherein all caches in a broad design space are evaluated in one pass over the trace. Multiprogramming degrades memory system performance since (process) context switching reduces the effectiveness of cache memories. Few single-pass methods exist which account for multiprogramming effects. This paper uses a general model of single-pass algorithms, the recurrencdconflict model, and extends the model for recording the effects due to both voluntary context switches (e.& system calls) and involuntary context switches (e+, time quantum expiration). Involuntary context switches are modeled using the distribution of lengths between a reference to an address and the re-reference to the same address. The paper makes the assumptions that involunary context switches are equally likely to occur between each reference, and that one can independently estimate, f r -5 , the fraction of a cache’s contents flushed between context switches. The case where fc,s = 1 is used to measure the effect of worst-case context switch penalty (the susceptibility) of several members of the SPEC89 benchmark set to context switching. Some empirical results of frs are presented to illustrate the case where fc,. < 1. The model is validated against its assumptions by comparing its results with more-restrictive methods.

Zndex Terms- Multiprogramming, cache, simulation, single- pass algorithm, memory hierarchy, performance anaysis, bench- marking, SPEC.

I. INTRODUCTION ULTIPLE levels of caching and buffering have become M the norm in memory system design. These systems are

typically designed using simulation to determine the performance of a wide range of memory system organizations. The inputs to the simulator are benchmarks that represent nominal system workloads. The designer’s job is to choose the most cost-effective organization using the simulation results as a guide. A class of powerful simulation methods, called single- pass stack methods, have become available to memory system designers [ 11-[5]. With these methods, the memory system performance of thousands of organizations can be determined

Manuscript received March 25, 1991; revised October 1992. This was supported by the National Science Foundation (NSF) under Grant MIP- 8809478, AT&T Global Information Solutions, the AMD Corp. 29K Advanced Processor Development Division, the National Aeronautics and Space Administration (NASA) under Contract NASA NAG 1-61 3 in cooperation with the Illinois Computer laboratory for Aerospace Systems and Software (ICLASS), and the Office of Naval Research under Contract N00014-88-K- 0656, and an equipment domation from the Hewlett-Packard Co.

W.-m. W. Hwu is with the Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL 61x01 USA; e-mail: [email protected].

T. M. Conte is with the Department of Electrical and Computer Engineering University of South Carolina, Columbia, SC 29208 USA; e- mail: [email protected].

IEEE Log Number 9400780.

using a single pass through the memory access trace of the benchmark, whereas traditional, multiple-pass methods require one pass per potential memory system design.

Multiprogramming degrades memory system performance since (process) context switching reduces the effectiveness of cache memories. This occurs when cache contents that will be needed after the process returns from a context switch are purged by the intervening processes. The cache contents that may fall victim to context switching are determined by the process’ reference pattern (a program characteristic) and the cache dimension (a system design parameter). The portion of the cache contents that is actually purged by intervening processes is determined by the load of the system, the number of ready processes and access patterns of these processes. The method presented in this paper accurately records, for all cache dimensions and all context switching intensities in a single pass, the total amount of cache contents that will be needed after the process returns. This information is defined as the susceptibility of the program to the effect of context switching.

Several other approaches have been used to measure the effects of context switching [6]-[ 141. The earliest approaches flushed the cache being simulated at fixed intervals in the trace [6], 171. Shedler and Slutz [8] approached the problem by stochastically merging several memory reference traces. Easton [9] used the average working set size of the memory reference trace to estimate cold-start miss ratios. Haikala [ 121 simplified Easton’s approach by estimated cold-start miss ratios using a Markov chain model. Cold-start miss ratios can be used to approximate the multiprogramming effects. Switching between multiple memory reference traces at a fixed interval was used by Smith [ 101 to measure multiprogramming effects. Also, measurements of actual multiprogrammed workloads were performed by Clark 1111, Agarwal et al. 1131, and Mogul and Borg 11.51. Apart from the approximations of Easton [9] and Haikala [ 121, no work has been done to extend single-pass methods to model the effects of context switching exactly. Since multiprogramming effects can account for a 4%-12% degradation in performance [11]-1131, this omission in the literature has limited the usefulness of single-pass methods.

One obvious extension to single-pass methods to model context switching effects is to flush the LRU stack periodically. The shortcoming of this approach is that one simulation would have to be performed for each context switching intensity (e.g., time quantum and I/O workload). A more desirable method is to record the context switching effects for all intensities in one pass. This paper introduces a single-pass method for measuring the susceptibility of a program to the effects of context switching for all cache dimensions and all intensities. It is demonstrated that the susceptibility measures

00 18-9340/94$04.00 0 I994 IEEE

mailto:[email protected]

mailto:[email protected]

HWU AND CONTE: THE SUSCEPTIBILITY OF PROGRAMS TO CONTEXT SWITCHING

-

995

Reference 1 a b c d e f g h Address I 0 1 2 3 1 2 1 0

Fig. I . An example trace of addresses.

Refercncc: a b C d Address: 0 miss 1 miss 2 miss 3 miss

block 0:

block 1:

e f L? h 1 miss 2 1 0 miss

* Dimensional conflict

Fig 2 An example two-block direct-mapped cache behavior.

can be combined with system load parameters and context switching intensity to yield the performance degradation in various multiprogramming environments without resimulation. Obtaining memory system performance degradation under many different system loads allows the mcmory system to be designed with a degree of robustness. It further increases the advantage of single-pass stack methods over multiple-pass methods. This is the first such study to make the dichotomy between program susceptibility and multiprogramming effects. The measured performance of the method is compared to results from periodic and random flushing of the LRU stack.

11. RECURRENCES, CONFLICTS AND CONTEXT SWITCHES

The metric used in many memory system studies is the miss ratio. This is the ratio of the number of references that are not satisfied (Le., that miss) for a cache at a level of the memory system hierarchy over the total number of references made at that level. The miss ratio has served as a good metric for memory systems since it is a characteristic of the workload (e.g., the memory trace) yet independent of the access time of the memory elements. A given miss ratio can be used to decide whether or not a potential memory element technology will meet the required access time for the memory system [6].

The recurrence/conflict model of the miss ratio is best illustrated with an example. Consider the trace of Fig. 1. The recurrences in the trace are accesses e , f 9 and h. In the ideal case of an infinite cache, the miss ratio may be expressed as

N - R p = - N '

where R is the total number of recurrences and N is the total number of references. Nonideal behavior occurs due to conflicts. A dimensional conflict is defined as an event which converts a recurrence into a miss due to limited cache capacity or mapping inflexibility. For illustration, consider a direct mapped cache composed of two one-byte blocks shown in Fig. 2. (Note that in practice, such a small cache would be impractical to build.) A miss occurs for the recurring reference e because reference d purges address 1 from the cache due to insufficient cache capacity. Similarly, a miss

occurs for recurring reference h due to reference c. References d and c represent a dimensional conflict for the recurrences e and h, respectively. The other misses, a , b, c and d, occur because these are the first references to addresses 0 , 1 , 2 and 3, respectively. The following formula can be used for deriving cache miss ratio, p. for a given trace, a given cache dimension:

where D the total number of dimensional conflicts. (For the example, p = (8 - (4 - 2))/8 = 0.75.) This is a general model and can be extended to account for other effects. This paper extends this model to address conflicts due to context switching.

A multiprogramming conflict is defincd as an event which converts a recurrence into a miss due to a context switch. For example, both f and g are dimensional hits of the cache in Fig. 1. If a context switch occurs between references e and f which purges addresses 1 and 2 from the cache, two multiprogramming conflicts will occur, one to reference f and one to reference g. Equation (2 ) can be extended to account for these multiprogramming conflicts:

N - ( R - D - M ) , (3) N p =

where M the total number of multiprogramming conflicts.

A. Reference Streams and Cache Dimensions

A formal abstraction of a benchmark's trace is termed a reference stream. This is a sequence of references to addresses, w ( k ) , of length N (0 5 k < N ) . When required, the addresses are represented by lower-case Greek letters, such as a, p, y. The reference stream is assumed to be generated by a single process in a multiprogramming system. Note that a reference at w ( k ) occurs later than w ( k - 1) in time, but the parameter IC does not represent parameterized time since it does not take into account the difference in service times between cache hits and cache misses. For this reason, k is referred to as the reference count. The trace also contains information about voluntary context switching. A reference is called a voluntary context-switch event if the benchmark relinquishes the CPU after the reference (e.g., a system call is performed).

The dimension of a cache is expressed using the notation, ( C , B , S ) , for a cache of size 2" bytes, with block size Z B bytes, and 2' blocks contained in each associativity set. The term set size is used to mean associativity level, or the number of blocks per set. Cache size is the total number of bytes per cache. Block size has been called line size elsewhere [ 101. Note that C 2 B + S. The notation (C, B , m) is an abbreviation for the dimension of a fully-associative cache (S = C - B). For example, a cache of dimension (10,6,0) is a 1-KB direct- mapped cache with a block size of 64 bytes; and, a cache of dimension (21,10,11) (alternately, (21,10, m)) is of size 2 MB with 1---length blocks and it is fully-associative. A dash is substituted for an entry in the triple to indicate all caches of that dimension: (-, 5 , l ) are all caches with block size of 32 bytes and having two-way associativity. Caches are

996 IEEE TRANSACTIONS ON COMPUTERS, VOL. 43. NO. 9, SEF'TEMBER 1994

Fig. 3. An example of LRU stack operation

assumed to use LRU replacement and map addresses into sets using bit selection [3].

It is useful to partition the reference stream by setting the block offset portion of all addresses in the stream to zero. This produces a block reference stream, W B ( ~ ) , is defined such that,

In binary, this is equivalent to setting the least-significant B bits of each address to zero.

B. Least Recently Used (LRU) Stack Operation

LRU stacks were first introduced by Mattson, et al. in [ l ] as a way to model the behavior of paging systems. An LRU stack operates as follows: when an address, wg(k ) = a, is encountered in the block reference stream, the LRU stack is checked to see if a is present on the stack. If a is not present, it is pushed onto the stack.

However, if a is present (e.g, it is a recurring reference), it is removed from the stack, then repushed onto the stack. This is illustrated in Fig. 3 for the example reference stream at the beginning of this section (Fig. 1). This stack maintenance policy is specific to a particular block size, as is the discussion below.

A stack is represented as S ~ ( l c ) , maintained for a block size B at time k. The ith ordered item of SB(~) is expressed as, S ~ ( k ) [ i ] . The stack may also be expressed as an ordered list, such that s ~ ( k ) = { s ~ ( k ) [ O ] , s,(k)[l], . . . , s ~ ( k ) [ M ] } , where m is the depth of the stack. The following operations are defined for a stack: the push(.) function,

and, the repush(.) function,

1. 1.1 determine D from I? 1.2 2. 3. N t " l

if a E S s ( k - 1) then

S ~ ( l i ) +- repush(Ss(k - l ) , a ) , else S B ( k ) t push(SB(h - 1),a)

The least recently used management policy for a stack, s ~ ( k ) Fig. 4. (adapted from Mattson et al.).

are defined as side-effect-free functions rather than procedures. This is to remove dependence on the time variable, k .

For an address a = WE(^), the least recently used (LRU) management policy for a stack is shown in Fig. 4. In Step 1.1, the references between the top of stack and the recurring reference have been referred to as the set I' = {pi I pi =

Fig. 4 is applied to a = W B ( ~ ) for all k . The LRU policy is essentially a definition for calculating Sg ( k ) from Sg ( k - 1) and a. In most situations, S B ( ~ ) is calculated in order to obtain other statistics, such as the stack depth distribution. (Step 1.1 is explained in detail in [ 161.)

S B ( ~ - l)[i],O 5 i 5 A } .

C. Types of Context Switching

Context switching occurs due to two distinct events: 1) a voluntary context switch, where the benchmark relinquishes the processor, and, 2) an involuntary context switch, where the benchmark's execution is suspended due to external interrupts. Voluntary context switches are a characteristic of the benchmark. They occur at the same place in the execution between different benchmark runs. On the other hand, involuntary context switches are determined by the YO system behavior (device interrupts), clock frequency (timer interrupts), etc. They do not occur at the same place between runs of the benchmark, and are not characteristic of the benchmark. Page faults are treated as involuntary context switches because page faults depend on the interaction of processes in the system, whose interaction is assumed to be pseudo-random in nature.

Since involuntary context switches occur at random in- stances, it is assumed that involuntary context switches can occur with equal probability for each reference in the reference stream [12]. This probability is denoted, q, and termed the involuntary context switching intensity. Separation of the system's characteristics from the characteristics of the benchmark allows many different systems to be considered without re- simulating the benchmark's behavior. This is the main goal of single-pass techniques in general [2]. Although the occurrence of involuntary context switches is not a characteristic of the benchmark, the benchmark's susceptibility to their occurrence is. This susceptibility can be measured as the expected number of multiprogramming conflicts due to random involuntary context switching. A method to measure this susceptibility is presented below that records the benchmark's susceptibility to all context-switching intensities in a single-pass through the trace. The empirical results discussed in Section 111-A demonstrate the validity of this single-pass approach.

The working set of a process (benchmark) may have been flushed from the cache before it re-enters the run state after a context switch. Let fcs represent the fraction of the cache's contents Pushed between context switches. The number of

HWU AND CONTE. THE SUSCEPTIBILITY OF PROGRAMS TO CONTEXT SWITCHING

processes executed before a process returns from a context switch is a function of the system load and the operating system scheduling policy. Furthermore. the particular cache block5 flushed due to a context switch also depends on the reference patterns of the processes executing on the system. This makes fCs highly dependent on several volatile variables and therefore difficult to measure. (Several empirical estimates of fcs are presented in Section 111-E.) Some virtual memory system implementations force a cache flush to eliminate problems with page sharing of writable pages [13]. Also, it has been shown that for small cache sizes, a context switch effectively flushes the cache, therefore fcs = 1 [ I O ] . For larger caches, this provides an upper bound for the effects of context switching.

D. lhe Components of Multiprogramming CnnJicts

Multiprogramming conflicts are defined in terms of potential victims. A recurring reference that is not removed from a specific cache by a dimensional conflict, yet that may be removed by a context switch, is a potential victim of the context switch. The numbers of each type of potential victims are defined as XI [C.B,S] and X I [ C , B .S ,q] , for all voluntary and involuntary context switches, respectively. X I [C. B , SI is the total number of potential victims due to voluntary context $witching for caches of dimension (C, B , S). X,[C. B, S , q] is the expected number of potential victims due to involuntary context switching of intensity q. The multiprogramming conflicts are expressed in terms of victims as,

M[G,B.S,q] f c s ( X t ~ [ C . B . S ] + X I [ C . B , S , ~ ] ) . (4)

The equation for the miss ratio (2) can be modified to take into account the new conflicts,

N - ( R - D - M ) N P =

- N - ( R - D - fcs(X1. IC, B, SI + X I [ C > B , s, 91) ) -

N ( 5 )

Determining the multiprogramming conflicts involves measuring X I . and X I from the reference stream. The measurement can be done by extending the recurrence/conflict single-pass technique. The miss ratio is then calculated by first calculating M [ C , B. S , q] using Equation 4 for a value of fcs, then using the result to complete Equation 5.

E. Multiprogramming Extensions to LRU Stack Operation

The extensions required to the recurrence/conflict single- pass technique measure X I . and X I are shown in Fig. 6. The procedure for determining X\.[C:, B. S] is illustrated in Fig. 5. The procedure operates as follows: When Q is processed, if it is not a recurring reference (Le., the test of Step 1 of Fig. 6 fails), then it cannot be a victim since it cannot produce a hit. However, if CL is a voluntary context switch event, it is marked as such when it is pushed on the stack in Step 2 (marked references are shown using asterisks in Fig. 5) .

991

voluntary context switch

Reference: 1 2 I 0

potential miss potential miss potential miss

(* marked slack position)

Fig. 5. operation.

An example for voluntary context switch of the modified LRU stack

If (Y is a recurring reference, XI/[C, B , S] is conditionally incremented if a marked reference is encountered when the dimensional conflicts are calculated. X v [ C , B , SI is only incremented for all dimensions in which Q does not have a dimensional conflict. If Xv [C. B , S] were incremented for all dimensions, a reference might be counted more than once as a conflict, once as a multiprogramming conflict and once as a dimensional conflict. Notice that the references immediately below repushed, marked references inherit the marking in Fig. 5 (Step 1.6 and its substeps of Fig. 6). This is done to insure all subsequent recurring references that cross the context switch event are subject to a voluntary context switch.

The procedure for determining X I [ C , B , S, q] using an LRU stack is somewhat more complicated than that for determining Xv[C, B , SI. Recall that an involuntary context switch may occur between every reference. Let L, the context switch distance, be the number of potential involuntary context switch events for the recurring reference a at reference count k (i.e., a = wg(k - L ) = wg(lc)). Let p~ be the probability that at least one involuntary context switch occurs between times k - L and k . Then,

Define ~ L [ C , B , SI to be the number of recurrences not subject to dimensional conflicts that have a context switch distance of L. Therefore,

Equation 7 expresses the expected number of potential victims due to involuntary context switching.

The equation fits naturally into a stack-based method. The new metric ~ L [ C , B , S ] can be recorded by annotating the references on the stack.

Figure 7 shows an example of calculating Xr[C, B , SI. The figure shows that a counter of the number of context switch events affecting cy is kept, defined as c l (a ) . Initially and after a recurring reference is repushed, c ~ ( a ) + 1 (Step 2.1 and 1.8 of Fig. 6). In Step 1.3 and its substeps and Step 1.4, L is computed from one plus the sum of the counters of entries

IEEE TRANSACTIONS ON COMPUTERS, VOL. 43, NO. 9, SEPTEMBER 1994

Benchmark dodur

eqntot t espresso

998

1. 1 1 I ' 1.7 13.1 1.3.2 1.3.2 1 1.3 3 1.4 1.5 1.5.1 1.5.2 1.5.2.1 1.6 1.6.1 1.6.2 1.7 1.8 1.9 2 2.1 2.2

Description Monte Carlo simulation of the t ime evolution of a thermohydraulical modelization for a nuclear reactor. Generates t ruth table from logic equations. Performs I'LA optimization.

if a E SB(k ~ 1) then u01.c~ + false L t l for I - 0 to A do

0, S E ( k - if pc marked as (I voluntary contezt swttch went then

u01.c~ + t rue L + L i .I(%)

nL[C.B,S] -nL[C , l ? .S ]+ I

L t L + 1 for all (C, B , S) wilhout a dimenstono/ conj7id do

if v01.c~ then Xv[C. 0. S] + X i [C. B , S] + 1

if a morked as a oollrntory contert swtlch ccent then mark ~ A + I

unmark a C l ( b A - , ) + C l ( 8 A - 1 ) + c l ( a )

c l ( a ) - 1

cr(a) - 1 S B ( k ) - push(sB(k - l ) , a )

SB(k) + repush(SB(k - l ) , u ) , else

xlisp

Fig. 6. An L R U stack method modified for context switching.

Lisp interpreter ( t he application) executing the Nine-Queens problem ( the recursive benchmark).

Reference: 0 I 2 3

1- 2 c - 1

o c - 1 1 c - 1

0 c - 1

Rcference: 1 2 1 0

L = 3 L = 3 L = 2 L = l

(C, is slack counter-- see text)

Fig. 7. stack operation.

An example for involuntary context switching of the modified LRU

above (1 on the stack. (Notice that c ~ ( a ) is not part of the calculation of L, Fig. 7 illustrates this). In Step 1.5 and its substeps, ~ L [ C , B, S] is incremented for all caches in which there are no dimensional conflicts. Let SB(& l ) [A- 11 = B o , the address that is directly above N in the stack S B ( ~ - 1). As a bookkeeping step. <:i([jo) is incremented by c I ( ( x ) (Step 1.7). In this way, all the references deeper in the stack than N in S B ( ~ - 1) will arrive at the correct context switch distance.

The algorithm shows TLL[C, B. S ] being maintained for all values of L. Not all values of L must be recorded using TLL[C: B, SI. Rather, power-of-two sized categories can be retained. The scheme used for the simulations that is presented below uses 14 categories. The first category contains ~ L [ C : B, S ] for 1 5 L < 4, following this, the ith category contains TLL[C: B , S] for 2(i+2) 5 L < 2''1-3. This quantization scheme is based on observations of the distribution of TLL[C, B. S] vs. L. The scheme does however produce error for small q, and this is commented on in the following section.

Notice that the calculation of ,rt~[C. B, S] is independent of the context switching intensity distribution assumptions. The function used to calculate p~ in (7) need not be (6). It is possible to substitute other context switching intensity distributions into (7) without altering the presented single-pass method. The impact of this observation is that the method is more general than the assumption of uniformly-distributed involuntary context switching of (6).

I GNU C compiler, version 1.35. I Performs 300 x 300 matr ix multiulv.

gcc matrix300

111. EMPIRICAL RESULTS OF PROGRAM SUSCEPTIBILITY

The validity of the single-pass method of the previous section is discussed below by comparing the method's results with the results from other techniques that have similar assumptions. The results from the model are presented and discussed for members of the SPECS9 benchmark set presented in Table I (from [IS]).

The dimensional conflicts that occur due to different cache sizes are discussed in Section 3.4 to compare their performance degradation with that of context switching. Empirically observed values of the parameter f cs are also presented. It is found that fcs < 1 for moderate multiprogramming loads, confirming the observation that fcs = 1 produces overly- pessimistic results.

A. The Validity qf the Single-Pass Method

It is important to question whether the single-pass method extended to measure context switching produces performance estimates that are consistent with the assumptions made in Section 11-C. (Whether these assumptions are valid themselves is beyond the scope of this study.) The approach used in testing the validity of the method is to compare its predictions against methods used for traditional cache simulators.

One commonly-used simulation technique to measure the effects of context switching is to flush the state of the simulation at context switching events [6],[7],[10]. It is clear that in this case, f cs = 1 is assumed. The decision of when to flush the stack for voluntary context switch events is known since these are present in the trace. The decision of when to flush the cache for involuntary context switch events is done by distributing involuntary context switch events throughout the trace uniformly. This random-interval simulation flushes the contents of the stack based on a uniformly-distributed random number with mean q. Note that the random-interval simulation requires a simulation for each value of q . The single-pass method does not have this restriction since it measures the effects of all y in one pass over the trace.

The random-interval simulation method approximates the assumptions of Section 11-C, except that the simulation produces results for one particular random distribution of context switch events across the trace. The single-pass method measures the average effect of all distributions. This dis- crepancy can be eased by averaging the results of several random-interval simulations. Random-interval simulations are performed iteratively until the results converged.

Figure 8 present the difference between the random-interval simulation and the single-pass method for the benchmarks

HWU AND CONTE THE SUSCEFTIBILITY OF PROGRAMS TO CONTEXT SWITCHING

Average Std. dev.

~

999

21.2% 9.73% 4.35% 16.7% 6.64% 2.64% 13.9% 4.76% 1.69% 12.2% 4.96% 1.66% 9.72% 3.52% 1.04% 8.24% 2.72% 0.71%

..pr...o, g - 0.001 .m- pcc, q - 0.001 +

X1i.P. q - 0.001 *.- e q n t o t t . q - 0.01 -+--

aspre.c.0, q - 0.01 9- qcc, q - 0 .01 c

X l 1 . P . q - 0.01 -.-- mtrlx300, q - 0.001

doduc, q - 0.01 -C-

matrix300. q - 0.01 -a-

1 14 16 18 20

cache sIzs (LOG byto.) 12

" ~~ ~~

10

Fig. 8. single-pass method, q = 0.01 and q = 0.001.

Absolute error of the miss ratio for random-interval simulation vs.

expressed as the absolute error of the miss ratio, for q = 0.01 and q = 0.001, respectively. Only the absolute errors for fully- associative caches are shown in the figure for brevity. Smaller levels of associativity were found to have lower error.

The figure demonstrates that the difference between the simulation and the model is approximately 2.5% to 3.3% for q = 0.01 and less than 1% for q = 0.001. The increase in error for larger q values is due to the quantization of L discussed in the previous section. This error of 2.5% to 3.3% is large for q = 0.01, however empirical evidence suggests that q values are typically in the range of q = 0.001 to q = 0.0002 (calculated from [19] and [20] and assuming one reference made every five instructions). The single-pass technique using the power-of-two quantization of L produces results within 1% of the random-interval simulation for practical q values. This evidence suggests the single-pass method produces results that are consistent with its assumptions.

B. Involuntary Context Switching Susceptibility

It is useful to define Ap = M[C, B, S, q ] / N as a measure of benchmark susceptibility to context switching. This is the difference between the uniprogramming and multiprogramming miss ratios. Figure 9 presents Ap for gcc and xlisp for block size 16 bytes. The results in the figure are for the cache (31,4, oo), to eliminate the effects of dimensional conflicts. Section III-D discusses the effects of dimensional conflicts. The figure considers only involuntary context switching. The effects of voluntary context switching are discussed in Section III-C. Also, complete cache flushing (fcs = 1) is used to emphasize the worst case context switching penalty (Section III-E discusses other values of fcs). The figure demonstrates that when the intensity of context switching, q, is small, Ap approaches zero such that context switching has little effect for q 5 0.0001. This value of q corresponds to an average context switching interval of loo00 references. The gcc benchmark is slightly more susceptible to context switching than the xlisp benchmark for q = 0.1. This situation reverses itself and xlisp becomes more susceptible for q > 0.01. This phenomenon can be explained with the cumulative distribution of n~ vs. L, which is plotted in Fig. 10. The figure has a logarithmic axis

Fig. 9. cache dimension (31,4, m).

Ap (involuntary) of gcc and xlisp vs. q for block size 16 bytes,

95 - 90 - 85 -

qcc block size 16 c x l isp block size 16 c.

~

0 1 2 3 4 5 6 1 8 9 1 0 1 1 1 2 1 3 1 4 1 5 Contmt S w i t c h I w d i s t a n c e , L (IM: base 21

Fig. 10. Cumulative distribution of n~ vs. L for block size 16 bytes, cache dimension ( 3 1 , 4 , m ) .

TABLE I1 INVOLUNTARY CONTEXT SWITCHING

SUSCEFTIBIL~ ( A p ) FOR CACHES (31, -, m)

for the independent variable, L. From the figure, it is apparent that xlisp has a higher number of recurrences for L 5 2% 85% for gcc vs. 87.4% for xlisp. This would imply that a context switch frequency of greater than every 25 = 32 references would impact xlisp more than gcc. This explains the behavior observed in Fig. 9.

Figures 11 and 12 presents Ap for gcc, espresso and xlisp for block sizes 32 bytes and 64 bytes, respectively. The data for all the benchmarks is presented in Table II. The corresponding cumulative distribution functions of n~ vs. L are presented in Figs. 13 and 14, respectively. Together, Figs. 9-12 demonstrate that the susceptibility to context switching decreases as block size increases. From Table 11, for q = 0.01, the difference in

IEEE TRANSACTIONS ON COMPUTERS, VOL. 43, NO. 9, SEITEMBER 1994

65

60

55

a

.

'

,

le-05 0.0001 0.001 0.01 0.1 contex t Switch probabi l i ty , g

Fig. 1 1 . bytes, cache dimension ( 3 1 , 5 , .u).

A p (involuntary) of gcc, espresso and xlisp vs. q for block size 32

Fig. 12. bytes, cache dimension ( 3 1 . 6 . x).

A p (involuntary) of gcc, espresso and xlisp vs. q for block size 64

the miss ratio is 21.2% for block size 16 bytes and 13.9% for block size 32 bytes, on average. One possible reason for this is that the block reference streams for larger block sizes have smaller context switching distances. This occurs since more references occupy the same cache block for larger block sizes than for smaller block sizes.

Comparison between the benchmarks reveals significant variance in susceptibility. The matrix300 and eqntott benchmarks have the lowest change in the miss ratio, whereas benchmarks such as doduc and xlisp are quite sensitive to the value of q . For q = 0.01, A p = 37.8% for doduc compared to 5.98% for xlisp. This confirms that susceptibility is a characteristic of the benchmark and that workload choice influences the observed effects of context switching.

C. Voluntary Context Switching Susceptibility

The susceptibility of the benchmarks to voluntary context switching effects is relatively small compared to the involuntary effects. This can be seen in Table 111, which presents the voluntary susceptibility (Ap) for fully-associative caches of the largest dimension for gcc and espresso. The largest- dimensional fully-associative caches were selected so that Ap would be at its maximum since no dimensional conflicts occur.

100

9s

90

85

80

t 7 s

70

6o .............

55 r eEpleSS0 block size 32 t qcc block size 32 +-

x l i r p block slze 32 -8.-

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 5 D I " " " " " " "

context s u l t c h l n g distance. L (LOG base 2 )

Fig. 13. dimension ( 3 1 , 5 , ~ ' ) .

Cumulative distribution of 111, vs. L for block size 32 bytes, cache

a

100

95

90

85 ..... ....A .-.+

...........

7 0 :::/ ..........

espresso block size 64 t qcc block size 64 +-

xllsp block s i r e 64 -0.-

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 S context Swltchlnq distance, L (LOG base 2 )

50' " " " " " ' " I

Fig. 14. dimension ( 3 1 . 6 . ea).

Cumulative distribution of 111, vs. L for block qize 64 bytes, cache

TABLE I I I VOI.UNTARY CONTEXT SWIrCHlNC SUSCEPTIBII.ITY VS. BLOCK SlZ t

I Block size (bytes) Brnchmark 1 16 32 64

dodric I 0.07% 0.04% 0.03% espresso 0.03% 0.02% 0.01%

gcci 3.1% 2.0% I:*%

xlisp G . O X I O - ~ % 6 .0~10- '% G . O X I O - ~ %

eqntott 5.6x~0-3% 3.8~10-3% 2 . 4 ~ 1 0 - 3 %

matrix300 1 . 9 ~ IO-"% 1 . 8 ~ 1 . 7 ~

The occurrences of voluntary context switches are rare for these benchmarks as well as for other members of the SPECS9 set [ 171. This explains the small susceptibility due to voluntary context switches. This may well be an artifact of benchmark selection and should not be taken as a general statement that voluntary context switches do not have much effect. One of the benchmarks, gcc, is selected to serve as an example for the discussions that follow to illustrate the behavior of the susceptibility model.

D. Dimensional Conjict EfSects

The dimensional conflicts have been excluded from consid- eration thus far by considering large, fully-associative caches to isolate the effects of context switching. The relative importance of dimensional conflicts to multiprogramming conflicts is

HWU AND CON= THE SUSCEPTIBILITY OF PROGRAMS TO CONTEXT SWITCHING 1001

100

[ ICC dlrect-mapped t

0 le-06 le-05 0.0001 0.001 0.01 0.1 1

Context switch probablllty, g

Fig. 15. Ap (involuntary) of gcc for caches (10,4, -).

, I

;; ?I '.' .I I ; ,I :.. / -

113, 4. 1) -I-- '; I' ,

gcc 110. 4. 0) t 110. 4. 1) +- 110, 4. 21 e- 110. 4. 00) .*-- 113. 4. 0) 4.-

(13, 4 , 0 0 ) -+- (13, 4 , 2 ) + - (I.' ; ,!

le-05 0.0001 0.001 0.01 0.1 1 context swltch probabllity, g

16. M I D vs. y for gcc, caches (10,4,-) and (13,4,-) .

interesting because some cache designs may be more resilient to context switching than others due to the influences of dimensional conflicts. Consider caches of size 1-K bytes: it is selected as a worst case since it should experience a high percentage of dimensional conflicts due to its extremely small size. Figure 15 shows Ap vs. q for gcc using caches of 1 -K-bytes and several set-associativities.

Calculating the miss ratios for the uniprogrammed case for gcc reveals a variation of 18% in the miss ratio for (10,4,0) to 15% for (10,4,00) (this data is not shown in the figure). However, there is much less variation in Ap apparent in Fig. 15. This same effect is apparent from the data collected for the other benchmarks.

The above data suggests that dimensional conflicts dominate over context switching effects for small caches. To quantify this, the ratio of the multiprogramming conflicts to the dimensional conflicts, M[C, B , S, q] /D[C, B, SI, can be used as a measure of the relative impact of multiprogramming conflicts to dimensional conflicts. This ratio is plotted against q using caches of dimension (10,4, -) and (13,4, -) for gcc and the results are shown in Fig. 16.

The figure demonstrates that for small q. dimensional conflicts dominate. The two kinds of conflicts have equal effect (Le., M/D = 1.0) for q M 0.02 with caches (10,4, -) and for

m 10000

qcc 64-byte blocks e-

: 100

Fig. 17. l o g ( M / D ) vs. cache size, for various block sizes (y = 0.02).

q z 0.00003 with caches (13,4, -). As associativity increases, the performance depends more on the multiprogramming conflicts than dimensional conflicts. Also, the importance of associativity increases with overall cache size. This implies that when associativity is used, multiprogramming effects can decide the cache size, which is similar to the observations of [ 131 concerning associativity.

To show the effects observed are not an artifact of the test cache sizes of 1-K and 8-K bytes, Fig. 17 presents M I D ratios for various cache and block sizes.

Any value of q would have been sufficient to demonstrate the general relationship between M I D and C. The data from Fig. 16 was used to select q = 0.02 for Fig. 17. Since in this region the effects of associativity are relatively minor, the associativity is fixed at 2-way associative (e.g., all caches (-, -, 1)). (Note that here, unlike the earlier figure, M I D is presented using a logarithmic scale). From the figure, it is immediately apparent that the worst-case relative impact of multiprogramming (i.e., M / D ) increases approximately linearly with cache size (both axes are logarithmic). Also, as a refinement of the observations made in Section 111-B, block size is inversely proportional to program susceptibility for small caches (less than 2-K bytes). However, program susceptibility appears to be directly proportional to block size for moderately-large cache sizes (4-K bytes up to 256-K bytes), after which the trend reverses itself again.

E. Some Empirical Measurement of Fraction of Cache Flush (fcs)

The majority of this section has assumed fcs = 1 in order to measure the worst-case susceptibility of programs to context switching. This section presents some empirical estimates of the fcs parameter. Note that these measurements are presented as some examples of fcs values. They should not be used to make general conclusions about fcs. The data is intended to illustrate the role of fcs in the model of context switching used in this paper.

The method used is to simulate a multiprogrammed system using a trace composed of interleaved sections of traces taken from several benchmarks. At each reierence, the interleaver

1002

0.5

0 .45

0 . 4

IEEE TRANSACTIONS ON COMPUTERS, VOL. 43, NO. 9, SEPTEMBER 1994

s. - o c p - 1 -I-- - 0 * 2 .e.-

5 - 00 *- -

I

1 0.05

Y-

10 11 12 13 I 4 15 16 17 18 19 20 cache SlZe (Iq bytes)

a

10 11 12 13 1 4 15 16 17 18 19 20 cache sire (Iq bytes1

Fig. 18. fcs vs. cache size for espresso, q = 0.01. Fig. 19. fcs vs. cache size for espresso, q = 0.001.

makes a decision whether to continue processing the current trace or to switch to another waiting trace. This probability is assumed to be uniformly distributed with mean q.

The values of fcs for each benchmark can be derived by comparing the number of misses for a uniprogramming cache to the observed number of misses for the multiprogramming cache. Any additional misses in the multiprogrammed case must be due to context switching. Let D,[C, B , S] represent the dimensional misses from the uniprogramming trace and D,[C, B, S] represent the dimensional misses measured from the multiprogramming trace. Let h;r represent the estimated multiprogramming conflicts. Then,

(8)

Note that it can always be assumed that Dm[C,B,S] > D,[C, B , SI since any dimensional conflicts for a benchmark trace must come as a result of the interleaving of traced events. Using (4) and neglecting the XV term as justified by the experimental results of Section 111-C, then,

&[C, B , s, q] = L [ C , B , SI - D,[C, B , SI.

> (9) A Dm[C, B, SI - D,[C, B, SI

XZ[C, B , s, q1 fcs =

where ~ C S is the experimental value for fcs. Equation (9) requires knowledge of X I , which would require use of the algorithm of Fig. 7.

The two experiments interleave espresso, gcc, and xlisp with q = 0.01 and q = 0.001. A block sizes of 32 bytes was assumed for these experiments. Values of fcs across cache size and associativities for the espresso benchmark are presented in Figs. 18 (q = 0.01) and 19 ( q = 0.001) from the perspective of the espresso benchmark.

It is clear from these two Figs. that fcs is a function of cache size, as suggested by Equation 9. For small caches, f c s M 13%-18%, whereas for large caches, fcs M 8%, when q = 0.01. The effect of associativity on fcs are less pronounced than the effects of cache size for q = 0.01 (Fig. 18). For q = 0.001 (Fig. 19), fcs M 0 large, fully-associative cache sizes. This is not true for smaller associativities, since dimensional conflicts occur between the references of the three benchmarks regardless of cache size. The effects of associativity also become less noticeable as cache size increases,

possibly because less dimensional conflicts occur between the references from espresso and those of gcc and xlisp.

These experiments show that fcs = 1 is a pessimistic assumption under moderate load. When workloads are de- composed into workload elements or workload elements are taken from standard benchmark sets such as SPEC89, it is not possible to predict the different combinations of benchmarks that may execute together in the final system. In this situation, a conservative assumption such as fcs = 1 is appropriate. The results for the susceptibility measures presented above for fcs = 1 suggest that the difference in the miss ratio will not change considerably for values of q 5 lo4. If designs are selected to satisfy a required maximum miss ratio, this observation suggests that selecting prototypes with fcs = 1 adds degree of tolerance to context switching to the designs with small increase in cost.

IV. CONCLUSION

This paper presented a method for constructing the worst- case context switching penalty (susceptibiliry) on the cache performance of a benchmark in the presence of multiprogramming. This was done by extending existing single-pass methods to measure the susceptibility in terms of potential victims of context switching. The method removes the single- pass simulation’s dependence on involuntary context switching intensity and system load effects, allowing performance to be calculated for values of these parameters without the need for re-simulation. This generalization of single-pass methods extends their usefulness into domains where multiple-pass methods are the only option.

The experimentation performed in this paper revealed that a benchmark’s susceptibility to context switching can be minimized by using large block sizes with small and large cache sizes. Interestingly, for medium-sized caches (4K-256K bytes for gcc) small block sizes minimize the impact of context switching.

An increase in context-switching intensity has an roughly- linear effect on a benchmark’s susceptibility. For all but extremely high intensities, dimensional conflicts dominate the miss ratio. Since all the benchmarks elicited very small involuntary context switching distances, a relatively high intensity

HWU AND CONTE: THE SUSCEPTIBILITY OF PROGRAMS TO CONTEXT SWITCHING 1003

of context switching (q 2 0.0001) was needed to have significant effects.

It is not true that all workloads will have susceptibilities similar to the SPEC89 benchmark members considered here. However, the method itself is not limited to a specific type of benchmark. Other results are easily generated. The benchmark results were shown to demonstrate the approach’s usefulness and validity. It was shown to perform comparable to less- general multiple-pass test methods. Also, the behavior of the multiprogramming miss ratio agrees with actual multiprogramming behavior results presented by other researchers, suggesting the results obtained using the single-pass method are reliable for design purposes.

ACKNOWLEDGMENT

research group for their support, comments and suggestions. The authors would like to thank all members of the IMPACT

REFERENCES

[I] R. L. Mattson, J. Gercsei, D. R. Slutz, and I. L. Traiger, “Evalutation techniques for storage hierarchies,” IBM Syst. J., vol. 9, no. 2, pp. 78-117, 1970.

[2] I. L. Traiger and D. R. Slutz, “One-pass techniques for the evaluation of memory hierarchies,” IBM Res. Rep. RJ 892, IBM, San Jose, CA, July 1971.

[3] M. D. Hill and A. J. Smith, “Evaluating associativity in CPU caches,” IEEE Trans. Comput., vol. 38, pp. 1612-1630, Dec. 1989.

[4] T. M. Conte and W. W. Hwu, “Single-pass memory system evaluation for multiprogramming workloads,” Tech. Rep. CSG-122, Ctr. for Reli- able and High-Performance Computing, Univ. of Illinois, Urbana, IL, May 1990.

[5] W.-H. Wang and J.-L. Baer, “Efficient trace-driven simulation methods for cache performance analysis,” ACM Trans. Comput. Sys., vol. 9, pp,

[6] K. R. Kaplan and R. 0. Winder, “Cache-based computer systems,” Computer, vol. 6, pp. 30-36, Mar. 1973.

[7] W. D. Strecker, “Cache memories for (PDP-11) family computers,” in Proc. 3rd Annu. In?. Symp. Comput. Architecture, Jan. 1976, pp. 155-158.

[8] G. S. Shedler and D. R. Slutz, “Derivation of miss ratios for merged .access streams,” IBM J. Res. Develop., vol. 20, pp. 505-517, Sept. 1976.

[9] M. C. Easton, “Computation of cold-start miss ratios,” IEEE Trans. Comput., vol. C-27, pp. 404-408, May 1978.

[IO] A. J. Smith, “Cache memories,’’ ACM Computing Surveys, vol. 14, no.

[ I l l D. W. Clark, “Cache performance in the VAX-11/780,” ACM Trans. Comput. Syst., vol. 1, pp. 24-37, Feb. 1983.

[I21 I. J. Haikala, “Cache bit ratios with geometric task switch intervals,” in Proc. I l t h Annu. In?. Symp. Comput. Architecture, Ann Arbor, MI, June 1984, pp. 364-371.

222-241, Aug. 1991.

3, pp. 473-530, 1982.

[I31 A. Aganval, J. Hennessy, and M. Horowitz, “Cache performance of operating system and multiprogramming workloads,” ACM Trans. Comput. Syst., vol. 6, pp. 393-431, Nov. 1988.

[I41 D. Thiebaut and H. S. Stone, “Footprints in the cache,” ACM Trans. Compur. Syst., vol. 5, pp. 305-329, Nov. 1987.

151 J. C. Mogul and A. Borg, “The effect of context switches on cache performance,” in Proc. Fourth Int. Con$ on Architectural Support for Prog. Lang. andOperating Syst., Santa Clara, CA, Apr. 1991, pp. 75-84.

161 T. M. Conte. “Systematic computer architecture prototyping,” Ph.D. thesis, Dept. of Elect. and Comput. Eng., Univ. of Illinois, Urbana, IL, 1992.

171 T. M. Conte and W. W. Hwu, “Benchmark characterization,” IEEE Computer, pp. 48-56, Jan. 1991.

[18] “Spec newsletter,” SPEC, Fremont, CA, Feb. 1989. [I91 J. S. Emer and D. W. Clark, “A characterization of processor perfor-

mance in the VAX-1 U780,” in Proc. I l t h Annu. In?. Symp. Comput. Architecture, Ann Arbor, MI, June 1984, pp. 301-309.

[20] D. W. Clark, P. J. Bannon, and J. B. Keller, “Measuring VAX 8800 performance with a histogram hardware monitor,” in Proc. 15th Annu. Int. Symp. Comput. Architecture, Honolulu, HI, May 1988, pp. 176185.

Wen-mei W. Hwu (S’81-M’87) received the Ph.D. degree in computer science from the University of California, Berkeley, in 1987.

He is an Associate Professor at the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interest is in the area of architecture, implementation, and compilation for high performance computer systems. He is the director of the IMPACT project, which has delivered new compiler and computer architecture technologies to the computer industry

since 1987. The IMPACT project has been sponsored by NSF, ONR, NASA as well as major corporations such as Hewlett-Packard, Intel, SUN, NCR, AMD, and Matsushita.

In recognition of his contributions to the areas of compiler optimization and computer architecture, the Intel Corporation named Dr. Hwu the Intel Associate Professor at the College of Engineering, University of Illinois in 1992. He received the National Eta Kappa Nu Outstanding Young Electrical Engineer Award for 1993 and the 1994 Senior Xerox Award for Faculty Research.

Thomas M. Conte received the BSEE degree from the University of Delaware, Newark and the M.S. and Ph.D. degrees in electrical engineering from the University of lllinois, Urbana-Champaign.

He is an Assistant Professor in the Department of Electrical and Computer Engineering at the Univer- sity of South Carolina, Columbia, South Carolina. His interests are in computer architecture and the history of computing. Dr. Conte is a member of the IEEE Computer Society, the ACM, Tau Beta Pi, and Eta Kappa Nu.

the susceptibility of programs to context...

Documents