column associative caches

12
Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches Anant Agarwal and Steven D. Pudar Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Abstract Direct-mapped caches are a popular design choice for high- perfortnsnce processors;unfortunately, direct-mapped cachessuf- fer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mappedaccessesby allowing conflicting addressesto dy- namically choose alternate hashing functions, so that most of the cordiicting data canreside in the cache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set stores data that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz these rehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremoves virtually all interference missesfor large caches,without altering the critical hit accesstime. 1 Introduction The cache is an important component of the memory system of workstations and mainframe wmputcrs, and its performance is often a critical factor in the overall performance of the system. lle advent of RISC processorsand VLSI technology have driven down processorcycle times and made frequent referencesto main memory unacceptable. Caches are characterized by several psmsneters,such as their size. their replacement algori~ their block size, and their degree of associativity [1]. For cache accesses,a typical addressa is divided into at least two fields, the tag field (typically the high- order bits) and the index field (the low-order bits), as shown in Figure 1. The index field is used to reference one of the sets, and the tag field is compared to the tags of the data blocks witim that set. If the tag field of the addressmatches one of tag fields of the referencedse~then we have a hit, and the data can be obtained ftom b ai aj [ Tag Index address Cache Figure 1: Indexing into a direct-mapped cacheusing bit-selection hashing. the block that exhibited the hit.l In a d-way set-associative coche, each set contains d distinct blocks of data accessed by addresses with common index fields but ditYerent tags, When the degree of sssociativity is reduced to one, each set can then hold no more than one block of data. This configuration is called a direct-mapped cache. For a cache of given size, the choice of its degreeof associa- tivity infhsencesmany performance parameterssuch asthe silicon area (or, alternatively, the number of chips) required to implement the cache, the cache access time, and the miss rate. Because a direct-mapped cache allows only one data block to reside in the cache set that is dwectly specified by the address in&x fiel~ its miss rote (the ratio of misses to total references) tends to be worse than that of a set-associative cache of the same total size. However, the higher miss rate of direct-mapped caches is mitigated by their smaller hit occess time [2, 3]. A set-associative cache of the same total size always displays a higher hit access time because an asso- ciative search of a set is required during each reference, followed by a multiplexing of the appropriate data word to the processor. Furthermore, duect-mapped caches are simpler and easier to de- si~ end they require less area. Overall, direct-mapped caches are often the most economical choice for use in workstations, where cost-performance is the most important criterion. 1.1 The Problem Unfortunately, the large number of interference misses that occur indirect-mapped cachesare still a major problem. An interference miss (also known as a conjict miss) occurs when two addresses map into the same cache set in a duect-mapped cache, ss shown 0SS4-7495193 $3.0001993 IEEE lIrs most caches,more than one data word can reside in a datablock. In this case,an oflset is the third and lowest-order field in the address,and it is used to select the appropriate data word. 179

Upload: huong-phan

Post on 13-Oct-2014

56 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Column Associative Caches

Column-Associative Caches:

A Technique for Reducing the Miss Rate of Direct-Mapped Caches

Anant Agarwal and Steven D. Pudar

Laboratory for Computer Science

Massachusetts Institute of Technology

Cambridge, MA 02139

Abstract

Direct-mapped caches are a popular design choice for high-perfortnsnce processors;unfortunately, direct-mapped cachessuf-fer systematic interference misses when more than one addressmaps into the sensecache set. This paper &scribes the design ofcolumn-ossociotive caches.which minhize the cofllcrs that arisein direct-mapped accessesby allowing conflicting addressesto dy-namically choose alternate hashing functions, so that most of thecordiicting datacanreside in thecache. At the sametime, however,the critical hit accesspath is unchanged. The key to implementingthis schemeefficiently is the addition of a reho.dsM to eachcachese~ which indicates whether that set storesdata that is referencedby an alternate hashing timction. When multiple addressesmapinto the samelocatioz theserehoshed locatwns are preferentially

replaced. Using trace-driven simulations and en analytical model,we demonstrate that a column-associative cacheremovesvirtuallyall interference missesfor large caches,without altering the criticalhit accesstime.

1 Introduction

The cache is an important component of the memory system ofworkstations and mainframe wmputcrs, and its performance isoften a critical factor in the overall performance of the system.lle advent of RISC processorsand VLSI technology have drivendown processorcycle times and made frequent referencesto mainmemory unacceptable.

Caches are characterized by several psmsneters,such as theirsize. their replacement algori~ their block size, and their degreeof associativity [1]. For cache accesses,a typical addressa isdivided into at least two fields, the tag field (typically the high-order bits) and the index field (the low-order bits), as shown inFigure 1. The index field is used to reference one of the sets,andthe tag field is compared to the tags of the data blocks witim thatset. If the tag field of the addressmatches one of tag fields of thereferencedse~then we haveahit, and the data can be obtained ftom

bai aj

[ Tag Index

address Cache

Figure 1: Indexing into a direct-mapped cacheusing bit-selectionhashing.

the block that exhibited the hit.l In a d-way set-associative coche,

each set contains d distinct blocks of data accessed by addresses

with common index fields but ditYerent tags, When the degree of

sssociativity is reduced to one, each set can then hold no more thanone block of data. This configuration is called a direct-mapped

cache.

For a cache of given size, the choice of its degreeof associa-tivity infhsencesmany performance parameterssuch asthe siliconarea (or, alternatively, the number of chips) required to implement

the cache, the cache access time, and the miss rate. Because adirect-mapped cache allows only one data block to reside in the

cache set that is dwectly specified by the address in&x fiel~ its

miss rote (the ratio of misses to total references) tends to be worse

than that of a set-associative cache of the same total size. However,

the higher miss rate of direct-mapped caches is mitigated by their

smaller hit occess time [2, 3]. A set-associative cache of the sametotal size always displays a higher hit access time because an asso-ciative search of a set is required during each reference, followed

by a multiplexing of the appropriate data word to the processor.

Furthermore, duect-mapped caches are simpler and easier to de-

si~ end they require less area. Overall, direct-mapped caches are

often the most economical choice for use in workstations, where

cost-performance is the most important criterion.

1.1 The Problem

Unfortunately, the large number of interference misses that occurindirect-mapped cachesarestill a major problem. An interference

miss (also known as a conjict miss) occurs when two addressesmap into the same cache set in a duect-mapped cache, ss shown

0SS4-7495193$3.0001993 IEEE

lIrs most caches,more than one dataword can reside in a datablock. Inthis case,an oflset is the third and lowest-order field in the address,and itis used to select the appropriate data word.

179

Page 2: Column Associative Caches

in Figure 1. Consider referencing a cache with two addresses, a,

end aj, that differ ordy in some of tie higher-order bits (which

often occurs in multiprogremmin g environments). In this case,

the addresses will have ditTerent tags but identical index fields;

therefore, they will reference the same set. If we denote the set that

is selected by choosing the low-order bits of an eddress a es b[a],

then we have b[ai] = b[~j] for conflicting addresses. The nameb comes from the bit-selection operation performed on the bits to

obtain the index.

Assume the fOllOWitlg referencepattern ai a~ Sti aj ai aj “ c..

A set-associative cache will not suffer a miss if the program issues

the above sequence of referencesbecausethe detareferenced by ~i

and Uj can co-reside in a set. In a duect-mapped cache, however,

the reference to a, will result in an interference miss because the

data horn ai occupies the selected cache block. The percentage

of misses that are due to cofiicts varies widely among different

applications, but it is often a substantial portion of the overall miss

rate.

We believe these interference misses can be largely eliminatedby implementing control logic which makes better use of cachemea. The challenge, then, is determining a simple, area-efficientcachecontrol algorithm to reducethenumber of irtterferencerstiasesand to boost the performance without increasing the degree ofessociativity.

1.2 Contributions of ThM Paper

This paper presents the design of a column-associative cache that

resolves conflicts by allowing sdternate hashing functions, which

results in significantly better use of cache area. Using trace-driven

simttlatiom we demonstrate that its miss rate is much better thanthat of Jouppi’s victim cache [4] and the hash-rehash cache of

Agarw4 Horowi@ end Hennessy [5], and virtually the same es

that of a two-way set-associative cache. Furthermore, its hit accesstime is the same as that of a direct-mapped cache. To help explain

the behavior of the column-associative cache, we also develop andvalidate an analytical model for this cache.

The rest of this paper is organized es follows. The next sectiond~cusses other efforta with similar goals. Section 3 presents the

column-associative cache, and Section 4 develops an analytical

model for this cache. Section 5 presents the results of trace-drivensimulations comparing the performance of several cache designs,

and Section 6 concludes the paper.

2 Previous Work

Several schemeshave beenproposedfor reducing the nusnberof in-

terference misses. A general approach to improving duect-mappedcache access is Jouppi’s victim cache [4]. A victim cache is a smell,

fully-associative cache that provides some extra cache lines for &tsremoved from the dxect-mapped cache due to misses. Thus, for a

. .reference stream of confhcttng addresses, such es ai aj ai aj . . . .

the second reference, at, will miss and force the data indexed by

ai out of the set. The data that is forced out is placed in the victimcache. Consequently, the third reference, ai, will not require ac-

cessing main memory because the data can be found in the victim

cache.

However, this scheme requires a sizable victim cache for ade-

quate performance because it must store aZl cottllicting data blocks.

LAe the column-associative cache, it requires two or more access

times to fetch a confhcdng datum. (One cycle is needed to check

the primary cache, the second to check the victim cache, and a pos-

sible third to store the datum into the primary cache.) Because of

ita fixed size relative to the primary direct-mapped cache, both ourresults end those presented by Jouppi (see Figure 3-6 in [4]) show

that it is not very effective at resolving cotilcts for large primaty

caches. On the other hand, because the area available to resolve

contlicts in the column-associative cache increases with primary

cache size, it resolves virtually all cofilcts in large caches.

The schemein [6] is proposed for instruction cachesand usestwo instruction buffers (of size eaual to a cache line) between theinstruction cache end the instruction register, end an instruction

encoding that makes it easy to &tect the presence of branch in-

structions in the btd%rs.

Kessler et al. [7] propose inexpensive implementations of set-

essociative caches by placing the multiple blocks in a set in sequen-

tial locations of cache memory. Tag checks, done serially, avoid the

wide datapathrequiremenrs of conventional set-associative caches.

The principle focus of this study was a reduction in implementation

cost. The performance (measured in terms of average access time)

of this scheme could often be worse than a direct-mapped cache for

long strings of consecutive addresses, which occur commordy. For

example, a long sequential reference sueem of length equal to the

cache size would fit into a duect-mapped cache, and subsequent

references to any of these locations would result in a first-time hit.

However, in a d-way set-associative irnplementstionof this schemq

only 1/d of the references would succeed in the first access.

A similar problem exista in the MRU achemeproposedby So et

al. [8]. TheMRU scheme is ameanafor speedittgup set-associative

cache accesses. It maintains a few bits with each cache set indicat-

ing the most recently used block in the set. An access to a given set

immediately reada out ita MRU block betting on the likelihood that

it is the desired block, If it ian’~ then en associative search accom-

panies a second access. Clearly, a tsvo-way set-associative cache

does not require en associate search, but does requke a secondaccess. Unfortunately, only 1/d of dte references in a long se-

quential address smesm would result in first-time hits into a d-way

set-associative cache using this scheme.

A more desirable cache design would reduce the interference

miss rate to the same extent es a set-associative cache, but at the

same time, it would maintain the critical hit access pads of the

dmect-mapped cache. The hash-rehashcache [5] had similar goals,

but in Section 3.1 we demonstrate that it has one serious drawback,

The technique introduced in Section 3 removes this drawback and

largely eliminates interference misses by implementing slightly

more complex control logic to make better use of the cache area.

By maintaining direct-mapped cache access, these achemea do noteffect the critical hit access time. With proper desigm the fewadditional cycles required to execute the algorithms in case of a

miss are balanced by the decrease in the miaa rate due to fewerconfticts. This decrease in the interference miss rate is achieved

not by set essociativity but by exploiting temporal locality to make

more efficient use of the given cache area-a notion called columnOssociativity.

3 Column-Associative Caches

The fundamental idea behind a colm-associative cache is to re-

solve cotttlicts by dynamically choosing different locations (ac-

cessed by dtierent hashing functions) in which cordlicting data

180

Page 3: Column Associative Caches

set

o12 ai

3 set

4

s

6

7 a, EIEI

Cotumn-Aaaeciative Tw&Way Set-AaaeciaSive

Figure 2 Comparison of column-associative and two-way set-associativecaches of equal size. The cofllct b[ai] = b[a, ] k

resolved by both schemes.

000

b001

010 ai aj011

100

101

110 aj

Tag I Index 111

address Cache

Figure 3: In&xing into a cache by bit selection srtdby bit fiipping.

The conflict b[~i] = b[~j] is resolved by the bit-flipping rehash.

canresi&. Figure 2 comparesthe column-associativecachewith atwo-way set-associativecacheof equal size. When presentedwithconflicting addresses(b[ai] = b[aj]), the set-associativecachere-solves the contlict statically by referencing another location within

rhe same set. On the other hand the column-associative cache isdwect-mapped, and when presented with conflicting addresses. adifferent hashing function is dynamically applied in order to place

or locate the data in a different set. One simple choice for this other

hashing function is bit selection with the highest-order bit inverted,which we term bitj?ipping. If b[a] = 010, then ~[a] = 110, as

illustrated m Figure 3. Therefore, cofllcts are resolved not within

a set but within the entire cache, which can be thought of as a

cdttrut of sets-thus the name column associarivily.

Column associativity can obviously improve upon dKect-

mapped caching by resolving a large number of the cofl~cts en-

countered irs an address stream. In additio~ as long as the con-

trol logic used to implement column associativity is simple and

fast then the benefits of direct-mapped caches overset-associativecaches (ss discussed in Section 1) are maintained, especially the

lower hit access time. Because hits are much more frequent

than misses, the extra cycles required to implement the column-

sssociative sdgotirhrn on a miss cart be easily bahmced by the small

improvement in hit access time on every hi~ restdting in a smaller

average memory access time when compared to a two-way set-

essociative cache. Of course, column sssociativity could be ex-

tended to emulate degrees of associativity higher than two, but it

is likely that the complexity of implementing such an algorithm

would add little to the performance and might even degrade it.

Additionally, the column-associative implementation uses sets

within the cache itself to store ccn-dhcting dati, ordy a simple re-hash of the address k required to access this data. By comparison,

a victim-cache implementation requires an entirely separate, fidly-

sssociative cache to store the conflicting data. Not only does the

victim cache consume ex~a area but it can also be quite slow due

a~

aX

01 010 }

I 01 I 110 ]

I 010 I 010 1-

011 110 Ef

010

011

100

101

ia :01 or 011 110f mrrec.t 111

miss

Cache

Figure 4: Appending the highader blt of the index to the tag.

This technique is necessary when bk fl@ping is implemented.

to the need for an associative search and for the logic to main-tain a least-recently-used replacement policy. Of course, storing

cortfiicting data within the cache-instead of in a separate victim

cache-very likely results in the loss of useful da~ but this ef-

fect (henceforth referred to es clobbering) can be minimid es

discussed in Section3.2.

‘f’heretnainder of ourd~cussiottproceeds in two steps. Fnk we

&scribe a basic system that uses mtdtiple hashiog functions end

discuss its drawbacks. Then, we add rehash bits to this &sign to

alleviate its problems.

3.1 Multiple Hashing Functions

Like the hash-rehash cache in [5], wlttmn-esaociative caches use

two (or possibly more) distinct hashing iitttctions, ht and hz, to ac-

cess the cache, where hl [a] &notes the index obtained by applying

hashing function hl to the address a. If hl [ai] indexes to valid daQ

afisf-fime hit occurs; if it misses, hz[a;] is then used to acceas the

cache. If a second-time hit occurs, the data is retrieved. The datain the two cache lines are then swapped so that the next access will

likely result in a fust-time hit. However, if the second access also

misses, then the data is retrieved km main memory, placed in the

cache line indexed by hz[ai], and swapped with tie data in the first

location.

Usrng two or more hashing functions mimics set sssociativity,

because for conflicting addresses (that is, ai and aj for which

hl [ai] = hl [at]), rehashing (tj with hz resolves the cotdiict witha high probability (that is, hl [ai] # Itz[aj]). However, notice

that the hit access time of a firxt-time hit remains unchanged. For

simplicity and for speed, the tirst-time awess is performed with bitselection (that is, hl = b), and bit flipping is often used for hz (that

i% k = f).

The use of bit ftipping as a second hashing function results in

a potential problem. Consider two addresses, ai tmd a., which

differ ordy in the high-order bit of the index field @tat is, ~[ai] =b[a=]). These two addresses are distinct however, the tag fields are

identicaL thus a rehash access with f[ai] results in a hit with adata

block that should only be accessed by b[a=]. This is unacceptable,

because a date block must have a one-to-one correspondence with

a unique address. For addresses whose indexes are the same and

which thus reference the same se~ the tags are compared in order

to determine whether an address should access the data block. This

suggests a simple solution to the simation, appending the high-

order bit of the index field to the wig, as illustrated in Figure 4. Therehash with ~[ai] wilt correctly fail because the data block is once

again referenced by a unique address, a=. Thii scheme is assumed

to be in place whenever bit flipping is used.

181

Page 4: Column Associative Caches

‘Ydone

1

‘[’]w’‘Yma]Y

-P dobber2

ilong memoty reference

doneSwap

3

+L%J%!Pdone: accass complete

Figttre 5: Decision tree for the hash-rehashelgorirhm.

umnemonic actwn I cycles u

b[a] bit-selection aeeess 10

f[a] bit-flipping access 1

swap swap data in sets accessed by b[a] end j [a] 2

clobber2 get date from memory, place m set f [a] M

clobberl get data tim memory, place in set b[a] M

Rbit=l ? check if set b[a] is a rehashed location o .

Table 1: Decision treemnemonics endcycle times for eachaction.

To illustrate the operations more clearly, the hash-rehash algo-

rithm has been expressed as the decision tree in Figure 5, simply

a translation of the verbal description of the hash-rehash algorithminto a tree structure. Table 1 explains the mnemonics used in this

decision tree and in the othera which are introduced in this paper.

The table alao includes the number of cycles required to complete

en ectiorL which is necessary for the calculation of average access

time.

In the decision tree, note that after a fist-time miss end a seeond-

time hi~ which require two cycles to complete, a swap is performed.

According to Table 1, the swap requires an additional two cycles

to complete. The design requirements for accomplishing a swapin two cycles ia discussed in Section A of the appendix. However,

given en extra buffer for the cache, this swap need not rnvolve the

processor, which may be able to do other useful work while waiting

for the cache to time available again. If this ia the case half of

the time, then the time wasted by a swap is one cycle. Therefore,

for all decision trees in this paper, we assume that a swap adda only

one cycle to the execution time. (However, we provide access drne

results for both one and two cycle swaps.) Thus, the three cycles

indicated m the swap branch of Figure 5 results from one cycle for

the initial cache access, one cycle for the rehash access, end onecycle wasted during the swap.

Unfortunately. the hash-rehash cache haa a serious drawback.

which often re~uccs its performance to that of a direct-mapped

cache, as can be seen in Section 5.3. The source of its problemsis that a rehash is attempted after every fist-time miss, which canreplace potentially useful data in the rehashed location, even when

the primary location had an inactive block, Consider the following

reference pattern ui a j a= aJ a= aj as . . . . where the addresses

a; and aj map into the same cache location with bh selection, enda. is art address which maps into the same location with bit flipping

(that is, where b[(ti] = b[a,], end j[~i] = b[a.]). Thii simation ia

illustrated in Figure 6. After the first two references, both the hsah-rehash and the column-associative algorithms will have the data

referenced by a, (which will be called j for brevity) and the data i

ai [ Tag Index 000

L001

—— +\

010r 5.

a, Tag Index ;/” – 011100

\ - ~\ 101

F- ––––-”—*110

111

Figure 6: The potential for secondary threshing in a reference

stream of the form ai aj a= aj a= aj as . . . . Different fonts

exe used to indicate different index fields end tam. In this case.

b[ai] = b[aj] end f[ai] = b[a=].

in the non-rehsshedend reheshed locations, respectively. When the

next address, a., ia encourtterd both algorithms attempt to access

the set b[a=], which eatteins the rehashed data i. But when this

Iirst-time miss occurs, the hash-rehash algorithm next tries to access

f [a=], which results in a second-time miss and the clobbering of

the data j. This pattern continues ss long es aj end a. alternate;

the data referenced by one of them is clobbered es the inactive date

block i ia swapped back end forth but never replaced. We will refer

to this negative effect es secondary threshing in the future.

The following section describes how the use of a rehash bit can

lessen the effecrs of these limitations.

3.2 Rehash Bits

The key to implementing column sssociativity effectively ia in-

hibiting a rehash access if the location reached by the first-time

access itself contains a rehashed data block. This idea can be im-

plemented as follows. Every cache set contains an extra bit which

indicates whether the set is a rehashed locution, that is, whether

the data in this set is indexed by j[a]. This algorithm, which isillustrated es a decision tree in Figure 7, ia similar to that of the

hash-rehash cschq however, the key difference lies in the fact thatwhen a cache set must be replaced a rehashed location is always

chosen—immediately if possible. Thus, if the fist-time access is a

miss, then the rehashed-location bh (or rehash bit for short) of that

set ia checked (RbtiIEl ?, as listed in Table 1). If it has been set to

one, then no rehash access will be attempted, end the date rerneved

from memory is placed in that location. Then the rehash bit is reset

to zero to indicate that the date in tbia set is to be indexed by b[a]

in the future, On the other hand, if the rehash bit ia already a zero,

then upon a first-time miss the rehash access will contirtue as de-

scribed in Seetion 3.1. Note that if a second-time miss occurs, then

the set whose data will be replaced is agein a rehashed location, es&sired.

Of course, at start-up (or after a cache flush), all of the emptycache locations should have rhetr rehash bits set to one. The reasonthat this algorithm can correctly replace a location with a set rehash

bit immediately after a first-time miss is baaed on the fact that

bit flipping is used es the second hashing function. Given two

addresses a; and a., if f[a,] = b[a.], then it must be true thatj[ac] = b[a,]. Therefore, if ai accesses a location using b[a,]

whose rehash bit ia set to one, rhen rhere are only two possibilities.

1. The accessed location ia an empty location horn start-up, or

2. there exists a non-rehashed location at f [ai] (or b[a=]) which

previously encountered a conflict end placed the data in its

182

Page 5: Column Associative Caches

ht

&ne

1

nCacheal

b[a] miss--=-A,

!7Rb’’-”xclobberl

Idone

‘> ’[” wSw,ap clobberZ

m0M+l I

b rahash bitdom +

(Rbit) 3ewap

f doneI I M+3

Figure 7: Decision tree for a column-associative cache.

rehashed locatioz ~[az].

In both cases, it makes sense to replace the location reached dttring

the tirat-time access that had its rehash bit set to one.

However, it must be proven that a third possibfity does not exis~

namely, the location b[ai] has its rehash bit set to one, but the date

referenced by a; actually resides in ~[ai] simtdtaneously. Consider

the actions taken by the algorithm when one of the conditions

precedes the other. Firsq if b[ai] ia a rehashed location, then anyfirst-time miss results in the immediate clobbering of that location

and dte resetting of the rehash bit to zero. Therefore, it is not

possible for the placement of the date into f[ai] to follow thiscondition.

On the other hand, if the &ta referenced by ai already resides

in f[ai] due to a conflic~ then the rehash bit of b[ai] must be azero, because it contains the most recently accessed data. The ordy

way to change this blt is if b[ai] were to be used aa a rehashed

location in order to resolve a different cortilict. However, becausebit fl@ping is the rehashing fimctiom the ordy location for which

this situation can occur is f[ai] itself. A lint-time access to this

location, though, would automatically clobber the rehashed data.

Therefore, it is clear that the two conditions for this third possibility

can never occur sirmtltammusly. This important property could

not be utilized in the column-associative algorithm if bit flipping

was not the second hashing function or if more than two hashing

functions were included.

Like the hash-rehash cache, the column-associative algorithm

attempts to exploit temporal locality by swapping the most recently

accessed data into the non-rehashed location, if a rehash is indeed

attempted. The use of the rehash bit helps utilize cache area more

efficiently because it immediately indicates whether a location is

rehashed end should be replaced in preference over a non-rehashed

location,

In addition to limiting rehash eccessmandclobbering, the rehash

bits in the column-associative cache elitninate secondary thrashing.

Referring to the reference stream, ai aj a. aj a= aj a= c. . . in

Figure 6, the third reference accesses b[a=], but it finds the rehash

bit set to one. Thus, the data i is replaced irnrmdately by z, the

desired action. Of course, thii column-associative cache suffers

thrashing if three or more conflicting addresses alternate, as in

ai aj a= @i @j az Ui . . . . but this case is much less probable than

two alternating addresses.

4 A Simple Analytical Model for Column-

Associative Caches

We have developed a simple analytical model for the cohtmn-

associative cache that predicts the percentage of interference misses

removed from a dwect-mapped cache using only one measured

parameter--the size of the program’s working set—from so ad-

dress trace. Our model builds on the self-interference component

of the direct-mapped cache model of AgarwaL Horowi~ and Hen-

nessy [9], and it estimates dte percentage of interference missesremoved by computing the percentage of cache block cofllcts re-

moved by the rehash algorithm. Because the behavior is captured

in a simple, closed-form expression, ottr model yields valuable

insights into the behavior of the column-associative cache. Vslida-

tiotts against empirically derived cache miss rates suggest that the

model’s predictions are fairly accurate as well.

Like the self-interference model in [9], the percentage reduction

in cache block cofllcts in the column-associative cache is captured

by two parameters: S end u. The parameter S represents the

number of cache sets; in direct-mapped caches, the product of S

and the block sixe yields the cache size. The parameter u denotes

the working-set size of the program, end must be measured fkoman address trace of a program. The working set of a program is the

set of distinct blocks a program accesses within some interval of

time.

The mo&l makes the assumption that blocks have a uniform

probabtity of mapping to any cache se~ and that the mappings

for different blocks are independent of each other. The same as-

sumption is also made for the rehash accesses. This assumption

is commonly made in cache modeling studies [10, 11, 9]. Al-

though this assumption makes the models generally overestimatemiss rates, its effect is less severe when we are interested in the

r~ws of the number of conflicting blocks in direct-mapped caches

and column-associative caches.

A detailed derivation of the model appears in [12]; this section

summarizes the major resuhs. Let Cd denote the nmnberof cofilct-

ing blocks in a direct-mapped cache, and c=a~ the corresponding

number of conflicting blocks in a cohtmn-associativeceche. Blocks

are said to conflict when multiple blocks horn the working set of a

program map to a given cache set. In a column-associative cache,

corttlicting blocks are blocks that conflict even after a rehash is

attempted. Section 5.1 provides further discussion on the notion of

contlicts.

Expressiona are derived in [12] for the number of conflicting

blocks in direct-mapped end column-associative caches in terms of

P(d), which is the probability that d program blocks (out of a total

of u) map to a given cache set. Because blocks are assumed to

map with equal likelihood to any cache se~ the distribution of the

number of blocks in a cache set is binomial, which yields

()‘(~)= ; (+)’(1 - ;)”-d (1)

The following are expressions for the number of conflicting

blocks.

cd= u - SP(l)

c... = u - SP(l) – SP(2)(1 + P(o) – P(1) – P(2))

183

Page 6: Column Associative Caches

We estimate the percentage of interference misses removed by

the percentage reduction in the number of cofllcting blocks. Ourvalidation experiments indicate that ti is a good approximation.

Thus, the percentage of interference misses remov~

_ SP(2) (1 + P(o) – P(1) - P(2))cd - Ccac

u - SP(l)(2)

cd

It is instructive to take the first-order approximations of the ex-

pression in Equation 2 after substituting for P(d) horn Equation 1

and sisnpl@ing the resulting expression. The first-order approxi-

mation is valid when S >> u and u >> L which~ow us to use

(1 - 1/s)”-’ = (1 - ./S). Proceeding along these. lines, we

obtain

-+3 (3)

It is easy to see from the above equation that the percentage of

conflicts removed by rehashing will approechunity as the cache size

is increased. Similarly, roughly 50~o of the conflicts are removedwhen the cache is foor times larger than the working set of the

program.

To demonstrate the accuracy of the model, we plot in Figure 13

the measured values of the average percentages of interference

misses removed and the values obtained using Equation 2 for our

traces. The prdlctions for each of the i.dvidusl traces is also fairlyaccurate, as d~played in Figures 8 end 9. Both the model and the

simulations use a block size of 16 bytes. The analytical model uses

only one parameter-the working-set size, u-measured horn each

trace. Table 3 shows the working set sizes for each of our traces.

5 Results

This section presents the data obtained through simulation of thevarious caches and en analysis of these results. FKsg the metrics

which have been used to evaluate the performance of the caches

must be described.

5.1 Cache Performance Metrics

We use three cache performance metrics in our results: the cache

miss rate, the percentage of interference misses removecL and the

average memory access time.

The miss rate is the ratio of the number of misses to the total

number of references.

lltepercentageof irtterfererscemisses removedis the percentageby which she number of interference misses in the cache under

consideration is reduced over those in a dwect-mapped cache. hinterference miss is defined as a miss that resulte when a block

that was previously displaced born the cache ia subsequently ref-

erenced. In a single processor environmen~ the total number of

misses minus the misses due to tirst-time references is the number

of interference misses.z

2A ~~ar pammeter was used by Jouppi [4] as a useful m-sure of tie

performance of victim caches. We nose that our interference metric mes-

sures the sum of the intrinsic interference misses and the extrinsic interfer-

ence misses in the classification of Agarwal, Hosowitz, and Hennessy [9],and the sum of she capacity, conflict, and context-switching misses its sheterminology of Hitl and Smith [13].

This metric is psrdcularly useful for determining g the success of

a particular scheme because all cache irrtplementations must share

the aarne compulsory or first-time miss rate for a given reference

stream, but they may have different interference miss rates. The

percentage of interference misses removed is calculated by the

equation

direct miss rate - miss rate

direct miss rate – compulsory miss ratex 10070

where, for a given address trace and cache size, the miss rate is that

of the particular cache design, and the direct miss rate is that of a

dwect-mapped cache of equal size. The compulsory miss rate is

the ratio of unique references to total references for tltat trace.

Finally, the average memory access time is defined asthe average

number of cycles required to complete one reference in a particular

eddress stream. This metric is useful in stsaessing the performance

of a specitic caching scheme because sdthough a particular cachedesign may demonstrate a lower miss rate than a d~ect-mapped

cache, it may do so at ttte expense of the hit acceas tine. As

mentioned earlier, our graphs include access time results for both

one-cycle and two-cycle swaps.

Let the cache access time for a hit be one cycle, and let M

representthe number of cycles required to service amiss from the

main memory (in our simulations, M = 20). If R is the total

number of references in the trace, H1 is the total number of hits on

a tirst-tirne access, and HZ is the total number of hits on a second-

time access, then rhe average memory access time for the various

schemes cam be computed from the decision trees of Section 3 as

shown below.

For direct-mapped caches, the access time is one for hits, and

one plus M for misses. Thus,

t ave = ~ [HI + (M + 1)(R - H,)]

For hash-rehash caches, the access time is one for first-time hits, 3

for rehash hits (Every first-time miss is followed by a rehash.), and(M+ 3) otherwise.

t . . . = ; [M +3H. + (M+3)(R - HI - I@]

For column-associative cachea, we need an additional parameter

Rz, which is the total number of second time accesses. (Recall

that second-time accesses are attempted only when the rehash bit is

zero.) Thus the access time is one for fit-time hits, and three for

the Hz hits during a rehash attempt. If a rehash is not attempted,

then (M + 1) cycles are spent. Rehash attempts that miss suffer a

penalty of (M +3) cycles. Therefore,3

t . . . = ; [H, +3H2 + (M + l)(R - HI - R2) + (M +3)(R2 - Hz)]

The simulator described in the next section measures R, RZ HI,and Hz for each of the cache typea, end it derives average memory

access rimes from the above equations.

3~e ~cle5 ~r ~5~&m (or tTI) assuming singk-cyck instmcsiort

execution can be crdculstcd easily from the average access time. For aunitied instruction and dsts cache wishs single cycte accesstime, sheCPIwidr a 100% hk sate is (1 + /), where i is she fraction of instructions shat

am. loads or stores. In the presence of cache misses, however, dte averagescccss time becomes ta”e,and the CPI becomes (1 + l)tav~.

184

Page 7: Column Associative Caches

D name I trace description g

LISPO LISP runa of BOYER (a theorem prover)

DECO.1 Behavioral simulator of cache hardware, DECSIM

SPTCO SPICE simulatitw a 2-irttmt tristate NAND buffer

IVEXO Interconnect verify, a DEC program checking

net lists in a VLSI chip

FORLO FORTRAN compile of LINPACK

Table 2: Description of tmiprocessor tracesused during simrtla-tion.

no. of references compulsory

@ace u tmique total miss rate (70)

LISPO 392 1,789 262,760 0.6808DECO.1 463 2,418 334,775 0.7223

SPICO 740 2,834 358,168 0.7912

IVEXO 774 11,087 307,172 3.6097

FORLO 826 6,787 314,110 2.1607

MUL6.O 5,267 400,698 1.3145

Table 3: Number of references (both instructions and data) andcompulsory miss rate for eachof the addresstracessimulated. Theblock size for measuring u and uraigue is set to 16 bytes (fourwords).

5.2 Simulator and Trace Descriptions

We wrote trace-driven simulators for dwect-mapped, set-

associative, victim hash-rehash, and column-associative caches.

Multiprogrammed simulations assume that a process identifier is

associated with each reference to distinguish between the data ofdtierent processes. All caches are assumed to be combined in-srrttction and data caches.

The traces used in this study come horn the ATUM experiment

of Sites and Agarwal [14]. The ATUM traces comprise realistic

workloads and include both operating system and multiprogram-

ming activity. The five uniprocessor traces, derived from large

programs running under VMS, are described in Table 2. We also

use a multiprogrammitr g trace called MUL6.0, which includes ac-

tivity tiom six processes including a FORTRAN compile, a direc-

tory searck and a microcode address allocator. Each trace length

is on the order of a half million references. We believe these

lengths are adequate for our putposes, since we explicitly subtractthe number of fit-time misses and present the percentage of inter-

ference misses removecl and because it is possible to differentiate

the performance of the various caching methods without resorting

to measurement methods that yield cache miss rates with a degree

of accuracy exceeding the tirst or second decimal place. The com-

pulsory miss rates and other parameters for these traces are listed

in Table 3. In the table, u is the average number of unique blocks

in 10,000 reference windows, while unique is dte total number ofunique blocks in the endre mce.

5.3 Measurements and Analysis

In this sation, the results of the trace-driven simulations are plottedand interpreted. Before introducing the plots, a few of their features

must be explained. If the miss rate of a cache happens to be worse

than that of a direct-mapped cache for the particular cache size, as is

occasionally the case for a hash-rehash cache, then the percentage

of interference misses removed becomes a negative quantity. On

the graph this is instead indicated by a point at zero percent?

The victim cache size has been set to 16ertuies. This is basedon

simulation data which suggests that the removal of cofllcts quickly

saturates beyond this size. In additioz remember that each victirn-

cache entry is a complete cache line, storing the tag, status bita, and

the data block, which contains four words in these simulations.

53.1 Miss Rates and Interference Misses Removed

LISPO and DECO.1 The resultsfor the IJSPOandDECO.1tracesarevery similar, so only LISPOresulta areplotted in Figure 8. It isevident that all of the cachedesignsexhibit much lower miss ratesthan the direct-mapped cache. The lowest miss rates areachievedby the two-way set-associativeand the column-associativecaches.The victim and hash-rehashcacheshavehigher miss rates.

A sfriking feature of the miss rate plots is the relationship be-tween the direct-mapped and hash-rehash caches. Whenever dou-

bling the cache size results in a aharp decrease in the dwect-mapped

miss rate, the same change in cache size yielda a sharp and similarly

sized increa.reef the haah-rehashsrdas rate. This effect makes senseintuitively-a hash-rehash cache is designed to resolve conflictsthrough the use of alternate cache locations. It is successful as long

as the the number of conflicts decreases ordy slightly as the cache

size increases. However, if an increase in cache size itself sud-

denly removes a large portion of the confiicts, then the hash-rehmh

algorithm clobbers many locations and suffers a sharp drop in the

second-tirne hit rate because it is attempting to resolve cofllcts

which no longer exists Notice that the column-associative cache

does not suffer from this degradation because its access algorithm

is designed specifically to alleviate the problems of clobbering andlow second-time hit rates.

Referring to the percentages of interference misses removed

in Figure 8, notice that the dashed curve corresponding to thepredctiona of the model ia very close to the curve obtained from

simulations. The LISPO trace haa a small working set compared

to the other traces (see Table 3), and therefore the percentage of

interference misses removed quickly approaches 1009o for all but

the victim cache, which is a phenomenon readily explained by the

approximate analytictd expression for this metric: (1 – 2u/S).

SPICO, IVEXO, and FORLO The results for these traces arealso similar enoughto be grouped together. The datafor the SPICO~acehaabeenplotted in Figure 9. Nearly all of the resultsfrom theprevious sectionapply to the simulations with thesethreetraces,butthere are several importantdifferences. Because the working-set

sizes of these traces are larger than tie LISPO trace, the percentages

of interference misses removed by column associativity start at

much lower vahtes and approach 1007o more slowly. Because

the victim cache is much more sensitive to working set size, it

does not attain the same percentages found for LISPO and DECO.l;

for these traces, the victim cache lies around 25% or less for this

metric. Recall that the vicdrn cache sixe remains constan~ while

the eohmm-associative and the set-associative caches cart devote

larger weaa to resolve conflicts as cache size increases.

4’fhis is why the points for rhe hash-rehash cache are not connected in

the graphs showing percentageof intmfctwtcc missesremoved.5‘fhe addition of one, Idsh-order bit to the index could separate two

groups of addresses which conffict often because they differ for the firsttime in that bit.

185

Page 8: Column Associative Caches

~ 8.0

‘ IL+c41w+wQed ~ ,Oof -— —–—-++ Hnh-Wu,h

~ 6.0 * Z-way %2-AuoceF [P

z + W6m C2C8W(16)9. Gdum A8s0c

k. ,“

0.7.)

0.8 I

Hash-RehP&I2-way S9t-AnxIWm Cache (16)C4um AssOcCdum Am.90 (kdd)

0.6 P or.’1248163284 128 288 12487832S6 t28 266

Ceehe Size (K Blocks) Cache Size (K Blocks)

Figttre 8: Miss rates and percentages of interference misses removed versus cache size for IJSPO. Block size is 16 bytes.

+ cire51-h6wxd++ Ha2h-R6hzsh* 2.way s4M880c-A- Vdmcde (16)-9- Cdunm hoc

25

‘“’~ 01Ceche Size (K Blocks)

G----B

,’,’

d’ + z-way SU.AMCC,’ + VH. CaOhO(16)

#’, -5 ~AScoc

,’-D- c&m-l ASSceOkdd)

{

Figure 9: Miss rates and percentages of interference misses removed versus cache size for SPICO. Block size is 16 bytes.

186

Page 9: Column Associative Caches

+ Drod-)+pd* Hd-Roh2dl* z-way s2t-A2Wc+ k4mC6cho (16)-6- cdunnA220c

Cache Bize (K Bleeks)

~Cache B&e (K Eleeks)

Figure 10 MKS rates andpercentagesof interference missesremoved versuscachesize for MUL6.O. Block sise is 16bytes.

The plots for SPICO in Figure 9 reveal another interesting

facb the cohurm-associative cache outperforms the two-way set-associative cache for some of the cache sizes. A hypothesis that

explains this behavior is based on the fact that when comparing

the two caches at an equal cache size, the set-associative cache hasonly half that number of cache sets. As seen before, doubling the

cache aim and thus adding a high-order bit to the index may elimi-

nate a large number of conflicts that have been occurring because

many addresses di&r for the firat time in that bit. For example,

corrsiderthe addresses 0001111, OIOllll, and 1011111. All three

result in multiple eotilcts (thrashing) if only the four, low-order

bits are used as the index. This is a cache size of 24 or 16 for

the eohmm-associative cache, but the total cache size is 32 for the

two-way set-associative cache, and it still exhibits thrashing. Notethat both caches have 16 sets. A 32-set column-associative cache,

however, uses five bits for the index. In this case, the cofllcts be-

tween 0101111 and 1011111 are automatically eliminated becauseof the different fifth bhs.

MUL6.O The miss rates tmd percentages of interference misses

removed for the mtdtiprogr amming trace are plotted in Figure 10.

Once ag~ many of the observations made for the other trace

results apply to MUL6.O. Perhaps the most telling result is the

relatively poor performance of the victim cache. Its miss rate is

virtually tie same as that of the direct-mapped cache (for cache

sises greater than 2K blocks). The large working sets of multi-

progr amming workloads make the fixed size of the victim cache a

serious liability. The larger available area for storing ccmfiicts in

the column-associative cache is clearly a big win in this situation.

5.3.2 Average Memory Access Times

Two key factors must be eortsidered when interpreting the access

time data. FirsL although the average memory access times of

set-associative caches are in reality increased due to their higher

hit access times, the graphs in this paper assume their hit access

times are the same as that of direct-mapped caches. If realistic

access times of two-way set-sssociative caches are considered, theiraverage memo~ access times might welt become greater that those

of column associative caches. (This is why the corresponding

curves are labeled “Ideal”).

Second the average memory aeeess time is very sensitive to the

time required to service a miss (M). The results assume M = 20cycles. For larger (and still reasonable) miss penalties, the designs

such as wlurnn-associative caches which reduce the number of

accesses to main memory (R - H1 - H2) will look even more

impressive than indicated by our results.

The results for the LISPO and SPICO traces arepresentedtogether

m Figure 11. As before, DECO.1 is similar to LISPO, while IVEXO

and FORLO are similar to SPICO. All the average memory access

time plots are largely similar in shape to the miss rate plots, which

is expec~ because t.ve is a linear function of the miss rate.

The graph for LISPO shows rhat column sssociativity achieves

much lower average memory access times than a direct-mapped

cache. The improvement is about 0.3 cycles for most cache sizes.

For SPICO, the cohmm-associative cache exceeds 0.2 cycle im-

provements only for small caches. This fact is contirrned whenthe miss rate plot is consi&red+e direct-mapped interference

miss rate is not much higher than the compulsory miss rate, unlike

the case for LISPO. The results for MUL6.O are largely similar

the column-associative cache saves about 0.2 cycles over d~ect-

mapped caches, and the two-way set associative cache saves a

further 0.1 cycle, when the caches are less than 4K blocks, (With

a 16-byte block the cache size is 64K bytes.) The savings are

smaller for larger caches.

Perhaps most importan~ however, is the fact that the cohtmn-

sssociative cache achieves an average aeeess time close to the two-

way set-associative cache, even though the hit access time of the

set-associative cache was (unrealistically) kept the same as that of

a direct-mapped cache.

5.4 Summary

This section presents data for each of the metrics averaged over all

of the traces, The resulting plots serve as excellent examples for

reviewing the major points made in this seetion.

When the miss rates of all six traces are averaged for each cache

size, the plot in Figure 12 is the resul~ The direct-mapped missrate is the baseline for comparison and fsUs quickly from 6.0~0 to

2.0%, before settling toward the average compulsory miss rate ofabout 1.5’3’0.

187

Page 10: Column Associative Caches

ISr~dHuh.W9sdl ~22- ScMuoctku)

.

calttnn AuOc(l wmsdo@)cdum&c0c(2 wmtDdcyd0*) ~

\

+ 1.12481632S4 124 SS6

CdJS S&e (K Blooks)

LIS$O t-

Dnd-~dHs,MQIuA2-wq ad.&sOs (kid)cOkmAuOc[lw-c+’dO)c4iulmAu0sls wsstodcyd@

Cscho SS. (K Blods)

.%vCo t-

Fimre 11: Average memory accesstimes CmcYcles)versuscachesize for I-IWO and SPICO.Block size is 16 bytes. The hit mess timeof two-way set-associative caches is assumed to be the same as that of a direct-mapped cache.

The other cachedesigns can be split into two groups, basednotonly on their similar miss rate curves but also on the relationshipsamong theii accessalgorithms. The fitat gtuup contains the hash-rehaah cache and the victim cache, which have similar cmtmlalgoriduns. Thehssh-rehssheacheis usually snimprovemerttupondirect-mapped caching themissratedrops more quickly fium6.0%to about 1.Yqo. However, at the Transitionpoin~ the hash-rehashmiss rate increasesabout as much as the direct-mapped miss ratedecreases.This is due to the fact that once the cachesize exceedsthe working-setsize, the interference missratedrops markedly. The- ra=h eccessesporformedby the hash-rehashalgorithm noware more likely to clobber live data than to resolve conflicts. Thevictim cachedoesnot suffer from this effec~ becauseit is designedto alleviate the main problems with the hash-rehash algorithmclobbering and low second-timehit rates.

The second group consists of the two-way set-associativeandthe column-associative caches.The miss rates of thesecachesarealmost 2.0% lower then direct-mappedmiss rates for small caches,just under 1.0% near the transitio~ and right at the compulsorymiss rate for large caches. As prediited in Section 3, the colunm-associativecacheachievestwo-way set-associativemiss rates.

The plot in Figure 13 showsthe averagepercentagesof interfer-encemissesremoved. (This averagedoesnot rnclude the MUL6.Onumbers, so that we could compare the simulation averageswiththe model.) The -es for set-associativeand column-associativecachesare almost identicd stardttg at about 40% and climbing to100% when the cache size reaches2S6 K blocks. As predictedin Section 2, the performance of the victim cache relative to thecolumn-associative cache degrades with cache size. Finally, thedashedcurve for the model is seento be surprisingly close to simu-lation results when the individual trace anomalies are averagedout.

The averagememory accesstime (t..=) date for the six traceshavebeenaveragedrutdplotted in Figure 14. Basedon this averageplot and on most of the other daa the column-associative cacheappearsto be good choice undermost operating conditions. k thisexample, to”~ is reduced by over 0.2 cycles for small to moderatecaches,and by about 0.1 cycles for moderate to large caches.

gsd

Ii

+ ok4-MqQed++Hd#rdmh* 2-lmy &r-Anm

h

4 W2mcsdW(16)-R- cdumlAs80c

-12S 266

Csehe Ss0 (K Blecks)

Figure 12: MS rates versus cache size, averagedover w sixtXSCOS. Block Si2e is 16 bytSS.

6 Conclusions

The goaI of this research has been to develop area-efficient cache

control algorithms for improved cache performance. The main

metrics used to evaluate cache performance have been the miss

rate and average memory access tim~ unfortunately, minkking

one of them usually ei%xts the other adversely. The optimal cache

design would remove interfereneo misses es well as a two-way set-associative cache but would maintain the fast hit access times of adirem-mapped cache.

Two previous solutions which attempted to achieve this are the

hash-rehash cache and the victim cache. Although some perfor-

mance gain is achieved by both these schemes, the success of the

hash-rehash cache is very erratic and is hampered by clobbering

and low second-time hit rates. The drawbacks of the victim cache

include the need for a large, fully-associative buffer and its lack

of robust performance (in terms of its miss rate) as the size of the

primary cache increases.

This paper proposed the design of a cdunn-associative cache

1S8

Page 11: Column Associative Caches

100 -

80 -

+ Wsmedn (16)

eo - +3. eah8m&wJc-!+ c+hwlk81JeiwcdalJ

)s

I%o-

128 226

Cdu B&e (K Blocks)

Figure 13: Percentages of interference misses removed versus

cache size., averaged over the single process traces. Block size is

16 bytIX.

Figure 14: Average memory access times versus cache size, av-

eraged over all six traecs. Block size is 16 bytes. The hit access

time of two-way set-associative caches is assumed to be the same

as that of a direct-mapped cache.

that has the good hit access time of a direct-mapped cache and

the high hit rate of a set-associative cache. The fundstnentsd idea

behind cohutm aasociativity is to resolve conflicts by dynamically

choosing different locations in which the conflicting data canreside.

The key aspect which distinguishes the cehttmt-associative cache is

the use of a rehsshbit to indicate whether a cache set is a rehashed

locadon.

Trace-driven simulations cdirrn that the cohsmn-associative

cache removes almost as many interference misses es does the

two-way set-associative cache. In additi~ the average memory

access times for this cache are close to that of an i&al two-way

set-associative cache, even when access time of the two-way set-sssociative cache is assumed to be the same as that of a direct-mapped cache. Finally, the hardware costs of implementing this

scheme are minor, and sdmost negligible if the state represented by

therehashbit caddbe encededinto the existing status bits of many

practical cache designa.

7 Acknowledgments

The research reported in this paper is funded by NSF grsmt

# MIP-9012773 and DARPA contract # NOO014-87-K-0825.

References

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Alan Jay Smith. Cache Memories. Computing Surveys,

14(4}473-530, September 1982.

Steven Przybylsu Mark Horowiz and John Hennessy. Per-

formance Trsde@s in Cache Design. In Proceedings of the15th Annual Symposium on Conputer Architecture, pages

290-298. IEEE Computer Society Press, June 1988.

Mark D. Hfl. A Case for Direct-Mapped Caches. ZEEEComputer, 21(12}25-40, December 1988.

Norman P. Jouppi. Improving Direct-Mapped Cache Perfor-

mance by the Addition of a Small FuUy-Associative Cache

and Prefetch Buffera. In Proceedings of the 17th AnnucdSymposiumon ComputerArchitecture, pages 364-373. IBEE

Computer Society Ress, August 1990.

Anant AgarwaL John Hermessy, and Mark Horowitz. Cache

Performance of Operating Systems and Multiprogramming.

ACM Trormctwnr on Computer System, 6(4>393-431,November 1988.

Matthew K. Fsrrens and Atdrew R. Pleszkun. ImprovingPerforsnartce of Small On-Chip Instruction Caches. In Pro-ceedingsof the 16th Annual Symposiumon Computer Archi-tecture, pages 234-241. IEEE Computer Society Press, May

1989.

R.E. Kessk, Rlchsrd JOOSS, Akin h?beck and Mark D.

Hill. Inexpensive Jmplementstions of Set-Associativity. In

Proceedings of the 16th Annual Sywposium on ComputerArchitecture, pages 131-139. IEEE Computer Society Press,

May 1989.

Kimming So and Rudolph N. Rechtachsffen. Cache Oper-

ations by MRU Change. Technical Report RC 11613, IBM

T.J. Watson Research Center, November 1985.

Ansnt Agsrwsl, Mark Horowitzt and John Hennessy. An

Analytical Cache Model. ACM Transactwns on ComputerSystems,7(2):18+215, May 1989.

189

Page 12: Column Associative Caches

wo

;&r 1 LM

u rm Array+

rahmh hLt012

RD/w’f

I

1

i Ceatrol I ‘552A%B-D””

Figure 15: Cohmm-associadve cache implementation.

cache set must have a rehash bit appended to it.

Every

[10]

[11]

[12]

[13]

[14]

A

Man Jay Smith. A Comparative study of Set Associative

Mem~ Mapping Algorit%s And Their Use for Cache and

Main Memory. IEEE Transactwns on Sojiware Engineering,SEAI(2)121-130, March 1978.

Dominique Thiebaut and Harold S. Stone. Footprints m the

Cache. ACM Transoctwm on Computer Systems,5(4>305-329, November 1987.

Anant Agarwal and Steven Pudar. Column-Associative

Caches: A Techrdque for Reducrng the Miss Rate of Direct-

MappedCaches. Technical ReportLCS TM 484, MIT, March1993.

M.D. Hill end A. J. Smith. Evaluating Aasociativity in CPUCaches. IEEE Transoctwns on Conquters, 38(12)1612-1630, December 1989.

Richard L. Sites and Anant Agarwal. Multiprocessor Cache

Analysis using ATUM. In Prmeedings of the 15th Interna-twnalSymposiumon ComputerArchitecture, pages 186-195,

New York June 1988. IEEE.

Cache Implementation Example

The datapaths required foracohunn-associative cache are displayed

in Figure 15. Since the rehashing function used is bit flipping, the

functional block ~(z) is simply an inverter. Jtt order to accomplishthe swap of conflicting&@ a data buffer ia required. AU buffers

are assumed to be edge triggered, An n-bh multiplexer can be

used to switch the current contents of the memory address register

(MAR) between the two conflicting addresses. A MUX is also

needed at the input of the data buffer, so that it may read data fium

either tlte swap buffer or the data bus. Finally, a rehash bit is added

to each cache seq when this bit is read out into the data buffer, itrhen seines= a muol signal. In sozm implmnemanom the rehashstate can be encoded using the existing state bits associated witheach cache line, thus eliminating the need for an extra bit.

First-time hits proceed as m direct-mapped caches; however, if

there is a tirst-time miss and the rehash bit of this location is a one,

then the column-associative algorithm requires that this location

be replaced by the data from memory, which is accomplished in

the XWAIT state. When tbe memory acknowledges completion

(MACK), the data is taken off the data bus (LD) and written back

into the cache (WT). On the other hand if the tirst location is not

rehashed (! HB), then a rehash is to be performed. The processor

state rnput output next state

IDLE I OP LM, RD b[a]

bral HIT IDLE—.—J!HIT, !HB sTALLMsELJ&f,RQ~ fl[a]!H~T, ~ MEM, STALL XV/m

fl[a] HIT MSEI+ LM, WT f2[a]

DSEL, LD!HIT Wml

t2[a] MSEL LM, WT IDLE

WAIT1 MACK MSEL LM. WT Wm

DSEI+ LD

WAIT2 MSEL LM, WT JDLE

XWAIT MACK LD, WT IDLE

Table 4: Stak flow table for the mntml logic of a ccdunut-

associative cache. In constructing the state flow table all cache

accesses are assumed to be reada.

is stalled (STALL), MSEL and LM are asserted to load the MAR

with ~[a], the second-time access is begun (RD), LS is asserted to

move the first datum into the swap buffer, and the state changes to

fl[a].

If there is a second-time hi~ then the correct datum resides in

the data buffer. In or&r to perform the swap, state fl[a] loa& the

MAR wirh the original address, ~(~(a)), and issues a write (WT).

State fl [a] also moves the datum accessed the first time horn the

swap buffer to the data ImiTer (by asserting DSEL and LD), where

it can be written back into the rehashed location in the next state,f2[a]. A second dme miss is handled similarly by states WAITl

and W- except that the correct datum to be swapped into the

non-rehashed location comes from the memory.

190