amoeba-cache: adaptive blocks for eliminating waste in the

13
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy * Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews * , Sandhya Dwarkadas†, Lesley Shannon * School of Computing Sciences, Simon Fraser University †Department of Computer Science, University of Rochester * School of Engineering Science, Simon Fraser University Abstract The fixed geometries of current cache designs do not adapt to the working set requirements of modern applications, causing significant inefficiency. The short block lifetimes and moderate spatial locality exhibited by many applications result in only a few words in the block being touched prior to eviction. Unused words occupy between 17—80% of a 64K L1 cache and between 1%—79% of a 1MB private LLC. This effectively shrinks the cache size, increases miss rate, and wastes on-chip bandwidth. Scaling limitations of wires mean that unused-word transfers comprise a large fraction (11%) of on-chip cache hierarchy energy consumption. We propose Amoeba-Cache, a design that supports a variable num- ber of cache blocks, each of a different granularity. Amoeba-Cache employs a novel organization that completely eliminates the tag array, treating the storage array as uniform and morphable between tags and data. This enables the cache to harvest space from unused words in blocks for additional tag storage, thereby supporting a variable number of tags (and correspondingly, blocks). Amoeba-Cache adjusts individual cache line granularities according to the spatial locality in the application. It adapts to the appropriate granularity both for different data objects in an application as well as for different phases of access to the same data. Overall, compared to a fixed granularity cache, the Amoeba-Cache reduces miss rate on average (geometric mean) by 18% at the L1 level and by 18% at the L2 level and reduces L1—L2 miss bandwidth by 46%. Correspondingly, Amoeba-Cache reduces on-chip memory hierarchy energy by as much as 36% (mcf) and improves performance by as much as 50% (art). 1 Introduction A cache block is the fundamental unit of space allocation and data transfer in the memory hierarchy. Typically, a block is an aligned fixed granularity of contiguous words (1 word = 8bytes). Current pro- cessors fix the block granularity largely based on the average spatial locality across workloads, while taking tag overhead into consider- ation. Unfortunately, many applications (see Section 2 for details) exhibit low— moderate spatial locality and most of the words in a cache block are left untouched during the block’s lifespan. Even for applications with good spatial behavior, the short lifespan of a block caused by cache geometry limitations can cause low cache utilization. Technology trends make it imperative that caching efficiency im- proves to reduce wastage of interconnect bandwidth. Recent reports from industry [2] show that on-chip networks can contribute up to 28% of total chip power. In the future an L2 — L1 transfer can cost up to 2.8× more energy than the L2 data access [14, 20]. Unused words waste 11% (4%—21% in commercial workloads) of the cache hierarchy energy. * This material is based upon work supported in part by grants from the National Science and Engineering Research Council, MARCO Gigascale Research Center, Cana- dian Microelectronics Corporation, and National Science Foundation (NSF) grants CNS-0834451, CCF-1016902, and CCF-1217920. Sector caches e.g., [29][31] Bandwidth Miss Rate Space Utilization Cache Filtering e.g.,[30] Bandwidth Miss Rate Space Utilization Amoeba Cache Bandwidth Miss Rate Space Utilization Figure 1: Cache designs optimizing different memory hierar- chy parameters. Arrows indicate the parameters that are tar- geted and improved compared to a conventional cache. Figure 1 organizes past research on cache block granularity along the three main parameters influenced by cache block granularity: miss rate, bandwidth usage, and cache space utilization. Sector caches have been used to [32, 15] minimize bandwidth by fetch- ing only sub-blocks but miss opportunities for spatial prefetching. Prefetching [17, 29] may help reduce the miss rate for utilized sectors, but on applications with low—moderate or variable spatial local- ity, unused sectors due to misprediction, or unused regions within sectors, still pollute the cache and consume bandwidth. Line distilla- tion [30] filters out unused words from the cache at evictions using a separate word-granularity cache. Other approaches identify dead cache blocks and replace or eliminate them eagerly [19, 18, 13, 21]. While these approaches improve utilization and potentially miss rate, they continue to consume bandwidth and interconnect energy for the unutilized words. Word-organized cache blocks also dramatically increase cache associativity and lookup overheads, which impacts their scalability. Determining a fixed optimal point for the cache line granularity at hardware design time is a challenge. Small cache lines tend to fetch fewer unused words, but impose significant performance penalties by missing opportunities for spatial prefetching in applications with high spatial locality. Small line sizes also introduce high tag overhead, increase lookup energy, and increase miss processing overhead (e.g., control messages). Larger cache line sizes minimize tag overhead and effectively prefetch neighboring words but introduce the nega- tive effect of unused words that increase network bandwidth. Prior approaches have proposed the use of multiple caches with differ- ent block sizes [33, 12]. These approaches require word granularity caches that increase lookup energy, impose high tag overhead (e.g., 50% in [33]), and reduce cache efficiency when there is good spatial locality.

Upload: others

Post on 13-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy∗

Snehasish Kumar, Hongzhou Zhao†, Arrvindh ShriramanEric Matthews∗, Sandhya Dwarkadas†, Lesley Shannon∗

School of Computing Sciences, Simon Fraser University†Department of Computer Science, University of Rochester∗School of Engineering Science, Simon Fraser University

AbstractThe fixed geometries of current cache designs do not adapt to the

working set requirements of modern applications, causing significantinefficiency. The short block lifetimes and moderate spatial localityexhibited by many applications result in only a few words in theblock being touched prior to eviction. Unused words occupy between17—80% of a 64K L1 cache and between 1%—79% of a 1MB privateLLC. This effectively shrinks the cache size, increases miss rate, andwastes on-chip bandwidth. Scaling limitations of wires mean thatunused-word transfers comprise a large fraction (11%) of on-chipcache hierarchy energy consumption.

We propose Amoeba-Cache, a design that supports a variable num-ber of cache blocks, each of a different granularity. Amoeba-Cacheemploys a novel organization that completely eliminates the tag array,treating the storage array as uniform and morphable between tagsand data. This enables the cache to harvest space from unused wordsin blocks for additional tag storage, thereby supporting a variablenumber of tags (and correspondingly, blocks). Amoeba-Cache adjustsindividual cache line granularities according to the spatial localityin the application. It adapts to the appropriate granularity both fordifferent data objects in an application as well as for different phasesof access to the same data. Overall, compared to a fixed granularitycache, the Amoeba-Cache reduces miss rate on average (geometricmean) by 18% at the L1 level and by 18% at the L2 level and reducesL1—L2 miss bandwidth by '46%. Correspondingly, Amoeba-Cachereduces on-chip memory hierarchy energy by as much as 36% (mcf)and improves performance by as much as 50% (art).

1 IntroductionA cache block is the fundamental unit of space allocation and data

transfer in the memory hierarchy. Typically, a block is an alignedfixed granularity of contiguous words (1 word = 8bytes). Current pro-cessors fix the block granularity largely based on the average spatiallocality across workloads, while taking tag overhead into consider-ation. Unfortunately, many applications (see Section 2 for details)exhibit low— moderate spatial locality and most of the words in acache block are left untouched during the block’s lifespan. Even forapplications with good spatial behavior, the short lifespan of a blockcaused by cache geometry limitations can cause low cache utilization.Technology trends make it imperative that caching efficiency im-proves to reduce wastage of interconnect bandwidth. Recent reportsfrom industry [2] show that on-chip networks can contribute up to28% of total chip power. In the future an L2 — L1 transfer can costup to 2.8× more energy than the L2 data access [14, 20]. Unusedwords waste ' 11% (4%—21% in commercial workloads) of thecache hierarchy energy.

∗This material is based upon work supported in part by grants from the NationalScience and Engineering Research Council, MARCO Gigascale Research Center, Cana-dian Microelectronics Corporation, and National Science Foundation (NSF) grantsCNS-0834451, CCF-1016902, and CCF-1217920.

Sectorcaches e.g.,

[29][31]

Bandwidth

Miss Rate

SpaceUtilization

Cache Filtering e.g.,[30]

Bandwidth

Miss Rate

SpaceUtilization

AmoebaCache

Bandwidth

Miss Rate

SpaceUtilization

Figure 1: Cache designs optimizing different memory hierar-chy parameters. Arrows indicate the parameters that are tar-geted and improved compared to a conventional cache.

Figure 1 organizes past research on cache block granularity alongthe three main parameters influenced by cache block granularity:miss rate, bandwidth usage, and cache space utilization. Sectorcaches have been used to [32, 15] minimize bandwidth by fetch-ing only sub-blocks but miss opportunities for spatial prefetching.Prefetching [17,29] may help reduce the miss rate for utilized sectors,but on applications with low—moderate or variable spatial local-ity, unused sectors due to misprediction, or unused regions withinsectors, still pollute the cache and consume bandwidth. Line distilla-tion [30] filters out unused words from the cache at evictions usinga separate word-granularity cache. Other approaches identify deadcache blocks and replace or eliminate them eagerly [19, 18, 13, 21].While these approaches improve utilization and potentially miss rate,they continue to consume bandwidth and interconnect energy for theunutilized words. Word-organized cache blocks also dramaticallyincrease cache associativity and lookup overheads, which impactstheir scalability.

Determining a fixed optimal point for the cache line granularity athardware design time is a challenge. Small cache lines tend to fetchfewer unused words, but impose significant performance penalties bymissing opportunities for spatial prefetching in applications with highspatial locality. Small line sizes also introduce high tag overhead,increase lookup energy, and increase miss processing overhead (e.g.,control messages). Larger cache line sizes minimize tag overheadand effectively prefetch neighboring words but introduce the nega-tive effect of unused words that increase network bandwidth. Priorapproaches have proposed the use of multiple caches with differ-ent block sizes [33,12]. These approaches require word granularitycaches that increase lookup energy, impose high tag overhead (e.g.,50% in [33]), and reduce cache efficiency when there is good spatiallocality.

Page 2: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

In this paper, we propose a novel cache architecture, Amoeba-Cache, to improve memory hierarchy efficiency by supporting fine-grain (per-miss) dynamic adjustment of cache block size and the # ofblocks per set. To enable variable granularity blocks within the samecache, the tags maintained per set need to grow and shrink as the # ofblocks/set vary. Amoeba-Cache eliminates the conventional tag arrayand collocates the tags with the cache blocks in the data array. Thisenables us to segment and partition a cache set in different ways: Forexample, in a configuration comparable to a traditional 4-way 64Kcache with 256 sets (256 bytes per set), we can hold eight 32-bytecache blocks, thirty-two 8-byte blocks, or any other collection ofcache blocks of varying granularity. Different sets may hold blocksof different granularity, providing maximum flexibility across addressregions of varying spatial locality. The Amoeba-Cache effectivelyfilters out unused words in a conventional block and prevents themfrom being inserted into the cache, allowing the resulting free spaceto be used to hold tags or data of other useful blocks. The Amoeba-Cache can adapt to the available spatial locality; when there is lowspatial locality, it will hold many blocks of small granularity andwhen there is good spatial locality, it can adapt and segment thecache into a few big blocks.

Compared to a fixed granularity cache, Amoeba-Cache improvescache utilization by 90% - 99% for most applications, saves missrate by up to 73% (omnetpp) at the L1 level and up to 88% (twolf) atthe LLC level, and reduces miss bandwidth by up to 84% (omnetpp)at the L1 and 92% (twolf) at the LLC. We compare against otherapproaches such as Sector Cache and Line distillation and show thatAmoeba-Cache can optimize miss rate and bandwidth better acrossmany applications, with lower hardware overhead. Our synthesisof the cache controller hit path shows that Amoeba-Cache can beimplemented with low energy impact and 0.7% area overhead for alatency- critical 64K L1.

The overall paper is organized as follows: § 2 provides quantitativeevidence for the acuteness of the spatial locality problem. § 3 detailsthe internals of the Amoeba-Cache organization and § 4 analyzesthe physical implementation overhead. § 5 deals with wider chip-level issues (i.e., inclusion and coherence). § 6 — § 10 evaluatethe Amoeba-Cache, commenting on the optimal block granularity,impact on overall on-chip energy, and performance improvement.§ 11 outlines related work.2 Motivation for Adaptive Blocks

In traditional caches, the cache block defines the fundamental unitof data movement and space allocation in caches. The blocks in thedata array are uniformly sized to simplify the insertion/removal ofblocks, simplify cache refill requests, and support low complexitytag organization. Unfortunately, conventional caches are inflexible(fixed block granularity and fixed # of blocks) and caching efficiencyis poor for applications that lack high spatial locality. Cache blocksinfluence multiple system metrics including bandwidth, miss rate, andcache utilization. The block granularity plays a key role in exploitingspatial locality by effectively prefetching neighboring words all atonce. However, the neighboring words could go unused due to thelow lifespan of a cache block. The unused words occupy interconnectbandwidth and pollute the cache, which increases the # of misses.We evaluate the influence of a fixed granularity block below.2.1 Cache Utilization

In the absence of spatial locality, multi-word cache blocks (typi-cally 64 bytes on existing processors) tend to increase cache pollutionand fill the cache with words unlikely to be used. To quantify this

pollution, we segment the cache line into words (8 bytes) and trackthe words touched before the block is evicted. We define utiliza-tion as the average # of words touched in a cache block before it isevicted. We study a comprehensive collection of workloads froma variety of domains: 6 from PARSEC [3], 7 from SPEC2006, 2from SPEC2000, 3 Java workloads from DaCapo [4], 3 commercialworkloads (Apache, SpecJBB2005, and TPC-C [22]), and the Firefoxweb browser. Subsets within benchmark suites were chosen based ondemonstrated miss rates on the fixed granularity cache (i.e., whoseworking sets did not fit in the cache size evaluated) and with a spreadand diversity in cache utilization. We classify the benchmarks into 3groups based on the utilization they exhibit: Low (<33%), Moderate(33%—66%), and High (66%+) utilization (see Table 1).

Table 1: Benchmark Groups

Group Utilization % BenchmarksLow 0 — 33% art, soplex, twolf, mcf, canneal, lbm, om-

netppModerate 34 — 66% astar, h2, jbb, apache, x264, firefox, tpc-c,

freqmine, fluidanimateHigh 67 — 100% tradesoap, facesim, eclipse, cactus, milc,

ferret

0

20

40

60

80

100ap

ach

e

art

asta

r

cact

us

can

nea

l

ecli

pse

face

sim

ferr

et

fire

fox

flu

id.

freq

.

h2

jbb

lbm

mcf

mil

c

om

net

.

sop

lex

tpc-

c.

trad

e.

two

lf

x2

64

mea

n

Wo

rds

Acc

esse

d (

%)

1-2 Words 3-4 Words 5-6 Words 7-8 Words

45

20

39

79

30

80

77

82

49

62

55

38

40

32

29

81

33

21

53

73

29

46

50

Figure 2: Distribution of words touched in a cache block. Avg.utilization is on top. (Config: 64K, 4 way, 64-byte block.)

Figure 2 shows the histogram of words touched at the time ofeviction in a cache line of a 64K, 4-way cache (64-byte block, 8words per block) across the different benchmarks. Seven applicationshave less than 33% utilization and 12 of them are dominated (>50%)by 1-2 word accesses. In applications with good spatial locality(cactus, ferret, tradesoap, milc, eclipse) more than 50% of the evictedblocks have 7-8 words touched. Despite similar average utilizationfor applications such as astar and h2 (39%), their distributions aredissimilar;'70% of the blocks in astar have 1-2 words accessed at thetime of eviction, whereas '50% of the blocks in h2 have 1-2 wordsaccessed per block. Utilization for a single application also changesover time; for example, ferret’s average utilization, measured as theaverage fraction of words used in evicted cache lines over 50 millioninstruction windows, varies from 50% to 95% with a periodicity ofroughly 400 million instructions.

Page 3: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

2.2 Effect of Block Granularity on Miss Rate and BandwidthCache miss rate directly correlates with performance, while under

current and future wire-limited technologies [2], bandwidth directlycorrelates with dynamic energy. Figure 3 shows the influence ofblock granularity on miss rate and bandwidth for a 64K L1 cache anda 1M L2 cache keeping the number of ways constant. For the 64KL1, the plots highlight the pitfalls of simply decreasing the block sizeto accommodate the Low group of applications; miss rate increasesby 2× for the High group when the block size is changed from 64Bto 32B; it increases by 30% for the Moderate group. A smallerblock size decreases bandwidth proportionately but increases missrate. With a 1M L2 cache, the lifetime of the cache lines increasessignificantly, improving overall utilization. Increasing the block sizefrom 64→256 halves the miss rate for all application groups. Thebandwidth is increased by 2× for the Low and Moderate.

Since miss rate and bandwidth have different optimal block granu-larities, we use the following metric: 1

MissRate×Bandwidth to determinea fixed block granularity suited to an application that takes both cri-teria into account. Table 2 shows the block size that maximizes themetric for each application. It can be seen that different applica-tions have different block granularity requirements. For example, themetric is maximized for apache at 128 bytes and for firefox (similarutilization) at 32 bytes. Furthermore, the optimal block sizes varywith the cache size as the cache lifespan changes. This highlightsthe challenge of picking a single block size at design time especiallywhen the working set does not fit in the cache.2.3 Need for adaptive cache blocks

Our observations motivate the need for adaptive cache line granu-larities that match the spatial locality of the data access patterns in anapplication. In summary:• Smaller cache lines improve utilization but tend to increase miss

rate and potentially traffic for applications with good spatial local-ity, affecting the overall performance.

• Large cache lines pollute the cache space and interconnect with un-used words for applications with poor spatial locality, significantlydecreasing the caching efficiency.

• Many applications waste a significant fraction of the cache space.Spatial locality varies not only across applications but also withineach application, for different data structures as well as differentphases of access over time.

Table 2: Optimal block size. Metric: 1Miss−rate×Bandwidth

64K, 4-wayBlock Benchmarks

32B cactus, eclipse, facesim, ferret, firefox, fluidani-mate,freqmine, milc, tpc-c, tradesoap

64B art

128B apache, astar, canneal, h2, jbb, lbm, mcf, omnetpp, so-plex, twolf, x264

1M, 8-wayBlock Benchmarks

64B apache, astar, cactus, eclipse, facesim, ferret, firefox,freqmine, h2, lbm, milc, omnetpp, tradesoap, x264

128B art256B canneal, fluidanimate, jbb, mcf, soplex, tpc-c, twolf

3 Amoeba-Cache : ArchitectureThe Amoeba-Cache architecture enables the memory hierarchy to

fetch and allocate space for a range of words (a variable granularity

32

64

128

3.0

4.0

5.0

6.0

7.0

8.0

10 26 42 58 74 90

Ba

nd

wid

th (

KB

/ 1

K i

ns)

Miss Rate (Misses/1K ins)

(a) 64K - Low

64

128

256

2.5

3.0

3.5

4.0

4.5

5.0

10 26 42 58 74 90

Ba

nd

wid

th (

KB

/ 1

K i

ns)

Miss Rate (Misses/1K ins)

(b) 1M - Low

32

64

128

1.0

1.3

1.6

1.9

2.2

2.5

2 6 10 14 18 22

Ba

nd

wid

th (

KB

/ 1

K i

ns)

Miss Rate (Misses/1K ins)

(c) 64K - Moderate

64

128

256

0.5

0.6

0.7

0.8

0.9

1.0

2 6 10 14 18 22

Ba

nd

wid

th (

KB

/ 1

K i

ns)

Miss Rate (Misses/1K ins)

(d) 1M - Moderate

32 64

128

0.80

0.84

0.88

0.92

0.96

1.00

0 3 6 9 12 15

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

(e) 64K - High

64 128

256

0.40

0.45

0.50

0.55

0.60

0.65

0 3 6 9 12 15

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

(f) 1M - High

Figure 3: Bandwidth vs. Miss Rate. (a),(c),(e): 64K, 4-way L1.(b),(d),(f): 1M, 8-way LLC. Markers on the plot indicate cacheblock size. Note the different scales for different groups.

cache block) based on the spatial locality of the application. Forexample, consider a 64K cache (256 sets) that allocates 256 bytesper set. These 256 bytes can adapt to support, for example, eight32-bytes blocks, thirty-two 8-byte blocks, or four 32-byte blocks andsixteen 8-byte blocks, based on the set of contiguous words likelyto be accessed. The key challenge to supporting variable granularityblocks is how to grow and shrink the # of tags as the # of blocks perset vary with block granularity? Amoeba-Cache adopts a solutioninspired by software data structures, where programs hold meta-data and actual data entries in the same address space. To achievemaximum flexibility, Amoeba-Cache completely eliminates the tagarray and collocates the tags with the actual data blocks (see Figure 4).We use a bitmap (T? Bitmap) to indicate which words in the dataarray represent tags. We also decouple the conventional valid/invalidbits (typically associated with the tags) and organize them into aseparate array (V? : Valid bitmap) to simplify block replacement andinsertion. V? and T? bitmaps both require 1 bit for very word (64bits)in the data array (total overhead of 3%). Amoeba-Cache tags arerepresented as a range (Start and End address) to support variablegranularity blocks. We next discuss the overall architecture.3.1 Amoeba Blocks and Set-Indexing

The Amoeba-Cache data array holds a collection of varied gran-ularity Amoeba-Blocks that do not overlap. Each Amoeba-Block isa 4 tuple consisting of <Region Tag, Start, End, Data-Block> (Figure 4). A Region is an aligned block of memoryof size RMAX bytes. The boundaries of any Amoeba-Block block(Start and End) always will lie within the regions’ boundaries.The minimum granularity of the data in an Amoeba-Block is 1 word

Page 4: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

and the maximum is RMAX words. We can encode Start and Endin log2(RMAX) bits. The set indexing function masks the lowerlog2(RMAX) bits to ensure that all Amoeba-Blocks (every memoryword) from a region index to the same set. The Region Tag and Set-Index are identical for every word in the Amoeba-Block. Retainingthe notion of sets enables fast lookups and helps elude challengessuch as synonyms (same memory word mapping to different sets).When comparing against a conventional cache, we set RMAX to 8words (64 bytes), ensuring that the set indexing function is identicalto that in the conventional cache to allow for a fair evaluation.3.2 Data Lookup

Figure 5 describes the steps of the lookup. y1 The lowerlog2(RMAX) bits are masked from the address and the set index isderived from the remaining bits. y2 In parallel with the data read out,the T? bitmap activates the words in the data array correspondingto the tags for comparison. Note that given the minimum size of aAmoeba-Block is two words (1 word for the tag metadata, 1 word forthe data), adjacent words cannot be tags. We need Nwords

2 2-1 multi-plexers that route one of the adjacent words to the comparator (∈ op-erator). The comparator generates the hit signal for the word selector.The ∈ operator consists of two comparators: a) an aligned Regiontag comparator, a conventional == (64 - log2Nsets - log2RMAXbits wide, e.g., 50 bits) that checks if the Amoeba-Block belongsto the same region and b) a Start <= W < END range comparator(log2RMAX bits wide; e.g., 3 bits) that checks if the Amoeba-Blockholds the required word. Finally, in y3 , the tag match activates andselects the appropriate word.3.3 Amoeba Block Insertion

On a miss for the desired word, a spatial granularity predictor isinvoked (see Section 5.1), which specifies the range of the Amoeba-Block to refill. To determine a position in the set to slot the incomingblock we can leverage the V? (Valid bits) bitmap. The V? bitmaphas 1 bit/word in the cache; a “1” bit indicates the word has beenallocated (valid data or a tag). To find space for an incoming datablock we perform a substring search on the V? bitmap of the cache setfor contiguous 0s (empty words). For example, to insert an Amoeba-Block of five words (four words for data and one word for tag), weperform a substring search for 00000 in the V? bitmap / set (e.g., 32bits wide for a 64K cache). If we cannot find the space, we keep

triggering the replacement algorithm until we create the requisitecontiguous space. Following this, the Amoeba-Block tuple (Tag andData block) is inserted, and the corresponding bits in the T? and V?array are set. The 0s substring search can be accomplished with alookup table; many current processors already include a substringinstruction [11, PCMISTRI].3.4 Replacement : Pseudo LRU

To reclaim the space from an Amoeba-Block the tag bits T? (tag)and V? (valid) bits corresponding to the block are unset. The keyissue is identifying the Amoeba-Block to replace. Classical pseudo-LRU algorithms [16,26,16] keep the metadata for the replacementalgorithm separate from the tags to reduce port contention. To becompatible with pseudo-LRU and other algorithms such as DIP [27]that work with a fixed # of ways, we can logically partition a set inAmoeba-Cache into Nways. For instance, if we partition a 32 wordcache set into 4 logical ways, any access to an Amoeba-Block tagfound in words 0 —7 of the set is treated as an access to logicalway 0. Finding a replacement candidate involves identifying theselected replacement way and then picking (possibly randomly) acandidate Amoeba-Block . More refined replacement algorithms thatrequire per-tag metadata can harvest the space in the tag-word of theAmoeba-Block which is 64 bits wide (for alignment purposes) whilephysical addresses rarely extend beyond 48 bits.3.5 Partial Misses

With variable granularity data blocks, a challenging although rarecase (5 in every 1K accesses) that occurs is a partial miss. It isobserved primarily when the spatial locality changes. Figure 6 showsan example. Initially, the set contains two blocks from a region R,one Amoeba-Block caches words 1 –3 (Region:R, START:1 END:3)and the other holds words 5 –6 (Region:R START:5 END:6). TheCPU reads word 4, which misses, and the spatial predictor requestsan Amoeba-Block with range START:0 and END:7. The cache hasAmoeba-Blocks that hold subparts of the incoming Amoeba-Block,and some words (0, 4, and 7) need to be fetched.

Amoeba-Cache removes the overlapping sub-blocks and allocatesa new Amoeba-Block. This is a multiple step process: y1 On a miss,the cache identifies the overlapping sub-blocks in the cache usingthe tags read out during lookup. ∩ 6= NULL is true if STARTnew< ENDTag and ENDnew > STARTTag (New = incoming block and

N sets

TT T T

T T TT TTT

DataRegionTag START END

64b1 -- RMAX

.........

T? Bitmap

Nwords

3.T

hew

rite

rw

aits

unti

litr

ecei

ves

ackn

owle

dgem

ents

from

all

the

shar

ers.

One

ofth

ese

ackn

owle

dgem

ents

wil

lco

ntai

nth

eda

ta.

Alt

erna

tivel

y,th

ew

rite

rw

ill

rece

ive

the

data

from

mem

ory.

4.T

hew

rite

rno

tifi

esth

edi

rect

ory

that

the

tran

sact

ion

isco

m-

plet

e.

TL

oper

ates

iden

tica

lly

exce

ptfo

rw

hen

itse

lect

s,in

step

2(b)

,apr

ovid

erth

atdo

esno

tha

veth

eda

ta.

Inpr

acti

ce,t

hepr

obab

ilit

yof

the

sele

cted

shar

erno

thav

ing

aco

pyis

very

low

;thu

s,th

eco

mm

onca

sefo

rT

Lpr

ocee

dsth

esa

me

asth

eor

igin

alpr

otoc

ol.

Inth

ele

ssco

mm

onca

se,

whe

nth

epr

ovid

erdo

esno

tha

vea

copy

,it

send

sa

NA

ckto

the

wri

ter

inst

ead

ofan

ackn

owle

dgem

ent.

Thi

sle

aves

thre

epo

ssib

lesc

enar

ios:

1.N

oco

pyex

ists

:T

hew

rite

rw

ill

wai

tto

rece

ive

all

ofth

eac

know

ledg

emen

tsan

dth

eN

Ack

,an

dsi

nce

itha

sno

tre

ceiv

edan

yda

ta,i

twil

lsen

da

requ

estt

oth

edi

rect

ory

whi

chth

enas

ksm

emor

yto

prov

ide

the

data

.3

2.A

clea

nco

pyex

ists

:F

orsi

mpl

icit

y,th

isis

hand

led

iden

-ti

call

yto

firs

tca

se.

All

shar

ers

inva

lida

teth

eir

copi

esan

dda

tais

retr

ieve

dfr

omm

emor

yin

stea

d.T

his

may

resu

ltin

ape

rfor

man

celo

ssif

itha

ppen

sfr

eque

ntly

,bu

tS

ecti

on5

show

sth

atno

perf

orm

ance

loss

occu

rsin

prac

tice

.

3.A

dirt

yco

pyex

ists

:O

neof

the

othe

rsh

arer

s,th

eow

ner,

wil

lbe

inth

eM

odifi

edor

Ow

ned

stat

e.T

heow

ner

cann

otdi

scar

dth

edi

rty

data

sinc

eit

does

not

know

that

the

prov

ider

has

aco

py.

Acc

ordi

ngly

,the

owne

ral

way

sin

clud

esth

eda

tain

its

ackn

owle

dgem

ent

toth

ew

rite

r.In

this

case

,th

ew

rite

rw

ill

rece

ive

the

prov

ider

’sN

Ack

and

the

owne

r’s

vali

dda

taan

dca

nth

usco

mpl

ete

its

requ

est.

Ifth

epr

ovid

erha

sa

copy

asw

ell,

the

wri

ter

wil

lre

ceiv

etw

oid

enti

cal

copi

esof

the

data

.S

ecti

on5.

5de

mon

stra

tes

that

this

isra

reen

ough

that

itdo

esno

tres

ulti

nan

ysi

gnifi

cant

incr

ease

inba

ndw

idth

util

izat

ion.

Inal

lca

ses,

all

tran

sact

ions

are

seri

aliz

edth

roug

hth

edi

rect

ory,

sono

addi

tion

alra

ceco

nsid

erat

ions

are

intr

oduc

ed.

Sim

ilar

ly,

the

unde

rlyi

ngde

adlo

ckan

dliv

eloc

kav

oida

nce

mec

hani

sms

rem

ain

appl

icab

le.

3T

his

requ

esti

sse

rial

ized

thro

ugh

the

dire

ctor

yto

avoi

dra

ces

wit

hev

icts

.

3.2.

3E

vict

ions

and

Inva

lida

tes

Whe

na

bloc

kis

evic

ted

orin

vali

date

dT

Ltr

ies

tocl

ear

the

appr

opri

ate

bits

inth

eco

rres

pond

ing

Blo

omfi

lter

tom

aint

ain

accu

racy

.E

ach

Blo

omfi

lter

repr

esen

tsth

ebl

ocks

ina

sing

lese

t,m

akin

git

sim

ple

tode

term

ine

whi

chbi

tsca

nbe

clea

red.

The

cach

esi

mpl

yev

alua

tes

all

the

hash

func

tion

sfo

ral

lre

mai

ning

bloc

ksin

the

set.

Thi

sre

quir

esno

extr

ata

gar

ray

acce

sses

asal

lta

gsar

eal

read

yre

adas

part

ofth

eno

rmal

cach

eop

erat

ion.

Any

bitm

appe

dto

byth

eev

icte

dor

inva

lida

ted

bloc

kca

nbe

clea

red

ifno

neof

the

rem

aini

ngbl

ocks

inth

ese

tm

apto

the

sam

ebi

t.W

hen

usin

gan

incl

usiv

eL

2ca

che,

asin

our

exam

ple,

itis

suffi

cien

tto

chec

kon

lyth

eon

eL

2ca

che

set.

Ifth

eL

2ca

che

isno

tin

clus

ive,

then

addi

tion

alL

1ac

cess

esm

aybe

nece

ssar

y.

3.3

AP

ract

ical

Impl

emen

tati

onC

once

ptua

lly,

TL

cons

ists

ofan

Pgr

idof

Blo

omfi

lter

s,w

here

Nis

the

num

ber

ofca

che

sets

,an

dP

isth

enu

mbe

rof

cach

es,

assh

own

inF

igur

e1(

c).

Eac

hB

loom

filt

erco

ntai

nsk

bit-

vect

ors.

For

clar

ity

we

wil

lre

fer

toth

esi

zeof

thes

ebi

tve

ctor

sas

the

num

ber

ofbu

ckets

.In

prac

tice

,th

ese

filt

ers

can

beco

mbi

ned

into

afe

wS

RA

Mar

rays

prov

ided

that

afe

wco

ndit

ions

are

met

:1)

all

the

filt

ers

use

the

sam

ese

tof

kha

shfu

ncti

ons,

and

2)an

enti

rero

wof

Pfi

lter

sis

acce

ssed

inpa

rall

el.

The

sere

quir

emen

tsdo

not

sign

ifica

ntly

affe

ctth

eef

fect

iven

ess

ofth

eB

loom

filt

ers.

Con

cept

uall

y,ea

chro

wco

ntai

nsP

Blo

omfi

lter

s,ea

chw

ith

kha

shfu

ncti

ons.

Fig

ure

2(a)

show

son

esu

chro

w.

By

alw

ays

acce

ssin

gth

ese

filt

ers

inpa

rall

elus

ing

the

sam

eha

shfu

ncti

ons,

the

filt

ers

can

beco

mbi

ned

tofo

rma

sing

lelo

okup

tabl

e,as

show

nin

Fig

ure

2(b)

,w

here

each

row

cont

ains

aP

-bit

shar

ing

vect

or.

As

are

sult

,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vect

orby

AN

D-i

ngk

shar

ing

vect

ors.

Sin

ceev

ery

row

ofB

loom

filt

ers

uses

the

sam

eha

shfu

ncti

ons,

we

can

furt

her

com

bine

the

look

upta

bles

for

alls

ets

for

each

hash

func

tion

,as

show

nin

Fig

ure

3(a)

.T

hese

look

upta

bles

can

beth

ough

tof

asa

“sea

ofsh

arin

gve

ctor

s”an

dca

nbe

orga

nize

din

toan

yst

ruct

ure

wit

hth

efo

llow

ing

cond

itio

ns:

1)a

uniq

uese

t,ha

shfu

ncti

onin

dex,

and

bloc

kad

dres

sm

apto

asi

ngle

uniq

uesh

arin

gve

ctor

;and

2)k

shar

ing

vect

ors

can

beac

cess

edin

para

llel

,w

here

kis

the

num

ber

ofha

shfu

ncti

ons.

An

exam

ple

impl

emen

tati

onus

esa

sing

le-p

orte

d,tw

o-di

men

sion

alS

RA

Mar

ray

for

each

hash

func

tion

,w

here

the

set

inde

xse

lect

sa

row

,an

dth

eha

shfu

ncti

onse

lect

sa

shar

ing

vect

or,

assh

own

inF

igur

e3(

b).

As

apr

acti

cal

exam

ple,

Sec

tion

5sh

ows

that

aT

Lde

sign

wit

hfo

urha

shta

bles

and

64bu

cket

sin

each

filt

erpe

rfor

ms

wel

lfo

r

!"#$ %

!"#$ &

'$(

%)*+(,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'()*$+,&-./%0*

122*.''

()'(3 ()'(4

5.%

(b)

Fig

ure

2:(a

)Ana

ïve

impl

emen

tati

onw

ith

one

filte

rpe

rco

re.(

b)C

ombi

ning

allo

fthe

filte

rsfo

ron

ero

w.

V? Bitmap

.........

Amoeba-Block

3.Th

e writ

erwa

itsun

tilit

rece

ives

ackn

owled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

the

data.

Alter

nativ

ely, t

hewr

iter w

illre

ceiv

eth

eda

tafro

mm

emor

y.

4.Th

ewr

iter n

otifi

esth

edi

recto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

s ide

ntica

llyex

cept

for w

hen

itse

lects,

inste

p2(

b), a

prov

ider

that

does

not h

ave t

heda

ta.In

prac

tice,

the p

roba

bilit

yof

the s

electe

d sha

rer n

otha

ving

a cop

y is v

ery l

ow; t

hus,

the c

omm

onca

sefo

r TL

proc

eeds

the s

ame a

s the

orig

inal

prot

ocol

. In

the l

ess

com

mon

case

, whe

nth

epr

ovid

erdo

esno

t hav

ea

copy

, it s

ends

aNA

ckto

the

write

r ins

tead

ofan

ackn

owled

gem

ent.

This

leave

sth

ree p

ossib

lesc

enar

ios:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceiv

eall

ofth

eac

know

ledge

men

tsan

dth

eNA

ck, a

ndsin

ceit

has

not

rece

ived

any d

ata, i

t will

send

a req

uest

toth

e dire

ctory

which

then

asks

mem

ory

topr

ovid

e the

data.

3

2.A

clean

copy

exist

s:Fo

r sim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

l sha

rers

inva

lidate

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

, but

Secti

on5

show

s tha

t no

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

e of t

heot

her s

hare

rs,th

e own

er, w

illbe

inth

e Mod

ified

orOw

ned s

tate.

The o

wner

cann

otdi

scar

dth

edi

rtyda

tasin

ceit

does

not k

now

that

the

prov

ider

has a

copy

. Acc

ordi

ngly,

the o

wner

alwa

ysin

clude

s the

data

inits

ackn

owled

gem

ent t

oth

ewr

iter.

Inth

isca

se, t

hewr

iter w

illre

ceiv

eth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epr

ovid

erha

s aco

pyas

well,

the w

riter

will

rece

ive t

woid

entic

alco

pies

ofth

e data

.Se

ction

5.5

dem

onstr

ates t

hat t

his i

s rar

e eno

ugh

that

itdo

esno

t res

ult i

n any

signi

fican

t inc

reas

e in b

andw

idth

utili

zatio

n.

Inall

case

s,all

trans

actio

nsar

e ser

ialize

dth

roug

hth

e dire

ctory

,so

noad

ditio

nal r

ace

cons

ider

ation

s are

intro

duce

d.Si

mila

rly, t

heun

derly

ing

dead

lock

and

livelo

ckav

oida

nce

mec

hani

sms

rem

ainap

plica

ble.

3 This

requ

est i

s ser

ialize

d thr

ough

the d

irecto

ryto

avoi

d rac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dIn

valid

ates

Whe

na

bloc

kis

evict

edor

inva

lidate

dTL

tries

tocle

arth

eap

prop

riate

bits

inth

eco

rresp

ondi

ngBl

oom

filter

tom

ainta

inac

cura

cy.

Each

Bloo

mfil

terre

pres

ents

the

bloc

ksin

asin

gle

set,

mak

ing

itsim

ple t

o dete

rmin

e whi

chbi

tsca

n be c

leare

d.Th

e cac

hesim

ply

evalu

ates a

llth

eha

shfu

nctio

nsfo

r all

rem

ainin

gbl

ocks

inth

ese

t.Th

isre

quire

sno

extra

tagar

ray

acce

sses

asall

tags

are

alrea

dyre

adas

part

ofth

e nor

mal

cach

e ope

ratio

n.An

y bit

map

ped

toby

the

evict

edor

inva

lidate

dbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

t map

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

r exa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

t inc

lusiv

e,th

enad

ditio

nal L

1ac

cess

esm

aybe

nece

ssar

y.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

ists

ofan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es, a

s sho

wnin

Figu

re1(

c). E

ach

Bloo

mfil

terco

ntain

s kbi

t-ve

ctors.

For c

larity

wewi

llre

fer t

oth

e size

ofth

ese

bit v

ecto

rsas

the

num

ber o

f buc

kets

. In

prac

tice,

thes

efil

ters c

anbe

com

bine

din

toa f

ewSR

AMar

rays

prov

ided

that

a few

cond

ition

s are

met:

1)all

the fi

lters

use t

hesa

me s

etof

kha

shfu

nctio

ns, a

nd2)

anen

tire

row

ofP

filter

s is a

cces

sed

inpa

ralle

l.Th

ese r

equi

rem

ents

dono

tsig

nific

antly

affe

ctth

e effe

ctive

ness

ofth

e Blo

omfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ing

thes

e filte

rsin

para

llel u

sing

the s

ame h

ash f

uncti

ons,

the

filter

s can

beco

mbi

ned

tofo

rma s

ingl

elo

okup

table,

assh

own

inFi

gure

2(b)

, whe

reea

chro

wco

ntain

sa

P-bi

t sha

ring

vecto

r.As

are

sult,

each

look

updi

rectl

ypr

ovid

esa

P-b

itsh

arin

gve

ctor b

yAN

D-in

gk

shar

ing

vecto

rs.Si

nce e

very

row

ofBl

oom

filter

s use

sth

e sam

e has

hfu

nctio

ns, w

e can

furth

erco

mbi

neth

e loo

kup

tables

for a

llse

tsfo

r eac

hha

shfu

nctio

n,as

show

nin

Figu

re3(

a).

Thes

elo

okup

tables

can

beth

ough

tof

asa

“sea

ofsh

arin

gve

ctors”

and

can

beor

gani

zed

into

any

struc

ture

with

the f

ollo

wing

cond

ition

s:1)

a uni

que s

et,ha

shfu

nctio

nin

dex,

and

bloc

kad

dres

sm

apto

a sin

gle u

niqu

e sha

ring

vecto

r;an

d2)

ksh

arin

gve

ctors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

r of h

ash

func

tions

.An

exam

ple i

mpl

emen

tatio

nus

esa s

ingl

e-po

rted,

two-

dim

ensio

nal

SRAM

arra

yfo

r eac

hha

shfu

nctio

n,wh

ere

the

set i

ndex

selec

tsa

row,

and

the

hash

func

tion

selec

tsa

shar

ing

vecto

r,as

show

nin

Figu

re3(

b).

Asa

prac

tical

exam

ple,

Secti

on5

show

s tha

t aTL

desig

nwi

thfo

urha

shtab

lesan

d64

buck

etsin

each

filter

perfo

rms

well

for

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a) A

naïve

impl

emen

tatio

nwi

thon

e filte

r per

core

. (b)

Com

bini

ngall

ofth

e filte

rsfo

r one

row.

3.Th

e writ

erwa

itsun

tilit

rece

ives

ackn

owled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

the

data.

Alter

nativ

ely, t

hewr

iter w

illre

ceiv

eth

eda

tafro

mm

emor

y.

4.Th

ewr

iter n

otifi

esth

edi

recto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

s ide

ntica

llyex

cept

for w

hen

itse

lects,

inste

p2(

b), a

prov

ider

that

does

not h

ave t

heda

ta.In

prac

tice,

the p

roba

bilit

yof

the s

electe

d sha

rer n

otha

ving

a cop

y is v

ery l

ow; t

hus,

the c

omm

onca

sefo

r TL

proc

eeds

the s

ame a

s the

orig

inal

prot

ocol

. In

the l

ess

com

mon

case

, whe

nth

epr

ovid

erdo

esno

t hav

ea

copy

, it s

ends

aNA

ckto

the

write

r ins

tead

ofan

ackn

owled

gem

ent.

This

leave

sth

ree p

ossib

lesc

enar

ios:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceiv

eall

ofth

eac

know

ledge

men

tsan

dth

eNA

ck, a

ndsin

ceit

has

not

rece

ived

any d

ata, i

t will

send

a req

uest

toth

e dire

ctory

which

then

asks

mem

ory

topr

ovid

e the

data.

3

2.A

clean

copy

exist

s:Fo

r sim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

l sha

rers

inva

lidate

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

, but

Secti

on5

show

s tha

t no

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

e of t

heot

her s

hare

rs,th

e own

er, w

illbe

inth

e Mod

ified

orOw

ned s

tate.

The o

wner

cann

otdi

scar

dth

edi

rtyda

tasin

ceit

does

not k

now

that

the

prov

ider

has a

copy

. Acc

ordi

ngly,

the o

wner

alwa

ysin

clude

s the

data

inits

ackn

owled

gem

ent t

oth

ewr

iter.

Inth

isca

se, t

hewr

iter w

illre

ceiv

eth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epr

ovid

erha

s aco

pyas

well,

the w

riter

will

rece

ive t

woid

entic

alco

pies

ofth

e data

.Se

ction

5.5

dem

onstr

ates t

hat t

his i

s rar

e eno

ugh

that

itdo

esno

t res

ult i

n any

signi

fican

t inc

reas

e in b

andw

idth

utili

zatio

n.

Inall

case

s,all

trans

actio

nsar

e ser

ialize

dth

roug

hth

e dire

ctory

,so

noad

ditio

nal r

ace

cons

ider

ation

s are

intro

duce

d.Si

mila

rly, t

heun

derly

ing

dead

lock

and

livelo

ckav

oida

nce

mec

hani

sms

rem

ainap

plica

ble.

3 This

requ

est i

s ser

ialize

d thr

ough

the d

irecto

ryto

avoi

d rac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dIn

valid

ates

Whe

na

bloc

kis

evict

edor

inva

lidate

dTL

tries

tocle

arth

eap

prop

riate

bits

inth

eco

rresp

ondi

ngBl

oom

filter

tom

ainta

inac

cura

cy.

Each

Bloo

mfil

terre

pres

ents

the

bloc

ksin

asin

gle

set,

mak

ing

itsim

ple t

o dete

rmin

e whi

chbi

tsca

n be c

leare

d.Th

e cac

hesim

ply

evalu

ates a

llth

eha

shfu

nctio

nsfo

r all

rem

ainin

gbl

ocks

inth

ese

t.Th

isre

quire

sno

extra

tagar

ray

acce

sses

asall

tags

are

alrea

dyre

adas

part

ofth

e nor

mal

cach

e ope

ratio

n.An

y bit

map

ped

toby

the

evict

edor

inva

lidate

dbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

t map

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

r exa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

t inc

lusiv

e,th

enad

ditio

nal L

1ac

cess

esm

aybe

nece

ssar

y.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

ists

ofan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es, a

s sho

wnin

Figu

re1(

c). E

ach

Bloo

mfil

terco

ntain

s kbi

t-ve

ctors.

For c

larity

wewi

llre

fer t

oth

e size

ofth

ese

bit v

ecto

rsas

the

num

ber o

f buc

kets

. In

prac

tice,

thes

efil

ters c

anbe

com

bine

din

toa f

ewSR

AMar

rays

prov

ided

that

a few

cond

ition

s are

met:

1)all

the fi

lters

use t

hesa

me s

etof

kha

shfu

nctio

ns, a

nd2)

anen

tire

row

ofP

filter

s is a

cces

sed

inpa

ralle

l.Th

ese r

equi

rem

ents

dono

tsig

nific

antly

affe

ctth

e effe

ctive

ness

ofth

e Blo

omfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ing

thes

e filte

rsin

para

llel u

sing

the s

ame h

ash f

uncti

ons,

the

filter

s can

beco

mbi

ned

tofo

rma s

ingl

elo

okup

table,

assh

own

inFi

gure

2(b)

, whe

reea

chro

wco

ntain

sa

P-bi

t sha

ring

vecto

r.As

are

sult,

each

look

updi

rectl

ypr

ovid

esa

P-b

itsh

arin

gve

ctor b

yAN

D-in

gk

shar

ing

vecto

rs.Si

nce e

very

row

ofBl

oom

filter

s use

sth

e sam

e has

hfu

nctio

ns, w

e can

furth

erco

mbi

neth

e loo

kup

tables

for a

llse

tsfo

r eac

hha

shfu

nctio

n,as

show

nin

Figu

re3(

a).

Thes

elo

okup

tables

can

beth

ough

tof

asa

“sea

ofsh

arin

gve

ctors”

and

can

beor

gani

zed

into

any

struc

ture

with

the f

ollo

wing

cond

ition

s:1)

a uni

que s

et,ha

shfu

nctio

nin

dex,

and

bloc

kad

dres

sm

apto

a sin

gle u

niqu

e sha

ring

vecto

r;an

d2)

ksh

arin

gve

ctors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

r of h

ash

func

tions

.An

exam

ple i

mpl

emen

tatio

nus

esa s

ingl

e-po

rted,

two-

dim

ensio

nal

SRAM

arra

yfo

r eac

hha

shfu

nctio

n,wh

ere

the

set i

ndex

selec

tsa

row,

and

the

hash

func

tion

selec

tsa

shar

ing

vecto

r,as

show

nin

Figu

re3(

b).

Asa

prac

tical

exam

ple,

Secti

on5

show

s tha

t aTL

desig

nwi

thfo

urha

shtab

lesan

d64

buck

etsin

each

filter

perfo

rms

well

for

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a) A

naïve

impl

emen

tatio

nwi

thon

e filte

r per

core

. (b)

Com

bini

ngall

ofth

e filte

rsfo

r one

row.

RMAXRegion

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

3.Th

ewrit

erwa

itsun

tilit

rece

ivesa

ckno

wled

gem

ents

from

allth

esh

arer

s.On

eof

thes

eac

know

ledge

men

tswi

llco

ntain

thed

ata.A

ltern

ative

ly,th

ewrit

erwi

llre

ceive

thed

atafro

mm

emor

y.

4.Th

ewrit

erno

tifies

thed

irecto

ryth

atth

etra

nsac

tion

isco

m-

plete

.

TLop

erate

side

ntica

llyex

cept

forw

heni

tsele

cts,i

nstep

2(b)

,apr

ovid

erth

atdo

esno

thav

ethe

data.

Inpr

actic

e,th

epro

babi

lity

ofth

esele

cteds

hare

rnot

havi

ngac

opyi

sver

ylow

;thu

s,th

ecom

mon

case

forT

Lpr

ocee

dsth

esam

east

heor

igin

alpr

otoc

ol.I

nth

eles

sco

mm

onca

se,w

hen

thep

rovid

erdo

esno

thav

eaco

py,i

tsen

dsa

NAck

toth

ewr

iteri

nstea

dof

anac

know

ledge

men

t.Th

islea

ves

thre

epos

sible

scen

ario

s:

1.No

copy

exist

s:Th

ewr

iter

will

wait

tore

ceive

allof

the

ackn

owled

gem

ents

and

the

NAck

,and

since

itha

sno

tre

ceive

dany

data,

itwill

send

areq

uest

toth

edire

ctory

which

then

asks

mem

ory

topr

ovid

ethe

data.

3

2.A

clean

copy

exist

s:Fo

rsim

plici

ty,th

isis

hand

ledid

en-

ticall

yto

first

case

.Al

lsha

rers

invali

date

their

copi

esan

dda

tais

retri

eved

from

mem

ory

inste

ad.

This

may

resu

ltin

ape

rform

ance

loss

ifit

happ

ens

frequ

ently

,but

Secti

on5

show

stha

tno

perfo

rman

celo

ssoc

curs

inpr

actic

e.

3.A

dirty

copy

exist

s:On

eoft

heot

hers

hare

rs,th

eown

er,w

illbe

inth

eMod

ified

orOw

neds

tate.

Theo

wner

cann

otdi

scar

dth

edirt

yda

tasin

ceit

does

notk

now

that

thep

rovi

derh

asa

copy

.Acc

ordi

ngly,

theo

wner

alwa

ysin

clude

sthe

data

inits

ackn

owled

gem

entt

oth

ewrit

er.In

this

case

,the

write

rwill

rece

iveth

epr

ovid

er’s

NAck

and

the

owne

r’sva

lidda

taan

dca

nth

usco

mpl

eteits

requ

est.

Ifth

epro

vide

rhas

acop

yas

well,

thew

riter

will

rece

ivetw

oid

entic

alco

pies

ofth

edata

.Se

ction

5.5de

mon

strate

stha

tthi

sisr

aree

noug

hth

atit

does

notr

esul

tina

nysig

nific

anti

ncre

asei

nban

dwid

thut

iliza

tion.

Inall

case

s,all

trans

actio

nsar

eser

ialize

dth

roug

hth

edire

ctory

,so

noad

ditio

nalr

acec

onsid

erati

onsa

rein

trodu

ced.

Sim

ilarly

,the

unde

rlyin

gde

adlo

ckan

dliv

elock

avoi

danc

em

echa

nism

sre

main

appl

icabl

e.

3 This

requ

esti

sser

ialize

dthr

ough

thed

irecto

ryto

avoi

drac

eswi

thev

icts.

3.2.

3Ev

ictio

nsan

dInv

alid

ates

Whe

na

bloc

kis

evict

edor

invali

dated

TLtri

esto

clear

the

appr

opria

tebi

tsin

the

corre

spon

ding

Bloo

mfil

terto

main

tain

accu

racy

.Ea

chBl

oom

filter

repr

esen

tsth

ebl

ocks

ina

singl

ese

t,m

akin

gits

impl

etod

eterm

inew

hich

bits

canb

eclea

red.

Thec

ache

simpl

yev

aluate

sall

theh

ash

func

tions

fora

llre

main

ing

bloc

ksin

the

set.

This

requ

ires

noex

tratag

arra

yac

cess

esas

alltag

sar

ealr

eady

read

aspa

rtof

then

orm

alca

cheo

pera

tion.

Anyb

itm

appe

dto

byth

eev

icted

orinv

alida

tedbl

ock

can

becle

ared

ifno

neof

the

rem

ainin

gbl

ocks

inth

ese

tmap

toth

esa

me

bit.

Whe

nus

ing

anin

clusiv

eL2

cach

e,as

inou

rexa

mpl

e,it

issu

fficie

ntto

chec

kon

lyth

eon

eL2

cach

ese

t.If

the

L2ca

che

isno

tinc

lusiv

e,th

enad

ditio

nalL

1acc

esse

smay

bene

cess

ary.

3.3A

Prac

tical

Impl

emen

tatio

nCo

ncep

tuall

y,TL

cons

istso

fan

Pgr

idof

Bloo

mfil

ters,

wher

eN

isth

enu

mbe

rof

cach

ese

ts,an

dP

isth

enu

mbe

rof

cach

es,a

ssho

wnin

Figu

re1(

c).E

ach

Bloo

mfil

terco

ntain

skbi

t-ve

ctors.

Forc

larity

wewi

llre

fert

oth

esize

ofth

eseb

itve

ctors

asth

enum

bero

fbuc

kets

.In

prac

tice,

thes

efilte

rsca

nbe

com

bine

din

toaf

ewSR

AMar

rays

prov

ided

that

afew

cond

ition

sare

met:

1)all

thefi

lters

uset

hesa

mes

etof

kha

shfu

nctio

ns,a

nd2)

anen

tire

row

ofP

filter

sisa

cces

sed

inpa

ralle

l.Th

eser

equi

rem

ents

dono

tsig

nific

antly

affec

tthe

effec

tiven

esso

fthe

Bloo

mfil

ters.

Conc

eptu

ally,

each

row

cont

ains

PBl

oom

filter

s,ea

chwi

thk

hash

func

tions

.Fi

gure

2(a)

show

son

esu

chro

w.By

alway

sac

cess

ingt

hese

filter

sinp

arall

elus

ingt

hesa

meh

ashf

uncti

ons,

the

filter

scan

beco

mbi

ned

tofo

rmas

ingl

eloo

kup

table,

assh

own

inFi

gure

2(b)

,whe

reea

chro

wco

ntain

saP-

bits

harin

gve

ctor.

Asa

resu

lt,ea

chlo

okup

dire

ctly

prov

ides

aP

-bit

shar

ing

vecto

rby

AND-

ing

ksh

arin

gve

ctors.

Sinc

eeve

ryro

wof

Bloo

mfil

tersu

ses

thes

ameh

ash

func

tions

,wec

anfu

rther

com

bine

thel

ooku

ptab

lesfo

rall

sets

fore

achh

ash

func

tion,

assh

own

inFi

gure

3(a)

.Th

ese

look

uptab

lesca

nbe

thou

ght

ofas

a“s

eaof

shar

ing

vecto

rs”an

dcan

beor

gani

zedi

ntoa

nystr

uctu

rewi

thth

efol

lowin

gco

nditi

ons:

1)au

niqu

eset,

hash

func

tion

inde

x,an

dbl

ock

addr

ess

map

toas

ingl

euni

ques

harin

gvec

tor;

and2

)ksh

arin

gvec

tors

can

beac

cess

edin

para

llel,

wher

ek

isth

enu

mbe

rofh

ash

func

tions

.An

exam

plei

mpl

emen

tatio

nuse

sasin

gle-

porte

d,tw

o-di

men

siona

lSR

AMar

ray

fore

ach

hash

func

tion,

wher

eth

ese

tind

exse

lects

arow

,and

theh

ash

func

tion

selec

tsas

harin

gve

ctor,

assh

own

inFi

gure

3(b)

.As

apr

actic

alex

ampl

e,Se

ction

5sh

owst

hata

TLde

sign

with

four

hash

tables

and

64bu

ckets

inea

chfil

terpe

rform

swe

llfo

r

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'(

)*$+,

&-./%0

*122*.''

()'(3 ()'(4

5.%

(b)

Figu

re2:

(a)A

naïve

impl

emen

tatio

nwith

onefi

lterp

erco

re.(

b)Co

mbi

ning

allof

thefi

lters

foron

erow

.

Address space

Address

START ENDAmoeba Block

{

WordSet

Index Region

Tag log2RMAX

TT T T

(words)

Byte

3b

Left: Cache Layout. Size: $N_sets*N_words*8 bytes$. T? (Tag map) 1 bit/word. Is word a a tag?

Yes(1). V? (Valid map): 1 bit/word. Is word allocated? Yes(1). Total overhead: 3%. Right: Address space

layout: Region: Aligned regions of size RMAX words (×8 bytes). Amoeba-Block range cannot exceed Region

boundaries. Cache Indexing: Lower log2RMAX bits masked and set index is derived from remaining bits; all

words (and all Amoeba-Blocks) within a region index to the same set.

Figure 4: Amoeba-Cache Architecture.

Nwords

2

Region Tag

Start W End W

Critical path

T? Bitmap

Output buffer

...Addr∈Tag

Addr∈Tag

Tag

Word Selector

Word(W)

SetIndex

...

Addr.

Hit?

3.

The

wri

ter

wait

sunti

lit

receiv

es

acknow

ledgem

ents

from

all

the

share

rs.

One

of

these

acknow

ledgem

ents

wil

lconta

inth

edata

.A

ltern

ati

vely

,th

ew

rite

rw

ill

receiv

eth

edata

from

mem

ory

.

4.

The

wri

ter

noti

fies

the

dir

ecto

ryth

at

the

transacti

on

iscom

-ple

te.

TL

opera

tes

identi

call

yexcept

for

when

itsele

cts

,in

ste

p2(b

),a

pro

vid

er

that

does

not

have

the

data

.In

pra

cti

ce,

the

pro

babil

ity

of

the

sele

cte

dshare

rnot

havin

ga

copy

isvery

low

;th

us,th

ecom

mon

case

for

TL

pro

ceeds

the

sam

eas

the

ori

gin

al

pro

tocol.

Inth

ele

ss

com

mon

case,

when

the

pro

vid

er

does

not

have

acopy,

itsends

aN

Ack

toth

ew

rite

rin

ste

ad

of

an

acknow

ledgem

ent.

This

leaves

thre

epossib

lescenari

os:

1.

No

cop

yexis

ts:

The

wri

ter

wil

lw

ait

tore

ceiv

eall

of

the

acknow

ledgem

ents

and

the

NA

ck,

and

sin

ce

ithas

not

receiv

ed

any

data

,it

wil

lsend

are

questto

the

dir

ecto

ryw

hic

hth

en

asks

mem

ory

topro

vid

eth

edata

.3

2.

Acle

an

cop

yexis

ts:

For

sim

pli

cit

y,

this

ishandle

did

en-

ticall

yto

firs

tcase.

All

share

rsin

vali

date

their

copie

sand

data

isre

trie

ved

from

mem

ory

inste

ad.

This

may

result

ina

perf

orm

ance

loss

ifit

happens

frequentl

y,

but

Secti

on

5show

sth

at

no

perf

orm

ance

loss

occurs

inpra

cti

ce.

3.

Ad

irty

cop

yexis

ts:

One

of

the

oth

er

share

rs,th

eow

ner,w

ill

be

inth

eM

odifi

ed

or

Ow

ned

sta

te.

The

ow

ner

cannot

dis

card

the

dir

tydata

sin

ce

itdoes

not

know

that

the

pro

vid

er

has

acopy.

Accord

ingly

,th

eow

ner

alw

ays

inclu

des

the

data

init

sacknow

ledgem

ent

toth

ew

rite

r.In

this

case,

the

wri

ter

wil

lre

ceiv

eth

epro

vid

er’

sN

Ack

and

the

ow

ner’

svali

ddata

and

can

thus

com

ple

teit

sre

quest.

Ifth

epro

vid

er

has

acopy

as

well

,th

ew

rite

rw

ill

receiv

etw

oid

enti

cal

copie

sof

the

data

.S

ecti

on

5.5

dem

onstr

ate

sth

at

this

isra

reenough

that

itdoes

not

result

inany

sig

nifi

cant

incre

ase

inbandw

idth

uti

lizati

on.

Inall

cases,

all

transacti

ons

are

seri

ali

zed

thro

ugh

the

dir

ecto

ry,

so

no

addit

ional

race

consid

era

tions

are

intr

oduced.

Sim

ilarl

y,

the

underl

yin

gdeadlo

ck

and

livelo

ck

avoid

ance

mechanis

ms

rem

ain

appli

cable

.

3T

his

request

isseri

ali

zed

thro

ugh

the

dir

ecto

ryto

avoid

races

wit

hevic

ts.

3.2

.3E

vic

tio

ns

an

dIn

va

lid

ate

sW

hen

ablo

ck

isevic

ted

or

invali

date

dT

Ltr

ies

tocle

ar

the

appro

pri

ate

bit

sin

the

corr

espondin

gB

loom

filt

er

tom

ain

tain

accura

cy.

Each

Blo

om

filt

er

repre

sents

the

blo

cks

ina

sin

gle

set,

makin

git

sim

ple

todete

rmin

ew

hic

hbit

scan

be

cle

are

d.

The

cache

sim

ply

evalu

ate

sall

the

hash

functi

ons

for

all

rem

ain

ing

blo

cks

inth

eset.

This

requir

es

no

extr

ata

garr

ay

accesses

as

all

tags

are

alr

eady

read

as

part

of

the

norm

al

cache

opera

tion.

Any

bit

mapped

toby

the

evic

ted

or

invali

date

dblo

ck

can

be

cle

are

dif

none

of

the

rem

ain

ing

blo

cks

inth

eset

map

toth

esam

ebit

.W

hen

usin

gan

inclu

siv

eL

2cache,

as

inour

exam

ple

,it

issuffi

cie

nt

tocheck

only

the

one

L2

cache

set.

Ifth

eL

2cache

isnot

inclu

siv

e,

then

addit

ional

L1

accesses

may

be

necessary

.

3.3

AP

ra

cti

ca

lIm

ple

men

tati

on

Conceptu

all

y,

TL

consis

tsof

an

Pgri

dof

Blo

om

filt

ers

,w

here

Nis

the

num

ber

of

cache

sets

,and

Pis

the

num

ber

of

caches,

as

show

nin

Fig

ure

1(c

).E

ach

Blo

om

filt

er

conta

ins

kbit

-vecto

rs.

For

cla

rity

we

wil

lre

fer

toth

esiz

eof

these

bit

vecto

rsas

the

num

ber

ofbuckets.

Inpra

cti

ce,

these

filt

ers

can

be

com

bin

ed

into

afe

wS

RA

Marr

ays

pro

vid

ed

that

afe

wcondit

ions

are

met:

1)

all

the

filt

ers

use

the

sam

eset

ofk

hash

functi

ons,

and

2)

an

enti

rero

wofP

filt

ers

isaccessed

inpara

llel.

These

requir

em

ents

do

not

sig

nifi

cantl

yaff

ect

the

eff

ecti

veness

of

the

Blo

om

filt

ers

.C

onceptu

all

y,

each

row

conta

ins

PB

loom

filt

ers

,each

wit

hk

hash

functi

ons.

Fig

ure

2(a

)show

sone

such

row

.B

yalw

ays

accessin

gth

ese

filt

ers

inpara

llelusin

gth

esam

ehash

functi

ons,th

efi

lters

can

be

com

bin

ed

tofo

rma

sin

gle

lookup

table

,as

show

nin

Fig

ure

2(b

),w

here

each

row

conta

ins

aP

-bit

sharin

gvecto

r.

As

are

sult

,each

lookup

dir

ectl

ypro

vid

es

aP

-bit

shari

ng

vecto

rby

AN

D-i

ng

kshari

ng

vecto

rs.

Sin

ce

every

row

of

Blo

om

filt

ers

uses

the

sam

ehash

functi

ons,

we

can

furt

her

com

bin

eth

elo

okup

table

sfo

rall

sets

for

each

hash

functi

on,

as

show

nin

Fig

ure

3(a

).T

hese

lookup

table

scan

be

thought

of

as

a“sea

of

shari

ng

vecto

rs”

and

can

be

org

aniz

ed

into

any

str

uctu

rew

ith

the

foll

ow

ing

condit

ions:

1)

auniq

ue

set,

hash

functi

on

index,

and

blo

ck

addre

ss

map

toa

sin

gle

uniq

ue

shari

ng

vecto

r;and

2)k

shari

ng

vecto

rscan

be

accessed

inpara

llel,

where

kis

the

num

ber

of

hash

functi

ons.

An

exam

ple

imple

menta

tion

uses

asin

gle

-port

ed,tw

o-d

imen

sio

nal

SR

AM

arr

ay

for

each

hash

functi

on,

where

the

set

index

sele

cts

aro

w,

and

the

hash

functi

on

sele

cts

ashari

ng

vecto

r,as

show

nin

Fig

ure

3(b

).A

sa

pra

cti

cal

exam

ple

,S

ecti

on

5show

sth

at

aT

Ldesig

nw

ith

four

hash

table

sand

64

buckets

ineach

filt

er

perf

orm

sw

ell

for

!"#$%

!"#$&

'$(

%)*+(

,--#$..

/0./% /0./1

,--#$..

/0./% /0./1

(a)

!"#$%

!"#$&

!"#$%&'()*$+,&-./%0*

122*.''

()'(3 ()'(45.%

(b)

Fig

ure

2:

(a)

An

aïv

eim

ple

men

tati

on

wit

hon

efi

lter

per

core.

(b)

Com

bin

ing

all

of

the

filt

ers

for

on

erow

.

≤ 3 bits

==Region

>50 bits

{ {

Tag

1

2

3

Data Array

Lookup steps. y1 Mask lower log2RMAX bits and index into data array.y2 Identify tags using T? bitmap and initiate tag checks (∈) operator

(Region tag comparator ==? and≤ range comparator). y3 Word selection

within Amoeba-Block.

Figure 5: Lookup Logic

Page 5: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

Tag = Amoeba-Block in set). y2 The data blocks that overlap withthe miss range are evicted and moved one-at-a-time to the MSHRentry. y3 Space is then allocated for the new block, i.e., it is treatedlike a new insertion. y4 A miss request is issued for the entire block(START:0 — END:7) even if only some words (e.g., 0, 4, and 7)may be needed. This ensures request processing is simple and onlya single refill request is sent. y5 Finally, the incoming data block ispatched into the MSHR; only the words not obtained from the L1 arecopied (since the lower level could be stale).

Miss(0:7)

R 1--3 R 5--6

0--7RFetch

MSHR

1New∩ Tag

0--72

==RegionStartNEW ≤ EndEndNEW > Start

MSHR1--3 5--6X X

+ (0--7)3

4Refill (0:7)Patch Xs

X

5

Identify Sub-Blocks

Insert New-Block

y1 Identify blocks overlapping with New block. y2 Evict overlapping

blocks to MSHR. y3 Allocate space for new block (treat it like a newinsertion). y4 Issue refill request to lower level for entire block. y5Patch only newer words as lower-level data could be stale.Figure 6: Partial Miss Handling. Upper: Identify relevant sub-blocks. Useful for other cache controller events as well, e.g.,recalls. Lower: Refill of words and insertion.

4 Hardware ComplexityWe analyze the complexity of Amoeba-Cache along the following

directions: we quantify the additions needed to the cache controller,we analyze the latency, area, and energy penalty, and finally, we studythe challenges specifically introduced by large caches.4.1 Cache Controller

The variable granularity Amoeba-Block blocks need specific con-sideration in the cache controller. We focus on the L1 controllerhere, and in particular, partial misses. The cache controller managesoperations at the aligned RMAX granularity. The controller permitsonly one in-flight cache operation per RMAX region. In-flight cacheoperations ensure no address overlap with stable Amoeba-Blocks inorder to eliminate complex race conditions. Figure 7 shows the L1cache controller state machine. We add two states to the defaultprotocol, IV_B and IV_C, to handle partial misses. IV_B is ablocking state that blocks other cache operations to RMAX regionuntil all relevant Amoeba-Blocks to a partial miss are evicted (e.g.,0–3 and 5–7 blocks in Figure 6). IV_C indicates partial miss com-pletion. This enables the controller to treat the access as a full missand issue the refill request. The other stable states (I, V, D) and tran-sient states (IV_Data and ID_Data) are present in a conventionalprotocol as well. Partial-miss triggers the clean-up operations(1 and 2 in Figure 6). Local_L1_Evict is a looping event thatkeeps retriggering for each Amoeba-Block involved in the partialmiss; Last_L1_Evict is triggered when the last Amoeba-Blockinvolved in the partial miss is evicted to the MSHR. A key difference

Local

L1_Evict

IVData

IDData

IV_B

IV_CV D

Partial

Miss

L2

DATA L2

DATA

Last

L1_E

vict Partial MissSt

oreLoad

Parti

alM

iss

NPLoad StoreL1

Evict

L1Writeback

LoadLoad/Store

Store

IDData

Cache controller statesState DescriptionNP Amoeba-Block not present in the cache.V All words corresponding to Amoeba-Block present and valid

(read-only)D Valid and atleast one word in Amoeba-Block is dirty (read-write)

IV B Partial miss being processed (blocking state)IV Data Load miss; waiting for data from L2ID Data Store miss; waiting for data. Set dirty bit.

IV C Partial miss cleanup from cache completed (treat as full miss)Amoeba-specific Cache Events

Partial miss: Process partial miss.Local L1 Evict: Remove overlapping Amoeba-Block to MSHR.Last L1 Evict: Last Amoeba-Block moved to MSHR. Convert to fullmiss and process load or store.Bold and Broken-lines: Amoeba-Cache additions.

Figure 7: Amoeba Cache Controller (L1 level).

between the L1 and lower-level protocols is that the Load/Store eventin the lower-level protocol may need to access data from multipleAmoeba-Blocks. In such cases, similar to the partial-miss event, weread out each block independently before supplying the data (moredetails in § 5.2).

4.2 Area, Latency, and Energy Overhead

The extra metadata required by Amoeba-Cache are the T? (1 tagbit per word) and V? (1 valid bit per word) bitmaps. Table 3 showsthe quantitative overhead compared to the data storage. Both the T?and V? bitmap arrays are directly proportional to the size of the cacheand require a constant storage overhead (3% in total). The T? bitmapis read in parallel with the data array and does not affect the criticalpath; T? adds 2%—3.5% (depending on cache size) to the overallcache access energy. V? is referred only on misses when inserting anew block.

Table 3: Amoeba-Cache Hardware Complexity.Cache configuration

64K (256by/set) 1MB (512by/set) 4MB (1024by/set)Data RAM parameters

Delay 0.36ns 2ns 2.5 nsEnergy 100pJ 230pJ 280pJ

Amoeba-Cache components (CACTI model)T?/V? map 1KB 16KB 64KB

Latency 0.019ns (5%) 0.12ns (6%) 0.2ns (6%)Energy 2pJ (2%) 8pJ (3.4%) 10pJ (3.5%)LRU 1

8 KB 2KB 8KBLookup Overhead (VHDL model)

Area 0.7% 0.1%Latency 0.02ns 0.035ns 0.04ns

% indicates overhead compared to data array of cache. 64K cacheoperates in Fast mode; 1MB and 4MB operate in Normal mode.We use 32nm ITRS HP transistors for 64K and 32nm ITRS LOPtransistors for 1MB and 4MB.

Page 6: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

We synthesized1 the cache lookup logic using Synopsys and quan-tify the area, latency, and energy penalty. Amoeba-Cache is compat-ible with Fast and Normal cache access modes [28, -access-modeconfig], both of which read the entire set from the data array in par-allel with the way selection to achieve lower latency. Fast modetransfers the entire set to the edge of the H-tree, while Normal mode,only transmits the selected way over the H-tree. For synthesis, weused the Synopsys design compiler (Vision Z-2007.03-SP5).

Figure 5 shows Amoeba-Caches lookup hardware on the criticalpath; we compare it against a fixed-granularity cache’s lookup logic(mainly the comparators). The area overhead of the Amoeba-Cacheincludes registering an entire line that has been read out, the tagoperation logic, and the word selector. The components on thecritical path once the data is read out are the 2-way multiplexers, the∈ comparators, and priority encoder that selects the word; the T?bitmap is accessed in parallel and off the critical path. Amoeba-Cacheis made feasible under today’s wire-limited technology where thecache latency and energy is dominated by the bit/word lines, decoder,and H-tree [28]. Amoeba-Cache’s comparators, which operate onthe entire cache set, are 6× the area of a fixed cache’s comparators.Note that the data array occupies 99% of the overall cache area.The critical path is dominated by the wide word selector since thecomparators all operate in parallel. The lookup logic adds 60% tothe conventional cache’s comparator time. The overall critical pathis dominated by the data array access and Amoeba-Cache’s lookupcircuit adds 0.02ns to the access latency and ' 1pJ to the energy ofa 64K cache, and 0.035ns to the latency and '2pJ to the energy ofa 1MB cache. Finally, Amoeba-Cache amortizes the energy penaltyof the peripheral components (H-tree, Wordline, and decoder) over asingle RAM.

Amoeba-Cache’s overhead needs careful consideration when im-plemented at the L1 cache level. We have two options for handlingthe latency overhead a) if the L1 cache is the critical stage in thepipeline, we can throttle the CPU clock by the latency overhead toensure that the additional logic fits within the pipeline stage. Thisensures that the number of pipeline stages for a memory access doesnot change with respect to a conventional cache, although all instruc-tions bear the overhead of the reduced CPU clock. b) we can add anextra pipeline stage to the L1 hit path, adding a 1 cycle overhead toall memory accesses but ensuring no change in CPU frequency. Wequantify the performance impact of both approaches in Section 6.4.3 Tag-only Operations

Conventional caches support tag-only operations to reduce dataport contention. While the Amoeba-Cache merges tags and data, likemany commercial processors it decouples the replacement metadataand valid bits from the tags, accessing the tags only on cache lookup.Lookups can be either CPU side or network side (coherence invalida-tion and Wback/Forwarding). CPU-side lookups and writebacks ('95% of cache operations) both need data and hence Amoeba-Cache inthe common case does not introduce extra overhead. Amoeba-Cachedoes read out the entire data array unlike serial-mode caches (wediscuss this issue in the next section). Invalidation checks and snoopscan be more energy expensive with Amoeba-Cache compared to aconventional cache. Fortunately, coherence snoops are not commonin many applications (e.g., 1/100 cache operations in SpecJBB) as acoherence directory and an inclusive LLC filter them out.

1We do not have access to an industry-grade 32nm library, so we synthesized ata higher 180nm node size and scaled the results to 32 nm (latency and energy scaledproportional to Vdd (taken from [36]) and V dd2 respectively).

4.4 Tradeoff with Large Caches

0

0.4

0.8

1.2

1.6

2

4 8 16 32 4 8 16 32 4 8 16 32

2M 4M 8M

No

rmal

v

s S

eria

l M

od

e

Latency Energy

Ways Ways Ways

Baseline: Serial. ≤ 1 Normal is better. 32nm, ITRS LOP.Figure 8: Serial vs Normal mode cache.

Large caches with many words per set (≡ highly associative con-ventional cache) need careful consideration. Typically, highly as-sociative caches tend to serialize tag and data access with only therelevant cache block read out on a hit and no data access on a miss.We first analyze the the tradeoff between reading the entire set (nor-mal mode), which is compatible with Amoeba-Cache and only therelevant block (serial mode). We vary the cache size from 2M—8Mand associativity from 4(256B/set) — 32 (2048B/set). Under currenttechnology constraints (Figure 8), only at very high associativity doesserial mode demonstrate a notable energy benefit. Large caches aredominated by H-tree energy consumption and reading out the entireset at each sub-bank imposes an energy penalty when bitlines andwordlines dominate (2KB+ # of words/set).

Table 4: % of direct accesses with fast tags64K(256by/set) 1MB(512by/set) 2MB(1024 by/set)

# Tags/set 2 4 4 8 8 16Overhead 1KB 2KB 2KB 16KB 16KB 32KBBenchmarksLow 30% 45% 42% 64% 55% 74%Moderate 24% 62% 46% 70% 63% 85%High 35% 79% 67% 95% 75% 96%

Amoeba-Cache can be tuned to minimize the hardware overheadfor large caches. With many words/set the cache utilization improvesdue to longer block lifetimes making it feasible to support Amoeba-Blocks with a larger minimum granularity (> 1 word). If we increaseminimum granularity to two or four words, only every third or fifthword could be a tag, meaning the # of comparators and multiplex-ers reduce to Nwords/set

3 or Nwords/set5 . When the minimum granularity

is equal to max granularity (RMAX), we obtain a fixed granularitycache with Nwords/set/RMAX ways. Cache organizations that collo-cate all the tags together at the head of the data array enable tag-onlyoperations and serial Amoeba-Block accesses that need to activateonly a portion of the data array. However, the set may need to becompacted at each insertion. Recently, Loh and Hill [23] exploredsuch an organization for supporting tags in multi-gigabyte caches.

Finally, the use of Fast Tags help reduce the tag lookups in thedata array. Fast tags use a separate traditional tag array-like structureto cache the tags of the recently-used blocks and provide a pointerdirectly to the Amoeba-Block. The # of Fast Tags needed per setis proportional to the # of blocks in each set, which varies with thespatial locality in the application and the # of bytes per set (moredetails in Section 6.1). We studied 3 different cache configurations(64K 256B/set, 1M 512B/set, and 2M 1024B/set) while varying thenumber of fast tags per set (see Table 4). With 8 tags/set (16KB

Page 7: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

overhead), we can filter 64—95% of the accesses in a 1MB cacheand 55— 75% of the accesses in a 2MB cache.5 Chip-Level Issues5.1 Spatial Patterns Prediction

101 1101 1.........

101 1100 1

PC or Region Tag

2101 1100 1

==13

1PC: Read 0xaddr

Miss Word

Block <Start, End>

PC or Region Tag

Figure 9: Spatial Predictor invoked on a Amoeba-Cache miss

Amoeba-Cache needs a spatial block predictor, which informsrefill requests about the range of the block to fetch. Amoeba-Cachecan exploit any spatial locality predictor and there have been manyefforts in the compiler and architecture community [7, 17, 29, 6]. Weadopt a table-driven approach consisting of a set of access bitmaps;each entry is RMAX (maximum granularity of an Amoeba-Block)bits wide and represents whether the word was touched during thelifetime of the recently evicted cache block. On a miss, the predictorwill search for an entry (indexed by either the miss PC or regionaddress) and choose the range of words to be fetched on a misson either side (left and right) of the critical word. The PC-basedindexing also uses the critical word index for improved accuracy. Thepredictor optimizes for spatial prefetching and will overfetch (bringin potentially untouched words), if they are interspersed amongstcontiguous chunks of touched words. We can also bypass theprediction when there is low confidence in the prediction accuracy.For example, for streaming applications without repeated misses to aregion, we can bring in a fixed granularity block based on the overallglobal behavior of the application. We evaluate tradeoffs in the designof the spatial predictor in Section 6.2.5.2 Multi-level Caches

We discuss the design of inclusive cache hierarchies includingmultiple Amoeba-Caches; we illustrate using a 2-level hierarchy.Inclusion means that the L2 cache contains a superset of the datawords in the L1 cache; however, the two levels may include differentgranularity blocks. For example, the Sun Niagara T2 uses 16 byte L1blocks and 64 byte L2 blocks. Amoeba-Cache permits non-alignedblocks of variable granularity at the L1 and the L2, and needs to dealwith two issues: a) L2 recalls that may invalidate multiple L1 blocksand b) L1 refills that may need data from multiple blocks at the L2.For both cases, we need to identify all the relevant Amoeba-Blocksthat overlap with either the recall or the refill request. This situationis similar to a Nigara’s L2 eviction which may need to recall 4 L1blocks. Amoeba-Cache’s logic ensures that all Amoeba-Blocks froma region map to a single set at any level (using the same RMAX forboth L1 and L2). This ensures that L2 recalls or L1 refills indexinto only a single set. To process multiple blocks for a single cacheoperation, we use the step-by-step process outlined in Section 3.5( y1 and y2 in Figure 6). Finally, the L1-L2 interconnect needs 3virtual networks, two of which, the L2→L1 data virtual network andthe L1→L2 writeback virtual network, can have packets of variablegranularity; each packet is broken down into a variable number ofsmaller physical flits.5.3 Cache Coherence

There are three main challenges that variable cache line granu-larity introduces when interacting with the coherence protocol: 1)

How is the coherence directory maintained? 2) How to supportvariable granularity read sharing? and 3) What is the granularityof write invalidations? The key insight that ensures compatibilitywith a conventional fixed-granularity coherence protocol is that aAmoeba-Block always lies within an aligned RMAX byte region (see§ 3). To ensure correctness, it is sufficient to maintain the coherencegranularity and directory information at a fixed granularity ≤ RMAXgranularity. Multiple cores can simultaneously cache any variablegranularity Amoeba-Block from the same region in Shared state; allsuch cores are marked as sharers in the directory entry. A core thatdesires exclusive ownership of an Amoeba-Block in the region usesthe directory entry to invalidate every Amoeba-Block correspondingto the fixed coherence granularity. All Amoeba-Blocks relevant to aninvalidation will be found in the same set in the private cache (see setindexing in § 3). The coherence granularity could potentially be <RMAX so that false sharing is not introduced in the quest for highercache utilization (larger RMAX). The core claiming the ownershipon a write will itself fetch only the desired granularity Amoeba-Block,saving bandwidth. A detailed evaluation of the coherence protocol isbeyond the scope of the current paper.6 Evaluation

Framework We evaluate the Amoeba-Cache architecture with theWisconsin GEMS simulation infrastructure [25]; we use the in-orderprocessor timing model. We have replaced the SIMICS functionalsimulator with the faster Pin [24] instrumentation framework to en-able longer simulation runs. We perform timing simulation for 1billion instructions. We warm up the cache using 20 million accessesfrom the trace. We model the cache controller in detail including thetransient states needed for the multi-step cache operations and all theassociated port and queue contention. We use a Full— LRU replace-ment policy, evicting Amoeba-Blocks in LRU order until sufficientspace is freed up for the block to be brought in. This helps decoupleour observations from the replacement policy, enabling a fairer com-parison with other approaches (Section 9). Our workloads are a mixof applications whose working sets stress our caches and includesSPEC- CPU benchmarks, Dacapo Java benchmarks [4], commercialworkloads (SpecJBB2005, TPC-C, and Apache), and the Firefox webbrowser. Table 1 classifies the application categories: Low, Moderate,and High, based on the spatial locality. When presenting averages ofratios or improvements, we use the geometric mean.6.1 Improved Memory Hierarchy EfficiencyResult 1:Amoeba-Cache increases cache capacity by harvestingspace from unused words and can achieve an 18% reduction in bothL1 and L2 miss rate.Result 2:Amoeba-Cache adaptively sizes the cache block granularityand reduces L1↔L2 bandwidth by 46% and L2↔Memory bandwidthby 38%.

In this section, we compare the bandwidth and miss rate propertiesof Amoeba-Cache against a conventional cache. We evaluate twotypes of caches: a Fixed cache, which represents a conventionalset-associative cache, and the Amoeba-Cache. In order to isolate thebenefits of Amoeba-Cache from the potentially changing accuracyof the spatial predictor across different cache geometries, we useutilization at the next eviction as the spatial prediction, determinedfrom a prior run on a fixed granularity cache. This also ensures thatthe spatial granularity predictions can be replayed across multiplesimulation runs. To ensure equivalent data storage space, we setthe Amoeba-Cache size to the sum of the tag array and the data arrayin a conventional cache. At the L1 level (64K), the net capacity of

Page 8: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

the Amoeba-Cache is 64K + 8*4*256 bytes and at the L2 level (1M)configuration, it is 1M + 8*8*2048 bytes. The L1 cache has 256 setsand the L2 cache has 2048 sets.

1.0

2.0

3.0

4.0

5.0

6.0

20 26 32 38 44 50

Ba

nd

wid

th (

KB

/ 1

K i

ns)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(a) 64K - Low

0.5

1.0

1.5

2.0

2.5

3.0

20 26 32 38 44 50B

an

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(b) 1M - Low

0.00

0.40

0.80

1.20

1.60

2.00

0 3 6 9 12 15

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(c) 64K - Moderate

0.35

0.40

0.45

0.50

0.55

0.60

0 3 6 9 12 15

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(d) 1M - Moderate

0.60

0.66

0.72

0.78

0.84

0.90

3 4 5 6 7 8

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(e) 64K - High

0.40

0.42

0.44

0.46

0.48

0.50

3 4 5 6 7 8

Ban

dw

idth

(K

B /

1K

in

s)

Miss Rate (Misses/1K ins)

Amoeba

Fixed

(f) 1M - High

Figure 10: Fixed vs. Amoeba (Bandwidth and Miss Rate). Notethe different scale for different application groups.

Figure 10 plots the miss rate and the traffic characteristics of theAmoeba-Cache. Since Amoeba-Cache can hold blocks varying from8B to 64B, each set can hold more blocks by utilizing the spacefrom untouched words. Amoeba-Cache reduces the 64K L1 missrate by 23%(stdev:24) for the Low group, and by 21%(stdev:16)for the moderate group; even applications with high spatial localityexperience a 7%(stdev:8) improvement in miss rate. There is a46%(stdev:20) reduction on average in L1↔L2 bandwidth. At the1M L2 level, Amoeba-Cache improves the moderate group’s missrate by 8%(stdev:10) and bandwidth by 23%(stdev:12). Applicationswith moderate utilization make better use of the space harvested fromunused words by Amoeba-Cache. Many low utilization applicationstend to be streaming and providing extra cache space does not helplower miss rate. However, by not fetching unused words, Amoeba-Cache achieves a significant reduction (38%(stdev:24) on average) inoff-chip L2↔Memory bandwidth; even High utilization applicationssee a 17%(stdev:15) reduction in bandwidth. Utilization and missrate are not, however, always directly correlated (more details in§ 8).

With Amoeba-Cache the # of blocks/set varies based on the granu-larity of the blocks being fetched, which in turn depends on the spatiallocality in the application. Table 5 shows the avg.# of blocks/set. Inapplications with low spatial locality, Amoeba-Cache adjusts theblock size and adapts to store many smaller blocks. The 64K L1

Table 5: Avg. # of Amoeba-Block / Set# Blocks/Set 64K Cache, 288 B/set

4—5 ferret, cactus, firefox, eclipse, facesim, freqmine, milc, astar6—7 tpc-c, tradesoap, soplex, apache, fluidanimate8—9 h2, canneal, omnetpp, twolf, x264, lbm, jbb

10—12 mcf, art

1M Cache, 576 B/set3—5 eclipse, omnetpp8—9 cactus, firefox, tradesoap, freqmine, h2, x264, tpc-c

10—11 facesim, soplex, astar, milc, apache, ferret12—13 twolf, art, jbb, lbm , fluidanimate15—18 canneal, mcf

0

20

40

60

80

100

apac

he

art

asta

rca

ctu

scan…

ecli

pse

face…

ferr

etfi

refo

xfl

uid

.fr

eq.

h2

jbb

lbm

mcf

mil

co

mn

et.

sop

lex

tpc-

c.tr

ade.

two

lfx

26

4m

ean

% o

f A

mo

eba

Blo

cks

1-2 Words 3-4 Words 5-6 Words 7-8 Words

92

80

98

10

0

67

98

88

99

78

10

0

94

82

89

89

93

10

0

83

91

91

97

70

91

90

(a) 64K L1 cache

0

20

40

60

80

100

apac

he

art

asta

rca

ctu

scan…

ecli

pse

face…

ferr

etfi

refo

xfl

uid

.fr

eq.

h2

jbb

lbm

mcf

mil

co

mn

et.

sop

lex

tpc-

c.tr

ade.

two

lfx

26

4m

ean

% o

f A

mo

eba

Blo

cks

1-2 Words 3-4 Words 5-6 Words 7-8 Words9

5

91

97

10

0

69

93

88

99

79

10

0

97

92

92

83

92

10

0

93

87

91

97

70

90

91

(b) 1M L2 cache

Figure 11: Distribution of cache line granularities in the 64KL1 and 1M L2 Amoeba-Cache. Avg. utilization is on top.

Amoeba-Cache stores 10 blocks per set for mcf and 12 blocks perset for art, effectively increasing associativity without introducingfixed hardware overheads. At the L2, when the working set startsto fit in the L2 cache, the set is partitioned into fewer blocks. Notethat applications like eclipse and omnetpp hold only 3—5 blocksper set on average (lower than conventional associativity) due totheir low miss rates (see Table 8). With streaming applications (e.g.,canneal), Amoeba-Cache increases the # of blocks/set to >15 onaverage. Finally, some applications like apache store between 6—7blocks/set with a 64K cache with varied block sizes (see Figure 11):approximately 50% of the blocks store 1-2 words and 30% of theblocks store 8 words at the L1. As the size of the cache increases andthereby the lifetime of the blocks, the Amoeba-Cache adapts to storelarger size blocks as can be seen in Figure 11.

Page 9: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

Utilization is improved greatly across all applications (90%+ inmany cases). Figure 11 shows the distribution of cache block granu-larities in Amoeba-Cache. The Amoeba-Block distribution matchesthe word access distribution presented in § 2). With the 1M cache,the larger cache size improves block lifespan and thereby utilization,with a significant drop in the % of 1—2 word blocks. However, inmany applications (tpc-c, apache, firefox, twolf, lbm, mcf), up to 20%of the blocks are 3–6 words wide, indicating the benefits of adaptivityand the challenges faced by Fixed.6.2 Overall Performance and EnergyResult 3:Amoeba-Cache improves overall cache efficiency andboosts performance by 10% on commercial applications2, savingup to 11% of the energy of the on-chip memory hierarchy. Off-chipL2↔memory energy sees a mean reduction of 41% across all work-loads (86% for art and 93% for twolf) .

We model a two-level cache hierarchy in which the L1 is a 64Kcache with 256 sets (3 cycles load-to-use) and the L2 is 1M, 8192 sets(20 cycles). We assume a fixed memory latency of 300 cycles. Weconservatively assume that the L1 access is the critical pipeline stageand throttle CPU clock by 4% (we evaluate an alternative approachin the next section). We calculate the total dynamic energy of theAmoeba-Cache using the energy #s determined in Section 4 througha combination of synthesis and CACTI [28]. We use 4 fast tagsper set at the L1 and 8 fast tags per set at the L2. We include thepenalty for all the extra metadata in Amoeba-Cache. We derive theenergy for a single L1—L2 transfer ((6.8pJ per byte) from [2,28].Theinterconnect uses full-swing wires at 32nm, 0.6V.

0

2

4

6

8

10

0

2

4

6

8

10

apac

he

art

asta

r

cact

us

can.

eclip

se

face

.

ferr

et

firef

ox

fluid

.

freq

. % P

erf.

Impr

ovem

ent

% E

nerg

y Im

prov

emen

t On-Chip Energy Perf.

Ener

gy +

21%

Per

f. +1

1%

Ener

gy +

33%

Per

f. +5

0%

0

5

10

15

20

25

0

8

16

24

32

40

h2

jbb

lbm

mcf

milc

omne

t.

sopl

ex

tpc-

c.

trade

.

twol

f

x264

% P

erf.

Impr

ovem

ent

% E

nerg

y Im

prov

emen

t

Perf

. +36

%

Figure 12: % improvement in performance and % reductionin on-chip memory hierarchy energy. Higher is better. Y-axisterminated to illustrate bars clearly. Baseline: Fixed, 64K L1,1M L2.

Figure 12 plots the overall improvement in performance and re-duction in on-chip memory hierarchy energy (L1 and L2 caches, andL1↔L2 interconnect). On applications that have good spatial locality(e.g., tradesoap, milc, facesim, eclipse, and cactus), Amoeba-Cache

2“Commercial” applications includes Apache, SpecJBB and TPC-C.

has minimal impact on miss rate, but provides significant bandwidthbenefit. This results in on-chip energy reduction: milc’s L1↔L2bandwidth reduces by 15% and its on-chip energy reduces by 5%.Applications that suffer from cache pollution under Fixed (apache,jbb, twolf, soplex, and art) see gains in performance and energy.Apache’s performance improves by 11% and on-chip energy reducesby 21%, while SpecJBB’s performance improves by 21% and en-ergy reduces by 9%. Art gains approximately 50% in performance.Streaming applications like mcf access blocks with both low andhigh utilization. Keeping out the unused words in the under-utilizedblocks prevents the well-utilized cache blocks from being evicted;mcf’s performance improves by 12% and on-chip energy by 36%.

Extra cache pipeline stage. An alternative strategy to accommo-date Amoeba-Cache’s overheads is to add an extra pipeline stage tothe cache access which increases hit latency by 1 cycle. The cpuclock frequency entails no extra penalty compared to a conventionalcache. We find that for applications in the moderate and low spa-tial locality group (for 8 applications) Amoeba-Cache continues toprovide a performance benefit between 6—50%. milc and cannealsuffer minimal impact, with a 0.4% improvement and 3% slowdownrespectively. Applications in the high spatial locality group (12 appli-cations) suffer an average 15% slowdown (maximum 22%) due tothe increase in L1 access latency. In these applications, 43% of theinstructions (on average) are memory accesses and a 33% increasein L1 hit latency imposes a high penalty. Note that all applicationscontinue to retain the energy benefit. The cache hierarchy energyis dominated by the interconnects and Amoeba-Cache provides no-table bandwidth reduction. While these results may change for anout-of-order, multiple-issue processor, the evaluation suggests thatAmoeba-Cache if implemented with the extra pipeline stage is moresuited for lower levels in the memory hierarchy other than the L1.

Off-chip L2↔Memory energy The L2’s higher cache capacitymakes it less susceptible to pollution and provides less opportunityfor improving miss rate. In such cases, Amoeba-Cache keeps outunused words and reduces off-chip bandwidth and thereby off-chipenergy. We assume that the off-chip DRAM can provide adaptivegranularity transfers for Amoeba-Cache’s L2 refills as in [35]. We usea DRAM model presented in a recent study [9] and model 0.5nJ perword transferred off-chip. The low spatial locality applications see adramatic reduction in off-chip energy. For example, twolf sees a 93%reduction. On commercial workloads the off-chip energy decreasesby 31% and 24% respectively. Even for applications with high cacheutilization, off-chip energy decreases by 15%.7 Spatial Predictor Tradeoffs

We evaluate the effectiveness of spatial pattern prediction in thissection. In our table-based approach, a pattern history table recordsspatial patterns from evicted blocks and is accessed using a predictionindex. The table-driven approach requires careful consideration of thefollowing: prediction index, predictor table size, and training period.We quantify the effects by comparing the predictor against a baselinefixed-granularity cache. We use a 64K cache since it induces enoughmisses and evictions to highlight the predictor tradeoffs clearly.7.1 Predictor Indexing

A critical choice with the history-based prediction is the selectionof the predictor index. We explored two types of predictor indexinga) a PC-based approach [6] based on the intuition that fields in a datastructure are accessed by specific PCs and tend to exhibit similar spa-tial behavior. The tag includes the the PC and the critical word index:

Page 10: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

02468

10121416

asta

r

cact

.

can

ne.

ecli

p.

face

s.

ferr

et

fire

f.

flu

id.

freq

.

h2

mil

c

om

n.

tpc-

c

trad

e.

x2

64

MP

KI

Aligned Finite Infinite Finite+FT History

(a) Low MPKI Group

020406080

100120140

apac

.

art

lbm

mcf

jbb

sople

x

twolf

(b) High MPKI Group

0

300

600

900

1200

1500

1800

asta

r

cact

.

can

ne.

ecli

p.

face

s.

ferr

et

fire

f.

flu

id.

freq

.

h2

mil

c

om

n.

tpc-

c

trad

e.

x264

Ban

dw

idth

Aligned Finite Infinite Finite+FT History

(c) Low Bandwidth Group

0

2000

4000

6000

8000

10000

apac

.

art

lbm

mcf

jbb

sople

x

twolf

(d) High Bandwidth GroupALIGNED: fixed-granularity cache (64B blocks). FINITE: Amoeba-Cache with a REGION predictor (1024 entry predictor table and 4K region size).INFINITE: Amoeba-Cache with an unbounded predictor table (REGION predictor).FINITE+FT is FINITE augmented with hints for predicting a defaultgranularity on compulsory misses (first touches). HISTORY: Amoeba-Cache uses spatial pattern hints based on utilization at the next eviction, collected from aprior run.

((PC >> 3) << 3)+ (addr%64)8 . and b) a Region-based (REGION)

approach that is based on the intuition that similar data objects tendto be allocated in contiguous regions of the address space and tendto exhibit similar spatial behavior. We compared the miss rate andbandwidth properties of both the PC (256 entries, fully associative)and REGION (1024 entries, 4KB region size) predictors. The sizeof the predictors was selected as the sweet spot in behavior for eachpredictor type. For all applications apart from cactus (a high spatiallocality application), REGION-based prediction tends to overfetchand waste bandwidth as compared to PC-based prediction, which has27% less bandwidth consumption on average across all applications.For 17 out of 22 applications, REGION-based prediction shows 17%better MPKI on average (max: 49% for cactus). For 5 applications(apache, art, mcf, lbm, and omnetpp), we find that PC demonstratesbetter accuracy when predicting the spatial behavior of cache blocksthan REGION and demonstates a 24% improvement in MPKI (max:68% for omnetpp).

7.2 Predictor TableWe studied the organization and size of the pattern table using the

REGION predictor. We evaluated the following parameters a) regionsize, which directly correlates with the coverage of a fixed-size table,and b) the size of the predictor table, which influences how manyunique region patterns can be tracked, and c) the # of bits required torepresent the spatial pattern.

Large region sizes effectively reduce the # of regions in the work-ing set and require a smaller predictor table. However, a larger regionis likely to have more blocks that exhibit varied spatial behaviorand may pollute the pattern entry. We find that going from 1KB(4096 entries) to 4KB (1024 entries) regions, the 4KB region gran-ularity decreased miss rate by 0.3% and increased bandwidth by0.4% even though both tables provide the same working set cover-age (4MB). Fixing the region size at 4KB, we studied the benefits

of an unbounded table. Compared to a 1024 entry table (FINITEin Figure 6.2), the unbounded table increases miss rate by 1% anddecreases bandwidth by 0.3% . A 1024 entry predictor table (4KB re-gion granularity per-entry) suffices for most applications. Organizingthe 1024 entries as a 128-set×8-way table suffices for eliminatingassociativity related conflicts (<0.8% evictions due to lack of ways).

Focusing on the # of bits required to represent the pattern table,we evaluated the use of 4-bit saturation counters (instead of 1-bitbitmaps). The saturation counters seek to avoid pattern pollutionwhen blocks with varied spatial behavior reside in the same region.Interestingly, we find that it is more beneficial to use 1-bit bitmapsfor the majority of the applications (12 out of 22); the hysteresisintroduced by the counters increases training period. To summarize,we find that a REGION predictor with region size 4KB and 1024entries can predict the spatial pattern in a majority of the applications.CACTI indicates that the predictor table can be indexed in 0.025nsand requires 2.3pJ per miss indexing.

7.3 Spatial Pattern TrainingA widely-used approach to training the predictor is to harvest the

word usage information on an eviction. Unfortunately, evictions maynot be frequent, which means the predictor’s training period tends tobe long, during which the cache performs less efficiently and/or thatthe application’s phase has changed in the meantime. Particularlyat the time of first touch (compulsory miss to a location), we needto infer the global spatial access patterns. We compare the finiteregion predictor (FINITE in Figure 6.2) that only predicts usingeviction history, against a FINITE+FT: this adds the optimizationof inferring the default pattern (in this paper, from a prior run) whenthere is no predictor information. FINITE+FT demonstrates anavg. 1% (max: 6% for jbb) reduction in miss rate compared toFINITE and comes within 16% the miss rate of HISTORY. Interms of bandwidth FINITE+FT can save 8% of the bandwidth (up

Page 11: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

to 32% for lbm) compared to FINITE. The percentage of first-touchaccesses is shown in Table 8.7.4 Predictor Summary• For the majority of the applications (17/22) the address-region

predictor with region size 4KB works well. However, five applica-tions (apache, lbm, mcf, art, omnetpp) show better performancewith PC-based indexing. For best efficiency, the predictor shouldadapt indexing to each application.

• Updating the predictor only on evictions leads to long trainingperiods, which causes loss in caching efficiency. We need todevelop mechanisms to infer the default pattern based on globalbehavior demonstrated by resident cache lines.

• The online predictor reduces MPKI by 7% and bandwidth by 26%on average relative to the conventional approach. However, it stillhas a 14% higher MPKI and 38% higher bandwidth relative to theHISTORY-based predictor, indicating room for improvement inprediction.

• The 1024-entry (4K region size) predictor table imposes '0.12%energy overhead on the overall cache hierarchy energy since it isreferenced only on misses.

8 Amoeba-Cache AdaptivityWe demonstrate that Amoeba-Cache can adapt better than a con-

ventional cache to the variations in spatial locality.

Tuning RMAX for High Spatial Locality A challenge oftenfaced by conventional caches is the desire to widen the cache block(to achieve spatial prefetching) without wasting space and bandwidthin low spatial locality applications. We study 3 specific applications:milc and tradesoap have good spatial locality and soplex has poorspatial locality. With a conventional 1M cache, when we widen theblock size from 64 to 128 bytes, milc and tradesoap experience a37% and 39% reduction in miss rate. However, soplex’s miss rateincreases by 2× and bandwidth by 3.1×.

With Amoeba-Cache we do not have to make this uneasy choiceas it permits Amoeba-Blocks with granularity 1—RMAX words(RMAX: maximum block size). When we increase RMAX from64 bytes to 128 bytes, miss rate reduces by 37% for milc and 24% fortradesoap, while simultaneously lowering bandwidth by 7%. Unlikethe conventional cache, Amoeba-Cache is able to adapt to poor spatiallocality: soplex experiences only a 2% increase in bandwidth and40% increase in miss rate.

-50

-25

0

25

50

sopl

ex

milc

trade

s.

sopl

ex

milc

trade

s.

Fixed Amoeba

Red

uctio

n (%

) Miss Rate Bandwidth

-213

-9

2

Figure 13: Effect of increase in block size from 64 to 128 bytesin a 1 MB cache

Predicting Strided Accesses Many applications (e.g., firefox andcanneal) exhibit strided access patterns, which touch a few wordsin a block before accessing another block. Strided accesses patternsintroduce intra-block holes (untouched words). For instance, cannealaccesses '10K distinct fixed granularity cache blocks with a specificaccess pattern, [--x--x--] (x indicates ith word has been touched).

Any predictor that uses the access pattern history has two choiceswhen an access misses on word 3 or 6 a) A miss oriented policy(Policy-Miss) may refill the entire range of words 3–6 and eliminatethe secondary miss but bring in untouched words 4–5, increasingbandwidth, and b) a bandwidth focused choice (Policy- BW) thatrefills only the requested word but will incur more misses. Table 6compares the two contrasting policies for Amoeba-Cache (relative toa Fixed granularity baseline cache). Policy-BW saves 9% bandwidthcompared to Policy-Miss but suffer 25-30% higher miss rate.

Table 6: Predictor Policy Comparisoncanneal firefox

Miss Rate BW Miss Rate BWPolicy-Miss 10.31% 81.2% 11.18% 47.1%Policy-BW –20.08% 88.09% –13.44% 56.82%Spatial Patterns [--x--x--] [x--x----] [-x--x---] [x---x---]–: indicates Miss or BW higher than Fixed.

9 Amoeba-Cache vs other approachesWe compare Amoeba-Cache against four approaches.

• Fixed-2x: Our baseline is a fixed granularity cache 2× the capacityof the other designs (64B block).

• Sector is the conventional sector cache design (as in IBMPower7 [15]): 64B block and a small sector size (16bytes or 2words). This design targets optimal bandwidth. On any access,the cache fetches only the requisite sector, even though it allocatesspace for the entire line.

• Sector-Pre adds prefetching to Sector. This optimized designprefetches multiple sectors based on a spatial granularity predictorto improve the miss rate [17, 29].

• Multi$ combines features from line distillation [30] and spatial-temporal caches [12]. It is an aggressive design that partitionsa cache into two: a line organized cache (LOC) and a word-organized cache (WOC). At insertion time, Multi$ uses the spatialgranularity hints to direct the cache to either refill words (into theWOC) or the entire line. We investigate two design points: 50%of the cache as a WOC (Multi$-50) and 25% of cache as a WOC(Multi$-25).Sector, Sector-Pre, Multi$, and Amoeba-Cache all use the same

spatial predictor hints. On a demand miss, they prefetch all the sub-blocks needed together. Prefetching only changes the timing of anaccess; it does not change the miss bandwidth and cannot remove themisses caused by cache pollution.

Energy and Storage The sector approaches impose low hardwarecomplexity and energy penalty. To filter out unused words, thesector granularity has to be close to word granularity; we explored2words/sector which leads to a storage penalty of '64KB for a 1MBcache. In Multi$, the WOC increases associativity to the number ofwords/block. Multi$-25 partitions a 64K 4-way cache into a 48K3-way LOC and a 16K 8-way WOC, increasing associativity to 11.For a 1M cache, Multi$-50 increases associativity to 36. Comparedto Fixed, Multi$-50 imposes over 3× increase in lookup latency, 5×increase in lookup energy, and '4× increase in tag storage. Amoeba-Cache provides a more scalable approach to using adaptive cachelines since it varies the storage dedicated to the tags based on thespatial locality in the application.

Miss Rate and Bandwidth Figure 14 summarizes the compari-son; we focus on the moderate and low utilization groups of applica-

Page 12: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.0 1.1 1.2 1.3 1.4 1.5 1.6

Ba

nd

wid

th R

ati

o

Miss Rate Ratio

Fixed-2X

Sector

(x:1.8)

Sector-Pre Amoeba

Multi$-25

Multi$-50

(a) 64K - Low

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.0 1.1 1.2 1.3 1.4 1.5 1.6

Ban

dw

idth

Rati

o

Miss Rate Ratio

Sector

(x:2.9)

Sector-Pre

Fixed-2X

Amoeba

Multi$-25

Multi$-50

(b) 64K - Moderate

0.6

0.7

0.8

0.9

1.0

1.0 1.1 1.2 1.3 1.4 1.5 1.6

Ba

nd

wid

th R

ati

o

Miss Rate Ratio

Sector

(x:2.6

y:1.2) Sector-Pre

Amoeba

Multi$-50

Multi$-25

Fixed-2X

(c) 1M - Low

0.6

0.7

0.8

0.9

1.0

1.0 1.1 1.2 1.3 1.4 1.5 1.6

Ban

dw

idth

Rati

o

Miss Rate Ratio

Sector

(x:3.9,

y:1.0)

Sector-Pre

Multi$-50

Multi$-25

Amoeba

Fixed-2X

(d) 1M - Moderate

Figure 14: Relative miss rate and bandwidth for differentcaches. Baseline (1,1) is the Fixed-2x design. Labels: • Fixed-2x, ◦ Sector approaches. ∗ : Multi$, 4 Amoeba. (a),(b) 64Kcache (c),(d) 1M cache. Note the different Y-axis scale for eachgroup.

tions. On the high utilization group, all designs other than Sector havecomparable miss rates. Amoeba-Cache improves miss rate to within5%—6% of the Fixed-2× for the low group and within 8%—17%for the moderate group. Compared to the Fixed-2×, Amoeba-Cachealso lowers bandwidth by 40% (64K cache) and 20% (1M cache).Compared to Sector-Pre (with prefetching), Amoeba-Cache is ableto adapt better with flexible granularity and achieves lower miss rate(up to 30% @ 64K and 35% @ 1M). Multi$’s benefits are propor-tional to the fraction of the cache organized as a WOC; Multi$-50(18-way@64K and 36-way@1M) is needed to match the miss rateof Amoeba-Cache. Finally, in the moderate group, many applica-tions exhibit strided access. Compared to Multi-$’s WOC, whichfetches individual words, Amoeba-Cache increases bandwidth sinceit chooses to fetch the contiguous chunk in order to lower miss rate.10 Multicore Shared Cache

We evaluate a shared cache implemented with the Amoeba-Cachedesign. By dynamically varying the cache block size and keeping outunused words, the Amoeba-Cache effectively minimizes the footprintof an application. Minimizing the footprint helps multiple applica-tions effectively share the cache space. We experimented with a1M shared Amoeba-Cache in a 4 core system. Table 7 shows theapplication mixes; we chose a mix of applications across all groups.We tabulate the change in miss rate per thread and the overall changein bandwidth for Amoeba-Cache with respect to a fixed granularitycache running the same mix. Minimizing the overall footprint enablesa reduction in the miss rate of each application in the mix. The com-mercial workloads (SpecJBB and TPC-C) are able to make use of thespace available and achieve a significant reduction in miss rate (avg:18%). Only two applications suffered a small increase in miss rate(x264 Mix#2: 2% and ferret Mix#3: 4%) due to contention. The over-all L2 miss bandwidth significantly improves, showing 16%—39%reduction across all workload mixes. We believe that the Amoeba-based shared cache can effectively enable the shared cache to support

more cores and increase overall throughput. We leave the designspace exploration of an Amoeba-based coherent cache for futurework.

Table 7: Multiprogrammed Workloads on 1M Shared Amoeba-Cache% reduction in miss rate and bandwidth. Baseline:Fixed 1M.

Miss Miss Miss Miss BWMix T1 T2 T3 T4 (All)

jbb×2, tpc-c×2 12.38% 12.38% 22.29% 22.37% 39.07%firefox×2, x264×2 3.82% 3.61% –2.44% 0.43% 15.71%

cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62%canneal, astar, ferret, milc 4.85% 2.75% 19.39% –4.07% 17.77%–: indicates Miss or BW higher than Fixed. T1—T4, threads in the mix;in the order of applications in the mix

11 Related WorkBurger et al. [5] defined cache efficiency as the fraction of blocks

that store data that is likely to be used. We use the term cacheutilization to identify touched versus untouched words residing in thecache. Past works [6,29,30] have also observed low cache utilizationat specific levels of the cache. Some works [18, 19, 13, 21] havesought to improve cache utilization by eliminating cache blocks thatare no longer likely to be used (referred to as dead blocks). Thesetechniques do not address the problem of intra-block waste (i.e.,untouched words).

Sector caches [31, 32] associate a single tag with a group of con-tiguous cache lines, allowing cache sizes to grow without payingthe penalty of additional tag overhead. Sector caches use bandwidthefficiently by transferring only the needed cache lines within a sector.Conventional sector caches [31] may result in worse utilization due tothe space occupied by invalid cache lines within a sector. Decoupledsector caches [32] help reduce the number of invalid cache lines persector by increasing the number of tags per sector. Compared to theAmoeba cache, the tag space is a constant overhead, and limits the #of invalid sectors that can be eliminated. Pujara et al. [29] consider aword granularity sector cache, and use a predictor to try and bring inonly the used words. Our results (see Figure 14) show that smallergranularity sectors significantly increase misses, and optimizationsthat prefetch [29] can pollute the cache and interconnect with unusedwords.

Line distillation [30] applies filtering at the word granularity toeliminate unused words in a cache block at eviction. This approachrequires part of the cache to be organized as a word-organized cache,which increases tag overhead, complicates lookup, and bounds per-formance improvements. Most importantly, line distillation does notaddress the bandwidth penalty of unused words. This inefficiencyis increasingly important to address under current and future tech-nology dominated by interconnects [14,28]. Veidenbaum et al. [33]propose that the entire cache be word organized and propose anonline algorithm to prefetch words. Unfortunately, a static word-organized cache has a built-in tag overhead of 50% and requiresenergy-intensive associative searches.

Amoeba-Cache adopts a more proactive approach that enables con-tinuous dynamic block granularity adaptation to the available spatiallocality. When there is high spatial locality, the Amoeba-Cache willautomatically store a few big cache blocks (most space dedicatedfor data); with low spatial locality, it will adapt to storing manysmall cache blocks (extra space allocated for tags). Recently, Yoon et

Page 13: Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the

al. have proposed an adaptive granularity DRAM architecture [35].This provides the support necessary for supporting variable granu-larity off-chip requests from an Amoeba-Cache-based LLC. Someresearch [10,8] has also focused on reducing false sharing in coherentcaches by splitting/merging cache blocks to avoid invalidations. Theywould benefit from the Amoeba-Cache design, which manages blockgranularity in hardware.

There has a been a significant amount of work at the compilerand software runtime level (e.g. [7]) to restructure data for improvedspatial efficiency. There have also been efforts from the architecturecommunity to predict spatial locality [29, 34, 17, 36], which we canleverage to predict Amoeba-Block ranges. Finally, cache compressionis an orthogonal body of work that does not eliminate unused wordsbut seeks to minimize the overall memory footprint [1].

12 SummaryIn this paper, we propose a cache design, Amoeba-Cache, that

can dynamically hold a variable number of cache blocks of differentgranularities. The Amoeba-Cache employs a novel organization thatcompletely eliminates the tag array and collocates the tags with thecache block in the data array. This permits the Amoeba-Cache to tradethe space budgeted for the cache blocks for tags and support a variablenumber of tags (and blocks). For applications that have low spatiallocality, Amoeba-Cache can reduce cache pollution, improve theoverall miss rate, and reduce bandwidth wasted in the interconnects.When applications have moderate to high spatial locality, Amoeba-Cache coarsens the block size and ensures good performance. Finally,for applications that are streaming (e.g., lbm), Amoeba-Cache cansave significant energy by eliminating unused words from beingtransmitted over the interconnects.

AcknowledgmentsWe would like to thank our shepherd, André Seznec, and the

anonymous reviewers for their comments and feedback.

Appendix

Table 8: Amoeba-Cache Performance. Absolute #s.MPKI BW bytes/1K CPI Predictor Stats

L1 L2 L1←→L2 L2←→Mem First Touch Evict Win.MPKI MPKI #Bytes/1K #Bytes/1K Cycles/Ins. % Misses # ins./Evict

apache 64.9 19.6 5,000 2,067 8.3 0.4 17art 133.7 53.0 5,475 1,425 16.0 0.0 9

astar 0.9 0.3 70 35 1.9 18.0 1,600cactus 6.9 4.4 604 456 3.5 7.5 162canne. 8.3 5.0 486 357 3.2 5.8 128eclip. 3.6 <0.1 433 <1 1.8 0.1 198faces. 5.5 4.7 683 632 3.0 41.2 190ferre. 6.8 1.4 827 83 2.1 1.3 156firef. 1.5 1.0 123 95 2.1 11.1 727fluid. 1.7 1.4 138 127 1.9 39.2 629

freqm. 1.1 0.6 89 65 2.3 17.7 994h2 4.6 0.4 328 46 1.8 1.7 154jbb 24.6 9.6 1,542 830 5.0 10.2 42lbm 63.1 42.2 3,755 3,438 13.6 6.7 18mcf 55.8 40.7 2,519 2,073 13.2 0.0 19milc 16.1 16.0 1,486 1,476 6.0 2.4 66

omnet. 2.5 <0.1 158 <1 1.9 0.0 458sople. 30.7 4.0 1,045 292 3.1 0.9 35tpcc 5.4 0.5 438 36 2.0 0.4 200

trade. 3.6 <0.1 410 6 1.8 0.6 194twolf 23.3 0.6 1,326 45 2.2 0.0 49x264 4.1 1.8 270 190 2.2 12.4 274MPKI : Misses / 1K instructions. BW: # words / 1K instructionsCPI: Clock cycles per instruction.Predictor First touch: Compulsory misses. % of accesses that use default granularity.Evict window: # of instructions between evictions. Higher value indicatespredictor training takes longer.

References[1] A. R. Alameldeen. Using Compression to Improve Chip Multiprocessor Perfor-

mance. In Ph.D. dissertation, Univ. of Wisconsin-Madison, 2006.[2] D. Albonesi, A. Kodi, and V. Stojanovic. NSF Workshop on Emerging Technolo-

gies for Interconnects (WETI). 2012.[3] C. Bienia. Benchmarking Modern Multiprocessors. In Ph.D. Thesis. Princeton

University, 2011.[4] S. M. Blackburn et al. The DaCapo benchmarks: java benchmarking development

and analysis. In Proc. of the 21st OOPSLA, 2006.[5] D. Burger, J. Goodman, and A. Kaigi. The Declining Effectiveness of Dynamic

Caching for General-Purpose Workloads. In University of Wisconsin TechnicalReport, 1995.

[6] C. F. Chen et al. Accurate and Complexity-Effective Spatial Pattern Prediction. InProc. of the 10th HPCA. 2004.

[7] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. InProc. of the PLDI, 1999.

[8] B. Choi et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Paral-lelism. In Proc. of the PACT, 2011.

[9] B. Dally. Power, Programmability, and Granularity: The Challenges of ExaScaleComputing. In Proc. of the IPDPS, 2011.

[10] C. Dubnicki and T. J. Leblanc. Adjustable Block Size Coherent Caches. In Proc.of the 19th ISCA, 1992.

[11] A. Fog. Instruction tables. 2012. http://www.agner.org/optimize/instruction_tables.pdf.

[12] A. González, C. Aliagas, and M. Valero. A data cache with multiple cachingstrategies tuned to different types of locality. In Proc. of the ACM Intl. Conf. onSupercomputing, 1995.

[13] Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: predict-ing and optimizing memory behavior. In Proc. of the 29th ISCA, 2002.

[14] C. Hughes, C. Kim, and Y.-K. Chen. Performance and Energy Implications ofMany-Core Caches for Throughput Computing. In IEEE Micro Journal, 2010.

[15] R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM’s Next-GenerationServer Processor. In IEEE Micro Journal, 2010.

[16] K. Kedzierski, M. Moreto, F. J. Cazorla, and M. Valero. Adapting Cache Parti-tioning Algorithms to Pseudo-LRU Replacement Policies. In Proc. of the IPDPS,2010.

[17] S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatialfootprints. In Proc. of the 25th ISCA, 1998.

[18] A.-C. Lai and B. Falsafi. Selective, accurate, and timely self-invalidation usinglast-touch prediction. In Proc. of the 27th ISCA, 2000.

[19] A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: reducing coherenceoverhead in shared-memory multiprocessors. In Proc. of the 22nd ISCA. 1995.

[20] S. Li et al. McPAT: an integrated power, area, and timing modeling framework formulticore and manycore architectures. In Proc. of the 42nd MICRO, 2009.

[21] H. Liu et al. Cache bursts: A new approach for eliminating dead blocks andincreasing cache efficiency. In Proc. of the 41st MICRO, 2008.

[22] D. R. Llanos and B. Palop. TPCC-UVA: An Open-source TPC-C implementationfor parallel and distributed systems. In Proc. of the 2006 Intl. Parallel and Dis-tributed Processing Symp. PMEO Workshop, 2006. http://www.infor.uva.es/~diego/tpcc-uva.html.

[23] G. H. Loh and M. D. Hill. Efficiently enabling conventional block sizes for verylarge die-stacked DRAM caches. In Proc. of the 44th Intl. Symp. on Microarchitec-ture, 2011.

[24] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.Reddi, and K. Hazelwood. Pin: building customized program analysis tools withdynamic instrumentation. In Proc. of the PLDI, 2005.

[25] Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset.In ACM SIGARCH Computer Architecture News, Sept. 2005.

[26] S. Microsystems. OpenSPARC T1 Processor Megacell Specification. 2007.[27] Moinuddin K. Qureshi et al. Adaptive insertion policies for high performance

caching In Prof. of the 34th ISCA, 2007.[28] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA

Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proc.of the 40th MICRO, 2007.

[29] P. Pujara and A. Aggarwal. Increasing the Cache Efficiency by Eliminating Noise.In Proc. of the 12th HPCA, 2006.

[30] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line Distillation: Increasing CacheCapacity by Filtering Unused Words in Cache Lines. In Proc. of the 13th HPCA,2007.

[31] J. B. Rothman and A. J. Smith. The pool of subsectors cache design. In Proc. ofthe 13th ACM ICS, 1999.

[32] A. Seznec. Decoupled sectored caches: conciliating low tag implementation cost.In Proc. of the 21st ISCA, 1994.

[33] A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache linesize to application behavior. In Proc. of the 13th ACM ICS. 1999.

[34] M. A. Watkins, S. A. Mckee, and L. Schaelicke. Revisiting Cache Block Super-loading. In Proc. of the 4th HIPEAC. 2009.

[35] D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems:a tradeoff between storage efficiency and throughput. In Proc. of the 38th ISCA.2011.

[36] D. H. Yoon, M. K. Jeong, M. B. Sullivan, and M. Erez. The Dynamic GranularityMemory System. In Proc. of the 39th ISCA, 2012.

[36] http://cpudb.stanford.edu/