amoeba-cache: adaptive blocks for eliminating waste in the
TRANSCRIPT
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy∗
Snehasish Kumar, Hongzhou Zhao†, Arrvindh ShriramanEric Matthews∗, Sandhya Dwarkadas†, Lesley Shannon∗
School of Computing Sciences, Simon Fraser University†Department of Computer Science, University of Rochester∗School of Engineering Science, Simon Fraser University
AbstractThe fixed geometries of current cache designs do not adapt to the
working set requirements of modern applications, causing significantinefficiency. The short block lifetimes and moderate spatial localityexhibited by many applications result in only a few words in theblock being touched prior to eviction. Unused words occupy between17—80% of a 64K L1 cache and between 1%—79% of a 1MB privateLLC. This effectively shrinks the cache size, increases miss rate, andwastes on-chip bandwidth. Scaling limitations of wires mean thatunused-word transfers comprise a large fraction (11%) of on-chipcache hierarchy energy consumption.
We propose Amoeba-Cache, a design that supports a variable num-ber of cache blocks, each of a different granularity. Amoeba-Cacheemploys a novel organization that completely eliminates the tag array,treating the storage array as uniform and morphable between tagsand data. This enables the cache to harvest space from unused wordsin blocks for additional tag storage, thereby supporting a variablenumber of tags (and correspondingly, blocks). Amoeba-Cache adjustsindividual cache line granularities according to the spatial localityin the application. It adapts to the appropriate granularity both fordifferent data objects in an application as well as for different phasesof access to the same data. Overall, compared to a fixed granularitycache, the Amoeba-Cache reduces miss rate on average (geometricmean) by 18% at the L1 level and by 18% at the L2 level and reducesL1—L2 miss bandwidth by '46%. Correspondingly, Amoeba-Cachereduces on-chip memory hierarchy energy by as much as 36% (mcf)and improves performance by as much as 50% (art).
1 IntroductionA cache block is the fundamental unit of space allocation and data
transfer in the memory hierarchy. Typically, a block is an alignedfixed granularity of contiguous words (1 word = 8bytes). Current pro-cessors fix the block granularity largely based on the average spatiallocality across workloads, while taking tag overhead into consider-ation. Unfortunately, many applications (see Section 2 for details)exhibit low— moderate spatial locality and most of the words in acache block are left untouched during the block’s lifespan. Even forapplications with good spatial behavior, the short lifespan of a blockcaused by cache geometry limitations can cause low cache utilization.Technology trends make it imperative that caching efficiency im-proves to reduce wastage of interconnect bandwidth. Recent reportsfrom industry [2] show that on-chip networks can contribute up to28% of total chip power. In the future an L2 — L1 transfer can costup to 2.8× more energy than the L2 data access [14, 20]. Unusedwords waste ' 11% (4%—21% in commercial workloads) of thecache hierarchy energy.
∗This material is based upon work supported in part by grants from the NationalScience and Engineering Research Council, MARCO Gigascale Research Center, Cana-dian Microelectronics Corporation, and National Science Foundation (NSF) grantsCNS-0834451, CCF-1016902, and CCF-1217920.
Sectorcaches e.g.,
[29][31]
Bandwidth
Miss Rate
SpaceUtilization
Cache Filtering e.g.,[30]
Bandwidth
Miss Rate
SpaceUtilization
AmoebaCache
Bandwidth
Miss Rate
SpaceUtilization
Figure 1: Cache designs optimizing different memory hierar-chy parameters. Arrows indicate the parameters that are tar-geted and improved compared to a conventional cache.
Figure 1 organizes past research on cache block granularity alongthe three main parameters influenced by cache block granularity:miss rate, bandwidth usage, and cache space utilization. Sectorcaches have been used to [32, 15] minimize bandwidth by fetch-ing only sub-blocks but miss opportunities for spatial prefetching.Prefetching [17,29] may help reduce the miss rate for utilized sectors,but on applications with low—moderate or variable spatial local-ity, unused sectors due to misprediction, or unused regions withinsectors, still pollute the cache and consume bandwidth. Line distilla-tion [30] filters out unused words from the cache at evictions usinga separate word-granularity cache. Other approaches identify deadcache blocks and replace or eliminate them eagerly [19, 18, 13, 21].While these approaches improve utilization and potentially miss rate,they continue to consume bandwidth and interconnect energy for theunutilized words. Word-organized cache blocks also dramaticallyincrease cache associativity and lookup overheads, which impactstheir scalability.
Determining a fixed optimal point for the cache line granularity athardware design time is a challenge. Small cache lines tend to fetchfewer unused words, but impose significant performance penalties bymissing opportunities for spatial prefetching in applications with highspatial locality. Small line sizes also introduce high tag overhead,increase lookup energy, and increase miss processing overhead (e.g.,control messages). Larger cache line sizes minimize tag overheadand effectively prefetch neighboring words but introduce the nega-tive effect of unused words that increase network bandwidth. Priorapproaches have proposed the use of multiple caches with differ-ent block sizes [33,12]. These approaches require word granularitycaches that increase lookup energy, impose high tag overhead (e.g.,50% in [33]), and reduce cache efficiency when there is good spatiallocality.
In this paper, we propose a novel cache architecture, Amoeba-Cache, to improve memory hierarchy efficiency by supporting fine-grain (per-miss) dynamic adjustment of cache block size and the # ofblocks per set. To enable variable granularity blocks within the samecache, the tags maintained per set need to grow and shrink as the # ofblocks/set vary. Amoeba-Cache eliminates the conventional tag arrayand collocates the tags with the cache blocks in the data array. Thisenables us to segment and partition a cache set in different ways: Forexample, in a configuration comparable to a traditional 4-way 64Kcache with 256 sets (256 bytes per set), we can hold eight 32-bytecache blocks, thirty-two 8-byte blocks, or any other collection ofcache blocks of varying granularity. Different sets may hold blocksof different granularity, providing maximum flexibility across addressregions of varying spatial locality. The Amoeba-Cache effectivelyfilters out unused words in a conventional block and prevents themfrom being inserted into the cache, allowing the resulting free spaceto be used to hold tags or data of other useful blocks. The Amoeba-Cache can adapt to the available spatial locality; when there is lowspatial locality, it will hold many blocks of small granularity andwhen there is good spatial locality, it can adapt and segment thecache into a few big blocks.
Compared to a fixed granularity cache, Amoeba-Cache improvescache utilization by 90% - 99% for most applications, saves missrate by up to 73% (omnetpp) at the L1 level and up to 88% (twolf) atthe LLC level, and reduces miss bandwidth by up to 84% (omnetpp)at the L1 and 92% (twolf) at the LLC. We compare against otherapproaches such as Sector Cache and Line distillation and show thatAmoeba-Cache can optimize miss rate and bandwidth better acrossmany applications, with lower hardware overhead. Our synthesisof the cache controller hit path shows that Amoeba-Cache can beimplemented with low energy impact and 0.7% area overhead for alatency- critical 64K L1.
The overall paper is organized as follows: § 2 provides quantitativeevidence for the acuteness of the spatial locality problem. § 3 detailsthe internals of the Amoeba-Cache organization and § 4 analyzesthe physical implementation overhead. § 5 deals with wider chip-level issues (i.e., inclusion and coherence). § 6 — § 10 evaluatethe Amoeba-Cache, commenting on the optimal block granularity,impact on overall on-chip energy, and performance improvement.§ 11 outlines related work.2 Motivation for Adaptive Blocks
In traditional caches, the cache block defines the fundamental unitof data movement and space allocation in caches. The blocks in thedata array are uniformly sized to simplify the insertion/removal ofblocks, simplify cache refill requests, and support low complexitytag organization. Unfortunately, conventional caches are inflexible(fixed block granularity and fixed # of blocks) and caching efficiencyis poor for applications that lack high spatial locality. Cache blocksinfluence multiple system metrics including bandwidth, miss rate, andcache utilization. The block granularity plays a key role in exploitingspatial locality by effectively prefetching neighboring words all atonce. However, the neighboring words could go unused due to thelow lifespan of a cache block. The unused words occupy interconnectbandwidth and pollute the cache, which increases the # of misses.We evaluate the influence of a fixed granularity block below.2.1 Cache Utilization
In the absence of spatial locality, multi-word cache blocks (typi-cally 64 bytes on existing processors) tend to increase cache pollutionand fill the cache with words unlikely to be used. To quantify this
pollution, we segment the cache line into words (8 bytes) and trackthe words touched before the block is evicted. We define utiliza-tion as the average # of words touched in a cache block before it isevicted. We study a comprehensive collection of workloads froma variety of domains: 6 from PARSEC [3], 7 from SPEC2006, 2from SPEC2000, 3 Java workloads from DaCapo [4], 3 commercialworkloads (Apache, SpecJBB2005, and TPC-C [22]), and the Firefoxweb browser. Subsets within benchmark suites were chosen based ondemonstrated miss rates on the fixed granularity cache (i.e., whoseworking sets did not fit in the cache size evaluated) and with a spreadand diversity in cache utilization. We classify the benchmarks into 3groups based on the utilization they exhibit: Low (<33%), Moderate(33%—66%), and High (66%+) utilization (see Table 1).
Table 1: Benchmark Groups
Group Utilization % BenchmarksLow 0 — 33% art, soplex, twolf, mcf, canneal, lbm, om-
netppModerate 34 — 66% astar, h2, jbb, apache, x264, firefox, tpc-c,
freqmine, fluidanimateHigh 67 — 100% tradesoap, facesim, eclipse, cactus, milc,
ferret
0
20
40
60
80
100ap
ach
e
art
asta
r
cact
us
can
nea
l
ecli
pse
face
sim
ferr
et
fire
fox
flu
id.
freq
.
h2
jbb
lbm
mcf
mil
c
om
net
.
sop
lex
tpc-
c.
trad
e.
two
lf
x2
64
mea
n
Wo
rds
Acc
esse
d (
%)
1-2 Words 3-4 Words 5-6 Words 7-8 Words
45
20
39
79
30
80
77
82
49
62
55
38
40
32
29
81
33
21
53
73
29
46
50
Figure 2: Distribution of words touched in a cache block. Avg.utilization is on top. (Config: 64K, 4 way, 64-byte block.)
Figure 2 shows the histogram of words touched at the time ofeviction in a cache line of a 64K, 4-way cache (64-byte block, 8words per block) across the different benchmarks. Seven applicationshave less than 33% utilization and 12 of them are dominated (>50%)by 1-2 word accesses. In applications with good spatial locality(cactus, ferret, tradesoap, milc, eclipse) more than 50% of the evictedblocks have 7-8 words touched. Despite similar average utilizationfor applications such as astar and h2 (39%), their distributions aredissimilar;'70% of the blocks in astar have 1-2 words accessed at thetime of eviction, whereas '50% of the blocks in h2 have 1-2 wordsaccessed per block. Utilization for a single application also changesover time; for example, ferret’s average utilization, measured as theaverage fraction of words used in evicted cache lines over 50 millioninstruction windows, varies from 50% to 95% with a periodicity ofroughly 400 million instructions.
2.2 Effect of Block Granularity on Miss Rate and BandwidthCache miss rate directly correlates with performance, while under
current and future wire-limited technologies [2], bandwidth directlycorrelates with dynamic energy. Figure 3 shows the influence ofblock granularity on miss rate and bandwidth for a 64K L1 cache anda 1M L2 cache keeping the number of ways constant. For the 64KL1, the plots highlight the pitfalls of simply decreasing the block sizeto accommodate the Low group of applications; miss rate increasesby 2× for the High group when the block size is changed from 64Bto 32B; it increases by 30% for the Moderate group. A smallerblock size decreases bandwidth proportionately but increases missrate. With a 1M L2 cache, the lifetime of the cache lines increasessignificantly, improving overall utilization. Increasing the block sizefrom 64→256 halves the miss rate for all application groups. Thebandwidth is increased by 2× for the Low and Moderate.
Since miss rate and bandwidth have different optimal block granu-larities, we use the following metric: 1
MissRate×Bandwidth to determinea fixed block granularity suited to an application that takes both cri-teria into account. Table 2 shows the block size that maximizes themetric for each application. It can be seen that different applica-tions have different block granularity requirements. For example, themetric is maximized for apache at 128 bytes and for firefox (similarutilization) at 32 bytes. Furthermore, the optimal block sizes varywith the cache size as the cache lifespan changes. This highlightsthe challenge of picking a single block size at design time especiallywhen the working set does not fit in the cache.2.3 Need for adaptive cache blocks
Our observations motivate the need for adaptive cache line granu-larities that match the spatial locality of the data access patterns in anapplication. In summary:• Smaller cache lines improve utilization but tend to increase miss
rate and potentially traffic for applications with good spatial local-ity, affecting the overall performance.
• Large cache lines pollute the cache space and interconnect with un-used words for applications with poor spatial locality, significantlydecreasing the caching efficiency.
• Many applications waste a significant fraction of the cache space.Spatial locality varies not only across applications but also withineach application, for different data structures as well as differentphases of access over time.
Table 2: Optimal block size. Metric: 1Miss−rate×Bandwidth
64K, 4-wayBlock Benchmarks
32B cactus, eclipse, facesim, ferret, firefox, fluidani-mate,freqmine, milc, tpc-c, tradesoap
64B art
128B apache, astar, canneal, h2, jbb, lbm, mcf, omnetpp, so-plex, twolf, x264
1M, 8-wayBlock Benchmarks
64B apache, astar, cactus, eclipse, facesim, ferret, firefox,freqmine, h2, lbm, milc, omnetpp, tradesoap, x264
128B art256B canneal, fluidanimate, jbb, mcf, soplex, tpc-c, twolf
3 Amoeba-Cache : ArchitectureThe Amoeba-Cache architecture enables the memory hierarchy to
fetch and allocate space for a range of words (a variable granularity
32
64
128
3.0
4.0
5.0
6.0
7.0
8.0
10 26 42 58 74 90
Ba
nd
wid
th (
KB
/ 1
K i
ns)
Miss Rate (Misses/1K ins)
(a) 64K - Low
64
128
256
2.5
3.0
3.5
4.0
4.5
5.0
10 26 42 58 74 90
Ba
nd
wid
th (
KB
/ 1
K i
ns)
Miss Rate (Misses/1K ins)
(b) 1M - Low
32
64
128
1.0
1.3
1.6
1.9
2.2
2.5
2 6 10 14 18 22
Ba
nd
wid
th (
KB
/ 1
K i
ns)
Miss Rate (Misses/1K ins)
(c) 64K - Moderate
64
128
256
0.5
0.6
0.7
0.8
0.9
1.0
2 6 10 14 18 22
Ba
nd
wid
th (
KB
/ 1
K i
ns)
Miss Rate (Misses/1K ins)
(d) 1M - Moderate
32 64
128
0.80
0.84
0.88
0.92
0.96
1.00
0 3 6 9 12 15
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
(e) 64K - High
64 128
256
0.40
0.45
0.50
0.55
0.60
0.65
0 3 6 9 12 15
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
(f) 1M - High
Figure 3: Bandwidth vs. Miss Rate. (a),(c),(e): 64K, 4-way L1.(b),(d),(f): 1M, 8-way LLC. Markers on the plot indicate cacheblock size. Note the different scales for different groups.
cache block) based on the spatial locality of the application. Forexample, consider a 64K cache (256 sets) that allocates 256 bytesper set. These 256 bytes can adapt to support, for example, eight32-bytes blocks, thirty-two 8-byte blocks, or four 32-byte blocks andsixteen 8-byte blocks, based on the set of contiguous words likelyto be accessed. The key challenge to supporting variable granularityblocks is how to grow and shrink the # of tags as the # of blocks perset vary with block granularity? Amoeba-Cache adopts a solutioninspired by software data structures, where programs hold meta-data and actual data entries in the same address space. To achievemaximum flexibility, Amoeba-Cache completely eliminates the tagarray and collocates the tags with the actual data blocks (see Figure 4).We use a bitmap (T? Bitmap) to indicate which words in the dataarray represent tags. We also decouple the conventional valid/invalidbits (typically associated with the tags) and organize them into aseparate array (V? : Valid bitmap) to simplify block replacement andinsertion. V? and T? bitmaps both require 1 bit for very word (64bits)in the data array (total overhead of 3%). Amoeba-Cache tags arerepresented as a range (Start and End address) to support variablegranularity blocks. We next discuss the overall architecture.3.1 Amoeba Blocks and Set-Indexing
The Amoeba-Cache data array holds a collection of varied gran-ularity Amoeba-Blocks that do not overlap. Each Amoeba-Block isa 4 tuple consisting of <Region Tag, Start, End, Data-Block> (Figure 4). A Region is an aligned block of memoryof size RMAX bytes. The boundaries of any Amoeba-Block block(Start and End) always will lie within the regions’ boundaries.The minimum granularity of the data in an Amoeba-Block is 1 word
and the maximum is RMAX words. We can encode Start and Endin log2(RMAX) bits. The set indexing function masks the lowerlog2(RMAX) bits to ensure that all Amoeba-Blocks (every memoryword) from a region index to the same set. The Region Tag and Set-Index are identical for every word in the Amoeba-Block. Retainingthe notion of sets enables fast lookups and helps elude challengessuch as synonyms (same memory word mapping to different sets).When comparing against a conventional cache, we set RMAX to 8words (64 bytes), ensuring that the set indexing function is identicalto that in the conventional cache to allow for a fair evaluation.3.2 Data Lookup
Figure 5 describes the steps of the lookup. y1 The lowerlog2(RMAX) bits are masked from the address and the set index isderived from the remaining bits. y2 In parallel with the data read out,the T? bitmap activates the words in the data array correspondingto the tags for comparison. Note that given the minimum size of aAmoeba-Block is two words (1 word for the tag metadata, 1 word forthe data), adjacent words cannot be tags. We need Nwords
2 2-1 multi-plexers that route one of the adjacent words to the comparator (∈ op-erator). The comparator generates the hit signal for the word selector.The ∈ operator consists of two comparators: a) an aligned Regiontag comparator, a conventional == (64 - log2Nsets - log2RMAXbits wide, e.g., 50 bits) that checks if the Amoeba-Block belongsto the same region and b) a Start <= W < END range comparator(log2RMAX bits wide; e.g., 3 bits) that checks if the Amoeba-Blockholds the required word. Finally, in y3 , the tag match activates andselects the appropriate word.3.3 Amoeba Block Insertion
On a miss for the desired word, a spatial granularity predictor isinvoked (see Section 5.1), which specifies the range of the Amoeba-Block to refill. To determine a position in the set to slot the incomingblock we can leverage the V? (Valid bits) bitmap. The V? bitmaphas 1 bit/word in the cache; a “1” bit indicates the word has beenallocated (valid data or a tag). To find space for an incoming datablock we perform a substring search on the V? bitmap of the cache setfor contiguous 0s (empty words). For example, to insert an Amoeba-Block of five words (four words for data and one word for tag), weperform a substring search for 00000 in the V? bitmap / set (e.g., 32bits wide for a 64K cache). If we cannot find the space, we keep
triggering the replacement algorithm until we create the requisitecontiguous space. Following this, the Amoeba-Block tuple (Tag andData block) is inserted, and the corresponding bits in the T? and V?array are set. The 0s substring search can be accomplished with alookup table; many current processors already include a substringinstruction [11, PCMISTRI].3.4 Replacement : Pseudo LRU
To reclaim the space from an Amoeba-Block the tag bits T? (tag)and V? (valid) bits corresponding to the block are unset. The keyissue is identifying the Amoeba-Block to replace. Classical pseudo-LRU algorithms [16,26,16] keep the metadata for the replacementalgorithm separate from the tags to reduce port contention. To becompatible with pseudo-LRU and other algorithms such as DIP [27]that work with a fixed # of ways, we can logically partition a set inAmoeba-Cache into Nways. For instance, if we partition a 32 wordcache set into 4 logical ways, any access to an Amoeba-Block tagfound in words 0 —7 of the set is treated as an access to logicalway 0. Finding a replacement candidate involves identifying theselected replacement way and then picking (possibly randomly) acandidate Amoeba-Block . More refined replacement algorithms thatrequire per-tag metadata can harvest the space in the tag-word of theAmoeba-Block which is 64 bits wide (for alignment purposes) whilephysical addresses rarely extend beyond 48 bits.3.5 Partial Misses
With variable granularity data blocks, a challenging although rarecase (5 in every 1K accesses) that occurs is a partial miss. It isobserved primarily when the spatial locality changes. Figure 6 showsan example. Initially, the set contains two blocks from a region R,one Amoeba-Block caches words 1 –3 (Region:R, START:1 END:3)and the other holds words 5 –6 (Region:R START:5 END:6). TheCPU reads word 4, which misses, and the spatial predictor requestsan Amoeba-Block with range START:0 and END:7. The cache hasAmoeba-Blocks that hold subparts of the incoming Amoeba-Block,and some words (0, 4, and 7) need to be fetched.
Amoeba-Cache removes the overlapping sub-blocks and allocatesa new Amoeba-Block. This is a multiple step process: y1 On a miss,the cache identifies the overlapping sub-blocks in the cache usingthe tags read out during lookup. ∩ 6= NULL is true if STARTnew< ENDTag and ENDnew > STARTTag (New = incoming block and
N sets
TT T T
T T TT TTT
DataRegionTag START END
64b1 -- RMAX
.........
T? Bitmap
Nwords
3.T
hew
rite
rw
aits
unti
litr
ecei
ves
ackn
owle
dgem
ents
from
all
the
shar
ers.
One
ofth
ese
ackn
owle
dgem
ents
wil
lco
ntai
nth
eda
ta.
Alt
erna
tivel
y,th
ew
rite
rw
ill
rece
ive
the
data
from
mem
ory.
4.T
hew
rite
rno
tifi
esth
edi
rect
ory
that
the
tran
sact
ion
isco
m-
plet
e.
TL
oper
ates
iden
tica
lly
exce
ptfo
rw
hen
itse
lect
s,in
step
2(b)
,apr
ovid
erth
atdo
esno
tha
veth
eda
ta.
Inpr
acti
ce,t
hepr
obab
ilit
yof
the
sele
cted
shar
erno
thav
ing
aco
pyis
very
low
;thu
s,th
eco
mm
onca
sefo
rT
Lpr
ocee
dsth
esa
me
asth
eor
igin
alpr
otoc
ol.
Inth
ele
ssco
mm
onca
se,
whe
nth
epr
ovid
erdo
esno
tha
vea
copy
,it
send
sa
NA
ckto
the
wri
ter
inst
ead
ofan
ackn
owle
dgem
ent.
Thi
sle
aves
thre
epo
ssib
lesc
enar
ios:
1.N
oco
pyex
ists
:T
hew
rite
rw
ill
wai
tto
rece
ive
all
ofth
eac
know
ledg
emen
tsan
dth
eN
Ack
,an
dsi
nce
itha
sno
tre
ceiv
edan
yda
ta,i
twil
lsen
da
requ
estt
oth
edi
rect
ory
whi
chth
enas
ksm
emor
yto
prov
ide
the
data
.3
2.A
clea
nco
pyex
ists
:F
orsi
mpl
icit
y,th
isis
hand
led
iden
-ti
call
yto
firs
tca
se.
All
shar
ers
inva
lida
teth
eir
copi
esan
dda
tais
retr
ieve
dfr
omm
emor
yin
stea
d.T
his
may
resu
ltin
ape
rfor
man
celo
ssif
itha
ppen
sfr
eque
ntly
,bu
tS
ecti
on5
show
sth
atno
perf
orm
ance
loss
occu
rsin
prac
tice
.
3.A
dirt
yco
pyex
ists
:O
neof
the
othe
rsh
arer
s,th
eow
ner,
wil
lbe
inth
eM
odifi
edor
Ow
ned
stat
e.T
heow
ner
cann
otdi
scar
dth
edi
rty
data
sinc
eit
does
not
know
that
the
prov
ider
has
aco
py.
Acc
ordi
ngly
,the
owne
ral
way
sin
clud
esth
eda
tain
its
ackn
owle
dgem
ent
toth
ew
rite
r.In
this
case
,th
ew
rite
rw
ill
rece
ive
the
prov
ider
’sN
Ack
and
the
owne
r’s
vali
dda
taan
dca
nth
usco
mpl
ete
its
requ
est.
Ifth
epr
ovid
erha
sa
copy
asw
ell,
the
wri
ter
wil
lre
ceiv
etw
oid
enti
cal
copi
esof
the
data
.S
ecti
on5.
5de
mon
stra
tes
that
this
isra
reen
ough
that
itdo
esno
tres
ulti
nan
ysi
gnifi
cant
incr
ease
inba
ndw
idth
util
izat
ion.
Inal
lca
ses,
all
tran
sact
ions
are
seri
aliz
edth
roug
hth
edi
rect
ory,
sono
addi
tion
alra
ceco
nsid
erat
ions
are
intr
oduc
ed.
Sim
ilar
ly,
the
unde
rlyi
ngde
adlo
ckan
dliv
eloc
kav
oida
nce
mec
hani
sms
rem
ain
appl
icab
le.
3T
his
requ
esti
sse
rial
ized
thro
ugh
the
dire
ctor
yto
avoi
dra
ces
wit
hev
icts
.
3.2.
3E
vict
ions
and
Inva
lida
tes
Whe
na
bloc
kis
evic
ted
orin
vali
date
dT
Ltr
ies
tocl
ear
the
appr
opri
ate
bits
inth
eco
rres
pond
ing
Blo
omfi
lter
tom
aint
ain
accu
racy
.E
ach
Blo
omfi
lter
repr
esen
tsth
ebl
ocks
ina
sing
lese
t,m
akin
git
sim
ple
tode
term
ine
whi
chbi
tsca
nbe
clea
red.
The
cach
esi
mpl
yev
alua
tes
all
the
hash
func
tion
sfo
ral
lre
mai
ning
bloc
ksin
the
set.
Thi
sre
quir
esno
extr
ata
gar
ray
acce
sses
asal
lta
gsar
eal
read
yre
adas
part
ofth
eno
rmal
cach
eop
erat
ion.
Any
bitm
appe
dto
byth
eev
icte
dor
inva
lida
ted
bloc
kca
nbe
clea
red
ifno
neof
the
rem
aini
ngbl
ocks
inth
ese
tm
apto
the
sam
ebi
t.W
hen
usin
gan
incl
usiv
eL
2ca
che,
asin
our
exam
ple,
itis
suffi
cien
tto
chec
kon
lyth
eon
eL
2ca
che
set.
Ifth
eL
2ca
che
isno
tin
clus
ive,
then
addi
tion
alL
1ac
cess
esm
aybe
nece
ssar
y.
3.3
AP
ract
ical
Impl
emen
tati
onC
once
ptua
lly,
TL
cons
ists
ofan
N×
Pgr
idof
Blo
omfi
lter
s,w
here
Nis
the
num
ber
ofca
che
sets
,an
dP
isth
enu
mbe
rof
cach
es,
assh
own
inF
igur
e1(
c).
Eac
hB
loom
filt
erco
ntai
nsk
bit-
vect
ors.
For
clar
ity
we
wil
lre
fer
toth
esi
zeof
thes
ebi
tve
ctor
sas
the
num
ber
ofbu
ckets
.In
prac
tice
,th
ese
filt
ers
can
beco
mbi
ned
into
afe
wS
RA
Mar
rays
prov
ided
that
afe
wco
ndit
ions
are
met
:1)
all
the
filt
ers
use
the
sam
ese
tof
kha
shfu
ncti
ons,
and
2)an
enti
rero
wof
Pfi
lter
sis
acce
ssed
inpa
rall
el.
The
sere
quir
emen
tsdo
not
sign
ifica
ntly
affe
ctth
eef
fect
iven
ess
ofth
eB
loom
filt
ers.
Con
cept
uall
y,ea
chro
wco
ntai
nsP
Blo
omfi
lter
s,ea
chw
ith
kha
shfu
ncti
ons.
Fig
ure
2(a)
show
son
esu
chro
w.
By
alw
ays
acce
ssin
gth
ese
filt
ers
inpa
rall
elus
ing
the
sam
eha
shfu
ncti
ons,
the
filt
ers
can
beco
mbi
ned
tofo
rma
sing
lelo
okup
tabl
e,as
show
nin
Fig
ure
2(b)
,w
here
each
row
cont
ains
aP
-bit
shar
ing
vect
or.
As
are
sult
,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vect
orby
AN
D-i
ngk
shar
ing
vect
ors.
Sin
ceev
ery
row
ofB
loom
filt
ers
uses
the
sam
eha
shfu
ncti
ons,
we
can
furt
her
com
bine
the
look
upta
bles
for
alls
ets
for
each
hash
func
tion
,as
show
nin
Fig
ure
3(a)
.T
hese
look
upta
bles
can
beth
ough
tof
asa
“sea
ofsh
arin
gve
ctor
s”an
dca
nbe
orga
nize
din
toan
yst
ruct
ure
wit
hth
efo
llow
ing
cond
itio
ns:
1)a
uniq
uese
t,ha
shfu
ncti
onin
dex,
and
bloc
kad
dres
sm
apto
asi
ngle
uniq
uesh
arin
gve
ctor
;and
2)k
shar
ing
vect
ors
can
beac
cess
edin
para
llel
,w
here
kis
the
num
ber
ofha
shfu
ncti
ons.
An
exam
ple
impl
emen
tati
onus
esa
sing
le-p
orte
d,tw
o-di
men
sion
alS
RA
Mar
ray
for
each
hash
func
tion
,w
here
the
set
inde
xse
lect
sa
row
,an
dth
eha
shfu
ncti
onse
lect
sa
shar
ing
vect
or,
assh
own
inF
igur
e3(
b).
As
apr
acti
cal
exam
ple,
Sec
tion
5sh
ows
that
aT
Lde
sign
wit
hfo
urha
shta
bles
and
64bu
cket
sin
each
filt
erpe
rfor
ms
wel
lfo
r
!"#$ %
!"#$ &
'$(
%)*+(,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'()*$+,&-./%0*
122*.''
()'(3 ()'(4
5.%
(b)
Fig
ure
2:(a
)Ana
ïve
impl
emen
tati
onw
ith
one
filte
rpe
rco
re.(
b)C
ombi
ning
allo
fthe
filte
rsfo
ron
ero
w.
V? Bitmap
.........
Amoeba-Block
3.Th
e writ
erwa
itsun
tilit
rece
ives
ackn
owled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
the
data.
Alter
nativ
ely, t
hewr
iter w
illre
ceiv
eth
eda
tafro
mm
emor
y.
4.Th
ewr
iter n
otifi
esth
edi
recto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
s ide
ntica
llyex
cept
for w
hen
itse
lects,
inste
p2(
b), a
prov
ider
that
does
not h
ave t
heda
ta.In
prac
tice,
the p
roba
bilit
yof
the s
electe
d sha
rer n
otha
ving
a cop
y is v
ery l
ow; t
hus,
the c
omm
onca
sefo
r TL
proc
eeds
the s
ame a
s the
orig
inal
prot
ocol
. In
the l
ess
com
mon
case
, whe
nth
epr
ovid
erdo
esno
t hav
ea
copy
, it s
ends
aNA
ckto
the
write
r ins
tead
ofan
ackn
owled
gem
ent.
This
leave
sth
ree p
ossib
lesc
enar
ios:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceiv
eall
ofth
eac
know
ledge
men
tsan
dth
eNA
ck, a
ndsin
ceit
has
not
rece
ived
any d
ata, i
t will
send
a req
uest
toth
e dire
ctory
which
then
asks
mem
ory
topr
ovid
e the
data.
3
2.A
clean
copy
exist
s:Fo
r sim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
l sha
rers
inva
lidate
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
, but
Secti
on5
show
s tha
t no
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
e of t
heot
her s
hare
rs,th
e own
er, w
illbe
inth
e Mod
ified
orOw
ned s
tate.
The o
wner
cann
otdi
scar
dth
edi
rtyda
tasin
ceit
does
not k
now
that
the
prov
ider
has a
copy
. Acc
ordi
ngly,
the o
wner
alwa
ysin
clude
s the
data
inits
ackn
owled
gem
ent t
oth
ewr
iter.
Inth
isca
se, t
hewr
iter w
illre
ceiv
eth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epr
ovid
erha
s aco
pyas
well,
the w
riter
will
rece
ive t
woid
entic
alco
pies
ofth
e data
.Se
ction
5.5
dem
onstr
ates t
hat t
his i
s rar
e eno
ugh
that
itdo
esno
t res
ult i
n any
signi
fican
t inc
reas
e in b
andw
idth
utili
zatio
n.
Inall
case
s,all
trans
actio
nsar
e ser
ialize
dth
roug
hth
e dire
ctory
,so
noad
ditio
nal r
ace
cons
ider
ation
s are
intro
duce
d.Si
mila
rly, t
heun
derly
ing
dead
lock
and
livelo
ckav
oida
nce
mec
hani
sms
rem
ainap
plica
ble.
3 This
requ
est i
s ser
ialize
d thr
ough
the d
irecto
ryto
avoi
d rac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dIn
valid
ates
Whe
na
bloc
kis
evict
edor
inva
lidate
dTL
tries
tocle
arth
eap
prop
riate
bits
inth
eco
rresp
ondi
ngBl
oom
filter
tom
ainta
inac
cura
cy.
Each
Bloo
mfil
terre
pres
ents
the
bloc
ksin
asin
gle
set,
mak
ing
itsim
ple t
o dete
rmin
e whi
chbi
tsca
n be c
leare
d.Th
e cac
hesim
ply
evalu
ates a
llth
eha
shfu
nctio
nsfo
r all
rem
ainin
gbl
ocks
inth
ese
t.Th
isre
quire
sno
extra
tagar
ray
acce
sses
asall
tags
are
alrea
dyre
adas
part
ofth
e nor
mal
cach
e ope
ratio
n.An
y bit
map
ped
toby
the
evict
edor
inva
lidate
dbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
t map
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
r exa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
t inc
lusiv
e,th
enad
ditio
nal L
1ac
cess
esm
aybe
nece
ssar
y.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
ists
ofan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es, a
s sho
wnin
Figu
re1(
c). E
ach
Bloo
mfil
terco
ntain
s kbi
t-ve
ctors.
For c
larity
wewi
llre
fer t
oth
e size
ofth
ese
bit v
ecto
rsas
the
num
ber o
f buc
kets
. In
prac
tice,
thes
efil
ters c
anbe
com
bine
din
toa f
ewSR
AMar
rays
prov
ided
that
a few
cond
ition
s are
met:
1)all
the fi
lters
use t
hesa
me s
etof
kha
shfu
nctio
ns, a
nd2)
anen
tire
row
ofP
filter
s is a
cces
sed
inpa
ralle
l.Th
ese r
equi
rem
ents
dono
tsig
nific
antly
affe
ctth
e effe
ctive
ness
ofth
e Blo
omfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ing
thes
e filte
rsin
para
llel u
sing
the s
ame h
ash f
uncti
ons,
the
filter
s can
beco
mbi
ned
tofo
rma s
ingl
elo
okup
table,
assh
own
inFi
gure
2(b)
, whe
reea
chro
wco
ntain
sa
P-bi
t sha
ring
vecto
r.As
are
sult,
each
look
updi
rectl
ypr
ovid
esa
P-b
itsh
arin
gve
ctor b
yAN
D-in
gk
shar
ing
vecto
rs.Si
nce e
very
row
ofBl
oom
filter
s use
sth
e sam
e has
hfu
nctio
ns, w
e can
furth
erco
mbi
neth
e loo
kup
tables
for a
llse
tsfo
r eac
hha
shfu
nctio
n,as
show
nin
Figu
re3(
a).
Thes
elo
okup
tables
can
beth
ough
tof
asa
“sea
ofsh
arin
gve
ctors”
and
can
beor
gani
zed
into
any
struc
ture
with
the f
ollo
wing
cond
ition
s:1)
a uni
que s
et,ha
shfu
nctio
nin
dex,
and
bloc
kad
dres
sm
apto
a sin
gle u
niqu
e sha
ring
vecto
r;an
d2)
ksh
arin
gve
ctors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
r of h
ash
func
tions
.An
exam
ple i
mpl
emen
tatio
nus
esa s
ingl
e-po
rted,
two-
dim
ensio
nal
SRAM
arra
yfo
r eac
hha
shfu
nctio
n,wh
ere
the
set i
ndex
selec
tsa
row,
and
the
hash
func
tion
selec
tsa
shar
ing
vecto
r,as
show
nin
Figu
re3(
b).
Asa
prac
tical
exam
ple,
Secti
on5
show
s tha
t aTL
desig
nwi
thfo
urha
shtab
lesan
d64
buck
etsin
each
filter
perfo
rms
well
for
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a) A
naïve
impl
emen
tatio
nwi
thon
e filte
r per
core
. (b)
Com
bini
ngall
ofth
e filte
rsfo
r one
row.
3.Th
e writ
erwa
itsun
tilit
rece
ives
ackn
owled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
the
data.
Alter
nativ
ely, t
hewr
iter w
illre
ceiv
eth
eda
tafro
mm
emor
y.
4.Th
ewr
iter n
otifi
esth
edi
recto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
s ide
ntica
llyex
cept
for w
hen
itse
lects,
inste
p2(
b), a
prov
ider
that
does
not h
ave t
heda
ta.In
prac
tice,
the p
roba
bilit
yof
the s
electe
d sha
rer n
otha
ving
a cop
y is v
ery l
ow; t
hus,
the c
omm
onca
sefo
r TL
proc
eeds
the s
ame a
s the
orig
inal
prot
ocol
. In
the l
ess
com
mon
case
, whe
nth
epr
ovid
erdo
esno
t hav
ea
copy
, it s
ends
aNA
ckto
the
write
r ins
tead
ofan
ackn
owled
gem
ent.
This
leave
sth
ree p
ossib
lesc
enar
ios:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceiv
eall
ofth
eac
know
ledge
men
tsan
dth
eNA
ck, a
ndsin
ceit
has
not
rece
ived
any d
ata, i
t will
send
a req
uest
toth
e dire
ctory
which
then
asks
mem
ory
topr
ovid
e the
data.
3
2.A
clean
copy
exist
s:Fo
r sim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
l sha
rers
inva
lidate
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
, but
Secti
on5
show
s tha
t no
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
e of t
heot
her s
hare
rs,th
e own
er, w
illbe
inth
e Mod
ified
orOw
ned s
tate.
The o
wner
cann
otdi
scar
dth
edi
rtyda
tasin
ceit
does
not k
now
that
the
prov
ider
has a
copy
. Acc
ordi
ngly,
the o
wner
alwa
ysin
clude
s the
data
inits
ackn
owled
gem
ent t
oth
ewr
iter.
Inth
isca
se, t
hewr
iter w
illre
ceiv
eth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epr
ovid
erha
s aco
pyas
well,
the w
riter
will
rece
ive t
woid
entic
alco
pies
ofth
e data
.Se
ction
5.5
dem
onstr
ates t
hat t
his i
s rar
e eno
ugh
that
itdo
esno
t res
ult i
n any
signi
fican
t inc
reas
e in b
andw
idth
utili
zatio
n.
Inall
case
s,all
trans
actio
nsar
e ser
ialize
dth
roug
hth
e dire
ctory
,so
noad
ditio
nal r
ace
cons
ider
ation
s are
intro
duce
d.Si
mila
rly, t
heun
derly
ing
dead
lock
and
livelo
ckav
oida
nce
mec
hani
sms
rem
ainap
plica
ble.
3 This
requ
est i
s ser
ialize
d thr
ough
the d
irecto
ryto
avoi
d rac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dIn
valid
ates
Whe
na
bloc
kis
evict
edor
inva
lidate
dTL
tries
tocle
arth
eap
prop
riate
bits
inth
eco
rresp
ondi
ngBl
oom
filter
tom
ainta
inac
cura
cy.
Each
Bloo
mfil
terre
pres
ents
the
bloc
ksin
asin
gle
set,
mak
ing
itsim
ple t
o dete
rmin
e whi
chbi
tsca
n be c
leare
d.Th
e cac
hesim
ply
evalu
ates a
llth
eha
shfu
nctio
nsfo
r all
rem
ainin
gbl
ocks
inth
ese
t.Th
isre
quire
sno
extra
tagar
ray
acce
sses
asall
tags
are
alrea
dyre
adas
part
ofth
e nor
mal
cach
e ope
ratio
n.An
y bit
map
ped
toby
the
evict
edor
inva
lidate
dbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
t map
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
r exa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
t inc
lusiv
e,th
enad
ditio
nal L
1ac
cess
esm
aybe
nece
ssar
y.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
ists
ofan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es, a
s sho
wnin
Figu
re1(
c). E
ach
Bloo
mfil
terco
ntain
s kbi
t-ve
ctors.
For c
larity
wewi
llre
fer t
oth
e size
ofth
ese
bit v
ecto
rsas
the
num
ber o
f buc
kets
. In
prac
tice,
thes
efil
ters c
anbe
com
bine
din
toa f
ewSR
AMar
rays
prov
ided
that
a few
cond
ition
s are
met:
1)all
the fi
lters
use t
hesa
me s
etof
kha
shfu
nctio
ns, a
nd2)
anen
tire
row
ofP
filter
s is a
cces
sed
inpa
ralle
l.Th
ese r
equi
rem
ents
dono
tsig
nific
antly
affe
ctth
e effe
ctive
ness
ofth
e Blo
omfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ing
thes
e filte
rsin
para
llel u
sing
the s
ame h
ash f
uncti
ons,
the
filter
s can
beco
mbi
ned
tofo
rma s
ingl
elo
okup
table,
assh
own
inFi
gure
2(b)
, whe
reea
chro
wco
ntain
sa
P-bi
t sha
ring
vecto
r.As
are
sult,
each
look
updi
rectl
ypr
ovid
esa
P-b
itsh
arin
gve
ctor b
yAN
D-in
gk
shar
ing
vecto
rs.Si
nce e
very
row
ofBl
oom
filter
s use
sth
e sam
e has
hfu
nctio
ns, w
e can
furth
erco
mbi
neth
e loo
kup
tables
for a
llse
tsfo
r eac
hha
shfu
nctio
n,as
show
nin
Figu
re3(
a).
Thes
elo
okup
tables
can
beth
ough
tof
asa
“sea
ofsh
arin
gve
ctors”
and
can
beor
gani
zed
into
any
struc
ture
with
the f
ollo
wing
cond
ition
s:1)
a uni
que s
et,ha
shfu
nctio
nin
dex,
and
bloc
kad
dres
sm
apto
a sin
gle u
niqu
e sha
ring
vecto
r;an
d2)
ksh
arin
gve
ctors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
r of h
ash
func
tions
.An
exam
ple i
mpl
emen
tatio
nus
esa s
ingl
e-po
rted,
two-
dim
ensio
nal
SRAM
arra
yfo
r eac
hha
shfu
nctio
n,wh
ere
the
set i
ndex
selec
tsa
row,
and
the
hash
func
tion
selec
tsa
shar
ing
vecto
r,as
show
nin
Figu
re3(
b).
Asa
prac
tical
exam
ple,
Secti
on5
show
s tha
t aTL
desig
nwi
thfo
urha
shtab
lesan
d64
buck
etsin
each
filter
perfo
rms
well
for
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a) A
naïve
impl
emen
tatio
nwi
thon
e filte
r per
core
. (b)
Com
bini
ngall
ofth
e filte
rsfo
r one
row.
RMAXRegion
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
3.Th
ewrit
erwa
itsun
tilit
rece
ivesa
ckno
wled
gem
ents
from
allth
esh
arer
s.On
eof
thes
eac
know
ledge
men
tswi
llco
ntain
thed
ata.A
ltern
ative
ly,th
ewrit
erwi
llre
ceive
thed
atafro
mm
emor
y.
4.Th
ewrit
erno
tifies
thed
irecto
ryth
atth
etra
nsac
tion
isco
m-
plete
.
TLop
erate
side
ntica
llyex
cept
forw
heni
tsele
cts,i
nstep
2(b)
,apr
ovid
erth
atdo
esno
thav
ethe
data.
Inpr
actic
e,th
epro
babi
lity
ofth
esele
cteds
hare
rnot
havi
ngac
opyi
sver
ylow
;thu
s,th
ecom
mon
case
forT
Lpr
ocee
dsth
esam
east
heor
igin
alpr
otoc
ol.I
nth
eles
sco
mm
onca
se,w
hen
thep
rovid
erdo
esno
thav
eaco
py,i
tsen
dsa
NAck
toth
ewr
iteri
nstea
dof
anac
know
ledge
men
t.Th
islea
ves
thre
epos
sible
scen
ario
s:
1.No
copy
exist
s:Th
ewr
iter
will
wait
tore
ceive
allof
the
ackn
owled
gem
ents
and
the
NAck
,and
since
itha
sno
tre
ceive
dany
data,
itwill
send
areq
uest
toth
edire
ctory
which
then
asks
mem
ory
topr
ovid
ethe
data.
3
2.A
clean
copy
exist
s:Fo
rsim
plici
ty,th
isis
hand
ledid
en-
ticall
yto
first
case
.Al
lsha
rers
invali
date
their
copi
esan
dda
tais
retri
eved
from
mem
ory
inste
ad.
This
may
resu
ltin
ape
rform
ance
loss
ifit
happ
ens
frequ
ently
,but
Secti
on5
show
stha
tno
perfo
rman
celo
ssoc
curs
inpr
actic
e.
3.A
dirty
copy
exist
s:On
eoft
heot
hers
hare
rs,th
eown
er,w
illbe
inth
eMod
ified
orOw
neds
tate.
Theo
wner
cann
otdi
scar
dth
edirt
yda
tasin
ceit
does
notk
now
that
thep
rovi
derh
asa
copy
.Acc
ordi
ngly,
theo
wner
alwa
ysin
clude
sthe
data
inits
ackn
owled
gem
entt
oth
ewrit
er.In
this
case
,the
write
rwill
rece
iveth
epr
ovid
er’s
NAck
and
the
owne
r’sva
lidda
taan
dca
nth
usco
mpl
eteits
requ
est.
Ifth
epro
vide
rhas
acop
yas
well,
thew
riter
will
rece
ivetw
oid
entic
alco
pies
ofth
edata
.Se
ction
5.5de
mon
strate
stha
tthi
sisr
aree
noug
hth
atit
does
notr
esul
tina
nysig
nific
anti
ncre
asei
nban
dwid
thut
iliza
tion.
Inall
case
s,all
trans
actio
nsar
eser
ialize
dth
roug
hth
edire
ctory
,so
noad
ditio
nalr
acec
onsid
erati
onsa
rein
trodu
ced.
Sim
ilarly
,the
unde
rlyin
gde
adlo
ckan
dliv
elock
avoi
danc
em
echa
nism
sre
main
appl
icabl
e.
3 This
requ
esti
sser
ialize
dthr
ough
thed
irecto
ryto
avoi
drac
eswi
thev
icts.
3.2.
3Ev
ictio
nsan
dInv
alid
ates
Whe
na
bloc
kis
evict
edor
invali
dated
TLtri
esto
clear
the
appr
opria
tebi
tsin
the
corre
spon
ding
Bloo
mfil
terto
main
tain
accu
racy
.Ea
chBl
oom
filter
repr
esen
tsth
ebl
ocks
ina
singl
ese
t,m
akin
gits
impl
etod
eterm
inew
hich
bits
canb
eclea
red.
Thec
ache
simpl
yev
aluate
sall
theh
ash
func
tions
fora
llre
main
ing
bloc
ksin
the
set.
This
requ
ires
noex
tratag
arra
yac
cess
esas
alltag
sar
ealr
eady
read
aspa
rtof
then
orm
alca
cheo
pera
tion.
Anyb
itm
appe
dto
byth
eev
icted
orinv
alida
tedbl
ock
can
becle
ared
ifno
neof
the
rem
ainin
gbl
ocks
inth
ese
tmap
toth
esa
me
bit.
Whe
nus
ing
anin
clusiv
eL2
cach
e,as
inou
rexa
mpl
e,it
issu
fficie
ntto
chec
kon
lyth
eon
eL2
cach
ese
t.If
the
L2ca
che
isno
tinc
lusiv
e,th
enad
ditio
nalL
1acc
esse
smay
bene
cess
ary.
3.3A
Prac
tical
Impl
emen
tatio
nCo
ncep
tuall
y,TL
cons
istso
fan
N×
Pgr
idof
Bloo
mfil
ters,
wher
eN
isth
enu
mbe
rof
cach
ese
ts,an
dP
isth
enu
mbe
rof
cach
es,a
ssho
wnin
Figu
re1(
c).E
ach
Bloo
mfil
terco
ntain
skbi
t-ve
ctors.
Forc
larity
wewi
llre
fert
oth
esize
ofth
eseb
itve
ctors
asth
enum
bero
fbuc
kets
.In
prac
tice,
thes
efilte
rsca
nbe
com
bine
din
toaf
ewSR
AMar
rays
prov
ided
that
afew
cond
ition
sare
met:
1)all
thefi
lters
uset
hesa
mes
etof
kha
shfu
nctio
ns,a
nd2)
anen
tire
row
ofP
filter
sisa
cces
sed
inpa
ralle
l.Th
eser
equi
rem
ents
dono
tsig
nific
antly
affec
tthe
effec
tiven
esso
fthe
Bloo
mfil
ters.
Conc
eptu
ally,
each
row
cont
ains
PBl
oom
filter
s,ea
chwi
thk
hash
func
tions
.Fi
gure
2(a)
show
son
esu
chro
w.By
alway
sac
cess
ingt
hese
filter
sinp
arall
elus
ingt
hesa
meh
ashf
uncti
ons,
the
filter
scan
beco
mbi
ned
tofo
rmas
ingl
eloo
kup
table,
assh
own
inFi
gure
2(b)
,whe
reea
chro
wco
ntain
saP-
bits
harin
gve
ctor.
Asa
resu
lt,ea
chlo
okup
dire
ctly
prov
ides
aP
-bit
shar
ing
vecto
rby
AND-
ing
ksh
arin
gve
ctors.
Sinc
eeve
ryro
wof
Bloo
mfil
tersu
ses
thes
ameh
ash
func
tions
,wec
anfu
rther
com
bine
thel
ooku
ptab
lesfo
rall
sets
fore
achh
ash
func
tion,
assh
own
inFi
gure
3(a)
.Th
ese
look
uptab
lesca
nbe
thou
ght
ofas
a“s
eaof
shar
ing
vecto
rs”an
dcan
beor
gani
zedi
ntoa
nystr
uctu
rewi
thth
efol
lowin
gco
nditi
ons:
1)au
niqu
eset,
hash
func
tion
inde
x,an
dbl
ock
addr
ess
map
toas
ingl
euni
ques
harin
gvec
tor;
and2
)ksh
arin
gvec
tors
can
beac
cess
edin
para
llel,
wher
ek
isth
enu
mbe
rofh
ash
func
tions
.An
exam
plei
mpl
emen
tatio
nuse
sasin
gle-
porte
d,tw
o-di
men
siona
lSR
AMar
ray
fore
ach
hash
func
tion,
wher
eth
ese
tind
exse
lects
arow
,and
theh
ash
func
tion
selec
tsas
harin
gve
ctor,
assh
own
inFi
gure
3(b)
.As
apr
actic
alex
ampl
e,Se
ction
5sh
owst
hata
TLde
sign
with
four
hash
tables
and
64bu
ckets
inea
chfil
terpe
rform
swe
llfo
r
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'(
)*$+,
&-./%0
*122*.''
()'(3 ()'(4
5.%
(b)
Figu
re2:
(a)A
naïve
impl
emen
tatio
nwith
onefi
lterp
erco
re.(
b)Co
mbi
ning
allof
thefi
lters
foron
erow
.
Address space
Address
START ENDAmoeba Block
{
WordSet
Index Region
Tag log2RMAX
TT T T
(words)
Byte
3b
Left: Cache Layout. Size: $N_sets*N_words*8 bytes$. T? (Tag map) 1 bit/word. Is word a a tag?
Yes(1). V? (Valid map): 1 bit/word. Is word allocated? Yes(1). Total overhead: 3%. Right: Address space
layout: Region: Aligned regions of size RMAX words (×8 bytes). Amoeba-Block range cannot exceed Region
boundaries. Cache Indexing: Lower log2RMAX bits masked and set index is derived from remaining bits; all
words (and all Amoeba-Blocks) within a region index to the same set.
Figure 4: Amoeba-Cache Architecture.
Nwords
2
Region Tag
Start W End W
Critical path
T? Bitmap
Output buffer
...Addr∈Tag
Addr∈Tag
Tag
Word Selector
Word(W)
SetIndex
...
Addr.
Hit?
3.
The
wri
ter
wait
sunti
lit
receiv
es
acknow
ledgem
ents
from
all
the
share
rs.
One
of
these
acknow
ledgem
ents
wil
lconta
inth
edata
.A
ltern
ati
vely
,th
ew
rite
rw
ill
receiv
eth
edata
from
mem
ory
.
4.
The
wri
ter
noti
fies
the
dir
ecto
ryth
at
the
transacti
on
iscom
-ple
te.
TL
opera
tes
identi
call
yexcept
for
when
itsele
cts
,in
ste
p2(b
),a
pro
vid
er
that
does
not
have
the
data
.In
pra
cti
ce,
the
pro
babil
ity
of
the
sele
cte
dshare
rnot
havin
ga
copy
isvery
low
;th
us,th
ecom
mon
case
for
TL
pro
ceeds
the
sam
eas
the
ori
gin
al
pro
tocol.
Inth
ele
ss
com
mon
case,
when
the
pro
vid
er
does
not
have
acopy,
itsends
aN
Ack
toth
ew
rite
rin
ste
ad
of
an
acknow
ledgem
ent.
This
leaves
thre
epossib
lescenari
os:
1.
No
cop
yexis
ts:
The
wri
ter
wil
lw
ait
tore
ceiv
eall
of
the
acknow
ledgem
ents
and
the
NA
ck,
and
sin
ce
ithas
not
receiv
ed
any
data
,it
wil
lsend
are
questto
the
dir
ecto
ryw
hic
hth
en
asks
mem
ory
topro
vid
eth
edata
.3
2.
Acle
an
cop
yexis
ts:
For
sim
pli
cit
y,
this
ishandle
did
en-
ticall
yto
firs
tcase.
All
share
rsin
vali
date
their
copie
sand
data
isre
trie
ved
from
mem
ory
inste
ad.
This
may
result
ina
perf
orm
ance
loss
ifit
happens
frequentl
y,
but
Secti
on
5show
sth
at
no
perf
orm
ance
loss
occurs
inpra
cti
ce.
3.
Ad
irty
cop
yexis
ts:
One
of
the
oth
er
share
rs,th
eow
ner,w
ill
be
inth
eM
odifi
ed
or
Ow
ned
sta
te.
The
ow
ner
cannot
dis
card
the
dir
tydata
sin
ce
itdoes
not
know
that
the
pro
vid
er
has
acopy.
Accord
ingly
,th
eow
ner
alw
ays
inclu
des
the
data
init
sacknow
ledgem
ent
toth
ew
rite
r.In
this
case,
the
wri
ter
wil
lre
ceiv
eth
epro
vid
er’
sN
Ack
and
the
ow
ner’
svali
ddata
and
can
thus
com
ple
teit
sre
quest.
Ifth
epro
vid
er
has
acopy
as
well
,th
ew
rite
rw
ill
receiv
etw
oid
enti
cal
copie
sof
the
data
.S
ecti
on
5.5
dem
onstr
ate
sth
at
this
isra
reenough
that
itdoes
not
result
inany
sig
nifi
cant
incre
ase
inbandw
idth
uti
lizati
on.
Inall
cases,
all
transacti
ons
are
seri
ali
zed
thro
ugh
the
dir
ecto
ry,
so
no
addit
ional
race
consid
era
tions
are
intr
oduced.
Sim
ilarl
y,
the
underl
yin
gdeadlo
ck
and
livelo
ck
avoid
ance
mechanis
ms
rem
ain
appli
cable
.
3T
his
request
isseri
ali
zed
thro
ugh
the
dir
ecto
ryto
avoid
races
wit
hevic
ts.
3.2
.3E
vic
tio
ns
an
dIn
va
lid
ate
sW
hen
ablo
ck
isevic
ted
or
invali
date
dT
Ltr
ies
tocle
ar
the
appro
pri
ate
bit
sin
the
corr
espondin
gB
loom
filt
er
tom
ain
tain
accura
cy.
Each
Blo
om
filt
er
repre
sents
the
blo
cks
ina
sin
gle
set,
makin
git
sim
ple
todete
rmin
ew
hic
hbit
scan
be
cle
are
d.
The
cache
sim
ply
evalu
ate
sall
the
hash
functi
ons
for
all
rem
ain
ing
blo
cks
inth
eset.
This
requir
es
no
extr
ata
garr
ay
accesses
as
all
tags
are
alr
eady
read
as
part
of
the
norm
al
cache
opera
tion.
Any
bit
mapped
toby
the
evic
ted
or
invali
date
dblo
ck
can
be
cle
are
dif
none
of
the
rem
ain
ing
blo
cks
inth
eset
map
toth
esam
ebit
.W
hen
usin
gan
inclu
siv
eL
2cache,
as
inour
exam
ple
,it
issuffi
cie
nt
tocheck
only
the
one
L2
cache
set.
Ifth
eL
2cache
isnot
inclu
siv
e,
then
addit
ional
L1
accesses
may
be
necessary
.
3.3
AP
ra
cti
ca
lIm
ple
men
tati
on
Conceptu
all
y,
TL
consis
tsof
an
N×
Pgri
dof
Blo
om
filt
ers
,w
here
Nis
the
num
ber
of
cache
sets
,and
Pis
the
num
ber
of
caches,
as
show
nin
Fig
ure
1(c
).E
ach
Blo
om
filt
er
conta
ins
kbit
-vecto
rs.
For
cla
rity
we
wil
lre
fer
toth
esiz
eof
these
bit
vecto
rsas
the
num
ber
ofbuckets.
Inpra
cti
ce,
these
filt
ers
can
be
com
bin
ed
into
afe
wS
RA
Marr
ays
pro
vid
ed
that
afe
wcondit
ions
are
met:
1)
all
the
filt
ers
use
the
sam
eset
ofk
hash
functi
ons,
and
2)
an
enti
rero
wofP
filt
ers
isaccessed
inpara
llel.
These
requir
em
ents
do
not
sig
nifi
cantl
yaff
ect
the
eff
ecti
veness
of
the
Blo
om
filt
ers
.C
onceptu
all
y,
each
row
conta
ins
PB
loom
filt
ers
,each
wit
hk
hash
functi
ons.
Fig
ure
2(a
)show
sone
such
row
.B
yalw
ays
accessin
gth
ese
filt
ers
inpara
llelusin
gth
esam
ehash
functi
ons,th
efi
lters
can
be
com
bin
ed
tofo
rma
sin
gle
lookup
table
,as
show
nin
Fig
ure
2(b
),w
here
each
row
conta
ins
aP
-bit
sharin
gvecto
r.
As
are
sult
,each
lookup
dir
ectl
ypro
vid
es
aP
-bit
shari
ng
vecto
rby
AN
D-i
ng
kshari
ng
vecto
rs.
Sin
ce
every
row
of
Blo
om
filt
ers
uses
the
sam
ehash
functi
ons,
we
can
furt
her
com
bin
eth
elo
okup
table
sfo
rall
sets
for
each
hash
functi
on,
as
show
nin
Fig
ure
3(a
).T
hese
lookup
table
scan
be
thought
of
as
a“sea
of
shari
ng
vecto
rs”
and
can
be
org
aniz
ed
into
any
str
uctu
rew
ith
the
foll
ow
ing
condit
ions:
1)
auniq
ue
set,
hash
functi
on
index,
and
blo
ck
addre
ss
map
toa
sin
gle
uniq
ue
shari
ng
vecto
r;and
2)k
shari
ng
vecto
rscan
be
accessed
inpara
llel,
where
kis
the
num
ber
of
hash
functi
ons.
An
exam
ple
imple
menta
tion
uses
asin
gle
-port
ed,tw
o-d
imen
sio
nal
SR
AM
arr
ay
for
each
hash
functi
on,
where
the
set
index
sele
cts
aro
w,
and
the
hash
functi
on
sele
cts
ashari
ng
vecto
r,as
show
nin
Fig
ure
3(b
).A
sa
pra
cti
cal
exam
ple
,S
ecti
on
5show
sth
at
aT
Ldesig
nw
ith
four
hash
table
sand
64
buckets
ineach
filt
er
perf
orm
sw
ell
for
!"#$%
!"#$&
'$(
%)*+(
,--#$..
/0./% /0./1
,--#$..
/0./% /0./1
(a)
!"#$%
!"#$&
!"#$%&'()*$+,&-./%0*
122*.''
()'(3 ()'(45.%
(b)
Fig
ure
2:
(a)
An
aïv
eim
ple
men
tati
on
wit
hon
efi
lter
per
core.
(b)
Com
bin
ing
all
of
the
filt
ers
for
on
erow
.
≤ 3 bits
==Region
>50 bits
{ {
Tag
1
2
3
Data Array
Lookup steps. y1 Mask lower log2RMAX bits and index into data array.y2 Identify tags using T? bitmap and initiate tag checks (∈) operator
(Region tag comparator ==? and≤ range comparator). y3 Word selection
within Amoeba-Block.
Figure 5: Lookup Logic
Tag = Amoeba-Block in set). y2 The data blocks that overlap withthe miss range are evicted and moved one-at-a-time to the MSHRentry. y3 Space is then allocated for the new block, i.e., it is treatedlike a new insertion. y4 A miss request is issued for the entire block(START:0 — END:7) even if only some words (e.g., 0, 4, and 7)may be needed. This ensures request processing is simple and onlya single refill request is sent. y5 Finally, the incoming data block ispatched into the MSHR; only the words not obtained from the L1 arecopied (since the lower level could be stale).
Miss(0:7)
R 1--3 R 5--6
0--7RFetch
MSHR
1New∩ Tag
0--72
==RegionStartNEW ≤ EndEndNEW > Start
MSHR1--3 5--6X X
+ (0--7)3
4Refill (0:7)Patch Xs
X
5
Identify Sub-Blocks
Insert New-Block
y1 Identify blocks overlapping with New block. y2 Evict overlapping
blocks to MSHR. y3 Allocate space for new block (treat it like a newinsertion). y4 Issue refill request to lower level for entire block. y5Patch only newer words as lower-level data could be stale.Figure 6: Partial Miss Handling. Upper: Identify relevant sub-blocks. Useful for other cache controller events as well, e.g.,recalls. Lower: Refill of words and insertion.
4 Hardware ComplexityWe analyze the complexity of Amoeba-Cache along the following
directions: we quantify the additions needed to the cache controller,we analyze the latency, area, and energy penalty, and finally, we studythe challenges specifically introduced by large caches.4.1 Cache Controller
The variable granularity Amoeba-Block blocks need specific con-sideration in the cache controller. We focus on the L1 controllerhere, and in particular, partial misses. The cache controller managesoperations at the aligned RMAX granularity. The controller permitsonly one in-flight cache operation per RMAX region. In-flight cacheoperations ensure no address overlap with stable Amoeba-Blocks inorder to eliminate complex race conditions. Figure 7 shows the L1cache controller state machine. We add two states to the defaultprotocol, IV_B and IV_C, to handle partial misses. IV_B is ablocking state that blocks other cache operations to RMAX regionuntil all relevant Amoeba-Blocks to a partial miss are evicted (e.g.,0–3 and 5–7 blocks in Figure 6). IV_C indicates partial miss com-pletion. This enables the controller to treat the access as a full missand issue the refill request. The other stable states (I, V, D) and tran-sient states (IV_Data and ID_Data) are present in a conventionalprotocol as well. Partial-miss triggers the clean-up operations(1 and 2 in Figure 6). Local_L1_Evict is a looping event thatkeeps retriggering for each Amoeba-Block involved in the partialmiss; Last_L1_Evict is triggered when the last Amoeba-Blockinvolved in the partial miss is evicted to the MSHR. A key difference
Local
L1_Evict
IVData
IDData
IV_B
IV_CV D
Partial
Miss
L2
DATA L2
DATA
Last
L1_E
vict Partial MissSt
oreLoad
Parti
alM
iss
NPLoad StoreL1
Evict
L1Writeback
LoadLoad/Store
Store
IDData
Cache controller statesState DescriptionNP Amoeba-Block not present in the cache.V All words corresponding to Amoeba-Block present and valid
(read-only)D Valid and atleast one word in Amoeba-Block is dirty (read-write)
IV B Partial miss being processed (blocking state)IV Data Load miss; waiting for data from L2ID Data Store miss; waiting for data. Set dirty bit.
IV C Partial miss cleanup from cache completed (treat as full miss)Amoeba-specific Cache Events
Partial miss: Process partial miss.Local L1 Evict: Remove overlapping Amoeba-Block to MSHR.Last L1 Evict: Last Amoeba-Block moved to MSHR. Convert to fullmiss and process load or store.Bold and Broken-lines: Amoeba-Cache additions.
Figure 7: Amoeba Cache Controller (L1 level).
between the L1 and lower-level protocols is that the Load/Store eventin the lower-level protocol may need to access data from multipleAmoeba-Blocks. In such cases, similar to the partial-miss event, weread out each block independently before supplying the data (moredetails in § 5.2).
4.2 Area, Latency, and Energy Overhead
The extra metadata required by Amoeba-Cache are the T? (1 tagbit per word) and V? (1 valid bit per word) bitmaps. Table 3 showsthe quantitative overhead compared to the data storage. Both the T?and V? bitmap arrays are directly proportional to the size of the cacheand require a constant storage overhead (3% in total). The T? bitmapis read in parallel with the data array and does not affect the criticalpath; T? adds 2%—3.5% (depending on cache size) to the overallcache access energy. V? is referred only on misses when inserting anew block.
Table 3: Amoeba-Cache Hardware Complexity.Cache configuration
64K (256by/set) 1MB (512by/set) 4MB (1024by/set)Data RAM parameters
Delay 0.36ns 2ns 2.5 nsEnergy 100pJ 230pJ 280pJ
Amoeba-Cache components (CACTI model)T?/V? map 1KB 16KB 64KB
Latency 0.019ns (5%) 0.12ns (6%) 0.2ns (6%)Energy 2pJ (2%) 8pJ (3.4%) 10pJ (3.5%)LRU 1
8 KB 2KB 8KBLookup Overhead (VHDL model)
Area 0.7% 0.1%Latency 0.02ns 0.035ns 0.04ns
% indicates overhead compared to data array of cache. 64K cacheoperates in Fast mode; 1MB and 4MB operate in Normal mode.We use 32nm ITRS HP transistors for 64K and 32nm ITRS LOPtransistors for 1MB and 4MB.
We synthesized1 the cache lookup logic using Synopsys and quan-tify the area, latency, and energy penalty. Amoeba-Cache is compat-ible with Fast and Normal cache access modes [28, -access-modeconfig], both of which read the entire set from the data array in par-allel with the way selection to achieve lower latency. Fast modetransfers the entire set to the edge of the H-tree, while Normal mode,only transmits the selected way over the H-tree. For synthesis, weused the Synopsys design compiler (Vision Z-2007.03-SP5).
Figure 5 shows Amoeba-Caches lookup hardware on the criticalpath; we compare it against a fixed-granularity cache’s lookup logic(mainly the comparators). The area overhead of the Amoeba-Cacheincludes registering an entire line that has been read out, the tagoperation logic, and the word selector. The components on thecritical path once the data is read out are the 2-way multiplexers, the∈ comparators, and priority encoder that selects the word; the T?bitmap is accessed in parallel and off the critical path. Amoeba-Cacheis made feasible under today’s wire-limited technology where thecache latency and energy is dominated by the bit/word lines, decoder,and H-tree [28]. Amoeba-Cache’s comparators, which operate onthe entire cache set, are 6× the area of a fixed cache’s comparators.Note that the data array occupies 99% of the overall cache area.The critical path is dominated by the wide word selector since thecomparators all operate in parallel. The lookup logic adds 60% tothe conventional cache’s comparator time. The overall critical pathis dominated by the data array access and Amoeba-Cache’s lookupcircuit adds 0.02ns to the access latency and ' 1pJ to the energy ofa 64K cache, and 0.035ns to the latency and '2pJ to the energy ofa 1MB cache. Finally, Amoeba-Cache amortizes the energy penaltyof the peripheral components (H-tree, Wordline, and decoder) over asingle RAM.
Amoeba-Cache’s overhead needs careful consideration when im-plemented at the L1 cache level. We have two options for handlingthe latency overhead a) if the L1 cache is the critical stage in thepipeline, we can throttle the CPU clock by the latency overhead toensure that the additional logic fits within the pipeline stage. Thisensures that the number of pipeline stages for a memory access doesnot change with respect to a conventional cache, although all instruc-tions bear the overhead of the reduced CPU clock. b) we can add anextra pipeline stage to the L1 hit path, adding a 1 cycle overhead toall memory accesses but ensuring no change in CPU frequency. Wequantify the performance impact of both approaches in Section 6.4.3 Tag-only Operations
Conventional caches support tag-only operations to reduce dataport contention. While the Amoeba-Cache merges tags and data, likemany commercial processors it decouples the replacement metadataand valid bits from the tags, accessing the tags only on cache lookup.Lookups can be either CPU side or network side (coherence invalida-tion and Wback/Forwarding). CPU-side lookups and writebacks ('95% of cache operations) both need data and hence Amoeba-Cache inthe common case does not introduce extra overhead. Amoeba-Cachedoes read out the entire data array unlike serial-mode caches (wediscuss this issue in the next section). Invalidation checks and snoopscan be more energy expensive with Amoeba-Cache compared to aconventional cache. Fortunately, coherence snoops are not commonin many applications (e.g., 1/100 cache operations in SpecJBB) as acoherence directory and an inclusive LLC filter them out.
1We do not have access to an industry-grade 32nm library, so we synthesized ata higher 180nm node size and scaled the results to 32 nm (latency and energy scaledproportional to Vdd (taken from [36]) and V dd2 respectively).
4.4 Tradeoff with Large Caches
0
0.4
0.8
1.2
1.6
2
4 8 16 32 4 8 16 32 4 8 16 32
2M 4M 8M
No
rmal
v
s S
eria
l M
od
e
Latency Energy
Ways Ways Ways
Baseline: Serial. ≤ 1 Normal is better. 32nm, ITRS LOP.Figure 8: Serial vs Normal mode cache.
Large caches with many words per set (≡ highly associative con-ventional cache) need careful consideration. Typically, highly as-sociative caches tend to serialize tag and data access with only therelevant cache block read out on a hit and no data access on a miss.We first analyze the the tradeoff between reading the entire set (nor-mal mode), which is compatible with Amoeba-Cache and only therelevant block (serial mode). We vary the cache size from 2M—8Mand associativity from 4(256B/set) — 32 (2048B/set). Under currenttechnology constraints (Figure 8), only at very high associativity doesserial mode demonstrate a notable energy benefit. Large caches aredominated by H-tree energy consumption and reading out the entireset at each sub-bank imposes an energy penalty when bitlines andwordlines dominate (2KB+ # of words/set).
Table 4: % of direct accesses with fast tags64K(256by/set) 1MB(512by/set) 2MB(1024 by/set)
# Tags/set 2 4 4 8 8 16Overhead 1KB 2KB 2KB 16KB 16KB 32KBBenchmarksLow 30% 45% 42% 64% 55% 74%Moderate 24% 62% 46% 70% 63% 85%High 35% 79% 67% 95% 75% 96%
Amoeba-Cache can be tuned to minimize the hardware overheadfor large caches. With many words/set the cache utilization improvesdue to longer block lifetimes making it feasible to support Amoeba-Blocks with a larger minimum granularity (> 1 word). If we increaseminimum granularity to two or four words, only every third or fifthword could be a tag, meaning the # of comparators and multiplex-ers reduce to Nwords/set
3 or Nwords/set5 . When the minimum granularity
is equal to max granularity (RMAX), we obtain a fixed granularitycache with Nwords/set/RMAX ways. Cache organizations that collo-cate all the tags together at the head of the data array enable tag-onlyoperations and serial Amoeba-Block accesses that need to activateonly a portion of the data array. However, the set may need to becompacted at each insertion. Recently, Loh and Hill [23] exploredsuch an organization for supporting tags in multi-gigabyte caches.
Finally, the use of Fast Tags help reduce the tag lookups in thedata array. Fast tags use a separate traditional tag array-like structureto cache the tags of the recently-used blocks and provide a pointerdirectly to the Amoeba-Block. The # of Fast Tags needed per setis proportional to the # of blocks in each set, which varies with thespatial locality in the application and the # of bytes per set (moredetails in Section 6.1). We studied 3 different cache configurations(64K 256B/set, 1M 512B/set, and 2M 1024B/set) while varying thenumber of fast tags per set (see Table 4). With 8 tags/set (16KB
overhead), we can filter 64—95% of the accesses in a 1MB cacheand 55— 75% of the accesses in a 2MB cache.5 Chip-Level Issues5.1 Spatial Patterns Prediction
101 1101 1.........
101 1100 1
PC or Region Tag
2101 1100 1
==13
1PC: Read 0xaddr
Miss Word
Block <Start, End>
PC or Region Tag
Figure 9: Spatial Predictor invoked on a Amoeba-Cache miss
Amoeba-Cache needs a spatial block predictor, which informsrefill requests about the range of the block to fetch. Amoeba-Cachecan exploit any spatial locality predictor and there have been manyefforts in the compiler and architecture community [7, 17, 29, 6]. Weadopt a table-driven approach consisting of a set of access bitmaps;each entry is RMAX (maximum granularity of an Amoeba-Block)bits wide and represents whether the word was touched during thelifetime of the recently evicted cache block. On a miss, the predictorwill search for an entry (indexed by either the miss PC or regionaddress) and choose the range of words to be fetched on a misson either side (left and right) of the critical word. The PC-basedindexing also uses the critical word index for improved accuracy. Thepredictor optimizes for spatial prefetching and will overfetch (bringin potentially untouched words), if they are interspersed amongstcontiguous chunks of touched words. We can also bypass theprediction when there is low confidence in the prediction accuracy.For example, for streaming applications without repeated misses to aregion, we can bring in a fixed granularity block based on the overallglobal behavior of the application. We evaluate tradeoffs in the designof the spatial predictor in Section 6.2.5.2 Multi-level Caches
We discuss the design of inclusive cache hierarchies includingmultiple Amoeba-Caches; we illustrate using a 2-level hierarchy.Inclusion means that the L2 cache contains a superset of the datawords in the L1 cache; however, the two levels may include differentgranularity blocks. For example, the Sun Niagara T2 uses 16 byte L1blocks and 64 byte L2 blocks. Amoeba-Cache permits non-alignedblocks of variable granularity at the L1 and the L2, and needs to dealwith two issues: a) L2 recalls that may invalidate multiple L1 blocksand b) L1 refills that may need data from multiple blocks at the L2.For both cases, we need to identify all the relevant Amoeba-Blocksthat overlap with either the recall or the refill request. This situationis similar to a Nigara’s L2 eviction which may need to recall 4 L1blocks. Amoeba-Cache’s logic ensures that all Amoeba-Blocks froma region map to a single set at any level (using the same RMAX forboth L1 and L2). This ensures that L2 recalls or L1 refills indexinto only a single set. To process multiple blocks for a single cacheoperation, we use the step-by-step process outlined in Section 3.5( y1 and y2 in Figure 6). Finally, the L1-L2 interconnect needs 3virtual networks, two of which, the L2→L1 data virtual network andthe L1→L2 writeback virtual network, can have packets of variablegranularity; each packet is broken down into a variable number ofsmaller physical flits.5.3 Cache Coherence
There are three main challenges that variable cache line granu-larity introduces when interacting with the coherence protocol: 1)
How is the coherence directory maintained? 2) How to supportvariable granularity read sharing? and 3) What is the granularityof write invalidations? The key insight that ensures compatibilitywith a conventional fixed-granularity coherence protocol is that aAmoeba-Block always lies within an aligned RMAX byte region (see§ 3). To ensure correctness, it is sufficient to maintain the coherencegranularity and directory information at a fixed granularity ≤ RMAXgranularity. Multiple cores can simultaneously cache any variablegranularity Amoeba-Block from the same region in Shared state; allsuch cores are marked as sharers in the directory entry. A core thatdesires exclusive ownership of an Amoeba-Block in the region usesthe directory entry to invalidate every Amoeba-Block correspondingto the fixed coherence granularity. All Amoeba-Blocks relevant to aninvalidation will be found in the same set in the private cache (see setindexing in § 3). The coherence granularity could potentially be <RMAX so that false sharing is not introduced in the quest for highercache utilization (larger RMAX). The core claiming the ownershipon a write will itself fetch only the desired granularity Amoeba-Block,saving bandwidth. A detailed evaluation of the coherence protocol isbeyond the scope of the current paper.6 Evaluation
Framework We evaluate the Amoeba-Cache architecture with theWisconsin GEMS simulation infrastructure [25]; we use the in-orderprocessor timing model. We have replaced the SIMICS functionalsimulator with the faster Pin [24] instrumentation framework to en-able longer simulation runs. We perform timing simulation for 1billion instructions. We warm up the cache using 20 million accessesfrom the trace. We model the cache controller in detail including thetransient states needed for the multi-step cache operations and all theassociated port and queue contention. We use a Full— LRU replace-ment policy, evicting Amoeba-Blocks in LRU order until sufficientspace is freed up for the block to be brought in. This helps decoupleour observations from the replacement policy, enabling a fairer com-parison with other approaches (Section 9). Our workloads are a mixof applications whose working sets stress our caches and includesSPEC- CPU benchmarks, Dacapo Java benchmarks [4], commercialworkloads (SpecJBB2005, TPC-C, and Apache), and the Firefox webbrowser. Table 1 classifies the application categories: Low, Moderate,and High, based on the spatial locality. When presenting averages ofratios or improvements, we use the geometric mean.6.1 Improved Memory Hierarchy EfficiencyResult 1:Amoeba-Cache increases cache capacity by harvestingspace from unused words and can achieve an 18% reduction in bothL1 and L2 miss rate.Result 2:Amoeba-Cache adaptively sizes the cache block granularityand reduces L1↔L2 bandwidth by 46% and L2↔Memory bandwidthby 38%.
In this section, we compare the bandwidth and miss rate propertiesof Amoeba-Cache against a conventional cache. We evaluate twotypes of caches: a Fixed cache, which represents a conventionalset-associative cache, and the Amoeba-Cache. In order to isolate thebenefits of Amoeba-Cache from the potentially changing accuracyof the spatial predictor across different cache geometries, we useutilization at the next eviction as the spatial prediction, determinedfrom a prior run on a fixed granularity cache. This also ensures thatthe spatial granularity predictions can be replayed across multiplesimulation runs. To ensure equivalent data storage space, we setthe Amoeba-Cache size to the sum of the tag array and the data arrayin a conventional cache. At the L1 level (64K), the net capacity of
the Amoeba-Cache is 64K + 8*4*256 bytes and at the L2 level (1M)configuration, it is 1M + 8*8*2048 bytes. The L1 cache has 256 setsand the L2 cache has 2048 sets.
1.0
2.0
3.0
4.0
5.0
6.0
20 26 32 38 44 50
Ba
nd
wid
th (
KB
/ 1
K i
ns)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(a) 64K - Low
0.5
1.0
1.5
2.0
2.5
3.0
20 26 32 38 44 50B
an
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(b) 1M - Low
0.00
0.40
0.80
1.20
1.60
2.00
0 3 6 9 12 15
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(c) 64K - Moderate
0.35
0.40
0.45
0.50
0.55
0.60
0 3 6 9 12 15
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(d) 1M - Moderate
0.60
0.66
0.72
0.78
0.84
0.90
3 4 5 6 7 8
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(e) 64K - High
0.40
0.42
0.44
0.46
0.48
0.50
3 4 5 6 7 8
Ban
dw
idth
(K
B /
1K
in
s)
Miss Rate (Misses/1K ins)
Amoeba
Fixed
(f) 1M - High
Figure 10: Fixed vs. Amoeba (Bandwidth and Miss Rate). Notethe different scale for different application groups.
Figure 10 plots the miss rate and the traffic characteristics of theAmoeba-Cache. Since Amoeba-Cache can hold blocks varying from8B to 64B, each set can hold more blocks by utilizing the spacefrom untouched words. Amoeba-Cache reduces the 64K L1 missrate by 23%(stdev:24) for the Low group, and by 21%(stdev:16)for the moderate group; even applications with high spatial localityexperience a 7%(stdev:8) improvement in miss rate. There is a46%(stdev:20) reduction on average in L1↔L2 bandwidth. At the1M L2 level, Amoeba-Cache improves the moderate group’s missrate by 8%(stdev:10) and bandwidth by 23%(stdev:12). Applicationswith moderate utilization make better use of the space harvested fromunused words by Amoeba-Cache. Many low utilization applicationstend to be streaming and providing extra cache space does not helplower miss rate. However, by not fetching unused words, Amoeba-Cache achieves a significant reduction (38%(stdev:24) on average) inoff-chip L2↔Memory bandwidth; even High utilization applicationssee a 17%(stdev:15) reduction in bandwidth. Utilization and missrate are not, however, always directly correlated (more details in§ 8).
With Amoeba-Cache the # of blocks/set varies based on the granu-larity of the blocks being fetched, which in turn depends on the spatiallocality in the application. Table 5 shows the avg.# of blocks/set. Inapplications with low spatial locality, Amoeba-Cache adjusts theblock size and adapts to store many smaller blocks. The 64K L1
Table 5: Avg. # of Amoeba-Block / Set# Blocks/Set 64K Cache, 288 B/set
4—5 ferret, cactus, firefox, eclipse, facesim, freqmine, milc, astar6—7 tpc-c, tradesoap, soplex, apache, fluidanimate8—9 h2, canneal, omnetpp, twolf, x264, lbm, jbb
10—12 mcf, art
1M Cache, 576 B/set3—5 eclipse, omnetpp8—9 cactus, firefox, tradesoap, freqmine, h2, x264, tpc-c
10—11 facesim, soplex, astar, milc, apache, ferret12—13 twolf, art, jbb, lbm , fluidanimate15—18 canneal, mcf
0
20
40
60
80
100
apac
he
art
asta
rca
ctu
scan…
ecli
pse
face…
ferr
etfi
refo
xfl
uid
.fr
eq.
h2
jbb
lbm
mcf
mil
co
mn
et.
sop
lex
tpc-
c.tr
ade.
two
lfx
26
4m
ean
% o
f A
mo
eba
Blo
cks
1-2 Words 3-4 Words 5-6 Words 7-8 Words
92
80
98
10
0
67
98
88
99
78
10
0
94
82
89
89
93
10
0
83
91
91
97
70
91
90
(a) 64K L1 cache
0
20
40
60
80
100
apac
he
art
asta
rca
ctu
scan…
ecli
pse
face…
ferr
etfi
refo
xfl
uid
.fr
eq.
h2
jbb
lbm
mcf
mil
co
mn
et.
sop
lex
tpc-
c.tr
ade.
two
lfx
26
4m
ean
% o
f A
mo
eba
Blo
cks
1-2 Words 3-4 Words 5-6 Words 7-8 Words9
5
91
97
10
0
69
93
88
99
79
10
0
97
92
92
83
92
10
0
93
87
91
97
70
90
91
(b) 1M L2 cache
Figure 11: Distribution of cache line granularities in the 64KL1 and 1M L2 Amoeba-Cache. Avg. utilization is on top.
Amoeba-Cache stores 10 blocks per set for mcf and 12 blocks perset for art, effectively increasing associativity without introducingfixed hardware overheads. At the L2, when the working set startsto fit in the L2 cache, the set is partitioned into fewer blocks. Notethat applications like eclipse and omnetpp hold only 3—5 blocksper set on average (lower than conventional associativity) due totheir low miss rates (see Table 8). With streaming applications (e.g.,canneal), Amoeba-Cache increases the # of blocks/set to >15 onaverage. Finally, some applications like apache store between 6—7blocks/set with a 64K cache with varied block sizes (see Figure 11):approximately 50% of the blocks store 1-2 words and 30% of theblocks store 8 words at the L1. As the size of the cache increases andthereby the lifetime of the blocks, the Amoeba-Cache adapts to storelarger size blocks as can be seen in Figure 11.
Utilization is improved greatly across all applications (90%+ inmany cases). Figure 11 shows the distribution of cache block granu-larities in Amoeba-Cache. The Amoeba-Block distribution matchesthe word access distribution presented in § 2). With the 1M cache,the larger cache size improves block lifespan and thereby utilization,with a significant drop in the % of 1—2 word blocks. However, inmany applications (tpc-c, apache, firefox, twolf, lbm, mcf), up to 20%of the blocks are 3–6 words wide, indicating the benefits of adaptivityand the challenges faced by Fixed.6.2 Overall Performance and EnergyResult 3:Amoeba-Cache improves overall cache efficiency andboosts performance by 10% on commercial applications2, savingup to 11% of the energy of the on-chip memory hierarchy. Off-chipL2↔memory energy sees a mean reduction of 41% across all work-loads (86% for art and 93% for twolf) .
We model a two-level cache hierarchy in which the L1 is a 64Kcache with 256 sets (3 cycles load-to-use) and the L2 is 1M, 8192 sets(20 cycles). We assume a fixed memory latency of 300 cycles. Weconservatively assume that the L1 access is the critical pipeline stageand throttle CPU clock by 4% (we evaluate an alternative approachin the next section). We calculate the total dynamic energy of theAmoeba-Cache using the energy #s determined in Section 4 througha combination of synthesis and CACTI [28]. We use 4 fast tagsper set at the L1 and 8 fast tags per set at the L2. We include thepenalty for all the extra metadata in Amoeba-Cache. We derive theenergy for a single L1—L2 transfer ((6.8pJ per byte) from [2,28].Theinterconnect uses full-swing wires at 32nm, 0.6V.
0
2
4
6
8
10
0
2
4
6
8
10
apac
he
art
asta
r
cact
us
can.
eclip
se
face
.
ferr
et
firef
ox
fluid
.
freq
. % P
erf.
Impr
ovem
ent
% E
nerg
y Im
prov
emen
t On-Chip Energy Perf.
Ener
gy +
21%
Per
f. +1
1%
Ener
gy +
33%
Per
f. +5
0%
0
5
10
15
20
25
0
8
16
24
32
40
h2
jbb
lbm
mcf
milc
omne
t.
sopl
ex
tpc-
c.
trade
.
twol
f
x264
% P
erf.
Impr
ovem
ent
% E
nerg
y Im
prov
emen
t
Perf
. +36
%
Figure 12: % improvement in performance and % reductionin on-chip memory hierarchy energy. Higher is better. Y-axisterminated to illustrate bars clearly. Baseline: Fixed, 64K L1,1M L2.
Figure 12 plots the overall improvement in performance and re-duction in on-chip memory hierarchy energy (L1 and L2 caches, andL1↔L2 interconnect). On applications that have good spatial locality(e.g., tradesoap, milc, facesim, eclipse, and cactus), Amoeba-Cache
2“Commercial” applications includes Apache, SpecJBB and TPC-C.
has minimal impact on miss rate, but provides significant bandwidthbenefit. This results in on-chip energy reduction: milc’s L1↔L2bandwidth reduces by 15% and its on-chip energy reduces by 5%.Applications that suffer from cache pollution under Fixed (apache,jbb, twolf, soplex, and art) see gains in performance and energy.Apache’s performance improves by 11% and on-chip energy reducesby 21%, while SpecJBB’s performance improves by 21% and en-ergy reduces by 9%. Art gains approximately 50% in performance.Streaming applications like mcf access blocks with both low andhigh utilization. Keeping out the unused words in the under-utilizedblocks prevents the well-utilized cache blocks from being evicted;mcf’s performance improves by 12% and on-chip energy by 36%.
Extra cache pipeline stage. An alternative strategy to accommo-date Amoeba-Cache’s overheads is to add an extra pipeline stage tothe cache access which increases hit latency by 1 cycle. The cpuclock frequency entails no extra penalty compared to a conventionalcache. We find that for applications in the moderate and low spa-tial locality group (for 8 applications) Amoeba-Cache continues toprovide a performance benefit between 6—50%. milc and cannealsuffer minimal impact, with a 0.4% improvement and 3% slowdownrespectively. Applications in the high spatial locality group (12 appli-cations) suffer an average 15% slowdown (maximum 22%) due tothe increase in L1 access latency. In these applications, 43% of theinstructions (on average) are memory accesses and a 33% increasein L1 hit latency imposes a high penalty. Note that all applicationscontinue to retain the energy benefit. The cache hierarchy energyis dominated by the interconnects and Amoeba-Cache provides no-table bandwidth reduction. While these results may change for anout-of-order, multiple-issue processor, the evaluation suggests thatAmoeba-Cache if implemented with the extra pipeline stage is moresuited for lower levels in the memory hierarchy other than the L1.
Off-chip L2↔Memory energy The L2’s higher cache capacitymakes it less susceptible to pollution and provides less opportunityfor improving miss rate. In such cases, Amoeba-Cache keeps outunused words and reduces off-chip bandwidth and thereby off-chipenergy. We assume that the off-chip DRAM can provide adaptivegranularity transfers for Amoeba-Cache’s L2 refills as in [35]. We usea DRAM model presented in a recent study [9] and model 0.5nJ perword transferred off-chip. The low spatial locality applications see adramatic reduction in off-chip energy. For example, twolf sees a 93%reduction. On commercial workloads the off-chip energy decreasesby 31% and 24% respectively. Even for applications with high cacheutilization, off-chip energy decreases by 15%.7 Spatial Predictor Tradeoffs
We evaluate the effectiveness of spatial pattern prediction in thissection. In our table-based approach, a pattern history table recordsspatial patterns from evicted blocks and is accessed using a predictionindex. The table-driven approach requires careful consideration of thefollowing: prediction index, predictor table size, and training period.We quantify the effects by comparing the predictor against a baselinefixed-granularity cache. We use a 64K cache since it induces enoughmisses and evictions to highlight the predictor tradeoffs clearly.7.1 Predictor Indexing
A critical choice with the history-based prediction is the selectionof the predictor index. We explored two types of predictor indexinga) a PC-based approach [6] based on the intuition that fields in a datastructure are accessed by specific PCs and tend to exhibit similar spa-tial behavior. The tag includes the the PC and the critical word index:
02468
10121416
asta
r
cact
.
can
ne.
ecli
p.
face
s.
ferr
et
fire
f.
flu
id.
freq
.
h2
mil
c
om
n.
tpc-
c
trad
e.
x2
64
MP
KI
Aligned Finite Infinite Finite+FT History
(a) Low MPKI Group
020406080
100120140
apac
.
art
lbm
mcf
jbb
sople
x
twolf
(b) High MPKI Group
0
300
600
900
1200
1500
1800
asta
r
cact
.
can
ne.
ecli
p.
face
s.
ferr
et
fire
f.
flu
id.
freq
.
h2
mil
c
om
n.
tpc-
c
trad
e.
x264
Ban
dw
idth
Aligned Finite Infinite Finite+FT History
(c) Low Bandwidth Group
0
2000
4000
6000
8000
10000
apac
.
art
lbm
mcf
jbb
sople
x
twolf
(d) High Bandwidth GroupALIGNED: fixed-granularity cache (64B blocks). FINITE: Amoeba-Cache with a REGION predictor (1024 entry predictor table and 4K region size).INFINITE: Amoeba-Cache with an unbounded predictor table (REGION predictor).FINITE+FT is FINITE augmented with hints for predicting a defaultgranularity on compulsory misses (first touches). HISTORY: Amoeba-Cache uses spatial pattern hints based on utilization at the next eviction, collected from aprior run.
((PC >> 3) << 3)+ (addr%64)8 . and b) a Region-based (REGION)
approach that is based on the intuition that similar data objects tendto be allocated in contiguous regions of the address space and tendto exhibit similar spatial behavior. We compared the miss rate andbandwidth properties of both the PC (256 entries, fully associative)and REGION (1024 entries, 4KB region size) predictors. The sizeof the predictors was selected as the sweet spot in behavior for eachpredictor type. For all applications apart from cactus (a high spatiallocality application), REGION-based prediction tends to overfetchand waste bandwidth as compared to PC-based prediction, which has27% less bandwidth consumption on average across all applications.For 17 out of 22 applications, REGION-based prediction shows 17%better MPKI on average (max: 49% for cactus). For 5 applications(apache, art, mcf, lbm, and omnetpp), we find that PC demonstratesbetter accuracy when predicting the spatial behavior of cache blocksthan REGION and demonstates a 24% improvement in MPKI (max:68% for omnetpp).
7.2 Predictor TableWe studied the organization and size of the pattern table using the
REGION predictor. We evaluated the following parameters a) regionsize, which directly correlates with the coverage of a fixed-size table,and b) the size of the predictor table, which influences how manyunique region patterns can be tracked, and c) the # of bits required torepresent the spatial pattern.
Large region sizes effectively reduce the # of regions in the work-ing set and require a smaller predictor table. However, a larger regionis likely to have more blocks that exhibit varied spatial behaviorand may pollute the pattern entry. We find that going from 1KB(4096 entries) to 4KB (1024 entries) regions, the 4KB region gran-ularity decreased miss rate by 0.3% and increased bandwidth by0.4% even though both tables provide the same working set cover-age (4MB). Fixing the region size at 4KB, we studied the benefits
of an unbounded table. Compared to a 1024 entry table (FINITEin Figure 6.2), the unbounded table increases miss rate by 1% anddecreases bandwidth by 0.3% . A 1024 entry predictor table (4KB re-gion granularity per-entry) suffices for most applications. Organizingthe 1024 entries as a 128-set×8-way table suffices for eliminatingassociativity related conflicts (<0.8% evictions due to lack of ways).
Focusing on the # of bits required to represent the pattern table,we evaluated the use of 4-bit saturation counters (instead of 1-bitbitmaps). The saturation counters seek to avoid pattern pollutionwhen blocks with varied spatial behavior reside in the same region.Interestingly, we find that it is more beneficial to use 1-bit bitmapsfor the majority of the applications (12 out of 22); the hysteresisintroduced by the counters increases training period. To summarize,we find that a REGION predictor with region size 4KB and 1024entries can predict the spatial pattern in a majority of the applications.CACTI indicates that the predictor table can be indexed in 0.025nsand requires 2.3pJ per miss indexing.
7.3 Spatial Pattern TrainingA widely-used approach to training the predictor is to harvest the
word usage information on an eviction. Unfortunately, evictions maynot be frequent, which means the predictor’s training period tends tobe long, during which the cache performs less efficiently and/or thatthe application’s phase has changed in the meantime. Particularlyat the time of first touch (compulsory miss to a location), we needto infer the global spatial access patterns. We compare the finiteregion predictor (FINITE in Figure 6.2) that only predicts usingeviction history, against a FINITE+FT: this adds the optimizationof inferring the default pattern (in this paper, from a prior run) whenthere is no predictor information. FINITE+FT demonstrates anavg. 1% (max: 6% for jbb) reduction in miss rate compared toFINITE and comes within 16% the miss rate of HISTORY. Interms of bandwidth FINITE+FT can save 8% of the bandwidth (up
to 32% for lbm) compared to FINITE. The percentage of first-touchaccesses is shown in Table 8.7.4 Predictor Summary• For the majority of the applications (17/22) the address-region
predictor with region size 4KB works well. However, five applica-tions (apache, lbm, mcf, art, omnetpp) show better performancewith PC-based indexing. For best efficiency, the predictor shouldadapt indexing to each application.
• Updating the predictor only on evictions leads to long trainingperiods, which causes loss in caching efficiency. We need todevelop mechanisms to infer the default pattern based on globalbehavior demonstrated by resident cache lines.
• The online predictor reduces MPKI by 7% and bandwidth by 26%on average relative to the conventional approach. However, it stillhas a 14% higher MPKI and 38% higher bandwidth relative to theHISTORY-based predictor, indicating room for improvement inprediction.
• The 1024-entry (4K region size) predictor table imposes '0.12%energy overhead on the overall cache hierarchy energy since it isreferenced only on misses.
8 Amoeba-Cache AdaptivityWe demonstrate that Amoeba-Cache can adapt better than a con-
ventional cache to the variations in spatial locality.
Tuning RMAX for High Spatial Locality A challenge oftenfaced by conventional caches is the desire to widen the cache block(to achieve spatial prefetching) without wasting space and bandwidthin low spatial locality applications. We study 3 specific applications:milc and tradesoap have good spatial locality and soplex has poorspatial locality. With a conventional 1M cache, when we widen theblock size from 64 to 128 bytes, milc and tradesoap experience a37% and 39% reduction in miss rate. However, soplex’s miss rateincreases by 2× and bandwidth by 3.1×.
With Amoeba-Cache we do not have to make this uneasy choiceas it permits Amoeba-Blocks with granularity 1—RMAX words(RMAX: maximum block size). When we increase RMAX from64 bytes to 128 bytes, miss rate reduces by 37% for milc and 24% fortradesoap, while simultaneously lowering bandwidth by 7%. Unlikethe conventional cache, Amoeba-Cache is able to adapt to poor spatiallocality: soplex experiences only a 2% increase in bandwidth and40% increase in miss rate.
-50
-25
0
25
50
sopl
ex
milc
trade
s.
sopl
ex
milc
trade
s.
Fixed Amoeba
Red
uctio
n (%
) Miss Rate Bandwidth
-213
-9
2
Figure 13: Effect of increase in block size from 64 to 128 bytesin a 1 MB cache
Predicting Strided Accesses Many applications (e.g., firefox andcanneal) exhibit strided access patterns, which touch a few wordsin a block before accessing another block. Strided accesses patternsintroduce intra-block holes (untouched words). For instance, cannealaccesses '10K distinct fixed granularity cache blocks with a specificaccess pattern, [--x--x--] (x indicates ith word has been touched).
Any predictor that uses the access pattern history has two choiceswhen an access misses on word 3 or 6 a) A miss oriented policy(Policy-Miss) may refill the entire range of words 3–6 and eliminatethe secondary miss but bring in untouched words 4–5, increasingbandwidth, and b) a bandwidth focused choice (Policy- BW) thatrefills only the requested word but will incur more misses. Table 6compares the two contrasting policies for Amoeba-Cache (relative toa Fixed granularity baseline cache). Policy-BW saves 9% bandwidthcompared to Policy-Miss but suffer 25-30% higher miss rate.
Table 6: Predictor Policy Comparisoncanneal firefox
Miss Rate BW Miss Rate BWPolicy-Miss 10.31% 81.2% 11.18% 47.1%Policy-BW –20.08% 88.09% –13.44% 56.82%Spatial Patterns [--x--x--] [x--x----] [-x--x---] [x---x---]–: indicates Miss or BW higher than Fixed.
9 Amoeba-Cache vs other approachesWe compare Amoeba-Cache against four approaches.
• Fixed-2x: Our baseline is a fixed granularity cache 2× the capacityof the other designs (64B block).
• Sector is the conventional sector cache design (as in IBMPower7 [15]): 64B block and a small sector size (16bytes or 2words). This design targets optimal bandwidth. On any access,the cache fetches only the requisite sector, even though it allocatesspace for the entire line.
• Sector-Pre adds prefetching to Sector. This optimized designprefetches multiple sectors based on a spatial granularity predictorto improve the miss rate [17, 29].
• Multi$ combines features from line distillation [30] and spatial-temporal caches [12]. It is an aggressive design that partitionsa cache into two: a line organized cache (LOC) and a word-organized cache (WOC). At insertion time, Multi$ uses the spatialgranularity hints to direct the cache to either refill words (into theWOC) or the entire line. We investigate two design points: 50%of the cache as a WOC (Multi$-50) and 25% of cache as a WOC(Multi$-25).Sector, Sector-Pre, Multi$, and Amoeba-Cache all use the same
spatial predictor hints. On a demand miss, they prefetch all the sub-blocks needed together. Prefetching only changes the timing of anaccess; it does not change the miss bandwidth and cannot remove themisses caused by cache pollution.
Energy and Storage The sector approaches impose low hardwarecomplexity and energy penalty. To filter out unused words, thesector granularity has to be close to word granularity; we explored2words/sector which leads to a storage penalty of '64KB for a 1MBcache. In Multi$, the WOC increases associativity to the number ofwords/block. Multi$-25 partitions a 64K 4-way cache into a 48K3-way LOC and a 16K 8-way WOC, increasing associativity to 11.For a 1M cache, Multi$-50 increases associativity to 36. Comparedto Fixed, Multi$-50 imposes over 3× increase in lookup latency, 5×increase in lookup energy, and '4× increase in tag storage. Amoeba-Cache provides a more scalable approach to using adaptive cachelines since it varies the storage dedicated to the tags based on thespatial locality in the application.
Miss Rate and Bandwidth Figure 14 summarizes the compari-son; we focus on the moderate and low utilization groups of applica-
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.0 1.1 1.2 1.3 1.4 1.5 1.6
Ba
nd
wid
th R
ati
o
Miss Rate Ratio
Fixed-2X
Sector
(x:1.8)
Sector-Pre Amoeba
Multi$-25
Multi$-50
(a) 64K - Low
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.0 1.1 1.2 1.3 1.4 1.5 1.6
Ban
dw
idth
Rati
o
Miss Rate Ratio
Sector
(x:2.9)
Sector-Pre
Fixed-2X
Amoeba
Multi$-25
Multi$-50
(b) 64K - Moderate
0.6
0.7
0.8
0.9
1.0
1.0 1.1 1.2 1.3 1.4 1.5 1.6
Ba
nd
wid
th R
ati
o
Miss Rate Ratio
Sector
(x:2.6
y:1.2) Sector-Pre
Amoeba
Multi$-50
Multi$-25
Fixed-2X
(c) 1M - Low
0.6
0.7
0.8
0.9
1.0
1.0 1.1 1.2 1.3 1.4 1.5 1.6
Ban
dw
idth
Rati
o
Miss Rate Ratio
Sector
(x:3.9,
y:1.0)
Sector-Pre
Multi$-50
Multi$-25
Amoeba
Fixed-2X
(d) 1M - Moderate
Figure 14: Relative miss rate and bandwidth for differentcaches. Baseline (1,1) is the Fixed-2x design. Labels: • Fixed-2x, ◦ Sector approaches. ∗ : Multi$, 4 Amoeba. (a),(b) 64Kcache (c),(d) 1M cache. Note the different Y-axis scale for eachgroup.
tions. On the high utilization group, all designs other than Sector havecomparable miss rates. Amoeba-Cache improves miss rate to within5%—6% of the Fixed-2× for the low group and within 8%—17%for the moderate group. Compared to the Fixed-2×, Amoeba-Cachealso lowers bandwidth by 40% (64K cache) and 20% (1M cache).Compared to Sector-Pre (with prefetching), Amoeba-Cache is ableto adapt better with flexible granularity and achieves lower miss rate(up to 30% @ 64K and 35% @ 1M). Multi$’s benefits are propor-tional to the fraction of the cache organized as a WOC; Multi$-50(18-way@64K and 36-way@1M) is needed to match the miss rateof Amoeba-Cache. Finally, in the moderate group, many applica-tions exhibit strided access. Compared to Multi-$’s WOC, whichfetches individual words, Amoeba-Cache increases bandwidth sinceit chooses to fetch the contiguous chunk in order to lower miss rate.10 Multicore Shared Cache
We evaluate a shared cache implemented with the Amoeba-Cachedesign. By dynamically varying the cache block size and keeping outunused words, the Amoeba-Cache effectively minimizes the footprintof an application. Minimizing the footprint helps multiple applica-tions effectively share the cache space. We experimented with a1M shared Amoeba-Cache in a 4 core system. Table 7 shows theapplication mixes; we chose a mix of applications across all groups.We tabulate the change in miss rate per thread and the overall changein bandwidth for Amoeba-Cache with respect to a fixed granularitycache running the same mix. Minimizing the overall footprint enablesa reduction in the miss rate of each application in the mix. The com-mercial workloads (SpecJBB and TPC-C) are able to make use of thespace available and achieve a significant reduction in miss rate (avg:18%). Only two applications suffered a small increase in miss rate(x264 Mix#2: 2% and ferret Mix#3: 4%) due to contention. The over-all L2 miss bandwidth significantly improves, showing 16%—39%reduction across all workload mixes. We believe that the Amoeba-based shared cache can effectively enable the shared cache to support
more cores and increase overall throughput. We leave the designspace exploration of an Amoeba-based coherent cache for futurework.
Table 7: Multiprogrammed Workloads on 1M Shared Amoeba-Cache% reduction in miss rate and bandwidth. Baseline:Fixed 1M.
Miss Miss Miss Miss BWMix T1 T2 T3 T4 (All)
jbb×2, tpc-c×2 12.38% 12.38% 22.29% 22.37% 39.07%firefox×2, x264×2 3.82% 3.61% –2.44% 0.43% 15.71%
cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62%canneal, astar, ferret, milc 4.85% 2.75% 19.39% –4.07% 17.77%–: indicates Miss or BW higher than Fixed. T1—T4, threads in the mix;in the order of applications in the mix
11 Related WorkBurger et al. [5] defined cache efficiency as the fraction of blocks
that store data that is likely to be used. We use the term cacheutilization to identify touched versus untouched words residing in thecache. Past works [6,29,30] have also observed low cache utilizationat specific levels of the cache. Some works [18, 19, 13, 21] havesought to improve cache utilization by eliminating cache blocks thatare no longer likely to be used (referred to as dead blocks). Thesetechniques do not address the problem of intra-block waste (i.e.,untouched words).
Sector caches [31, 32] associate a single tag with a group of con-tiguous cache lines, allowing cache sizes to grow without payingthe penalty of additional tag overhead. Sector caches use bandwidthefficiently by transferring only the needed cache lines within a sector.Conventional sector caches [31] may result in worse utilization due tothe space occupied by invalid cache lines within a sector. Decoupledsector caches [32] help reduce the number of invalid cache lines persector by increasing the number of tags per sector. Compared to theAmoeba cache, the tag space is a constant overhead, and limits the #of invalid sectors that can be eliminated. Pujara et al. [29] consider aword granularity sector cache, and use a predictor to try and bring inonly the used words. Our results (see Figure 14) show that smallergranularity sectors significantly increase misses, and optimizationsthat prefetch [29] can pollute the cache and interconnect with unusedwords.
Line distillation [30] applies filtering at the word granularity toeliminate unused words in a cache block at eviction. This approachrequires part of the cache to be organized as a word-organized cache,which increases tag overhead, complicates lookup, and bounds per-formance improvements. Most importantly, line distillation does notaddress the bandwidth penalty of unused words. This inefficiencyis increasingly important to address under current and future tech-nology dominated by interconnects [14,28]. Veidenbaum et al. [33]propose that the entire cache be word organized and propose anonline algorithm to prefetch words. Unfortunately, a static word-organized cache has a built-in tag overhead of 50% and requiresenergy-intensive associative searches.
Amoeba-Cache adopts a more proactive approach that enables con-tinuous dynamic block granularity adaptation to the available spatiallocality. When there is high spatial locality, the Amoeba-Cache willautomatically store a few big cache blocks (most space dedicatedfor data); with low spatial locality, it will adapt to storing manysmall cache blocks (extra space allocated for tags). Recently, Yoon et
al. have proposed an adaptive granularity DRAM architecture [35].This provides the support necessary for supporting variable granu-larity off-chip requests from an Amoeba-Cache-based LLC. Someresearch [10,8] has also focused on reducing false sharing in coherentcaches by splitting/merging cache blocks to avoid invalidations. Theywould benefit from the Amoeba-Cache design, which manages blockgranularity in hardware.
There has a been a significant amount of work at the compilerand software runtime level (e.g. [7]) to restructure data for improvedspatial efficiency. There have also been efforts from the architecturecommunity to predict spatial locality [29, 34, 17, 36], which we canleverage to predict Amoeba-Block ranges. Finally, cache compressionis an orthogonal body of work that does not eliminate unused wordsbut seeks to minimize the overall memory footprint [1].
12 SummaryIn this paper, we propose a cache design, Amoeba-Cache, that
can dynamically hold a variable number of cache blocks of differentgranularities. The Amoeba-Cache employs a novel organization thatcompletely eliminates the tag array and collocates the tags with thecache block in the data array. This permits the Amoeba-Cache to tradethe space budgeted for the cache blocks for tags and support a variablenumber of tags (and blocks). For applications that have low spatiallocality, Amoeba-Cache can reduce cache pollution, improve theoverall miss rate, and reduce bandwidth wasted in the interconnects.When applications have moderate to high spatial locality, Amoeba-Cache coarsens the block size and ensures good performance. Finally,for applications that are streaming (e.g., lbm), Amoeba-Cache cansave significant energy by eliminating unused words from beingtransmitted over the interconnects.
AcknowledgmentsWe would like to thank our shepherd, André Seznec, and the
anonymous reviewers for their comments and feedback.
Appendix
Table 8: Amoeba-Cache Performance. Absolute #s.MPKI BW bytes/1K CPI Predictor Stats
L1 L2 L1←→L2 L2←→Mem First Touch Evict Win.MPKI MPKI #Bytes/1K #Bytes/1K Cycles/Ins. % Misses # ins./Evict
apache 64.9 19.6 5,000 2,067 8.3 0.4 17art 133.7 53.0 5,475 1,425 16.0 0.0 9
astar 0.9 0.3 70 35 1.9 18.0 1,600cactus 6.9 4.4 604 456 3.5 7.5 162canne. 8.3 5.0 486 357 3.2 5.8 128eclip. 3.6 <0.1 433 <1 1.8 0.1 198faces. 5.5 4.7 683 632 3.0 41.2 190ferre. 6.8 1.4 827 83 2.1 1.3 156firef. 1.5 1.0 123 95 2.1 11.1 727fluid. 1.7 1.4 138 127 1.9 39.2 629
freqm. 1.1 0.6 89 65 2.3 17.7 994h2 4.6 0.4 328 46 1.8 1.7 154jbb 24.6 9.6 1,542 830 5.0 10.2 42lbm 63.1 42.2 3,755 3,438 13.6 6.7 18mcf 55.8 40.7 2,519 2,073 13.2 0.0 19milc 16.1 16.0 1,486 1,476 6.0 2.4 66
omnet. 2.5 <0.1 158 <1 1.9 0.0 458sople. 30.7 4.0 1,045 292 3.1 0.9 35tpcc 5.4 0.5 438 36 2.0 0.4 200
trade. 3.6 <0.1 410 6 1.8 0.6 194twolf 23.3 0.6 1,326 45 2.2 0.0 49x264 4.1 1.8 270 190 2.2 12.4 274MPKI : Misses / 1K instructions. BW: # words / 1K instructionsCPI: Clock cycles per instruction.Predictor First touch: Compulsory misses. % of accesses that use default granularity.Evict window: # of instructions between evictions. Higher value indicatespredictor training takes longer.
References[1] A. R. Alameldeen. Using Compression to Improve Chip Multiprocessor Perfor-
mance. In Ph.D. dissertation, Univ. of Wisconsin-Madison, 2006.[2] D. Albonesi, A. Kodi, and V. Stojanovic. NSF Workshop on Emerging Technolo-
gies for Interconnects (WETI). 2012.[3] C. Bienia. Benchmarking Modern Multiprocessors. In Ph.D. Thesis. Princeton
University, 2011.[4] S. M. Blackburn et al. The DaCapo benchmarks: java benchmarking development
and analysis. In Proc. of the 21st OOPSLA, 2006.[5] D. Burger, J. Goodman, and A. Kaigi. The Declining Effectiveness of Dynamic
Caching for General-Purpose Workloads. In University of Wisconsin TechnicalReport, 1995.
[6] C. F. Chen et al. Accurate and Complexity-Effective Spatial Pattern Prediction. InProc. of the 10th HPCA. 2004.
[7] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. InProc. of the PLDI, 1999.
[8] B. Choi et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Paral-lelism. In Proc. of the PACT, 2011.
[9] B. Dally. Power, Programmability, and Granularity: The Challenges of ExaScaleComputing. In Proc. of the IPDPS, 2011.
[10] C. Dubnicki and T. J. Leblanc. Adjustable Block Size Coherent Caches. In Proc.of the 19th ISCA, 1992.
[11] A. Fog. Instruction tables. 2012. http://www.agner.org/optimize/instruction_tables.pdf.
[12] A. González, C. Aliagas, and M. Valero. A data cache with multiple cachingstrategies tuned to different types of locality. In Proc. of the ACM Intl. Conf. onSupercomputing, 1995.
[13] Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: predict-ing and optimizing memory behavior. In Proc. of the 29th ISCA, 2002.
[14] C. Hughes, C. Kim, and Y.-K. Chen. Performance and Energy Implications ofMany-Core Caches for Throughput Computing. In IEEE Micro Journal, 2010.
[15] R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM’s Next-GenerationServer Processor. In IEEE Micro Journal, 2010.
[16] K. Kedzierski, M. Moreto, F. J. Cazorla, and M. Valero. Adapting Cache Parti-tioning Algorithms to Pseudo-LRU Replacement Policies. In Proc. of the IPDPS,2010.
[17] S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatialfootprints. In Proc. of the 25th ISCA, 1998.
[18] A.-C. Lai and B. Falsafi. Selective, accurate, and timely self-invalidation usinglast-touch prediction. In Proc. of the 27th ISCA, 2000.
[19] A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: reducing coherenceoverhead in shared-memory multiprocessors. In Proc. of the 22nd ISCA. 1995.
[20] S. Li et al. McPAT: an integrated power, area, and timing modeling framework formulticore and manycore architectures. In Proc. of the 42nd MICRO, 2009.
[21] H. Liu et al. Cache bursts: A new approach for eliminating dead blocks andincreasing cache efficiency. In Proc. of the 41st MICRO, 2008.
[22] D. R. Llanos and B. Palop. TPCC-UVA: An Open-source TPC-C implementationfor parallel and distributed systems. In Proc. of the 2006 Intl. Parallel and Dis-tributed Processing Symp. PMEO Workshop, 2006. http://www.infor.uva.es/~diego/tpcc-uva.html.
[23] G. H. Loh and M. D. Hill. Efficiently enabling conventional block sizes for verylarge die-stacked DRAM caches. In Proc. of the 44th Intl. Symp. on Microarchitec-ture, 2011.
[24] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.Reddi, and K. Hazelwood. Pin: building customized program analysis tools withdynamic instrumentation. In Proc. of the PLDI, 2005.
[25] Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset.In ACM SIGARCH Computer Architecture News, Sept. 2005.
[26] S. Microsystems. OpenSPARC T1 Processor Megacell Specification. 2007.[27] Moinuddin K. Qureshi et al. Adaptive insertion policies for high performance
caching In Prof. of the 34th ISCA, 2007.[28] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA
Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proc.of the 40th MICRO, 2007.
[29] P. Pujara and A. Aggarwal. Increasing the Cache Efficiency by Eliminating Noise.In Proc. of the 12th HPCA, 2006.
[30] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line Distillation: Increasing CacheCapacity by Filtering Unused Words in Cache Lines. In Proc. of the 13th HPCA,2007.
[31] J. B. Rothman and A. J. Smith. The pool of subsectors cache design. In Proc. ofthe 13th ACM ICS, 1999.
[32] A. Seznec. Decoupled sectored caches: conciliating low tag implementation cost.In Proc. of the 21st ISCA, 1994.
[33] A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache linesize to application behavior. In Proc. of the 13th ACM ICS. 1999.
[34] M. A. Watkins, S. A. Mckee, and L. Schaelicke. Revisiting Cache Block Super-loading. In Proc. of the 4th HIPEAC. 2009.
[35] D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems:a tradeoff between storage efficiency and throughput. In Proc. of the 38th ISCA.2011.
[36] D. H. Yoon, M. K. Jeong, M. B. Sullivan, and M. Erez. The Dynamic GranularityMemory System. In Proc. of the 39th ISCA, 2012.
[36] http://cpudb.stanford.edu/