characterizing the d-tlb behavior of spec …axs53/csl/papers/metrics02b.pdfcharacterizing the d-tlb...

11
Characterizing the d-TLB Behavior of SPEC CPU2000 Benchmarks Gokul B. Kandiraju Anand Sivasubramaniam Dept. of Computer Science and Engineering The Pennsylvania State University University Park, PA 16802. kandiraj,anand @cse.psu.edu ABSTRACT Despite the numerous optimization and evaluation studies that have been conducted with TLBs over the years, there is still a deficiency in an indepth understanding of TLB characteristics from an appli- cation angle. This paper presents a detailed characterization study of the TLB behavior of the SPEC CPU2000 benchmark suite. The contributions of this work are in identifying important application characteristics for TLB studies, quantifying the SPEC2000 appli- cation behavior for these characteristics, as well as making pro- nouncements and suggestions for future research based on these results. Around one-fourth of the SPEC2000 applications (ammp, apsi, galgel, lucas, mcf, twolf and vpr) have significant TLB missrates. Both capacity and associativity are influencing factors on miss-rates, though they do not necessarily go hand-in- hand. Multi-level TLBs are definitely useful for these applications in cutting down access times without significant miss rate degrada- tion. Superpaging to combine TLB entries may not be rewarding for many of these applications. Software management of TLBs in terms of determining what entries to prefetch, what entries to re- place, and what entries to pin has a lot of potential to cut down miss rates considerably. Specifically, the potential benefits of prefetch- ing TLB entries is examined, and Distance Prefetching is shown to give good prediction accuracy for these applications. 1. INTRODUCTION Virtual to physical address translation is one of the most critical operations in computer systems since this is invoked on every in- struction fetch and data reference. To speed up address transla- tion, computer systems that use page based virtual memory pro- vide a cache of recent translations called the Translation Lookaside Buffer (TLB). The TLB is usually a small structure indexed by the virtual page number that can be quickly looked up by the memory This research has been supported in part by several NSF grants including Career Award 9701475, 0103583, 0097998, 9988164, 9900701, 9818327 and 0130143. management unit (MMU) to get the corresponding physical frame number. The importance of the TLB has always been recognized, and emerging technological and application trends only reaffirm its dominance in determining system performance. Increases in in- struction level parallelism, higher clock frequencies, and the grow- ing demands for larger working sets by applications continue to make TLB design and implementation critical in current and future microprocessors. Several studies have quantified the importance of TLB performance on system execution [16]. Anderson et al. [1] show that TLB miss handling has an important consequence on performance, and this is the most frequently executed kernel service. TLB miss handling has been shown to constitute as much as 40% of execution time [14] and upto 90% of a kernel’s computation [27]. Studies with specific applications [28] have also shown that the TLB miss rate can ac- count for over 10% of their execution time even with an optimistic 30-50 cycle miss overhead. With its importance widely recognized, the TLB has been the tar- get for several optimizations to reduce access latency, miss rates and miss handling overheads. With regard to TLB structures them- selves, there have been investigations on suitable organizations in terms of size, associativities and multi-level organizations [31, 23, 6]. Superpaging is a concept that has been proposed to boost TLB coverage. Hardware and software techniques for supporting this mechanism have come under a lot of scrutiny [31, 32, 11]. Most prior work in TLB optimizations has targeted lowering miss rates or miss handling costs. It is only recently [28, 25, 3, 20] that the is- sue of prefetching TLB entries to hide all or some of the miss costs has started drawing interest. Many research findings on TLBs have also made their way into commercial offerings. A very nice survey of several of these TLB structures can be found in [15]. In all, a good deal of research has been undertaken on TLB design and eval- uation. However, to our knowledge, there has not been any study looking at characterizing the TLB support that is required from an application’s perspective. At the same time, different studies have used different workloads to evaluate their designs/improvements, and the lack of a common ground makes a uniform comparison dif- ficult. Further, one may ignore one or more issues when optimizing any particular detail, and such omissions can play an important role in the effectiveness of the optimizations. It is essential to uniformly examine a wide range of application characteristics at the same time (from an application’s perspective) to gain better insights on: How can TLB structures/organization be optimized to reduce misses?

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Characterizing the d-TLB Behavior of SPEC CPU2000Benchmarks �

Gokul B. Kandiraju Anand Sivasubramaniam

Dept. of Computer Science and EngineeringThe Pennsylvania State University

University Park, PA 16802.fkandiraj,[email protected]

ABSTRACTDespite the numerous optimization and evaluation studies that havebeen conducted with TLBs over the years, there is still a deficiencyin an indepth understanding of TLB characteristics from an appli-cation angle. This paper presents a detailed characterization studyof the TLB behavior of the SPEC CPU2000 benchmark suite. Thecontributions of this work are in identifying important applicationcharacteristics for TLB studies, quantifying the SPEC2000 appli-cation behavior for these characteristics, as well as making pro-nouncements and suggestions for future research based on theseresults.

Around one-fourth of the SPEC2000 applications (ammp, apsi,galgel, lucas, mcf, twolf and vpr ) have significantTLB missrates. Both capacity and associativity are influencingfactors on miss-rates, though they do not necessarily go hand-in-hand. Multi-level TLBs are definitely useful for these applicationsin cutting down access times without significant miss rate degrada-tion. Superpaging to combine TLB entries may not be rewardingfor many of these applications. Software management of TLBsin terms of determining what entries to prefetch, what entries to re-place, and what entries to pin has a lot of potential to cut down missrates considerably. Specifically, the potential benefits of prefetch-ing TLB entries is examined, and Distance Prefetching is shown togive good prediction accuracy for these applications.

1. INTRODUCTIONVirtual to physical address translation is one of the most criticaloperations in computer systems since this is invoked on every in-struction fetch and data reference. To speed up address transla-tion, computer systems that use page based virtual memory pro-vide a cache of recent translations called the Translation LookasideBuffer (TLB). The TLB is usually a small structure indexed by thevirtual page number that can be quickly looked up by the memory�This research has been supported in part by several NSF grantsincluding Career Award 9701475, 0103583, 0097998, 9988164,9900701, 9818327 and 0130143.

management unit (MMU) to get the corresponding physical framenumber. The importance of the TLB has always been recognized,and emerging technological and application trends only reaffirm itsdominance in determining system performance. Increases in in-struction level parallelism, higher clock frequencies, and the grow-ing demands for larger working sets by applications continue tomake TLB design and implementation critical in current and futuremicroprocessors.

Several studies have quantified the importance of TLB performanceon system execution [16]. Anderson et al. [1] show that TLB misshandling has an important consequence on performance, and thisis the most frequently executed kernel service. TLB miss handlinghas been shown to constitute as much as 40% of execution time [14]and upto 90% of a kernel’s computation [27]. Studies with specificapplications [28] have also shown that the TLB miss rate can ac-count for over 10% of their execution time even with an optimistic30-50 cycle miss overhead.

With its importance widely recognized, the TLB has been the tar-get for several optimizations to reduce access latency, miss ratesand miss handling overheads. With regard to TLB structures them-selves, there have been investigations on suitable organizations interms of size, associativities and multi-level organizations [31, 23,6]. Superpaging is a concept that has been proposed to boost TLBcoverage. Hardware and software techniques for supporting thismechanism have come under a lot of scrutiny [31, 32, 11]. Mostprior work in TLB optimizations has targeted lowering miss ratesor miss handling costs. It is only recently [28, 25, 3, 20] that the is-sue of prefetching TLB entries to hide all or some of the miss costshas started drawing interest. Many research findings on TLBs havealso made their way into commercial offerings. A very nice surveyof several of these TLB structures can be found in [15]. In all, agood deal of research has been undertaken on TLB design and eval-uation. However, to our knowledge, there has not been any studylooking at characterizing the TLB support that is required from anapplication’s perspective. At the same time, different studies haveused different workloads to evaluate their designs/improvements,and the lack of a common ground makes a uniform comparison dif-ficult. Further, one may ignore one or more issues when optimizingany particular detail, and such omissions can play an important rolein the effectiveness of the optimizations. It is essential to uniformlyexamine a wide range of application characteristics at the same time(from an application’s perspective) to gain better insights on:

� How can TLB structures/organization be optimized to reducemisses?

� How much scope is there to benefit from superpaging?� Can we use the program’s static code structure and/or dy-

namic instruction stream to trigger optimizations?� Would the flexibility provided by software TLB management1

offset the higher overheads of miss handling compared to ahardwired approach?

� How well suited are applications to prefetching TLB entries?Which prefetching techniques should be employed, and un-der what circumstances?

� Do we have a large enough window to benefit from prefetch-ing? If not, what other techniques should we employ forlatency tolerance/reduction?

Characterizing the behavior of an application is crucial in any sys-tems design endeavor as many application-driven studies have shown[17, 9]. Application characteristics can specify what is really im-portant from an application’s perspective, identify bottlenecks inthe execution for a closer look, help evaluate current/proposed de-signs with real workloads, and even suggest architectural/OS en-hancements for better performance. While there have been charac-terization efforts in the context of other processor features, caches,I/O and interconnects, this issue has not been looked at previouslyfor TLBs.

Towards addressing this deficiency, this paper sets out to exam-ine the different characteristics of applications from the SPEC2000suite that affect TLB performance to answer many of the questionsraised above. The SPEC2000 [8, 13] suite contains 26 candidate C,C++ and Fortran applications for CPU evaluations, that encompassa wide spectrum of application domains containing interesting andimportant problems for several users. We have used all the appli-cations from this suite in this study to quantify their TLB behaviorfor different configurations, but focus specifically on those incur-ring the highest miss rates for the characterization effort. We limitourselves to d-TLB (only data references) in this study since datareferences are usually much less well-behaved than instructions interms of misses (i-TLB miss rates in our experiments are very low).

The next section gives a quick overview of prior research on TLBs.The experimental set up is detailed in Section 3, and the resultsfrom the characterization are presented in Section 4. Section 5 sum-marizes the lessons learnt from this study and outlines suggestionsfor future TLB design and optimization.

2. RELATED WORKAs was mentioned earlier, many studies [7, 15, 1, 14, 27] havepointed out the importance of the TLB and the necessity of speed-ing up the miss handling process.

Several studies [31, 15, 23] have looked at hardware TLB struc-tures/organization and their impact on system performance in termsof capacity and/or associativity. While some of these have focussedon single (monolithic) TLBs, there have been studies which haveinvestigated the benefits of multi-level TLBs [6, 2]. There arealso implementations of multi-level TLBs in commercial proces-sors such as MIPS R4000, Hal’s SPARC64, IBM AS/400 PowerPC,1We would like to differentiate between the terms software TLBmanagement and software TLB handling in this paper. We use thelatter to denote that the miss handling is done by the software i.e.the operating system, and the former term is used to denote moresophisticated software that can control placement, pinning and re-placement of TLB entries. While many current systems providesoftware miss handlers, software TLB management has not beeninvestigated extensively.

AMD K-7 and Intel Itanium. With instruction level parallelism(ILP) being exploited by most current processors, there is a need toprovide multi-ported TLBs to allow several concurrent instructionstreams to access the TLB. Austin and Sohi [2] show how multi-ple ports can impact access latencies, and argue for interleaved andmulti-level designs.

TLB miss handling costs need to be kept low for good performance.Commercial processors use either a hardware mechanism or a soft-ware mechanism to fill the TLB on a miss. Unlike hardware han-dled TLB misses which have a relatively small refill penalty, han-dlers for software managed TLBs need to be carefully crafted toboth reduce misses as well as reduce the miss handling costs. Na-gle et al. [23] study the influence of the operating system on thesoftware managed MIPS R2000 TLB, and investigate the impactof size, associativity and partitioning of TLB entries (between OSand application). They point out that the operating system has aconsiderable influence on the number and nature of misses. Bala etal. [3] focus in specifically on interprocess communication activi-ties, and illustrate software techniques for lowering miss penaltieson software managed TLBs.

Superpaging is another well investigated technique to boost thecoverage of the TLB and better utilize its capacity [31, 32, 30,12]. Studies have looked at hardware and operating system sup-port for providing superpage translations in the TLB. Recent workin this area [12] is investigating memory controller support for re-mapping pages so that there is more scope for creating superpageentries (without incurring the overheads of copying).

While most of the prior work has looked at lowering TLB missesor miss handling costs, the issue of hiding/tolerating TLB fills viaprefetching has come under investigation only recently [28]. Prefetch-ing cold start entries have been studied in [3] and [25]. On the otherhand, capacity related misses are typically much more dominant (asour results will show) particularly for the large codes such as thosein the SPEC2000 suite. Prefetching TLB entries for capacity re-lated misses has been undertaken in [28] and [20]. More details onthese approaches are discussed later in this paper when we examinetheir suitability for the SPEC2000 benchmarks.

3. EXPERIMENTAL SETUPWe have studied the TLB behavior for all 26 applications from theSPEC2000 suite. The benchmarks were compiled on an Alpha21264 machine using Compaq’s cc V5.9-008, cxx V6.2-024, f77V5.3-915 and f90 V5.3-915 compilers using -O4(-O5 for Fortran)optimization flags which enable loop unrolling, software pipeliningusing dependency analysis, vectorization of some loops, inline ex-pansion of small functions etc. All the simulations are conductedusing the SimpleScalar-3.0 toolset [4] that simulates the Alpha ar-chitecture. Since we are mainly interested in the TLB behavior, wehave modifiedsim-cache component of this toolset, by adding aTLB simulator that traps all memory references.sim-cache doesa functional (not a cycle-by-cycle) simulation of instructions, andwe only examine the memory references for the d-TLB investiga-tion. While there could be some possible effects due to instructionsretiring in a different order than that with a cycle accurate simu-lator, we do not feel that this will substantially change the resultsgiven in this paper since we usually find that TLB misses are rea-sonably spaced to affect the relative ordering of the misses. We alsofound that differences in what gets replaced is also not significantlyaffected by the coarser simulation granularity.

Appln. i-TLB Missrate Appln. i-TLB Missrate Appln. i-TLB Missrate Appln. i-TLB Missrate

galgel 1:4� 10�8 mcf 4:4� 10

�9 ammp 7:3� 10�9 apsi 2:46� 10

�7

vpr 7:2� 10�9 lucas 1:1� 10

�8 twolf 1:28� 10�8 facerec 8:8� 10

�8

art 3� 10�9 bzip2 4:4� 10

�9 parser 8:28� 10�8 vortex 2:74� 10

�4

crafty 1:8� 10�8 swim 1:3� 10

�8 applu 5:75� 10�8 gcc 1:08� 10

�4

mesa 3:6� 10�8 mgrid 1:5� 10

�8 equake 5:55� 10�9 perlbmk 3:35� 10

�5

wupwise 1:2� 10�8 sixtrack 8:8� 10

�8 gap 2:34� 10�8 fma3d 3:42� 10

�5

gzip 5� 10�9 eon 6:8� 10

�8

Table 1: i-TLB Missrates for all the applications using a 64-entry, 4-way set-associative i-TLB for the first 7 billion instructions.

As was mentioned earlier, we are only examining the data refer-ences (d-TLB). Table 1 shows the i-TLB miss rates which are verylow for these applications. Some studies have pointed out the in-fluence of the OS on TLB behavior [23]. However, similar to whatmany other studies [28, 6] have done, in this paper we examineonly application TLB references (we do not simulate the OS) sinceour focus is more on investigating application level characteristics.Issues about the interference between application and OS TLB en-tries, or reserving some entries for the OS are not considered here(we understand that these issues can and have been shown to havea considerable influence on TLB performance). The effect of co-existing applications and context switching is not considered, i.e.TLB fills after a context switch. One could perhaps think of theTLB being saved and restored at context switches to capture purelyapplication effects. Simulations have been conducted with differ-ent TLB configurations - sizes (64, 128, 256, and 512 entries), andassociativities (full, 4-way and 2-way) and we assume a page sizeof 4KB. We do not consider the effect of page faults on TLB behav-ior (the entry needs to be invalidated on a page replacement), sincewe believe that page faults are much less frequent than TLB missesto have a meaningful influence. Further, all these applications takeless than 256MB of space, which can be accommodated in mostcurrent systems.

Simulation of these benchmarks is time-consuming as has beenpointed out by others [24, 5]. In this study, we do not attempt toexecute these applications to completion. Rather, we have exam-ined the TLB behavior overfive billion instructions after skippingthe first two billion instructionsof the execution for each applica-tion. We believe that this ignores the initialization/start-up prop-erties of the application, and captures the representative facets ofthe main execution. The simulated instructions constitute around1%(parser )-12%(mcf ) of the application run time [5].

Forgzip/bzip2 , perlbmk andvortex the input files that weused areinput.source , diffmail.pl andlendian3.rawrespectively.

The term,miss rate, which is often used in the following discussionis defined as the number of TLB misses expressed as a percentageof the total number of memory references.

4. CHARACTERIZATION RESULTS4.1 What is the impact of TLB structure?As a starting step, we first examine the overall TLB performancefor the different applications for 9 different TLB configurations(combinations of three sizes - 64, 128, and 256, and three associa-tivities - fully associative (FA), 4-way and 2-way). The resultingmiss rates are shown in Figure 1, and we observe the following:

� A diverse spectrum of TLB characteristics is exhibited bythese applications. We have applications such asgzip , eon ,

perlbmk , gap , vortex , wupwise , swim , mgrid , applu ,mesa, art , equake , facerec and fma3d which incurfew TLB misses. On the other hand, applications such asvpr , mcf , twolf , galgel , ammp, lucas , sixtrackand apsi have a significant number (greater than 1%) ofTLB misses (at least in some of the configurations). TLBmiss penalties can be quite significant. For instance, the IBMPowerPC 603 that handles misses in software, incurs a la-tency of 32 cycles just to invoke and return from the handler,leave alone the cost of loading the TLBs [10]. Even if weoptimistically assume a TLB miss penalty of 30 cycles, a 1%TLB miss rate can result in significant overheads - around10% of the overall execution time for a CPI of 1.0, with overone-third of the instructions in these applications being mem-ory references. Other studies have pointed out that TLB misshandling overheads can be even higher [12, 28, 11]. Ap-plications such asbzip , gcc , crafty and parser fallbetween these extremes.

In the interest of space, we focus on the six applicationswhich incur higher miss rates -galgel , lucas , mcf , vpr ,apsi and twolf - for clarity in the rest of this paper.

� Even though making the TLB larger helps, the benefit is notvery significant in most of these applications exceptvpr ,twolf andmcf .

� Applications suchvpr , mcf , twolf , lucas andapsi aremore influenced by the associativity than the others.

These results reiterate the importance of the TLB in the executionof these applications. Nearly a fourth of the SPEC 2000 suite havereasonably high TLB miss rates to be affected by its miss penalty.While capacity and associativity do help in cases, they are not theuniversal solutions to address the TLB bottleneck.In the interest ofspace, we focus on a 128-entry fully associative TLB in the rest ofthe paper.

4.2 Will a multi-level TLB help?Processors such as the Itanium (32-entry L1, 96-entry L2), AMDAthlon (32-entry L1, 256-entry L2) etc. provide multi-level TLBstructures, instead of a single monolithic TLB, i.e. lookup in asmaller first level TLB, and only on a miss there do we go to thesecond level TLB. With a smaller first level TLB, overall TLBlookup time can become much lower as long as we have goodhit rates in the first level. Performance of 2-level TLBs has beenconducted by others [6], but its benefit for SPEC2000 workloadshas not been investigated to our knowledge. An in-depth investi-gation of the impact of multi-level TLBs requires varying numer-ous parameters for each of the levels. Rather than undertake sucha full factorial experiment, we limit our study to a 2-level TLBwith a 32-entry fully associative 1st level and a 96-entry fully as-sociative 2nd level that is similar to the Itanium’s(IA-64) structure.

gzip vpr gcc mcf crafty parser eon perlbmk gap 0

5

10

15

20

25

TLB

miss

rate

vortex bzip twolf wupwise swim mgrid applu mesa 0

5

10

15

20

25

TLB

miss

rate

64−FA64−4way64−2way128−FA128−4way128−2way256−FA256−4way256−2way

galgel art equake facerec ammp lucas fma3d sixtrack apsi 0

5

10

15

20

25

TLB

miss

rate

Figure 1: Figure showing TLB miss rates of all the SPEC2000 applications with different TLB sizes (64, 128, 256) and three differentassociativities (2-way, 4-way and FA).

We also assume that the 2-level TLB obeys the containment prop-erty(whatever is in the first level is duplicated in the second level).Table 2 shows the hit and miss rates with this 2-level structure forthe applications.

To study the impact of a hierarchical TLB, we compare these re-sults with that of a single monolithic TLB of the same size (32 + 96= 128 entries) in Table 2. It is to be expected that the 2-level TLBis making less effective use of the overall space because of the con-tainment property and as a result we do find slightly higher missrates for the six applications (especiallytwolf , vpr , apsi andmcf ). On the other hand, the benefit of a multi-level TLB wouldbe felt due to the much lower access time of the 1st level TLB andthe avoidance of accessing the second level - which is inturn deter-mined by the miss rates for the 1st level TLB.

Assume that the access time for a single monolithic TLB (i.e. 128-entry in this situation) isa. Let the access time for the 1st and 2ndlevel TLBs in the hierarchical alternative bea1 anda2 respectively.Let us denote the miss fraction of the monolithic TLB, the 1st leveland 2nd level of the hierarchical TLB to bem,m1, andm2 respec-tively. Also, let the cost of fetching a translation that is not in theTLB be denoted byC. Then, the cost of translating an address inthe monolithic structure (Cm) is calculated by

Cm = a+m�C (1)

The cost of translating an address in the 2-level TLB (Cs) is givenby

Cs = a1 +m1 � (a2 +m2 � C) (2)

The 2-level TLB is a better alternative when

a1 +m1 � (a2 +m2 �C) < a+m� C (3)

i.e.,

m1 <a� a1 +m� C

a2 +m2 � C(4)

Actual access times for associative memories of different sizes arehardware sensitive and fairly accurate models have been developed.We use the model described in [22]. According to this model, ac-cess times for 32, 96 and 128-entry TLBs would approximatelybe2:7ns, 2:9ns and3:0ns respectively. Plugging in these valuesand assuming an optimistic 30 cycle miss penalty(C = 30) for a1 GHz processor, we find that the miss rate (m1) in the first levelTLB has to be less than 10.15%, 4.05%, 4.97%, 2.44%, 3.04% and21.71% formcf , twolf , vpr , lucas , apsi andgalgel re-spectively. If we compare these numbers with the actual miss ratesin the 1st level TLB in Table 2, multi-level structure is definitely abetter choice fortwolf , vpr , lucas andapsi . For mcf andgalgel , the choice is questionable since the difference in missrate is not significant (and we have used a very optimistic 30 cyclemiss penalty).

4.3 Can we improve TLB coverage? (Super-paging)

The previous set of results showed how much we can gain by im-proving the structure (i.e. capacity and associativity) of the TLB,and these gains are obtained with an increase in hardware complex-ity. The recent trend [31] to improve performance without signif-icantly increasing hardware costs is through the concept of super-paging. TLBs that support superpaging use a single entry for trans-lating a set of consecutive virtual pages - the number of pages for asingle entry is typically a power of two - as long as these pages arelocated physically contiguous as well. As a result, pages that areaccessed at around the same time and which are adjacent to each

Application 1st Level TLB 2nd Level TLB Overall Miss Rate Monolithic TLBHits Misses Hits Misses for 2-level TLB Hits Misses

mcf 89.5% 10.5% 11.2% 88.8% 9.24% 91.01% 8.99%twolf 96.57% 3.43% 53.2% 46.8% 1.60% 98.71% 1.29%vpr 94.5% 4.5% 56.6% 43.4% 1.95% 98.36% 1.64%

lucas 98.33% 1.67% 2% 98% 1.64% 98.37% 1.63%apsi 97.39% 2.61% 12.6% 87.4% 2.28% 98.04% 1.96%

galgel 77.1% 22.9% 0.05% 99.95% 22.8% 77.2% 22.8%

Table 2: Hit and Miss Rates for the 2-level TLB configuration. Table shows the hits and miss rates for each of the 2-levels as apercentage of the number of references to that level, as well as the overall miss rate which is the percentage of references that do notfind a translation in either of the levels. Also shown are the miss rates for a single monolithic TLB of the same size.

2 2.0001 2.0002 2.0003 2.0004 2.0005 2.0006 2.0007 2.0008 2.0009 2.001

x 108

0

20

40

60

80

100

120

140

Num

ber

of S

uper

page

Ent

ries

Misses2 2.0001 2.0002 2.0003 2.0004 2.0005 2.0006 2.0007 2.0008 2.0009 2.001

x 108

60

70

80

90

100

110

120

130

Num

ber

of S

uper

page

Ent

ries

Misses2 2.0001 2.0002 2.0003 2.0004 2.0005 2.0006 2.0007 2.0008 2.0009 2.001

x 108

0

20

40

60

80

100

120

140

Num

ber

of S

uper

page

Ent

ries

Misses

apsi lucas mcf

2 2.0001 2.0002 2.0002 2.0002 2.0003 2.0004

x 108

120

121

122

123

124

125

126

127

Num

ber

of S

uper

page

Ent

ries

Misses2 2.0001 2.0002 2.0003 2.0004 2.0005 2.0006 2.0007 2.0008 2.0009 2.001

x 108

0

10

20

30

40

50

60

70

80

90

Num

ber

of S

uper

page

Ent

ries

Misses2.0004 2.0004 2.0005 2.0006 2.0006 2.0006 2.0007 2.0008

x 108

30

40

50

60

70

80

90

Num

ber

of S

uper

page

Ent

ries

Misses

galgel twolf vpr

Figure 2: Figure showing what size TLB would suffice when we combine contiguous virtual page translations with superpage entries.A 128-entry fully associative TLB is used. The horizontal line shows the average.

other virtually can share the same TLB entry, thus freeing up slotsfor other translations (to improve TLB coverage).

To study the potential benefits of superpaging for these applica-tions, we next examine the contents of the TLB during the course ofexecution. In particular, we are interested in finding out how smalla TLB would suffice to hold all the translations if it supported su-perpaging. At every miss (since TLB changes only on a miss), weexamine all the TLB entries and find out how many of them can becombined into a single one (they are virtually contiguous pages).The number of resulting entries is plotted in Figure 2 during thecourse of execution (in terms of the misses in thex-axis). It shouldbe noted that we are not restricting ourselves to a power of two andare assuming that the operating system will automatically allocatethese virtually contiguous pages in contiguous physical locations,since our goal here is to examine the potential of superpaging froman application perspective. For better clarity, only a small windowof the execution is plotted (similar execution is seen over a muchlarger window).

We find quite different behaviors across the applications. Applica-tions like lucas , mcf , galgel andapsi may not benefit muchfrom superpaging. These applications require at-least around 100TLB entries on the average without too much scope for combiningthese entries. Provisioning superpage support may not show muchimprovement in the performance while adding hardware complex-ity. On the other hand, applications liketwolf andvpr show thepotential to benefit significantly from superpaging mechanisms.

4.4 Where are the Misses Occurring?We move on to examining how the TLB misses are distributed overthe execution of the program. This can help us identify hot spotsin the execution that are worth a closer look for optimizations. Dy-namically chosen/triggered optimizations based on different phasesof the execution may perhaps be beneficial, if we do find well-defined phases.

Since we are working with thesim-cache version of SimpleScalar,we do not have the time stamps for the memory references and the

2 2.001 2.002 2.003 2.004 2.005 2.006 2.007 2.008 2.009 2.01

x 106

0

10

20

30

40

50

60

70

80

90

100

Num

ber o

f Mis

ses

Time (thousands of instructions)2.04 2.045 2.05 2.055 2.06

x 106

0

20

40

60

80

100

120

140

Num

ber o

f Mis

ses

Time (thousands of instructions)2.01 2.015 2.02 2.025 2.03 2.035 2.04 2.045 2.05

x 106

0

5

10

15

20

Num

ber o

f Mis

ses

Time (thousands of instructions)

mcf apsi vpr

2 2.001 2.002 2.003 2.004 2.005 2.006 2.007 2.008 2.009 2.01

x 106

115

116

117

118

119

120

121

122

Num

ber o

f Mis

ses

Time (thousands of instructions)2.01 2.015 2.02 2.025 2.03

x 106

0

5

10

15

20

Num

ber o

f Mis

ses

Time (thousands of instructions)2 2.001 2.002 2.003 2.004 2.005 2.006 2.007 2.008 2.009 2.01

x 106

0

5

10

15

20

25

30

35

40

Num

ber o

f Mis

ses

Time (thousands of instructions)

galgel twolf lucas

Figure 3: TLB misses as the application is executing.x-axis denotes the number of instructions executed in thousands, andy-axisdenotes the number of misses for every thousand instructions.

TLB misses. Instead, we plot the TLB misses as a function of thenumber of executed instructions (in thousands) in Figure 3. Notethat the scale of the graph can make it look as though there aremultipley values for a givenx, while this is actually not the case.

While mcf does show some phases of distinct TLB miss activ-ity, we found that most of the other applications do not have suchdemarcated phases. The TLB misses are more or less uniformlydistributed through the execution, suggesting that dynamic triggersin those cases may not be very helpful.

We would like to mention that we also have results showing whatprogram counter values incur these misses. We usually find thatjust a handful of program counter values contribute to the bulk ofTLB missesas is shown in Figure 4 formcf , twolf andvpr .

4.5 How far apart are the misses temporally?The temporal separation of misses is an important characteristic forexamining possible optimizations. If most misses are fairly closeto each other temporally, then mechanisms that prefetch a TLB en-try based on the previous miss may not enjoy a large enough win-dow of overlap to completely hide miss latencies. In such situa-tions, we should explore techniques for bulk loading/prefetchingTLB entries. A significantly larger window suggests that more so-phisticated prefetching techniques (which may take more time) canbe viable. To our knowledge, no previous work has looked at thetemporal separation of TLB misses previously. Figure 5 shows thecumulative density distribution of temporal miss separations (i.e.they-axis shows the fraction of misses that are separated by at mosta given value on thex-axis), where the separations are expressedin terms of instructions. A steep curve indicates that more misseshave very short temporal separations.

For nearly all applications, a large portion of misses are separated

100

101

102

103

104

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F o

f Mis

ses

Miss−Separation(In instructions)

apsigalgellucasmcftwolfvpr

Figure 5: Cumulative Density Function (CDF) of miss separa-tion/interval. x-axis shows the number of instructions betweensuccessive misses in increasing order, andy-axis denotes thefraction of misses that have miss separations of at most thismany instructions.

by at most 30-50 instructions. For instance,galgel andapsihave over 90% of misses separated by less than 50 instructions.mcf also has small temporal separation with around 70% of missesfalling within 50 instructions. Onlytwolf and vpr have lesssteep curves. In some applications (lucas andgalgel ), a fewmiss-distances dominate the execution, suggesting that the TLBmisses are occurring very periodically/regularly.

These results provide insight on the time constraints that prefetchmechanisms need to operate under. As CPIs get smaller, this win-dow can shrink further. This suggests that prefetching techniques

4.19 4.2 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29

x 106

0

2

4

6

8

10

12

14x 10

5

Num

ber o

f Mis

ses

Program Counter4.15 4.2 4.25 4.3 4.35 4.4 4.45 4.5 4.55 4.6 4.65

x 106

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

Num

ber o

f Mis

ses

Program Counter4.15 4.2 4.25 4.3 4.35 4.4 4.45 4.5

x 106

0

2

4

6

8

10

12

14

16

18x 10

4

Num

ber o

f Mis

ses

Program Counter

mcf twolf vpr

Figure 4: PC Values Incurring TLB Misses

not only need to be accurate (note that better accuracy would in-crease the temporal separation between misses), but also nimbleenough to make fast predictions without adding too much overhead.

4.6 How would software TLB managementhelp?

TLB management is essentially trying to (i) bring the entry thatis needed into the TLB at some appropriate time, and (ii) evictan entry currently present to make room for this incoming entry.These two actions play a key role in determining performance.The first action determines the overhead that is incurred when amiss is encountered, while the latter affects the miss rate itself(the choice of replacement candidate will determine future misses).Hardware TLB management, though fast, hardwires these mecha-nisms allowing little scope for flexibility. Typically, the missingentry is brought in on demand at the time of a miss (if we dis-count any prefetching), and the replacement is determined usu-ally based on LRU (either global in terms of fully associative, orwithin a set in case of set associative). On the other hand, select-ing the replacement candidate in software - either by the applica-tion or by the compiler - can potentially choose better alternativesthan a hardwired LRU. This is an issue that has not been investi-gated in prior research to our knowledge in the context of TLBs.Most current TLBs do not provide such capabilities, but any strongsupporting arguments in favor of this could influence future offer-ings. Prefetching TLB entries to tolerate/hide the miss latency isdiscussed later in section 4.8.

A detailed investigation of software directed replacement techniquesis well beyond the scope of this paper. Rather, we would like to findout what is the best that we can ever do, so that we can get a lowerbound of achievable miss rates with any software directed replace-ment scheme. Such a study can be conducted by simulating theOPT replacement policy, i.e. at any instant replace the entry whosenext usage is furthest in the future. It can be proved that no otherreplacement scheme can give lower miss rates than OPT (which isitself not practical/implementable). We simulate a TLB using OPTand compare the results with that for a fully associative TLB usingLRU of the same size. We would like to mention that simulationof OPT exactly is extremely difficult (pointed out by others [29])since it involves looking into the future (infinitely). Instead, at ev-ery miss, we look ahead one million references in our simulation(which we call OPT’). We have tried looking further into the futureand the results are not very different, leading us to believe that theresults with OPT’ are very close to that of OPT. Table 3 comparesthe miss rates for OPT’ and the TLB using LRU for three differentsizes (particularly note the columns denoting the ratio of miss rates

between the two approaches).

When the TLB size is very small, either scheme incurs a lot ofmisses, and the differences between the schemes are less significantthan at larger TLB sizes (in many cases). At the other end, whenthe TLB size gets large enough that the working set fits in it, thedifferences between the schemes again become less prominent. Wefind that in these applications, a 256-entry TLB is still small enoughthat the benefits of OPT’ over LRU continue to increase with size.mcf is an exception at small TLB sizes.

There are some interesting observations/suggestions we can makefrom these results. First, we find that in many situations, OPT’gives substantially lower miss rates than LRU, suggesting that soft-ware directed replacement needs to be explored further. In someapplications (such asammp), the reference behavior is statically an-alyzable, making compiler/application directed optimizations wor-thy for exploration. While the dynamic reference behavior in otherapplications may not be readily amenable for static optimizations,our results show that it is definitely worthwhile to explore a com-bination of static and runtime approaches to bring the miss ratescloser to OPT’.

Second, despite the significant improvements with OPT’, we stillfind that there is still a reasonably large miss rate that any softwaredirected replacement scheme cannot reduce. This motivates theneed for miss latency tolerance techniques such as prefetching (bysoftware or hardware) which is explored later in section 4.8.

4.7 How far apart are the misses spatially?Having seen the temporal separation of misses, we next move onto examining the spatial distribution of the misses in terms of thevirtual addresses themselves. Any observable patterns/periodicitywould be extremely useful for prefetching TLB entries from boththe architectural and compiler perspectives. Figure 6 shows whatvirtual addresses (on they-axis) incur misses during the executionof the applications (expressed in terms of the misses on thex-axis).In consideration of the scale of the graph, thex-axis only shows asmall representative window of the execution so that any observ-able patterns are more noticeable. Again, the scale of the graphmay make it appear as though there are twoy-values for a givenx-value, though this is not really the case. It should be noted thatthere are earlier studies which have drawn such graphs [6] for ap-plication reference behaviors. The difference is that we are lookingat the misses (not the references) since that is what TLB prefetchengines work with as is discussed next, and also in that no one haslooked at these graphs for SPEC2000 applications.

Application 64-entry 128-entry 256-entryOPT’ LRU OPT’/LRU OPT’ LRU OPT’/LRU OPT’ LRU OPT’/LRU

mcf 9.3% 9.73% 0.955 8.86% 8.99% 0.985 7.34% 8.1% 0.90twolf 1.3% 2.08% 0.625 0.65% 1.29% 0.503 0.2% 0.53% 0.377vpr 1.54% 2.58% 0.596 0.80% 1.64% 0.487 0.23% 0.66% 0.348

lucas 1.64% 1.64% 1.0 1.62% 1.63% 0.99 1.57% 1.61% 0.975apsi 1.85% 2.34% 0.79 1.38% 1.96% 0.704 0.77% 1.79% 0.43

galgel 22.5% 22.8% 0.986 19.8% 22.8% 0.868 14.4% 22.7% 0.634

Table 3: Comparing miss rates for OPT’ and LRU replacement policies

2.65 2.7 2.75 2.8 2.85 2.9 2.95 3

x 104

1.31

1.32

1.33

1.34

1.35

1.36

x 106

Add

ress

es

Time(Misses) 1500 2000 2500 3000 3500 4000

1.33

1.332

1.334

1.336

1.338

1.34

1.342

x 106

Add

ress

es

Time(Misses)3500 4000 4500 5000 5500 6000

1.309

1.31

1.311

1.312

1.313

1.314

1.315

1.316

1.317

1.318

x 106

Add

ress

es

Time(Misses)

apsi lucas mcf

1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6

x 104

1.31

1.315

1.32

1.325

1.33

1.335

1.34

1.345

x 106

Add

ress

es

Time(Misses)2 2.05 2.1 2.15 2.2 2.25 2.3

x 104

1.3107

1.3108

1.3109

1.311

1.3111

1.3112

x 106

Add

ress

es

Time(Misses) 6200 6400 6600 6800 7000 7200 7400 7600

1.3102

1.3104

1.3106

1.3108

1.311

1.3112

1.3114

1.3116

1.3118

x 106

Add

ress

es

Time(Misses)

galgel twolf vpr

Figure 6: Figure showing the data addresses that miss during the course of program execution shown in thex-axis in terms of misses(Time).

We observe that many applications exhibit patterns in the addressesthat miss, which repeat over time. For applications such asgalgel ,apsi , and to a lesser extentmcf andlucas , the patterns are morediscernable. Intwolf andvpr , the patterns if any, are not veryclear. These patterns are a result of the iterative nature of many ofthese executions that repeat reference behavior after a while.

4.8 What is the benefit of prefetching?As was pointed out in section 4.6, despite all the optimizations wecan do to reduce the miss rates (either with fancy TLBs or withfancy management schemes), it is still important to be able to re-duce the latency incurred on a miss by hiding some or all of thiscost. Prefetching is one such approach that has been explored indepth in the context of caches and I/O, but it has drawn little inter-est in the context of TLBs until recently [28, 3, 25, 20]. In a broadsense, prefetching mechanisms can be viewed in two classes: onesthat capture strided reference patterns and those that base their de-cisions on history. We consider three prefetching mechanisms inthis paper to estimate the potential of prefetching TLB entries. Thefirst technique is based on limited history -Markov prefetching

which has been studied previously [18] in the context of caches,and we have adapted it for TLBs. The second technique calledRecency-based prefetchingis based on a much larger history (untilthe previous miss to a page) and is described in [28]. The third isa stride-based technique calledDistance Prefetchingthat has alsobeen proposed for TLBs [20]. Markov and Distance implemen-tations use a prediction table of entries. Markov indexes this ta-ble with the current missing page, and the table gives two otherpages that also missed immediately after this page, which need tobe prefetched. Distance Prefetching indexes the table based on thecurrent distance (difference in page number between current missand previous miss), and the table gives two other distances that oc-curred immediately after this distance was encountered. Recencyprefetching keeps an LRU stack of TLB misses by maintaining twoadditional pointers in each page table entry (kept in memory). Thereader is referred to [18, 28, 20] for further implementation details.Since TLB access is extremely time critical (at least on the hit path),prefetching is usually attempted only at the time of a miss. As a re-sult, all the prefetching techniques we consider observe only themiss reference stream to make decisions. We assume the existence

of logic to observe these misses, make predictions, and bring thoseentries into a prefetch buffer. At the time of a miss, this buffercan be consulted concurrently with searching the TLB to find outif the entry has already been prefetched. Prefetching does not in-crease/reduce the number of misses since it never directly loads thefetched entries into the TLB. The only drawback of wrong predic-tions is in evicting entries from what is already there in the prefetchbuffer and in the additional memory traffic that prefetching induces.

We useprediction accuracyto compare the effectiveness of thetechniques. Prediction accuracy is the fraction of misses whose ad-dresses were accurately predicted and translations for these can befound in the prefetch buffer when these misses are actually encoun-tered. Remember that mispredictions can only throw away entriesfrom prefetch buffer (not from TLB). A 16-entry prefetch buffer isused in all cases for a uniform comparison.

We would like to mention that an in-depth investigation of theseprefetching mechanisms varying all the different parameters is notconducted here and interested readers can refer to [20]. Whilewe have investigated several parameters, the prediction accuracyis shown in Figure 7 for all the SPEC2000 applications for a fullyassociative TLB of 128 entries. For markov and distance prefetch-ing, the number of entries maintained in the prediction table andthe associativity of the table are also varied and these parametersare indicated in the figure where D stands for direct-mapped and Ffor fully-associative (2 or 4 refers to the associativity). For exam-ple, Markov512D means a markov predictor with 512 table entrieswhich is direct-mapped.

We observe that prefetching has the potential to reduce the TLBmisses significantly in 18 applications(gzip , gcc , crafty , parser ,gap , vortex , wupwise , swim , applu , mgrid , mesa, galgel ,art , facerec , ammp, equake , sixtrack andapsi ) whileapplications likevpr , mcf , eon , bzip , twolf andfma3d hardlybenefit from prefetching. In others, the mechanisms provide mod-erate accuracy.

Distance Prefetching does better than recency and markov in 13applications. It does extremely well compared to markov in manyapplications(gzip , mgrid , applu , wupwise , ammp, lucas ,swim ) because these applications have very large working sets andhence require large number of entries in the markov prediction ta-ble. Since distance prefetching requires fewer entries to be main-tained to capture such regular patterns, it does much better. In someapplications such astwolf andcrafty , markov does better thandistance because their working set is much lower (only 2-4 MB)and markov gives a very good indication of past history than dis-tance. While it may appear that some bars are missing in this graph(bars foreon ) this is not really the case. The prediction accura-cies here are very low. In many cases(gcc , gap , vortex , mesa,lucas , parser andsixtrack ) though distance does not out-perform both recency and markov, it either does equally well orbetter then atleast one of them. Another characteristic that we findis that distance prefetching is not very sensitive to the choice of ta-ble parameters that are considered, while markov shows more vari-ability across these parameters. This is a consequence of the factthat distance is able to capture the patterns in a much smaller space.

Overall, we find that distance prefetching is either the best or sharesfirst place in prediction accuracy for 18 of the 26 applications. Itis thus a better alternative than markov prefetching. Between re-cency and distance, the advantage of the former is that it does

not maintain a prediction table. However, the additional page ta-ble size cost of recency and the additional memory traffic that itinduces (several pointers in the page table need to be updated tomaintain LRU ordering) can tilt the balance in favor of the latter.Remember that we do not have too large a window to accommo-date such costs as shown in Section 4.5. The poor performance ofprefetching in some applications suggest a lot more work is neededin hardware/runtime/compiler directed prefetching. This is particu-larly true formcf , twolf andvpr where these prefetching mech-anisms are not as affective and the miss rates are high.

5. DISCUSSION OF RESULTS AND CON-CLUDING REMARKS

Examining several TLB characteristics of the SPEC2000 applica-tions has revealed interesting results and insights. Our results re-iterate what others have pointed out - TLB design and optimiza-tion for lowering miss rates and miss penalties is extremely crit-ical. Around one-fourth (seven) applications from the suite havemiss rates higher than 1% on a 128-entry fully associative TLB(that is representative of current structures). This can result in a10% overhead in terms of time even if we optimistically figure a30 cycle miss penalty, and a CPI of 1.0. Some other studies haveused miss penalties that are even higher [12, 11], and lower CPIs[21, 26] for these benchmarks. Further, as processor ILPs and clockspeeds continue to increase, the TLB miss handling bottleneck willbecome even more significant.

From the viewpoint of TLB organization, there are some applica-tions where associativity is more important than capacity (beyondsay 128-entry TLB), and some applications where the vice-versa istrue. We also find that the 2-level TLB, that is already available onsome commercial processors, would benefit the SPEC2000 appli-cations by reducing the access times in the common case withoutsignificantly affecting miss rates.

There are differences between the applications in the extent to whichthey can benefit from superpaging. Some applications access sev-eral contiguous virtual pages at the same time, requiring just a fewsuperpage entries in the TLB. But we found others where super-paging does not have much potential. It should be noted that thisstudy has only considered the potential for superpaging (at the vir-tual address level) and has not dwelled on other practical issuessuch as adjacency of the pages in physical memory, or how muchof this can actually be exploited by the OS. On the other hand, theseresults motivate ongoing and future work on superpaging.

These applications usually have a small code region that is repeat-edly executed (in a loop) to take up the bulk of the execution time.This is also the region where most of the TLB misses occur. Con-sequently, we do not find well demarcated phases (discounting ini-tializations) differing in the number of TLB misses except in a few.This suggests that it may not be worthwhile to trigger optimizationtechniques dynamically based on increases/decreases in the num-ber or frequency of TLB misses. Any such technique is likely to bealways dormant or always triggered, and the overhead of keepingtrack of changes in TLB behavior could discount such optimiza-tions. Incidentally, we have also conducted some studies to explorepossible benefits of procedure level TLB management (i.e. asso-ciate a separate virtual TLB for each procedure that is stored andrestored upon exit and invocation to prevent interference betweenprocedures). Results for these studies can be found in [19] whichshow that there is not really a strong case for doing this.

gzip vpr gcc mcf crafty parser 0

0.2

0.4

0.6

0.8

1

Predict

ion Acc

uracy

eon perlbmk gap vortex bzip twolf wupwise0

0.2

0.4

0.6

0.8

1

Predict

ion Acc

uracy

Recency Markov1024,DMarkov1024,4Markov1024,2Markov512,DMarkov512,4Markov256,DMarkov256,4Markov256,F Distance1024,DDistance1024,4Distance1024,2Distance512,DDistance512,4Distance256,DDistance256,4Distance256,F

swim mgrid applu mesa galgel art 0

0.2

0.4

0.6

0.8

1

Predict

ion Acc

uracy

equake facerec ammp lucas fma3d sixtrack apsi 0

0.2

0.4

0.6

0.8

1

Predict

ion Acc

uracy

Figure 7: Benefits of Prefetching

We find that the applications with high TLB miss rates includenot just those with dynamic data reference patterns, but also somewith quite regular reference patterns (such asammp, galgel,apsi ) This suggests that application level (restructuring algorithmsor code/data structures) or compiler directed optimizations have animportant role to play in reducing TLB misses or alleviating theircost. This is an area that has been little explored and holds a lotof promise for future research. Such techniques can give directivesand/or hints to (a) prefetch TLB entries before they are needed (ei-ther one at a time or bulk pre-loading), (b) pinning entries so thatthey are not replaced until told to do so (to over-ride a simple LRUbased approach that current TLBs use), or even for (c) superpag-ing. In our simple experiment where we used an algorithm close tooptimal for replacement (which indicates what is the best one canachieve by pinning entries), we observed over 50% improvement inmiss rates in several cases. This again reiterates the possible ben-efits of software directed TLB management. Until now, systemsonly provide or employ software handling capabilities (i.e. a missis handled in software which in turn fetches the next entry and putsit into the TLB). Determining what entries to keep (pin) or replaceis not really exploited in any current system even though some pro-cessors such as the UltraSparc-IIi provide mechanisms for allowingthis. Identifying what mechanisms should be exported by the TLBhardware (such as non-blocking load of an entry for prefetch, load-ing an entry into a particular slot, pinning an entry, etc.), what sup-port needs to be provided by the OS (to ensure that we still provideprotection), and exploiting these mechanisms at the application andcompiler level is part of our ongoing work. We refer to all this assoftware TLB managementand differentiate this fromsoftware TLBhandlingwhich is what some current systems provide.

Even with all the improvements that software TLB managementcan provide, we still find a significant number of misses remain.In such cases, hardware and/or runtime techniques to tolerate orreduce the miss penalties by prefetching is important. Of the threeprefetching mechanisms considered, we find that Distance prefetch-ing gives good accuracy in prediction for 18 applications of thesuite where there is some amount of regularity. When the accessbehavior is very chaotic, the prefetching mechanisms break downas is to be expected. One should be careful in univocally advocat-ing prefetching. While the previous studies [28, 20] and ours showthat we can get good prediction accuracies, there are a few issuesto consider. First, our characterization results show that we do nothave too large a time window between successive misses, and ifthe fetched entry cannot be loaded fast enough then we do not gainsignificantly from prefetching. This suggests that one may want toconsiderbulk prefetching- i.e. fetch several entries at once. Thishas been considered to a limited extent in [3] where it is specifi-cally used for certain kernel functions (inter-process communica-tion). On the other hand, we suggest exploring this for actual appli-cation executions. The regularity of the address patterns that missin many applications suggest that we may be fairly successful inprefetching several entries (that may be needed) at a time. Second,one should keep the possible increase in memory bandwidth im-posed on the system and how this affects the rest of the executionas illustrated in [20].

We believe that this paper has made important contributions by (i)identifying application characteristics that are important for TLBdesign/optimization, (ii) providing detailed results for these char-acteristics in several SPEC2000 benchmarks with a unified frame-work, and (iii) suggesting directions for future research based on

these results.

6. REFERENCES[1] T. E. Anderson, H. M. Levy, B. N. Bershad, and E. D.

Lazowska. The Interaction of Architecture and OperatingSystem Design. InProceedings of the Fourth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, pages 108–120, SantaClara, California, April 1991.

[2] T. M. Austin and G. S. Sohi. High Bandwidth AddressTranslation for Multiple Issue Processors. InProceedings ofthe 23rd Annual International Symposium on ComputerArchitecture, 1996.

[3] K. Bala, M. F. Kaashoek, and W. E. Weihl. SoftwarePrefetching and Caching for Translation Lookaside Buffers.In Proceedings of the Usenix Symposium on OperatingSystems Design and Implementation, pages 243–253, 1994.

[4] D. Burger and T. Austin. The SimpleScalar Toolset, Version3.0. http://www.simplescalar.org.

[5] J. F. Cantin and M. D. Hill. Cache Performance for SelectedSPEC CPU2000 Benchmarks. October 2001.http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/.

[6] J. B. Chen, A. Borg, and N. P. Jouppi. A Simulation BasedStudy of TLB Performance. InProceedings of the 19thAnnual International Symposium on Computer Architecture,pages 114–123, 1992.

[7] D. W. Clark and J. S. Emer. Performance of the VAX-11/780Translation Buffers: Simulation and Measurement.ACMTransactions on Computer Systems, 3(1), 1985.

[8] S. P. E. Corporation. http://www.spec.org.[9] D. Culler and J. P. Singh.Parallel Computer Architecture: A

Hardware-Software Approach. Morgan KauffmanPublishers, 1998.

[10] C. Dougan, P. Mackeras, and V. Yodaiken. Optimizing theidle task and other MMU tricks. InProceedings of theUsenix Symposium on Operating Systems Design andImplementation, pages 229–237, 1999.

[11] Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh.Online superpage promotion revisited. InProceedings of theACM SIGMETRICS 2000 Conference on Measurement andModeling of Computer Systems, 2000.

[12] Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh.Re-evaluating Online Superpage Promotion with HardwareSupport. InProceedings of the Seventh InternationalSymposium on High Performance Computer Architecture,pages 63–72, January 2001.

[13] J. L. Henning. SPEC CPU2000: Measuring CPUPerformance in the New Millenium.IEEE Computer,33(7):28–35, July 2000.

[14] J. Huck and J. Hays. Architectural support for translationtable management in large address space machines. InProceedings of the 20th Annual International Symposium onComputer Architecture, pages 39–50, May 1993.

[15] B. Jacob and T. Mudge. Virtual Memory in ContemporaryMicroprocessors.IEEE Micro, pages 60–75, July-August1998.

[16] B. L. Jacob and T. N. Mudge. A look at several memorymanagement units, TLB-refill mechanisms, and page tableorganizations. InProceedings of the Eigth InternationalConference on Architectural Support for ProgrammingLanguages and Operating System, pages 295–306, 1998.

[17] L. K. John and A. M. Maynard.Workload Characterizationfor Computer System design. Kluwer Academic Publishers,2000.

[18] D. Joseph and D. Grunwald. Prefetching Using MarkovPredictors.IEEE Transactions on Computer Systems,48(2):121–133, 1999.

[19] G. B. Kandiraju and A. Sivasubramaniam. Characterizing thed-TLB behavior of the SPEC CPU2000 Benchmarks.Technical Report CSE-01-023, Dept. of Comp. Sci. & Eng.,Penn State Univ., August, 2001.

[20] G. B. Kandiraju and A. Sivasubramaniam. Going theDistance for TLB Prefetching: An Application-driven Study.In Proceedings of the 29th International Symposium onComputrer Architecture, June 2002.

[21] W.-F. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAMlatencies with an Integrated Memory Hierarchy Design. InProceedings of the Seventh International Symposium on HighPerformance Computer Architecture, pages 301–312, 2001.

[22] G. McFarland.”CMOS Technology Scaling and Its Impacton Cache Delay”. PhD thesis, Computer ScienceDepartment, Stanford University, 1997.

[23] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, andR. Brown. Design Tradeoffs for Software Managed TLBs. InProceedings of the 20th Annual International Symposium onComputer Architecture, pages 27–38, 1993.

[24] A. K. Osowski, J. Flynn, N. Meares, and D. J. Lilja.Adapting the SPEC2000 Benchmark Suite forSimulation-based Computer Architecture Research.Kluwer-Academic Publishers, 2000. (papers from Workshopon Workload Characterization).

[25] J. S. Park and G. S. Ahn. A Software-controlled PrefetchingMechanism for Software-managed TLBs.Microprocessorsand Microprogramming, 41(2):121–136, May 1995.

[26] M. Postiff, D. Greene, S. Raasch, and T. Mudge. IntegratingSuperscalar Processor Components to Implement RegisterCaching. InProceedings of the ACM InternationalConference on Supercomputing, 2001.

[27] M. Rosenblum, E. Bugnion, S. Devine, and S. Herrod. Usingthe SimOS Machine Simulator to Study Complex ComputerSystems.ACM Transactions on Modeling and ComputerSimulation, 7(1):78–103, January 1997.

[28] A. Saulsbury, F. Dahlgren, and P. Stenstrom. Recency-basedTLB preloading. InProceedings of the 27th AnnualInternational Symposium on Computer Architecture, pages117–127, June 2000.

[29] R. A. Sugumar and S. G. Abraham. Efficient Simulation ofCaches under Optimal Replacement with Applications toMiss Characterization. InProceedings of the ACMSIGMETRICS 1993 Conference on Measurement andModeling of Computer Systems, pages 24–35, 1993.

[30] M. Swanson, L. Stoller, and J. Carter. Increasing TLB reachusing Superpages backed by Shadow Memory. InProceedings of the 25th Annual International Symposium onComputer Architecture, pages 204–213, 1998.

[31] M. Talluri. Use of Superpages and Subblocking in theAddress Translation Hierarchy. PhD thesis, Dept. of CS,Univ. of Wisconsin at Madison, 1995.

[32] M. Talluri and M. Hill. Surpassing the TLB Performance ofSuperpages with Less Operating System Support. InProceedings of the Sixth International Conference onArchitectural Support for Programming Languages andOperating Systems, pages 171–182, 1994.