improving cache locality for thread-level speculation stanley fung and j. gregory steffan

1Improving Cache Locality for TLS Steffan

Improving Cache Locality for Improving Cache Locality for

Thread-Level SpeculationThread-Level Speculation

Stanley Fung and J. Gregory SteffanStanley Fung and J. Gregory Steffan

Electrical and Computer EngineeringElectrical and Computer Engineering

University of TorontoUniversity of Toronto


Chip Multiprocessors (CMPs) are Here!Chip Multiprocessors (CMPs) are Here!

IBM Power 5AMD OpteronIntel Yonah

Use CMPs to improve sequential program performance?


Exploiting CMPS: The IntuitionExploiting CMPS: The Intuition

CMPs have lots of distributed resourcesCMPs have lots of distributed resources

– Caches, branch predictors, processorsCaches, branch predictors, processors

Somehow distribute sequential programsSomehow distribute sequential programs

– Use distributed resources to improve performanceUse distributed resources to improve performance

Increasingly aggressive approaches:Increasingly aggressive approaches:

1)1) Prefetching (eg., helper threads)Prefetching (eg., helper threads)

2)2) Transactions and transactional memoryTransactions and transactional memory

3)3) Thread-Level Speculation (TLS)Thread-Level Speculation (TLS)

But distributing a sequential program is non-trivial…


Exploiting CMPs: The TensionExploiting CMPs: The Tension

Distributed CMP ResourcesSequential Program

ParallelismLocality

Our challenge: relaxing this tension

L2

L1

P

L1

P

L1

P

L1

P


Example: TLS Execution on 4 ProcessorsExample: TLS Execution on 4 Processors

Execution

Time

L2

L1

P

L1

P

L1

P

L1

P

Sequential execution

active

inactive

L2

L1

P

L1

P

L1

P

L1

P

TLS execution

4X total cache capacity 4X cache performance?


TLS on 4 CPU CMP: % Increase in Cache MissesTLS on 4 CPU CMP: % Increase in Cache Misses

93.4 12

9.8

72.4

209.

8 306.

9

9.3

808.

9

37.9

19.1

934.

2

22.7

110.

3

788.

2

272.

5

0

100

200

300

400

500

600

700

800

900

1000

bzip2_

com

p

craf

tygc

c goijp

egli

m88

ksim

mcf

parse

r

perlb

mk

vorte

x

vpr_p

lace

vpr_r

oute

aver

age

Pe

rce

nta

ge

Inc

rea

se in

Da

ta C

ac

he

Miss

Ra

te

272.5%

~= 4X

4X total cache capacity 4X increase in cache misses


Opportunities for ImprovementOpportunities for Improvement

1)1) Prefetching EffectsPrefetching Effects

– TLS indirectly prefetches from off-chip into L2TLS indirectly prefetches from off-chip into L2

– Orthogonal to the focus of this workOrthogonal to the focus of this work

2)2) ““Locality Misses”Locality Misses”

– An L1 miss where the line is resident in another L1An L1 miss where the line is resident in another L1

– An indicator of both: An indicator of both:

• Broken localityBroken locality

• Opportunity to repair localityOpportunity to repair locality

What fraction of misses are locality misses?


TLS on 4 CPU CMP: % Locality MissesTLS on 4 CPU CMP: % Locality Misses

44.4

93.6

68.5

81.3

78.6

24.0

37.5

7.6

43.5

88.7 92

.2

65.8 69

.1

61.1

0

10

20

30

40

50

60

70

80

90

100

bzip2_

com

p

crafty gc

c goijp

egli

m88

ksim

mcf

parse

r

perlb

mk

vorte

x

vpr_p

lace

vpr_r

oute

avera

ge

Pe

rce

nta

ge

Lo

ca

lity

Ca

ch

e M

iss

significant locality misses: problem and opportunity

61.1%


OutlineOutline

• Experimental FrameworkExperimental Framework

• Classification of MissesClassification of Misses

• Techniques for Reducing MissesTechniques for Reducing Misses

• Combining TechniquesCombining Techniques

• Impact on ScalabilityImpact on Scalability

• ConclusionConclusion


Support for TLSSupport for TLS

Break programs into speculative threadsBreak programs into speculative threads

– We use the compilerWe use the compiler

Track data dependencesTrack data dependences

– We extend invalidation-based cache coherenceWe extend invalidation-based cache coherence

Recover from failed speculationRecover from failed speculation

– We extend L1 data caches to buffer speculative stateWe extend L1 data caches to buffer speculative state

three key elements of every TLS system


MIPS

Executable

Compiler Support for TLSCompiler Support for TLS

Region

Selection

Transformation and

Optimization

Sequential

SourceCode

inserts

TLS instructions

profile

informationwhich loops?


Hardware Support for TLSHardware Support for TLS

L2

L1

P

L1

P

L1

P

L1

P

CacheState Data

- -

- -

- -

Tag

-

-

-

-- -

SL

-

-

-

-

SM

-

-

-

-

P

extend generic CMP’s L1 caches and coherence


Experimental FrameworkExperimental Framework

• CMP with 4 CPUs (or more)CMP with 4 CPUs (or more)

– 4-way issue, out-of-order superscalar4-way issue, out-of-order superscalar

• Memory HierarchyMemory Hierarchy

– Private L1 data caches: 32KB, 2-wayPrivate L1 data caches: 32KB, 2-way

– 2MB shared L2 cache2MB shared L2 cache

– Bus interconnectBus interconnect

• Not shown: results for crossbar interconnectNot shown: results for crossbar interconnect

• Benchmarks: SPEC INT 95 and 2000Benchmarks: SPEC INT 95 and 2000

– Speculatively parallelizedSpeculatively parallelized


TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem




a shared cache solves locality problems (but slow)

Shared Cache ArchitecturePrivate Cache Architecture




Private Cache Architecture Shared Cache Architecture

Data Cache Instruction Cache

i-cache misses are insignificant; focus on d-cache






Parallel Regions Sequential Regions

miss patterns transitions


TLS Execution Stages and TransitionsTLS Execution Stages and Transitionstim

e

P P P P

ParallelRegion

SequentialRegion

SequentialRegion

SteadyState

Our main focus

Startup Little impact

Wind-down Has impact

wind-down transitions: scheduling the seq. region


Scheduling the Sequential RegionScheduling the Sequential Region

P0 P1 P2 P3

Floating Sequential Processor

Fixed Sequential Processor

PotentialCache Locality

which is better?


Performance of Fixed Relative to FloatingPerformance of Fixed Relative to Floating

Overall Program: 3.4% speedup

fixed sequential processor is superior, at no cost

99.6

95.0

96.7

88.8

96.5

100.0

87.0 99.8

99.3

97.3

99.9

94.8

101.6

96.6

0.0

50.0

100.0

150.0

bzi

p2_c

omp

craft

y

gcc

go

ijpeg li

m88ksi

m

mcf

pars

er

per

lbm

k

vort

ex

vpr_

pla

ce

vpr_

rou

te

ave

rage

Norm

ali

zed E

xecuti

on

Tim

e






Parallel Regions Sequential Regions

miss patterns transitions


Classifying Misses Within Parallel RegionsClassifying Misses Within Parallel Regions1)1) L2 Misses L2 Misses (ignore)(ignore)

– These cannot be locality misses (inclusion enforced)These cannot be locality misses (inclusion enforced)

2)2) Read-based sharingRead-based sharing

– Line is read by multiple processorsLine is read by multiple processors

3)3) Write-based sharingWrite-based sharing

– Line is written (and possibly read) by multiple processorsLine is written (and possibly read) by multiple processors

4)4) StridedStrided

– Addresses of missing lines progress by a cross-CPU strideAddresses of missing lines progress by a cross-CPU stride

5)5) Other Other (ignore)(ignore)

– No observable patterns; likely conflict and capacity missesNo observable patterns; likely conflict and capacity misses

caveats: there is overlap; priority order; sliding window


Miss Patterns ObservedMiss Patterns Observed

71.3%

Miss PatternMiss Pattern PercentagePercentage

L2 missL2 miss 15.7%15.7%

Read-based sharingRead-based sharing 53.7%53.7%

Write-based sharingWrite-based sharing 11.4%11.4%

StridedStrided 6.2%6.2%

OtherOther 13.0%13.0%

investigate techniques targeting these three patterns


Exploiting Read-Only Sharing PatternsExploiting Read-Only Sharing Patterns

• Read-only sharing misses dominate (53.7%)Read-only sharing misses dominate (53.7%)

– Hence a given read miss predicts future read missesHence a given read miss predicts future read misses

– i.e., other CPUs will likely read-miss that same linei.e., other CPUs will likely read-miss that same line

• Broadcasting for all read missesBroadcasting for all read misses

– Any read miss results in that line being pushed to all cachesAny read miss results in that line being pushed to all caches

• Provided lines in speculative state are not evictedProvided lines in speculative state are not evicted

– Trivial to implement in CMP with bus interconnectTrivial to implement in CMP with bus interconnect

• No extra trafficNo extra traffic

will such broadcasting result in cache pollution?


Impact of Broadcasting All Read Misses (RB)Impact of Broadcasting All Read Misses (RB)

Data Cache Misses Execution Time

27.7% reduction 7.3% speedup

simple broadcasting is effective

• Attempts to throttle broadcasting reduced benefitsAttempts to throttle broadcasting reduced benefits– Hence resulting cache pollution is limitedHence resulting cache pollution is limited



71.3%








Exploiting Write-Based Sharing PatternsExploiting Write-Based Sharing Patterns• Note: caches extended for TLS are write-backNote: caches extended for TLS are write-back

– Modifications are not propagated before thread commitsModifications are not propagated before thread commits

• Example: write-based sharing of a cache lineExample: write-based sharing of a cache line– CPU0 writes then commits; then CPU1 readsCPU0 writes then commits; then CPU1 reads

– Read results in miss, read-request, write-back, then fillRead results in miss, read-request, write-back, then fill

• Aggressive approach: Aggressive approach: – On commit, broadcast all modified linesOn commit, broadcast all modified lines

– Too much traffic, too many superfluous copiesToo much traffic, too many superfluous copies

• A more selective approach: A more selective approach: – Predict lines involved in write-based sharing Predict lines involved in write-based sharing

more general: predict stores involved in WB sharing


Predicting Stores & Lines Involved in WB SharingPredicting Stores & Lines Involved in WB Sharing

tag index offsetAddress:

Extended Tag (etag)

RST Index

8 Entries

8 Entries8 Entries

8-entries each is sufficient

Recent Store Table (RST)

store PC

store PC

store PCRST Index

(Recent store PCs)

store PC

store PC

store PC

store PC

Invalidation PC List (IPCL)

(Store PCs for lines that are written back)

Push Required Buffer (PRB)

etag

etag

etag

(lines to push on commit)


Operation of Write-Based Sharing TechniqueOperation of Write-Based Sharing Technique

On a store:On a store:

– Add store PC to Add store PC to Recent Store TableRecent Store Table ( (RSTRST))

– If store PC is in If store PC is in Invalidation PC List Invalidation PC List ((IPCLIPCL):):

• Add store PC to Add store PC to Push Required BufferPush Required Buffer ( (PRBPRB))

On a coherence request requring writeback:On a coherence request requring writeback:

– Use RST index to lookup PC in Use RST index to lookup PC in RSTRST, add PC to , add PC to IPCLIPCL

On commit:On commit:

– For each extended tag in For each extended tag in PRBPRB::

• Writeback, self-invalidate, push line to next cacheWriteback, self-invalidate, push line to next cache

simple case: next cache is in round-robin order


Impact of Write-Based Technique (WB)Impact of Write-Based Technique (WB)


19.6% reduction 7.8% speedup

worth the cost of small additional hardware



71.3%








Exploiting Strided Miss PatternsExploiting Strided Miss Patterns

• Hardware stride-prefetcher [Fu Hardware stride-prefetcher [Fu et alet al, Baer , Baer et alet al]]

– Each CPU has its own aggressive prefetcherEach CPU has its own aggressive prefetcher

– Fully associative, 512 entries: Fully associative, 512 entries:

• PC, miss address, stride distance, statePC, miss address, stride distance, state

– Issue 16 prefetches when stride is recognizedIssue 16 prefetches when stride is recognized

• Prefetches are throttled to avoid burst of trafficPrefetches are throttled to avoid burst of traffic

• Prefetch from L2 to private cachesPrefetch from L2 to private caches

– To be fair, prefetches do not go beyond L2To be fair, prefetches do not go beyond L2


Impact of Strided Prefetching (ST)Impact of Strided Prefetching (ST)

10.3% reduction No significant impact


no good alone---complementary with other techniques?


Combining Techniques: Parallel Region Perf.Combining Techniques: Parallel Region Perf.

72.4

92.7

65.9

93.6

61.8

87.2

57.3

88.1

0.0

100.0

Data cache

misses

Execution

time

No

rmal

ized

to

th

e B

asel

ine

WB/ST

RB/ST

RB/WB

RB/WB/ST

RB/WB/ST has fewest misses, but RB/WB performs best


Overall Program SpeedupOverall Program Speedup

9.2

13.4

16.7

16.2

13.0

18.9

18.1

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

Float Baseline RB WB ST RB/ WB RB/ WB/ ST

Per

cen

tag

e P

rog

ram

Sp

eed

up

RB/WB further improves program performance by 5.5%


Impact of RB/WB on ScalabilityImpact of RB/WB on Scalability

92.4

82.5

93.8

83.3 88.2

82.0

81.8

64.6

84.5

63.5

77.5

67.1

81.6

62.9

89.0

58.4

75.6

64.0

82.2

62.4

88.6

57.3

76.8

62.7

0.0

100.0

baseline improved baseline improved baseline improved

No

rmal

ized

Exe

cuti

on

Tim

e

2

4

6

8

facilitates scaling

Bzip2_comp Vpr_place Average(all benchmarks)


SummarySummary• Have a fixed processor for sequential regionsHave a fixed processor for sequential regions

• Exploiting read-only sharing patterns (RB):Exploiting read-only sharing patterns (RB):

– Simple broadcasting for all load misses is effectiveSimple broadcasting for all load misses is effective

• No significant cache pollutionNo significant cache pollution

• Exploiting write-based sharing patterns (WB):Exploiting write-based sharing patterns (WB):

– Write-back/self-invalidate/push technique is effectiveWrite-back/self-invalidate/push technique is effective

• Exploiting strided miss patterns (ST):Exploiting strided miss patterns (ST):

– Extra traffic overwhelms benefit of reduced missesExtra traffic overwhelms benefit of reduced misses

• RB/WB are complementary and perform bestRB/WB are complementary and perform best

– And dramatically improve the scalability of TLSAnd dramatically improve the scalability of TLS

Improving cache locality is key for effective TLS


BackupsBackups


Ideal CachesIdeal Caches

Ideal Caches Model (Parallel Region Performance)

100 99.9

80.4 80.1

0102030405060708090

100

Baseline Idealinstruction

cache

Ideal datacache

Idealinstructionand data

cache

Norm

aliz

ed

Exe

cutio

n T

ime


Parallel Region Cache Miss BreakdownParallel Region Cache Miss Breakdown

L2 Misses

Read-Based Sharing

Write-Based Sharing

Strided

Other

improving cache locality for thread-level speculation stanley fung and j. gregory steffan

Documents