improving cache locality for thread-level speculation stanley fung and j. gregory steffan

40
1 Improving Cache Locality for TLS Steffan Improving Cache Locality Improving Cache Locality for for Thread-Level Speculation Thread-Level Speculation Stanley Fung and J. Gregory Steffan Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering Electrical and Computer Engineering University of Toronto University of Toronto

Upload: taline

Post on 28-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan Electrical and Computer Engineering University of Toronto. IBM Power 5. AMD Opteron. Intel Yonah. Chip Multiprocessors (CMPs) are Here!.  Use CMPs to improve sequential program performance?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

1Improving Cache Locality for TLS Steffan

Improving Cache Locality for Improving Cache Locality for

Thread-Level SpeculationThread-Level Speculation

Stanley Fung and J. Gregory SteffanStanley Fung and J. Gregory Steffan

Electrical and Computer EngineeringElectrical and Computer Engineering

University of TorontoUniversity of Toronto

Page 2: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

2Improving Cache Locality for TLS Steffan

Chip Multiprocessors (CMPs) are Here!Chip Multiprocessors (CMPs) are Here!

IBM Power 5AMD OpteronIntel Yonah

Use CMPs to improve sequential program performance?

Page 3: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

3Improving Cache Locality for TLS Steffan

Exploiting CMPS: The IntuitionExploiting CMPS: The Intuition

CMPs have lots of distributed resourcesCMPs have lots of distributed resources

– Caches, branch predictors, processorsCaches, branch predictors, processors

Somehow distribute sequential programsSomehow distribute sequential programs

– Use distributed resources to improve performanceUse distributed resources to improve performance

Increasingly aggressive approaches:Increasingly aggressive approaches:

1)1) Prefetching (eg., helper threads)Prefetching (eg., helper threads)

2)2) Transactions and transactional memoryTransactions and transactional memory

3)3) Thread-Level Speculation (TLS)Thread-Level Speculation (TLS)

But distributing a sequential program is non-trivial…

Page 4: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

4Improving Cache Locality for TLS Steffan

Exploiting CMPs: The TensionExploiting CMPs: The Tension

Distributed CMP ResourcesSequential Program

ParallelismLocality

Our challenge: relaxing this tension

L2

L1

P

L1

P

L1

P

L1

P

Page 5: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

5Improving Cache Locality for TLS Steffan

Example: TLS Execution on 4 ProcessorsExample: TLS Execution on 4 Processors

Execution

Time

L2

L1

P

L1

P

L1

P

L1

P

Sequential execution

active

inactive

L2

L1

P

L1

P

L1

P

L1

P

TLS execution

4X total cache capacity 4X cache performance?

Page 6: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

6Improving Cache Locality for TLS Steffan

TLS on 4 CPU CMP: % Increase in Cache MissesTLS on 4 CPU CMP: % Increase in Cache Misses

93.4 12

9.8

72.4

209.

8 306.

9

9.3

808.

9

37.9

19.1

934.

2

22.7

110.

3

788.

2

272.

5

0

100

200

300

400

500

600

700

800

900

1000

bzip2_

com

p

craf

tygc

c goijp

egli

m88

ksim

mcf

parse

r

perlb

mk

vorte

x

vpr_p

lace

vpr_r

oute

aver

age

Pe

rce

nta

ge

Inc

rea

se in

Da

ta C

ac

he

Miss

Ra

te

272.5%

~= 4X

4X total cache capacity 4X increase in cache misses

Page 7: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

7Improving Cache Locality for TLS Steffan

Opportunities for ImprovementOpportunities for Improvement

1)1) Prefetching EffectsPrefetching Effects

– TLS indirectly prefetches from off-chip into L2TLS indirectly prefetches from off-chip into L2

– Orthogonal to the focus of this workOrthogonal to the focus of this work

2)2) ““Locality Misses”Locality Misses”

– An L1 miss where the line is resident in another L1An L1 miss where the line is resident in another L1

– An indicator of both: An indicator of both:

• Broken localityBroken locality

• Opportunity to repair localityOpportunity to repair locality

What fraction of misses are locality misses?

Page 8: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

8Improving Cache Locality for TLS Steffan

TLS on 4 CPU CMP: % Locality MissesTLS on 4 CPU CMP: % Locality Misses

44.4

93.6

68.5

81.3

78.6

24.0

37.5

7.6

43.5

88.7 92

.2

65.8 69

.1

61.1

0

10

20

30

40

50

60

70

80

90

100

bzip2_

com

p

crafty gc

c goijp

egli

m88

ksim

mcf

parse

r

perlb

mk

vorte

x

vpr_p

lace

vpr_r

oute

avera

ge

Pe

rce

nta

ge

Lo

ca

lity

Ca

ch

e M

iss

significant locality misses: problem and opportunity

61.1%

Page 9: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

9Improving Cache Locality for TLS Steffan

OutlineOutline

• Experimental FrameworkExperimental Framework

• Classification of MissesClassification of Misses

• Techniques for Reducing MissesTechniques for Reducing Misses

• Combining TechniquesCombining Techniques

• Impact on ScalabilityImpact on Scalability

• ConclusionConclusion

Page 10: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

10Improving Cache Locality for TLS Steffan

Support for TLSSupport for TLS

Break programs into speculative threadsBreak programs into speculative threads

– We use the compilerWe use the compiler

Track data dependencesTrack data dependences

– We extend invalidation-based cache coherenceWe extend invalidation-based cache coherence

Recover from failed speculationRecover from failed speculation

– We extend L1 data caches to buffer speculative stateWe extend L1 data caches to buffer speculative state

three key elements of every TLS system

Page 11: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

11Improving Cache Locality for TLS Steffan

MIPS

Executable

Compiler Support for TLSCompiler Support for TLS

Region

Selection

Transformation and

Optimization

Sequential

SourceCode

inserts

TLS instructions

profile

informationwhich loops?

Page 12: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

12Improving Cache Locality for TLS Steffan

Hardware Support for TLSHardware Support for TLS

L2

L1

P

L1

P

L1

P

L1

P

CacheState Data

- -

- -

- -

Tag

-

-

-

-- -

SL

-

-

-

-

SM

-

-

-

-

P

extend generic CMP’s L1 caches and coherence

Page 13: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

13Improving Cache Locality for TLS Steffan

Experimental FrameworkExperimental Framework

• CMP with 4 CPUs (or more)CMP with 4 CPUs (or more)

– 4-way issue, out-of-order superscalar4-way issue, out-of-order superscalar

• Memory HierarchyMemory Hierarchy

– Private L1 data caches: 32KB, 2-wayPrivate L1 data caches: 32KB, 2-way

– 2MB shared L2 cache2MB shared L2 cache

– Bus interconnectBus interconnect

• Not shown: results for crossbar interconnectNot shown: results for crossbar interconnect

• Benchmarks: SPEC INT 95 and 2000Benchmarks: SPEC INT 95 and 2000

– Speculatively parallelizedSpeculatively parallelized

Page 14: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

14Improving Cache Locality for TLS Steffan

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

Page 15: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

15Improving Cache Locality for TLS Steffan

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

a shared cache solves locality problems (but slow)

Shared Cache ArchitecturePrivate Cache Architecture

Page 16: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

16Improving Cache Locality for TLS Steffan

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

Private Cache Architecture Shared Cache Architecture

Data Cache Instruction Cache

i-cache misses are insignificant; focus on d-cache

Page 17: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

17Improving Cache Locality for TLS Steffan

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

Private Cache Architecture Shared Cache Architecture

Data Cache Instruction Cache

Parallel Regions Sequential Regions

miss patterns transitions

Page 18: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

18Improving Cache Locality for TLS Steffan

TLS Execution Stages and TransitionsTLS Execution Stages and Transitionstim

e

P P P P

ParallelRegion

SequentialRegion

SequentialRegion

SteadyState

Our main focus

Startup Little impact

Wind-down Has impact

wind-down transitions: scheduling the seq. region

Page 19: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

19Improving Cache Locality for TLS Steffan

Scheduling the Sequential RegionScheduling the Sequential Region

P0 P1 P2 P3

Floating Sequential Processor

Fixed Sequential Processor

PotentialCache Locality

which is better?

Page 20: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

20Improving Cache Locality for TLS Steffan

Performance of Fixed Relative to FloatingPerformance of Fixed Relative to Floating

Overall Program: 3.4% speedup

fixed sequential processor is superior, at no cost

99.6

95.0

96.7

88.8

96.5

100.0

87.0 99.8

99.3

97.3

99.9

94.8

101.6

96.6

0.0

50.0

100.0

150.0

bzi

p2_c

omp

craft

y

gcc

go

ijpeg li

m88ksi

m

mcf

pars

er

per

lbm

k

vort

ex

vpr_

pla

ce

vpr_

rou

te

ave

rage

Norm

ali

zed E

xecuti

on

Tim

e

Page 21: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

21Improving Cache Locality for TLS Steffan

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

Private Cache Architecture Shared Cache Architecture

Data Cache Instruction Cache

Parallel Regions Sequential Regions

miss patterns transitions

Page 22: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

22Improving Cache Locality for TLS Steffan

Classifying Misses Within Parallel RegionsClassifying Misses Within Parallel Regions1)1) L2 Misses L2 Misses (ignore)(ignore)

– These cannot be locality misses (inclusion enforced)These cannot be locality misses (inclusion enforced)

2)2) Read-based sharingRead-based sharing

– Line is read by multiple processorsLine is read by multiple processors

3)3) Write-based sharingWrite-based sharing

– Line is written (and possibly read) by multiple processorsLine is written (and possibly read) by multiple processors

4)4) StridedStrided

– Addresses of missing lines progress by a cross-CPU strideAddresses of missing lines progress by a cross-CPU stride

5)5) Other Other (ignore)(ignore)

– No observable patterns; likely conflict and capacity missesNo observable patterns; likely conflict and capacity misses

caveats: there is overlap; priority order; sliding window

Page 23: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

23Improving Cache Locality for TLS Steffan

Miss Patterns ObservedMiss Patterns Observed

71.3%

Miss PatternMiss Pattern PercentagePercentage

L2 missL2 miss 15.7%15.7%

Read-based sharingRead-based sharing 53.7%53.7%

Write-based sharingWrite-based sharing 11.4%11.4%

StridedStrided 6.2%6.2%

OtherOther 13.0%13.0%

investigate techniques targeting these three patterns

Page 24: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

24Improving Cache Locality for TLS Steffan

Exploiting Read-Only Sharing PatternsExploiting Read-Only Sharing Patterns

• Read-only sharing misses dominate (53.7%)Read-only sharing misses dominate (53.7%)

– Hence a given read miss predicts future read missesHence a given read miss predicts future read misses

– i.e., other CPUs will likely read-miss that same linei.e., other CPUs will likely read-miss that same line

• Broadcasting for all read missesBroadcasting for all read misses

– Any read miss results in that line being pushed to all cachesAny read miss results in that line being pushed to all caches

• Provided lines in speculative state are not evictedProvided lines in speculative state are not evicted

– Trivial to implement in CMP with bus interconnectTrivial to implement in CMP with bus interconnect

• No extra trafficNo extra traffic

will such broadcasting result in cache pollution?

Page 25: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

25Improving Cache Locality for TLS Steffan

Impact of Broadcasting All Read Misses (RB)Impact of Broadcasting All Read Misses (RB)

Data Cache Misses Execution Time

27.7% reduction 7.3% speedup

simple broadcasting is effective

• Attempts to throttle broadcasting reduced benefitsAttempts to throttle broadcasting reduced benefits– Hence resulting cache pollution is limitedHence resulting cache pollution is limited

Page 26: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

26Improving Cache Locality for TLS Steffan

Miss Patterns ObservedMiss Patterns Observed

71.3%

Miss PatternMiss Pattern PercentagePercentage

L2 missL2 miss 15.7%15.7%

Read-based sharingRead-based sharing 53.7%53.7%

Write-based sharingWrite-based sharing 11.4%11.4%

StridedStrided 6.2%6.2%

OtherOther 13.0%13.0%

Page 27: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

27Improving Cache Locality for TLS Steffan

Exploiting Write-Based Sharing PatternsExploiting Write-Based Sharing Patterns• Note: caches extended for TLS are write-backNote: caches extended for TLS are write-back

– Modifications are not propagated before thread commitsModifications are not propagated before thread commits

• Example: write-based sharing of a cache lineExample: write-based sharing of a cache line– CPU0 writes then commits; then CPU1 readsCPU0 writes then commits; then CPU1 reads

– Read results in miss, read-request, write-back, then fillRead results in miss, read-request, write-back, then fill

• Aggressive approach: Aggressive approach: – On commit, broadcast all modified linesOn commit, broadcast all modified lines

– Too much traffic, too many superfluous copiesToo much traffic, too many superfluous copies

• A more selective approach: A more selective approach: – Predict lines involved in write-based sharing Predict lines involved in write-based sharing

more general: predict stores involved in WB sharing

Page 28: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

28Improving Cache Locality for TLS Steffan

Predicting Stores & Lines Involved in WB SharingPredicting Stores & Lines Involved in WB Sharing

tag index offsetAddress:

Extended Tag (etag)

RST Index

8 Entries

8 Entries8 Entries

8-entries each is sufficient

Recent Store Table (RST)

store PC

store PC

store PCRST Index

(Recent store PCs)

store PC

store PC

store PC

store PC

Invalidation PC List (IPCL)

(Store PCs for lines that are written back)

Push Required Buffer (PRB)

etag

etag

etag

(lines to push on commit)

Page 29: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

29Improving Cache Locality for TLS Steffan

Operation of Write-Based Sharing TechniqueOperation of Write-Based Sharing Technique

On a store:On a store:

– Add store PC to Add store PC to Recent Store TableRecent Store Table ( (RSTRST))

– If store PC is in If store PC is in Invalidation PC List Invalidation PC List ((IPCLIPCL):):

• Add store PC to Add store PC to Push Required BufferPush Required Buffer ( (PRBPRB))

On a coherence request requring writeback:On a coherence request requring writeback:

– Use RST index to lookup PC in Use RST index to lookup PC in RSTRST, add PC to , add PC to IPCLIPCL

On commit:On commit:

– For each extended tag in For each extended tag in PRBPRB::

• Writeback, self-invalidate, push line to next cacheWriteback, self-invalidate, push line to next cache

simple case: next cache is in round-robin order

Page 30: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

30Improving Cache Locality for TLS Steffan

Impact of Write-Based Technique (WB)Impact of Write-Based Technique (WB)

Data Cache Misses Execution Time

19.6% reduction 7.8% speedup

worth the cost of small additional hardware

Page 31: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

31Improving Cache Locality for TLS Steffan

Miss Patterns ObservedMiss Patterns Observed

71.3%

Miss PatternMiss Pattern PercentagePercentage

L2 missL2 miss 15.7%15.7%

Read-based sharingRead-based sharing 53.7%53.7%

Write-based sharingWrite-based sharing 11.4%11.4%

StridedStrided 6.2%6.2%

OtherOther 13.0%13.0%

Page 32: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

32Improving Cache Locality for TLS Steffan

Exploiting Strided Miss PatternsExploiting Strided Miss Patterns

• Hardware stride-prefetcher [Fu Hardware stride-prefetcher [Fu et alet al, Baer , Baer et alet al]]

– Each CPU has its own aggressive prefetcherEach CPU has its own aggressive prefetcher

– Fully associative, 512 entries: Fully associative, 512 entries:

• PC, miss address, stride distance, statePC, miss address, stride distance, state

– Issue 16 prefetches when stride is recognizedIssue 16 prefetches when stride is recognized

• Prefetches are throttled to avoid burst of trafficPrefetches are throttled to avoid burst of traffic

• Prefetch from L2 to private cachesPrefetch from L2 to private caches

– To be fair, prefetches do not go beyond L2To be fair, prefetches do not go beyond L2

Page 33: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

33Improving Cache Locality for TLS Steffan

Impact of Strided Prefetching (ST)Impact of Strided Prefetching (ST)

10.3% reduction No significant impact

Data Cache Misses Execution Time

no good alone---complementary with other techniques?

Page 34: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

34Improving Cache Locality for TLS Steffan

Combining Techniques: Parallel Region Perf.Combining Techniques: Parallel Region Perf.

72.4

92.7

65.9

93.6

61.8

87.2

57.3

88.1

0.0

100.0

Data cache

misses

Execution

time

No

rmal

ized

to

th

e B

asel

ine

WB/ST

RB/ST

RB/WB

RB/WB/ST

RB/WB/ST has fewest misses, but RB/WB performs best

Page 35: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

35Improving Cache Locality for TLS Steffan

Overall Program SpeedupOverall Program Speedup

9.2

13.4

16.7

16.2

13.0

18.9

18.1

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

Float Baseline RB WB ST RB/ WB RB/ WB/ ST

Per

cen

tag

e P

rog

ram

Sp

eed

up

RB/WB further improves program performance by 5.5%

Page 36: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

36Improving Cache Locality for TLS Steffan

Impact of RB/WB on ScalabilityImpact of RB/WB on Scalability

92.4

82.5

93.8

83.3 88.2

82.0

81.8

64.6

84.5

63.5

77.5

67.1

81.6

62.9

89.0

58.4

75.6

64.0

82.2

62.4

88.6

57.3

76.8

62.7

0.0

100.0

baseline improved baseline improved baseline improved

No

rmal

ized

Exe

cuti

on

Tim

e

2

4

6

8

facilitates scaling

Bzip2_comp Vpr_place Average(all benchmarks)

Page 37: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

37Improving Cache Locality for TLS Steffan

SummarySummary• Have a fixed processor for sequential regionsHave a fixed processor for sequential regions

• Exploiting read-only sharing patterns (RB):Exploiting read-only sharing patterns (RB):

– Simple broadcasting for all load misses is effectiveSimple broadcasting for all load misses is effective

• No significant cache pollutionNo significant cache pollution

• Exploiting write-based sharing patterns (WB):Exploiting write-based sharing patterns (WB):

– Write-back/self-invalidate/push technique is effectiveWrite-back/self-invalidate/push technique is effective

• Exploiting strided miss patterns (ST):Exploiting strided miss patterns (ST):

– Extra traffic overwhelms benefit of reduced missesExtra traffic overwhelms benefit of reduced misses

• RB/WB are complementary and perform bestRB/WB are complementary and perform best

– And dramatically improve the scalability of TLSAnd dramatically improve the scalability of TLS

Improving cache locality is key for effective TLS

Page 38: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

38Improving Cache Locality for TLS Steffan

BackupsBackups

Page 39: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

39Improving Cache Locality for TLS Steffan

Ideal CachesIdeal Caches

Ideal Caches Model (Parallel Region Performance)

100 99.9

80.4 80.1

0102030405060708090

100

Baseline Idealinstruction

cache

Ideal datacache

Idealinstructionand data

cache

Norm

aliz

ed

Exe

cutio

n T

ime

Page 40: Improving Cache Locality for  Thread-Level Speculation Stanley Fung and J. Gregory Steffan

40Improving Cache Locality for TLS Steffan

Parallel Region Cache Miss BreakdownParallel Region Cache Miss Breakdown

L2 Misses

Read-Based Sharing

Write-Based Sharing

Strided

Other