eecs 470 lecture 17-18 mul3processors · lecture 20 eecs 470 slide 6 thread-level parallelism •...

43
Lecture 17 Slide 1 EECS 470 © Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 470 Lecture 18 NUCA & Prefetching Fall 2020 Prof. Ronald Dreslinski h5p://www.eecs.umich.edu/courses/eecs470 Prefetch A3 11 Correlating Prediction Table A3 A0,A1 A0 History Table Latest A1 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.

Upload: others

Post on 17-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 1 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS470Lecture18NUCA&Prefetching

Fall2020

Prof.RonaldDreslinski

h5p://www.eecs.umich.edu/courses/eecs470

Prefetch A3

11

Correlating Prediction Table

A3A0,A1 A0

History Table

Latest

A1

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.

Page 2: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 2 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Non-Uniform Cache Architecture ASPLOS2002proposedbyUT-AusDn(Kim,Burger,Keckler)

Facts❒  Largesharedon-dieL2❒  Wire-delaydominaDngon-diecache

3 cycles 1MB 180nm, 1999

11 cycles 4MB 90nm, 2004

24 cycles 16MB 50nm, 2010

Page 3: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 3 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Multi-banked L2 cache

Bank=128KB 11 cycles

2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles

Page 4: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 4 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Multi-banked L2 cache

Bank=64KB 47 cycles

16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles

Page 5: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 5 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Static NUCA-1

Useprivateper-bankchannel

EachbankhasitsdisDnctaccesslatency

StaDcallydecidedatalocaDonforitsgivenaddress

Averageaccesslatency=34.2cycles

Wireoverhead=20.9%àanissue

Tag Array

Data Bus

Address Bus

Bank

Sub-bank

Predecoder

Sense amplifier

Wordline driver and decoder

Page 6: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 6 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Static NUCA-2

Usea2Dswitchednetworktoalleviatewireareaoverhead

Averageaccesslatency=24.2cycles

Wireoverhead=5.9%

Bank

Data bus

Switch Tag Array

Wordline driver and decoder

Predecoder

Page 7: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 7 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic NUCA

Datacandynamicallymigrate

MovefrequentlyusedcachelinesclosertoCPU

Page 8: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 8 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic NUCA

SimpleMapping

All4waysofeachbanksetneedstobesearched

Fartherbanksetsàlongeraccess

8 bank sets way 0

way 1

way 2

way 3

one set bank

Page 9: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 9 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic NUCA

FairMapping

AverageaccessDmeacrossallbanksetsareequal

8 bank sets way 0

way 1

way 2

way 3

one set bank

Page 10: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 10 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic NUCA

SharedMapping

Sharingtheclosestbanksforfartherbanks

8 bank sets way 0

way 1

way 2

way 3

bank

Page 11: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 11 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

11

The memory wall

Today:1memaccess≈500arithmeDcops

Howtoreducememorystallsforexis3ngSW?

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

ance

Source:Hennessy&Pa]erson,ComputerArchitecture:AQuan2ta2veApproach,4thed.

Processor

Memory

Page 12: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 12 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

data

Conventional approach #1: Avoid main memory accesses

Cachehierarchies:

Tradeoffcapacityforspeed

Addmorecachelevels?

Diminishinglocalityreturns

NohelpforshareddatainMPs

CPU

64K

4M

Mainmemory

2clk

20clk

200clk

Writedata

CPU

Page 13: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 13 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Conventional approach #2: Hide memory latency

Out-of-orderexecuDon:

Overlapcompute&memstalls

ExpandOoOinstrucDonwindow?

Issue&load-storelogichardtoscale

NohelpfordependentinstrucDons

execuD

on

computememstall

OoOInorder

Page 14: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 14 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Challenges of server apps

Frequentsharing&synchronizaDon

Manylinked-datastructures❒  E.g.,linkedlist,B+tree,…❒  Dependentmisschains[Ranganathan98]

Lowmemorylevelparallelism[Chou04]

50-66%Dmestalledonmemory[Trancoso97][Barroso98][Ailamaki99]……

Ourgoals:Readmisses: Fetchearlier&inparallelWritemisses: Neverstall

exec

ution

Goal Today

compute mem stall

Page 15: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 15 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

What is Prefetching? •  FetchmemoryaheadofDme

•  Targetscompulsory,capacity,&coherencemisses

Bigchallenges:

1.  knowing“what”tofetch•  Fetchinguselessinfowastesvaluableresources

2.  “when”tofetchit•  Fetchingtooearlyclu]ersstorage•  Fetchingtoolatedefeatsthepurposeof“pre”-fetching

Page 16: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 16 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Software Prefetching

Compiler/programmerplacesprefetchinstrucDons❒  requiresISAsupport❒  whynotuseregularloads?❒  foundinrecentISA’ssuchasSPARCV-9

Prefetchinto❒  register(binding)❒  caches(non-binding)

Page 17: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 17 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Software Prefetching (Cont.)

e.g.,

for(I=1;I<rows;I++)

for(J=1;J<columns;J++)

{

prefetch(&x[I+1,J]);

sum=sum+x[I,J];

}

Page 18: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 18 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Hardware Prefetching

Whattoprefetch?❒  oneblockspaDallyahead?❒  useaddresspredictorsàworksforregularpa]erns(x,x+8,x+16,.)

Whentoprefetch?❒  oneveryreference❒  oneverymiss❒  whenpriorprefetcheddataisreferenced❒  uponlastprocessorreference❒  usemorecomplicatedrate-matchingmechanisms

Wheretoputprefetcheddata?❒  auxiliarybuffers❒  caches

Page 19: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 19 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Generalized Access Pattern Prefetchers

Howdoyouprefetch

1.  Heapdatastructures?

2.  Indirectarrayaccesses?

3.  Generalizedmemoryaccesspa]erns?

Taxonomyofapproaches:

•  SpaDalprefetchers

•  Address-correlaDngprefetchers

•  PrecomputaDonprefetchers

Page 20: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 20 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Spatial Locality and Sequential Prefetching

WorkswellforI-cache❒  InstrucDonfetchingtendtoaccessmemorysequenDally

Doesn’tworkverywellforD-cache❒  Moreirregularaccesspa]ern❒  regularpa]ernsmayhavenon-unitstride(e.g.matrixcode)

RelaDvelyeasytoimplement❒  Largecacheblocksizealreadyhavetheeffectofprefetching❒  Auerloadingone-cacheline,startloadingthenextlineautomaDcallyifthelineisnotincacheandthebusisnotbusy

Page 21: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 21 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Array/stridecorrelatedtostaDcloadinstrucDon[Baer’91]

ReferencePredicDonTable

RecordloadPC,lastaddr.&stridebetweenlasttwoaddrs.

Onloadàcomputedistancebetweencurrent&lastaddr

•  ifnewdistancematchesoldstrideàfoundapa]ern,goaheadandprefetch“currentaddr+stride”

• update“lastaddr”and“laststride”fornextlookup

LoadInst. LastAddress Last Flags

PC(tag) Referenced Stride

……. ……. ……

PC-based Stride Prefetchers

Load Inst PC

Page 22: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 22 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Stream Buffers [Jouppi] EachstreambufferholdsonestreamofsequenDally

prefetchedcachelines

Onaloadmisschecktheheadofallstreambuffersforanaddressmatch

❒  ifhit,poptheentryfromFIFO,updatethecachewithdata

❒  ifnot,allocateanewstreambuffertothenewmissaddress(mayhavetorecycleastreambufferfollowingLRUpolicy)

StreambufferFIFOsareconDnuouslytopped-offwithsubsequentcachelineswheneverthereisroomandthebusisnotbusy

StreambufferscanincorporatestridepredicDonmechanismstosupportnon-unit-stridestreams

FIFO

FIFO

FIFO

FIFO

DCache

Mem

ory

inte

rface

Indirectarrayaccesses(e.g.,A[B[i]])?

Nocachepollu2on

Page 23: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 23 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Global History Buffer (GHB) [Nesbit’04]

HoldsmissaddresshistoryinFIFOorder

LinkedlistswithinGHBconnectrelatedaddresses

❒  SamestaDcload(PC/DC)❒  Sameglobalmissaddress(G/AC)

LinkedlistwalkisshortcomparedwithL2misslatency

Global History Buffer

miss addresses

Index Table

FI

Load PC

FO

Page 24: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 24 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

GHB – Deltas

1 4 8

1 8 8 1 4 4 1 8 8 Delta Stream

Miss Address Stream 27 28 36 44 45 49 53 54 62 70 71

1

1

8

=> Current => Prefetches

Key

8

4

4

Width Depth Hybrid Markov Graph

.3 .3

.3 .7 .7 .7

71 + 8 => 79

79 + 8 => 87

Prefetches 71 + 4 => 75

79 + 4 => 79

Prefetches 71 + 8 => 79

71 + 4 => 75

Prefetches

Page 25: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 25 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

GHB – Stride Prefetching GHB-StrideusesthePCtoaccesstheindextable

Thelinkedlistscontainthelocalhistoryofeachload

Comparethelasttwolocalstrides.Ifthesamethenprefetchn+s,n+2s,…,n+ds

Global History Buffer miss address pointer pointer

Index Table

head pointer

A B C A B C B

1

C

1

PC

=?

Page 26: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 26 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

GHB –Delta Correlation (PC/DC) FormdeltacorrelaDonswithineachload’slocalhistory

Forexample,considerthelocalmissaddressstream:

BestresultsamongdataprefetchersforSPEC2K

[GraciaPérez’04]

Addresses 0 1 2 64 65 66 128 129 Deltas 1 1 62 1 1 62 1

Correlation Prefetch Predictions (1,1) 62 1 1

(1,62) 1 1 62 (62, 1) 1 62 1

Page 27: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 27 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Spatial Correlation

RepeDDvespaDalrelaDonshipsbetweenaccesses• Irregularlayoutànon-strided• Sparseàcan’tcapturewithcacheblocks• But,repeDDveàpredicttoimprovememory-levelpar.NottobeconfusedwithspaDallocality:• pa]ernsmayrepeatoverlarge(e.g.,fewkB)regions

Page 28: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 28 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Large-scalespaDalaccesspa]ernsPa]ernisfuncDonofprogram

Database Page (8kB)

page header

tuple data

tuple slot index

Mem

ory

Example Spatial correlation [Somogyi’06]

Page 29: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 29 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

SMS Operation Summary

spatial patterns

1100000001101…

1100001010001… Spatial patterns stored in a

pattern history table

3

2

1

cache hits time

observe

store

predict

PC1 ld A+4 PC2 ld A

PC3 ld A+3

PC1 ld B+4 PC2 ld B

PC3 ld B+3

evict A+3

29

Page 30: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 30 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Correlation-Based Prefetching [Charney 96]

ConsiderthefollowinghistoryofLoadaddressesemi]edbyaprocessorA,B,C,D,C,E,A,C,F,F,E,A,A,B,C,D,E,A,BC,D,C

AuerreferencingaparDcularaddress(sayAorE),aresomeaddressesmorelikelytobereferencednext

A B C

D E F 1.0

.33 .5

.2

1.0 .6 .2

.67 .6

.5

.2

.2

Markov Model

Page 31: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 31 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

TrackthelikelynextaddressesauerseeingaparDcularaddr.

PrefetchaccuracyisgenerallylowsoprefetchuptoNnextaddressestoincreasecoverage (butthiswastesbandwidth)

Prefetchaccuracycanbeimprovedbyusinglongerhistory❒  DecidewhichaddresstoprefetchnextbylookingatthelastKloadaddressesinsteadofjustthecurrentone

❒  e.g.indexwiththeXORofthedataaddressesfromthelastKloads❒  UsinghistoryofacoupleloadscanincreaseaccuracydramaDcally

Thistechniquecanalsobeappliedtojusttheloadmissstream

LoadDataAddr Prefetch Confidence …. Prefetch Confidence

(tag) Candidate1 …. CandidateN

……. ……. …… .… ……. ……

….

Correlation-Based Prefetching

Load Data Addr

Page 32: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 32 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

MarkovPrefetchers

• Correlatesubsequentcachemisses

• Triggerprefetchonmiss

• Width-prefetching:predict&prefetchfourcandidates❒  predicDngonlyoneresultsinlowcoverage!

• Prefetchintoabuffer

Example: Markov Prefetchers [Joseph’07]

Page 33: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 33 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Tag-Correlating Prefetchers [Kaxiras’04]

CorrelaDon-basedprefetching:

•  tablesaretoobig

•  theygrowwithdataworkingsetsize

Muchsimilarityinblockaddressesmappingtosets

•  whenmarchingthrougharrays,tagsacrosssetsidenDcal!

•  savespaceincorrelaDontablesbycorrelaDngtagsonly(notfulladdresses)

Page 34: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 34 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Revisit: Global History Buffer (GHB) [Nesbit’04]

HoldsmissaddresshistoryinFIFOorder

LinkedlistswithinGHBconnectrelatedaddresses

❒  PC/DC❒  Sameglobalmissaddress(G/AC)

LinkedlistwalkisshortcomparedwithL2misslatency

Global History Buffer

miss addresses

Index Table

FI

Miss Address

FO

Page 35: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 35 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Miss Address Stream 27 28 29 27 28 29 28

GHB (G/AC) - Example

29 Global History Buffer

miss address pointer pointer Index Table

28 29 29

29

head pointer

28

27

27 27

=> Current => Prefetches

Key

28 29 28

Miss Address

Page 36: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Linked-DataPrefetchers•  Whentraversinglinked-structure:•  Learn/recordload-to-loaddependence•  GetaheadofprocessorbytraversingstructureinFSM•  FSMgetsaheadofprocessorbyskippingcomputaDon

q  SimilarproposalswithSWhelp(e.g.,helper/scoutthreads)•  Example:while(*p!=NULL){if(p->key==MATCH)p->val++;p=p->next;}

Page 37: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

LinkedDataStructureAccess

next 0 4 8 12 14

next 0 4 8 12 14

next 0 4 8 12 14

next 0 4 8 12 14 Offset

Offset

Offset

Offset

Page 38: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

DetecNngRecursiveAccesses

next

a+0 a+4 a+8 a+12 a+14

next

Offset Offset

LOAD rdest, rsrc(14)

a rsrc:

LOAD rdest, rsrc(14)

b rsrc:

Producerofb Consumerofb/Producerofc

hold same value

next

c+0 c+4 c+8 c+12 c+14

Offset

b+0 b+4 b+8 b+12 b+14

p = p->next;

Example

Page 39: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Roth,Moshovos,Sohi(HW)[Roth’98]

PC-A: LOAD rdest, rsrc(14)

a rsrc:

PC-B: LOAD rdest, rsrc(14)

b rsrc:

Producer of b Consumer of b/Producer of c

hold same value

Potential Producer Window

Memory Value

Loaded

Producer Instruction Address

PC-A b

Correlation Table

Producer Instruction Address

Consumer Instruction Address

PC-B PC-A

Consumer Instruction Template

LOAD r,r(14)

Page 40: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 40 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar Runahead Execution [Mutlu’03]

Memory-levelparallelismoflargewindowwithoutbuildingit!

WhenoldestinstrucDonisL2miss:❒ Checkpointstateandenterrunaheadmode

Inrunaheadmode:❒  InstrucDonsspeculaDvelypre-executed❒ TodiscoverotherL2misses❒ ProcessorconDnuestorun

RunaheadmodeendswhentheoriginalL2missreturns❒ CheckpointisrestoredandnormalexecuDonresumes

Page 41: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 41 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Compute

Compute

Compute

Load 1 Miss

Stall Compute

Load 2 Miss

Stall

Load 1 Hit Load 2 Hit

Compute

Load 1 Miss

Runahead

Load 2 Miss Load 2 Hit

Compute

Load 1 Hit

Saved Cycles

Perfect Caches:

Small Window:

Runahead:

Runahead Example

Page 42: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 42 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Benefits of Runahead Execution

AvoidstallingduringanL2cachemiss!Pre-executedloads/storesgenerateaccurateprefetches❒ bothregularandirregularaccesspa]erns

InstrucDonsonpredictedpath❒ prefetchedintoi-cacheandL2

Hardwareprefetcherandbranchpredictor❒ aretrainedusingfutureaccessinformaDon

Page 43: EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 43 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Improving Cache Performance: Summary Missrate

❒  largeblocksize❒  higherassociaDvity❒  vicDmcaches❒  skewed-/pseudo-associaDvity❒  hardware/souwareprefetching❒  compileropDmizaDons

Misspenalty❒  giveprioritytoreadmissesoverwrites/writebacks

❒  subblockplacement❒  earlyrestartandcriDcalwordfirst❒  non-blockingcaches❒  mulD-levelcaches

HitDme(difficult?)❒  smallandsimplecaches❒  avoidingtranslaDonduringL1indexing(later)

❒  pipeliningwritesforfastwritehits

❒  subblockplacementforfastwritehitsinwritethroughcaches