2010 acm athena lecture shared caches in multicores mary jane irwin computer science & engr....

40
2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Upload: nicholas-henderson

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena

Lecture

Shared Caches in

MulticoresMary Jane IrwinComputer Science & Engr.Penn State UniversitySummer 2010

Page 2: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

The ACM Athena Lectures Athena is the Greek goddess of

wisdom. The ACM Athena Lectures were designed by ACM-W to “celebrate women researchers who have made fundamental research contributions to computer science.” Lecturers are nominated by ACM SIGs.

There have been five awarded to date: Deborah Estrin (‘06), Karen Sparck Jones (‘07), Shafi Goldwasser (‘08), Susan Eggers (‘09), and me (‘10)

22010 ACM Athena Lecture

Page 3: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

The forces at work2010 2012 2014 2016 2018

Tech node (nm) 32 22 16 11 8

Integ capacity (BT) 16 32 64 128 256

2010 ACM Athena Lecture 3

The Technology

The Power Wall

Multicores

0

20

40

60

80

100

120

Pow

er (W

atts)

Keep on performance

curve

Page 4: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

The multicore revolution

Multiple cores on one chip (socket) are the norm

But … the other on-chip resources used by those cores must also scale On-chip storage (e.g., caches) Core interconnect Off-chip memory bandwidth (e.g., memory controllers)

Performance also depends upon the design and effective management of these other resources

42010 ACM Athena Lecture

Improving the performance of on-chip caches in multicores

Page 5: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Multicore “A”

2010 ACM Athena Lecture 5

Core1

L1 I L1D

L2

Core2

L1 I L1D

L2

Core3

L1 I L1D

L2

Core4

L1 I L1D

L2

Examples – AMD’s AthlonX2, IBM’s POWER6 Good: Fast interconnect; No L2 app thread contention

Good for multiprogrammed (single-threaded apps) workloads Bad: App threads can’t share L2 capacity . . . or data

Page 6: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Core1

L1 I L1D

Core2

L1 I L1D

L2

Core3

L1 I L1D

Core4

L1 I L1D

L2

Multicore “B”

2010 ACM Athena Lecture 6

Again, many examples – e.g., Intel’s CoreDuo Good: App threads can share L2 data … and capacity

Good for multi-threaded (parallel apps) workloads

Bad: Slower interconnect; L2 app thread contention

Page 7: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

When app contention is an issue

Apps can be characterized as to their last level cache (LLC) behavior

2010 ACM Athena Lecture 7

Devils and rabbits – Xie and Loh (CMP-MSI’08) Devil apps do not “play well with others”

- access LLC very frequently, but still have a high miss rate (low reuse)

Rabbit apps need “more space to run around in” - access the LLC fairly frequently and have a low miss

rate if they have a sufficient number of LLC ways allocated to them, otherwise their performance

degrades rapidly

Page 8: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Architectural “solutions”

Keep the devil apps from harming the rabbit apps by dynamically “partitioning” the shared LLC

82010 ACM Athena Lecture

Regency position dynamic partitioning – Suh, et.al. (HPCA’04)

Utility-based cache partitioning (UCP) – Qureshi and Patt (MICRO’06)

Cooperative cache partitioning (CCP) – Chang and Sohi (ICS’07)

Thread-aware dynamic insertion (TADIP) – Jaleel, et.al (PACT’08)

. . .

Page 9: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

More architectural “solutions”

Or to dynamically “share” the capacity of private LLCs Cooperative caching – Chang and Sohi (ISCA’06) Distributed cooperative caching – Herrero, et.al.

(PACT’08) . . .

Or to try to have the best of both worlds with a LLC that can be private, shared or some of both Elastic cooperative caching – Herrero, Gonzalez,

Canal (ISCA’10) . . .

92010 ACM Athena Lecture

Page 10: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena Lecture 10

Observations about multi-threaded applications

Page 11: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Multi-threaded app’s (SPEComp)Simulation (Simics): 8 cores, 8 threads, 2MB shared L2

2010 ACM Athena Lecture 11

w'wis

e

swim

mgrid

applu

galgel

equak

e

apsi art

0

25

50

75Inter-core Intra-core

% o

f L

2 m

iss

es

Inter-core misses are on average double that of Intra-core misses

Page 12: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Another key observation

2010 ACM Athena Lecture 12

w'wis

e

swim

mgrid

applu

galgel

equak

e

apsi art

0

25

50

75Inter-core Intra-core

% o

f L

2 d

isti

nc

t re

fs l

ea

din

g t

o

mis

se

s

Most of these inter-core misses are from a few distinct memory address (hot blocks)

Page 13: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Yet another observation

2010 ACM Athena Lecture 13

w'wis

esw

imm

grid

applu

galgel

equak

eap

si art

0

20

40

60

80

100

>100K

10K-100K

1K-10K

100-1K

10-100

1 to 10

Te

mp

ora

l l

oc

ali

ty (

%)

of

In-

ter-

co

re m

iss

es

Most hot blocks (64.5%) are accessed over and over again within 100 references

So “pin” these hot blocks in the LLC and have another place to put the references from other

cores that map to those pinned sets

Page 14: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Set pinningSrikantaiah, et.al (ASPLOS’08)

2010 ACM Athena Lecture 14

Cores get replacement ownership of L2 sets by pinning them in place (core_id becomes part of the cache tag) Inter-core miss from

non-owner cores are stored in that core’s small (e.g., 16KB) POP (Privately Owned Processor) cache

Core2

Core1

Core3

Core4

Pinned

L2 cache

POP1

POP2

POP3

POP4

Can improve performance by periodically relinquishing ownership (some threads are too greedy)

Page 15: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Performance benefits

2010 ACM Athena Lecture 15w

'wis

esw

imm

grid

applu

galgel

equak

eap

si art

80

100

120

140

Traditional Set Pinned Adapt Pinned

No

rmal

ized

Per

form

anceAdaptive set pinning reduces L2 miss rate

by 48% on average and improves performance by 18% on average

Page 16: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Multi-threaded app’s structure

2010 ACM Athena Lecture 16

Start Parallel Section

End Parallel Section

critical path thread

T 1 T 2 T 3 T 4T 1 T 2 T 3 T 4

Page 17: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Multi-threaded app’s (SPEComp, NAS)Simulation (Simics): 4 cores, 4 threads, 1MB shared L2

2010 ACM Athena Lecture 17wupwise

mgrid swim applu art cg lu bt ep0

0.2

0.4

0.6

0.8

1

No

rma

lize

d L

LC

M

isse

s

wupwise

mgrid swim applu art cg lu bt ep0%

20%

40%

60%

80%

100%

Thread 1 Thread 2 Thread 3 Thread 4

No

rma

lize

d P

er-

form

an

ce

An app’s thread’s performance is largely determined by its LLC cache behavior

So dynamically partition the LLC to give the critical path thread more ways

Page 18: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Critical path thread cache partitioningMuralidhara, et.al. (IPDPS’10)

2010 ACM Athena Lecture 18

wupwise

mgrid swim applu art cg lu bt ep80%

85%

90%

95%

100%

105%

110%

115%

Throughput Critical Path

No

rma

lize

d P

erf

orm

an

ce

Improves performance by up to 23% (11% avg) over equal partitions, 15% (9% avg) over non-

partitioned, and 20% (10% avg) over throughput partitioned

Page 19: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Lately multicore architectures have gotten even more interesting ...

2010 ACM Athena Lecture 19

Page 20: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

More cores per socketMulticore “H”

2010 ACM Athena Lecture 20

H is for Intel’s Harpertown – with 8 cores Note the pair shared L2’s

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2

Page 21: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

More cache levelsMulticore “N”

2010 ACM Athena Lecture 21

N is for Intel’s Nehalem – again with 8 cores Three cache levels, private L2s, socket shared L3s

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2 L2 L2

L3

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2 L2 L2

L3

Page 22: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

More of bothMulticore “D”

2010 ACM Athena Lecture 22

D is for Intel’s Dunnington – with 12 cores Three cache levels, pair shared L2s, socket

shared L3s

C1

L1 I L1D

L2

L3

C2

L1 I L1D

C3

L1 I L1D

C4

L1 I L1D

C5

L1 I L1D

C6

L1 I L1D

L2 L2

C7

L1 I L1D

L2

L3

C8

L1 I L1D

C9

L1 I L1D

CA

L1 I L1D

CB

L1 I L1D

CC

L1 I L1D

L2 L2

Page 23: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Multi-threaded app’s

Consider running a single, multi-threaded application on N (or H or D)

2010 ACM Athena Lecture 23

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2 L2 L2

L3

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2 L2 L2

L3

Page 24: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Food for thought

Multi-threaded application (galgel) optimized for the cache hierarchy of each architecture

2010 ACM Athena Lecture 24

It is highly likely code optimized for N won’t run well on H (or D) and vice versa

Harpe

rtown

Nehale

m

Dunnin

gton

-2.22044604925031E-16

0.2

0.4

0.6

0.8

1

1.2

1.4

Harpertown Nehalem Dunnington

No

rmal

ized

Exe

cuti

on

Tim

e

Running code optimized for H on N gives a 26% performance hit

Page 25: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Iteration-to-core mapping – data sharing

2010 ACM Athena Lecture 25

B

B B

B B

B

B

0 1 2 3 0 1 2 3

Iterations i and j which both access B are mapped to cores that do not share an L2 Missed opportunity

Iterations i and j which both access B are mapped to cores that share an L2

Constructive data sharing

Page 26: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Local iteration scheduling

2010 ACM Athena Lecture 26

AB

BA

BC

0 1 0 1

BC

Iterations of core0 access A while core1 access B (core1 loads 1st ); the later access to B by core0 is a miss Missed opportunity

Iterations of core0 are rescheduled so that access to B by core0 is now a hit

Constructive data sharing

Page 27: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Compilation “solution”Kandemir, et.al. (PLDI’10)

1. Iteration-to-core mapping (assign iterations to cores)

2. Local (per core) iteration scheduling

2010 ACM Athena Lecture 27

Developed a compiler based, cache-topology aware thread mapper and scheduler

Microsoft’s Phoenix Compiler Infrastructure (code analyzer), polyhedral framework iteration and data sets (PSU) -> Omega Library -> iterations assigned to the cores, Intel compiler back end

Page 28: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Performance improvements

2010 ACM Athena Lecture 28

applu

galge

l

H.264

mesa

applu

galge

l

H.264

mesa

applu

galge

l

H.264

mesa

Harpertown Nehalem Dunnington

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Base+ Topology Aware

No

rmal

ized

Exe

cuti

on

Tim

e

Harpertown – 28% and 16% over Base, Base+Nehalem – 29% and 17% over Base, Base+

Dunnington – 30% and 21% over Base, Base+

Page 29: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena Lecture 29

But most multicores are likely to be running both multithreaded and single-threaded apps at the

same time

Page 30: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

A laptop, desktop scenario

Have both thread contention and thread sharing

2010 ACM Athena Lecture 30

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2 L2 L2

L3

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2 L2 L2

L3

Page 31: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

And have to deal with thermal emergencies

2010 ACM Athena Lecture 31

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2 L2 L2

L3

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2 L2 L2

L3

Page 32: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

And have to deal with faults

SEU, aging, … where cores have to be decommission (and possibly recommissioned)

2010 ACM Athena Lecture 32

C1L1 I L1D

L2

C2L1 I L1D

C3L1 I L1D

C4L1 I L1D

L2 L2 L2

L3

C5L1 I L1D

L2

C6L1 I L1D

C7L1 I L1D

C8L1 I L1D

L2 L2 L2

L3

Page 33: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena Lecture 33

Such scenarios require an approach that can reschedule threads at run time

Page 34: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Run-time “solutions”

The OS can optimize thread scheduling on multicores at runtime – for performance, for power, for reliability Arch support for OS cache management – Rafique,

et.al. (PACT’06) OS page allocation – e.g., Cho, et.al. (MICRO’06) Thread scheduling for constructive cache sharing –

e.g., Chen, et.al., (SPAA’07) Cache-fair scheduling – Fedorova, et.al. (PACT’07) Cache contention-aware thread scheduling – e.g.,

Zhuravlev, Blagodurov, Fedorova (ASPLOS’10) . . .

2010 ACM Athena Lecture 34

Page 35: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

REEact: A Virtual Exec Managerwith Soffa & Davidson (UVA), Childers (UPitt)

2010 ACM Athena Lecture 35

Support Services

Software Profiler

Core Temp Info

Thread monitor

Perfmon

Global Execution Manager OS

VEM Communicator

Multi app mapping policy

Application Execution Manager Application Execution Manager

Compiler thread mapping policy

Critical thread mapping policy

Core Execution Manager Core Execution Manager Core Execution Manager Core Execution Manager

Thermal alarm policy

Page 36: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena Lecture 36

SC_BT SC_SW CN_SW FL_BT

-5

0

5

10

15

20ThroughputARTime

Applications StreamCluster

(memory-bound) BodyTrack

(I/O-bound) Swaptions

(CPU-bound) CaNneal

(memory-bound) FLuidanimate

(memory-CPU- bound)

Mapping with REEact (PARSEC)Yorkfield: 4 cores, 4 threads/app, 6MB pair-shared L2

Static isolation mappingapp1 on core0 and core1app2 on core2 and core3

SC_BT SC_SW CN_SW FL_BT0

2

4

6

8

10

12

14

16

18

20ThroughputARTime

Dynamic mappingCore load balancing and

utilization

Page 37: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

372010 ACM Athena Lecture

Explicitly parallel codes (MPI, Cilk, TBB, cuda, pthreads, …)

Mapping (scheduling) threads to cores

Dynamically adapting thread-to-core mapping during execution

Other issues Cache

coherence Memory

bandwidth mgmt

Network-on-chip impacts

Impacts of 3D and new technologies

Architecture aware parallelizing compilers

Identifythe parallel threads

Initial thread mapping

Run-time thread adaptation

In closing – it takes a village

© Intel

Page 38: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Thanks to SIGARCH (and SIGDA)

My career journey Decade (or so) working in

application specific architectures

382010 ACM Athena Lecture

And to ISCA and ACM-W and Google

SIGARCH ISCA SIGARCH S/T

SIGDA DAC, ISLPED SIGDA Board

SIGARCH ISCA, ASPLOS SIGARCH Board

Decade (or so) working on EDA tools, from logic synthesis, to module layout … to power simulators (SimplePower)

Decade (or so) back in the architecture space … optimize power, performance, and reliability

Page 39: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

Thanks to my research colleagues

The faculty: Bob Owens, Mahmut Kandemir, Vijay Narayanan, Padma Raghavan, and Yuan Xie (PSU), Jack Davidson and Mary Lou Soffa (UVA), Bruce Childers (UPitt)

(Some of) the students:

2010 ACM Athena Lecture 39

Page 40: 2010 ACM Athena Lecture Shared Caches in Multicores Mary Jane Irwin Computer Science & Engr. Penn State University Summer 2010

2010 ACM Athena Lecture 40

Thank you!

Questions?