organizing the last line of defense before hitting the memory wall for chip-multiprocessors (cmps)

44
Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) C. Liu, A. Sivasubramaniam, M. Kandemir The Pennsylvania State University [email protected]

Upload: destiny-hurst

Post on 30-Dec-2015

25 views

Category:

Documents


2 download

DESCRIPTION

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs). C. Liu, A. Sivasubramaniam , M. Kandemir The Pennsylvania State University [email protected]. Outline. CMPs and L2 organization Shared Processor-based Split L2 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Organizing the Last Line of Defense before hitting the

Memory Wall for Chip-Multiprocessors (CMPs)

C. Liu, A. Sivasubramaniam, M. Kandemir

The Pennsylvania State [email protected]

Page 2: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Outline

• CMPs and L2 organization• Shared Processor-based Split L2• Evaluation using

SpecOMP/Specjbb• Summary of Results

Page 3: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Why CMPs?

• Can exploit coarser granularity of parallelism

• Better use of anticipated billion transistor designs– Multiple and simpler cores

• Commercial and research prototypes– Sun MAJC– Piranha– IBM Power 4/5– Stanford Hydra– ….

Page 4: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Higher pressure on memory system

• Multiple active threads => larger working set

• Solution?– Bigger Cache.– Faster interconnect.

• What if we have to go off-chip?• The cores need to share the limited pins.• Impact of off-chip accesses may be much

worse than incurring a few extra cycles on-chip

• Needs a close scrutiny of on-chip caches.

Page 5: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

On-chip Cache Hierarchy

• Assume 2 levels– L1 (I/D) is private– What about L2?

• L2 is the last line of defense before going off-chip, and is the focus of this paper.

Page 6: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Private (P) L2

I$ D$ I$ D$

L2 $ L2 $ L2 $ L2 $ L2 $ L2 $

I N T E R C O N N E C T

Coherence Protocol

Offchip Memory

Advantages: Less interconnect traffic Insulates L2 units

Disadvantages: Duplication Load imbalance

L1 L1

Page 7: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Shared-Interleaved (SI) L2

Disadvantages: Interconnect traffic Interference between cores

Advantages: No duplication Balance the load

I$ D$ I$ D$

I N T E R C O N N E C T

Coherence ProtocolL1

L2

Page 8: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Desirables

– Approach the behavior of private L2s, when the sharing is not significant

– Approach the behavior of private L2 when load is balanced or when there is interference

– Approach behavior of shared L2 when there is significant sharing

– Approach behavior of shared L2 when demands are uneven.

Page 9: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Shared Processor-based Split L2

I N T E R C O N N E C T

$ $ $ $ $ $$ $ $ $ $ $

Table and Split Select

I$ D$ I$ D$L1

L2

Processors/cores are allocated L2 splits

Page 10: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Lookup

• Look up all splits allocated to requesting core simultaneously.

• If not found, then look at all other splits (extra latency).

• If found, move block over to one of its splits (chosen randomly), and removing it from the other split.

• Else, go off-chip and place block in one of its splits (chosen randomly).

Page 11: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Note …

• Note, a core cannot place blocks that evict blocks useful to another (as in Private case)

• A core can look at (shared) blocks of other cores – at a slightly higher cost without being as high as off-chip accesses (as in Shared case).

• There is at most 1 copy of a block in L2.

Page 12: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Shared Split Uniform (SSU)

I N T E R C O N N E C T

$ $ $ $ $ $$ $ $ $ $ $

Table and Split Select

I$ D$ I$ D$L1

L2

Page 13: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Shared Split Non-Uniform (SSN)

I N T E R C O N N E C T

$ $ $ $ $ $$ $ $ $ $ $

Table and Split Select

I$ D$ I$ D$L1

L2

Page 14: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Split Table

$ $ $ $ $$ $ $ $ $

X X X

X X X X

X X

XP3

P2

P1

P0

Page 15: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Evaluation

• Using Simics complete system simulator

• Benchmarks: SpecOMP2000 + Specjbb

• Reference dataset used• Several billion instructions were

simulated.• A bus interconnect was simulated

with MESI.

Page 16: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Default configuration

# of proc 8 L2 Assoc 4-way

L1 Size 8KB L2 Latency 10 cycles

L1 Line Size

32 Byte # L2 Splits 8 (SI, SSU)

L1 Assoc 4-way # L2 Splits 16 (SSN)

L1 Latency 1 cycle MEM Access

120 cycles

L2 Size 2MB total Bus Arbitration

5 cycles

L2 Line Size

64 Byte Replacement

Strict LRU

Page 17: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Benchmarks (SpecOMP + Specjbb)

Benchmark

L1 L2# of Inst (m)# Miss Rate # Miss Rate

ammp 53.1m 0.007 2.1m 0.062 25,528

applu 111.2m 0.009 26.4m 0.168 21,519

apsi 378.9m 0.117 27.2m 0.083 15,713

art_m 66.1m 0.009 25.7m 0.507 22,967

fma3d 18.9m 0.002 6.2m 0.239 26,189

galgel 111.4m 0.014 10.7m 0.127 24,051

swim 261.6m 0.111 95.9m 0.296 7,761

mgrid 333.2m 0.153 68.3m 0.185 10,294

specjbb 828.5m 0.353 22.7m 0.083 9,413

Page 18: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

SSN Terminology

• With a total L2 of 2MB (16 splits of 128K each) to be allocated to 8 cores, SSN-152 refers to – 512K (4 splits) allocated to 1 CPU– 256K (2 splits) allocated to each of 5 CPUs– 128K (1 split) allocated to each of 2 CPUs

• Determining how much to allocate to each CPU (and when) – postpone for future work.

• Here, we use a profile based approach based on L2 demands.

Page 19: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Application behavior

• Intra-application heterogeneity– Spatial: (among CPUs)

allocate non-uniform splits to different CPUs.

– Temporal: (for each CPU)change the number of splits allocated to a CPU at different points of time.

• Inter-application heterogeneity– Different applications running at

same time can have different L2 demands.

Page 20: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Definition

• SHF (Spatial Heterogeneity Factor)

• THF (Temporal Heterogeneity Factor)

SHFepoch cpu L1Misses L1Accessepoch

THFcpu epoch L1Misses L1Accesscpu

Page 21: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Spatial heterogeneity Factor

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Spatial Heterogeneity Factor

ammp

applu

apsi

art_m

fma3d

galgel

swim

mgrid

specjbb

Page 22: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Temporal Heterogeneity Factor

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Temporal Heterogeneity Factor

ammp

applu

apsi

art_m

fma3d

galgel

swim

mgrid

specjbb

Page 23: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: SI

I PC increase over P

- 50%

- 30%

- 10%

10%

30%

50%

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

P SI SSU SSN- 152 SSN- 224 SSN- 304

Page 24: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: SSU

I PC increase over P

- 50%

- 30%

- 10%

10%

30%

50%

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

P SI SSU SSN- 152 SSN- 224 SSN- 304

Page 25: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: SSN

I PC increase over P

- 40%

- 20%

0%

20%

40%

60%

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

P SI SSU SSN- 152 SSN- 224 SSN- 304

Page 26: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Summary of Results

• When P does better than S (e.g. apsi), SSU/SSN does as well (if not better) as P.

• When S does better than P (e.g. swim, mgrid, specjbb), SSU/SSN does as well (if not better) as S.

• In nearly all cases (except applu), some configuration of SSU/SSN does the best.

• On the average we get over 11% improvement in IPC over the best S/P configuration(s).

Page 27: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Inter-application Heterogenity

• Different applications have different L2 demands

• These applications could even be running concurrently on different CPUs.

Page 28: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Inter-application results• ammp+apsi

, low+high.• ammp+fma

3d, both low

• swim+apsi, both high, imbalanced + balanced.

• swim+mgrid,both high, imbalanced + imbalanced

0

1

2

3

4

5

6

ammp+apsi ammp+fma3d swim+apsi swim+mgrid

I PC

P SI SSU SSN-152 SSN-224 SSN-304

Page 29: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Inter-application: ammp+apsi

• SSN-152• 1.25MB

dynamically allocated to apsi, 0.75MB to ammp.

• Graph shows the rough 5:3 allocation.

• Better overall IPC value.

Low miss rate for apsi and not affecting the miss rate of ammp.

Page 30: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Concluding Remarks

• Shared Processor-based Split L2 is a flexible way of approaching the behavior of shared or private L2 (based on what is preferable)

• It accommodates spatial and temporal heterogeneity in L2 demands both within an application and across applications.

• Becomes even more important with higher off-chip accesses.

Page 31: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Future Work

• How to configure the split sizes – statically, dynamically and a combination of the two?

Page 32: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Backup Slides

Page 33: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

0

0.1

0.2

0.3

0.4

0.5

0.6

L1 miss rate L2 miss rate

ammp

applu

apsi

art_m

fma3d

galgel

swim

mgrid

specjbb

Page 34: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Meaning

• Capture the heterogeneity between CPUs (spatial) or over the epochs (temporal) of the load imposed on the L2 structure.

• Weighted by L1 accesses reflect the effect on the overall IPC.– If the overall access are low, there

is not going to be a significant impact on the IPC even though the standard deviation is high.

Page 35: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: P

0

1

2

3

4

5

6

7

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

IPC

P SI SSU SSN-152 SSN-224 SSN-304

Page 36: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: SI

0

1

2

3

4

5

6

7

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

IPC

P SI SSU SSN-152 SSN-224 SSN-304

Page 37: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results: SSU

0

1

2

3

4

5

6

7

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

IPC

P SI SSU SSN-152 SSN-224 SSN-304

Page 38: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results

0

1

2

3

4

5

6

7

ammp applu apsi art_m fma3d galgel swim mgrid specjbb

IPC

P SI SSU SSN-152 SSN-224 SSN-304

Except applu, shared splitL2 perform the best.

In swim, mgrid, specjbb with high L1 miss rate means higher pressure on L2,

which results significant IPC improvement(30.9% to 42.5%)

Page 39: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Why private L2 does better in some?

• L2 performance:– The degree of sharing– The imbalance of load imposed on

L2

• For applu and swim+apsi, – Only 12% of the blocks are shared

at any time, mainly shared between 2 CPUs.

– Not much spatial/temporal heterogeneity.

Page 40: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Why we use IPC instead of the execution time?

• We could not finish any of the benchmark, since we are using the “reference” dataset.

• Another possible indicator is the number of iterations executed of certain loop (for example, the dominating loop) for unit amount of time.

• We did this and find the direct correlation between the IPC value and the number of iterations.

Private SSU

Average time

ipc Average time

ipc

apsi loop calling dctdx() (mainloop)

3,349m cycles

3.44 3,048m cycles

3.79

Page 41: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Results

Page 42: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Closer look: specjbb

• SSU is over 31% better than the private L2.

• Direct correlation between the L2 misses and the IPC values.

• P never exceeds 2.5, while SSU sometimes push over 3.0

Page 43: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Sensitivity: Larger L2

• 2MB -> 4MB -> 8MB– Miss rates go down, difference

arising from miss rate diminish. ‘swim’ still get considerable savings.

– If application size keep growing up, the split shared L2 is still going to help.

– More splits of L2 -> finer granularity -> could help SSN.

Page 44: Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

Sensitivity: Longer memory access

-10.00%0.00%

10.00%20.00%30.00%

40.00%50.00%60.00%

swim applu specjbb ammp+fma3d

IPC increase over P

SI-120 cycles SSU-120 cycles

SI-240 cycles SSU-240 cycles

120 cycles -> 240 cyclesBenefits are amplified