utilizing shared data in chip multiprocessors with the nahalal architecture zvika guz, idit keidar,...

18
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser Avinoam Kolodny, Uri C. Weiser The Technion – Israel Institute of Technology The Technion – Israel Institute of Technology

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

Utilizing Shared Data in Chip Multiprocessors

with the Nahalal Architecture

Zvika Guz, Idit Keidar,Zvika Guz, Idit Keidar,Avinoam Kolodny, Uri C. Weiser Avinoam Kolodny, Uri C. Weiser

The Technion – Israel Institute of Technology The Technion – Israel Institute of Technology

Page 2: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

2

CMP’s severely stress on-chip caches Capacity

Bandwidth

Latency

Data sharing complicates our life Contention on shared data

Synchronization

Caches are a principal challenge in CMP

How to organize & handle data in CMP caches? How to organize & handle data in CMP caches?

Page 3: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

3

Outline Caches in CMP

Cache-in-the-Middle layout

Application characterization

Nahalal solution Overview

Results

Putting Nahalal into practice Line search

Scalability

Summary

Page 4: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

4

Tackling Cache Latency via NUCA Due to the growing wire delay:

Hit time depends on physical location [Agarwal et al., ISCA 2000]

Non uniform access times Closer data => smaller hit time

Aim for vicinity of reference Locate data lines closer to their client

NUCA - Non Uniform Cache Architecture NUCA - Non Uniform Cache Architecture [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]

L2 Cache

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2 L2 Cache

Migrate cache lines towards processors that access them

Dynamic NUCA (DNUCA) Dynamic NUCA (DNUCA) [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04]

Source: [Keckler et al., ISSCC 2003]

Page 5: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

5

Cache-In-the-Middle Layout (CIM) Shared L2 cache

Higher capacity utilization

Single copy no inter-cache coherence

Banked , DNUCA Interconnected using Network-on-Chip (NoC)

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

[Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]

Page 6: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

6

Remoteness of Shared Data Inevitably resides far from (some of) its clients

Long access times

0 7

56 63

P0 P1

P5 P4P

6P

7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4P

6P

7

P3

P2

Distributed L2

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

Page 7: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

7

For many parallel applications: Splash-2, SpecOMP, Apache, Specjbb, STM, ..

Observations on Memory Accesses

1. Access to shared lines is substantial

2. Shared lines are shared by many processors

3. A small number of lines make for a large fraction of the total accesses

A small number of lines, shared by many processors, is accessed numerous times

⇒ Shared hot lines effect

Page 8: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

8

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

Shared Data Hinders Cache Perf.

What can be done better?

Bring shared data closer to all processors

Preserve vicinity of private data

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

Page 9: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

9

Aerial view of Nahalal cooperative village

P0 P1

P2

P3P4P5

P6

P7

This Has Been Addressed Before

Overview of Nahalal cache organization

Page 10: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

10

A more realistic layout:

P0 P1P2

P3P4P5

P6

P7

Nahalal Layout A new architectural differentiation of cache lines

According to the way the data is used

Private vs. Shared

Designated area for shared data lines in the center Small & fast structure

Close to all processors

Outer rings used for private data Preserves vicinity of private data

P0 P1

P2P3

P4P5P6

P7

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

Bank0Bank1

Bank2

Ban

k3SharedBank

Ban

k7Bank6

Bank5

Bank4

CPU0

CP

U1

CPU2

CPU6

CP

U5

CPU4

CPU3CPU7

CPU0

CP

U1

CPU2

CPU6

CP

U5

CPU4

CPU3CPU7

Page 11: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

11

Nahalal Cache ManagementWhere does the data go? Where does the data go?

First access – go to private yard of requester

Accesses by additional cores – go to the middle

On eviction from over-crowded middle, can go to any sharer’s private yard

In typical workloads: virtually all accesses to shared

data satisfied from the middleCPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

Bank0Bank1

Bank2

Ban

k3SharedBank

Ban

k7Bank6

Bank5

Bank4

Page 12: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

12

Full system simulation via SIMICS

8 Processor CMP

Private L1 for each processor (32KByte)

16MByte of shared L2

Simulations

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

Bank0Bank1

Bank2

Ban

k3SharedBankB

ank7

Bank6Bank5

Bank4

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

CIM (Cache In the Middle) Nahalal

2MB near each processor 1.875MB near each processor

1MB in the middle

Page 13: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

13

26.8% improvement in average cache hit time 41.1% in apache

Average Cache Hit Time (cycles)

Cache Performance

0

5

10

15

20

25

30

35

40

45

50

equake fma3d barnes water apache zeus specjbb RBTree HashTable

CIM

NAHALAL

# c

lock

cycl

es

3.9% 8.57%

40.53%

41.1%

29.06% 29.35%39.4%

29.1%24.2%

Page 14: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

14

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

equake fma3d barnes water apache zeus specjbb RBTree HashTable

Avera

ge R

ela

tive D

ista

nce

Average Relative Distance

Nahalal shortens the distance to shared data

Distance to private data remains roughly the same

Average Distance – Shared vs. Private

Page 15: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

15

Putting Nahalal into Practice Line search:

How to find a line within the cache

Line Migration: When and where to move a line between places in the cache

Scalability: How far can we take the Nahalal structure

“The difference between theory and practice is always larger in practice than it is in theory” [Peter H. Salus]

Page 16: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

16

Summary State-of-the-art cache’s weakness

Remoteness of shared data

Software behavior: Shared-hot-lines effect

Shared data hinders cache performance

Nahalal cache architecture Places shared lines closer to all processor

Preserve vicinity of private data

A new architectural differentiation of cache lines Not all data should be treated equally

Data-usage-aware design

P0 P1P2

P3P4P5

P6

P7

Questions ?

Page 17: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

17

Backup

Page 18: Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel

18

Scalability IssuesThis has (also) been addressed beforeThis has (also) been addressed before

A cluster of Garden-Cities (Ebenezer Howard, 1902)Clustered Nahalal CMP design

Nahalal

Kfar Yehoshua