1 e. bolotin – the power of priority, nocs 2007 the power of priority : noc based distributed...

22
1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority: NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny QNoC Research Group Technion EE Department Technion, Haifa, Israel

Post on 21-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

1 E. Bolotin – The Power of Priority, NoCs 2007

The Power of Priority:NoC based Distributed Cache

Coherency

Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny

QNoC Research GroupTechnion

EE Department Technion, Haifa, Israel

Page 2: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

2 E. Bolotin – The Power of Priority, NoCs 2007

Chip Multi-Processor (CMP)

Dual-Core

Monolithic shared cache

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

Multi-Core

Large cache

Shared cache

Distributed cache

NoC-based: How?

Page 3: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

3 E. Bolotin – The Power of Priority, NoCs 2007

• Global wires delayGlobal wire delay

100

1

10

0.1 250130 90 65 45 32180250250

Gate delay

Source: ITRS 2003

Global Wires Delay

Future Cache - Physics Perspective

• Large cache Large access time

Fraction of chip reachable in 1 clock cycleSource: Keckler et al. ISSCC 2003

• Distance reached in single cycle Today: ~25% of chip In 10 years: ~1% of chip

Large monolithic cache is not scalable

Page 4: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

4 E. Bolotin – The Power of Priority, NoCs 2007

NUCA - Non Uniform Cache Architecture

NUCA= Non uniform access times

Banked cache over NoC Smaller bank Smaller Access Time Multiple banks Multiple Ports Closer bank Smaller Access Time

Cache-line placement policy

• Static NUCA (SNUCA)

• Dynamic NUCA (DNUCA)

Sources:Kim et al. ASPLOS 2002Beckmann et al. MICRO 2004

Page 5: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

5 E. Bolotin – The Power of Priority, NoCs 2007

Issues in NUCA-based CMP

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

• NoC performance CMP performance

• Cache coherency and transaction order (correctness)

• Search (in DNUCA)

• Different traffic types (e.g. fetch vs. prefetch)

• Synchronization (locks)

NoC Services for CMP?

Page 6: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

6 E. Bolotin – The Power of Priority, NoCs 2007

Cache Coherency over NoC

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

How do we maintain coherency over NoC?

• Snooping

• Central directory

cache line status vec. D

cache line status vec. D

cache line status vec. D

cache line status vec. D

cache line status vec. D

cache line status vec. D

Cache lines Dist. Directory

Cache bank with distributed directory

• Distributed directory

Page 7: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

7 E. Bolotin – The Power of Priority, NoCs 2007

Distributed Cache Coherency

Example: Simple read transaction

L2

Dire

ctor

y

P0L1

1. READ REQ

2. READ RESP (data transfer)

NoC

P0-Shared

Cache access Multiple NoC transactions

Ctrl. packet

Data packet

Page 8: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

8 E. Bolotin – The Power of Priority, NoCs 2007

Read Transaction of Modified Block

L2

Dir

ect

ory

P2L1

P0L1

2. READ RESP (data transfer)

No

C

No

C

P2-MOD.

L2D

ire

cto

ryP2L1

P0L1

4. WR BACK REQ3. READ REQ

6. READ RESP (data transfer)

5. WR BACK RESP(data transfer)

No

C

No

CP0-SHARED

1. READ EXCL. REQ

Ctrl. packet

Data packet

Page 9: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

9 E. Bolotin – The Power of Priority, NoCs 2007

Read Exclusive of Shared Block

L2

Dire

ctor

y

NoC

NoC

NoC

P1L1

P2L1

P0L1

1. READ. REQ

1. R

EA

D R

EQ

P1-SharedP2-Shared

L2

Dire

ctor

y

NoCN

oC

NoC

P1L1

P2L1

P0L1

3. READ EXCL. REQ

6. Read EXCL. RESP (data transfer)

5. INVALID. ACK

5. IN

VA

LID

. AC

K

P0-MOD.

Ctrl. packet

Data packet

Page 10: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

10 E. Bolotin – The Power of Priority, NoCs 2007

• Smart interfaces

Basic NoC to Support CMP

Can We Do Better?

Off-the-shelf (Vanilla) NoC:

• Grid of wormhole routers

L2

Dire

ctor

y

NoC

NoC

NoC

P1L1

P2L1

P0L1

4. IN

VALID. R

EQ

3. READ EXCL. REQ

6. Read EXCL. RESP (data transfer)

5. INVALID. ACK

5. IN

VA

LID

. AC

K

P0-MOD.

• Unicast only

• Ordering in network Static routing No virtual channels

Vanilla NoC

Page 11: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

11 E. Bolotin – The Power of Priority, NoCs 2007

Observations: L2 Access

A) Delay = Queueing + NoC transactions B) All NoC transactions are equally important

C) NoC transactions consist of:• Short ctrl. packets• Long data packets

Idea: Differentiate between Ctrl. and Data

Solution: Preemptive Priority NoC Give priority to short ctrl. packets

L2

Dire

cto

ry

NoC

NoC

NoC

P1L1

P2L1

P0L1

4. IN

VALID. R

EQ

3. READ EXCL. REQ

6. Read EXCL. RESP (data transfer)

5. INVALID. ACK

5.

INV

AL

ID.

AC

K

P0-MOD.

Page 12: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

12 E. Bolotin – The Power of Priority, NoCs 2007

Preemptive Priority NoC: QNoC

Multiple SL link

QNoC

Input ports Output ports

BufSize

SL 0

SL 1

CR

OS

S-B

AR

Scheduler CREDITControlCREDIT

SL 2

SL 3

SL 0

SL 1

SL 2

SL 3

Physical Link

Output Input

SL 0

SL 1

SL 2

SL 3

SL 0

SL 1

SL 2

SL 3

Service Levels:

• Dedicated wormhole buffer

• Preemptive priority scheduling

Multiple SL Router

Page 13: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

13 E. Bolotin – The Power of Priority, NoCs 2007

Example: Vanilla NoC

Blue delay ~XRed delay ~ 2X+δAverage delay ~ 1.5X

Vanilla NoC example

A B

Without contention:X:Delay of long packetδ:Delay of short packetLong Data

Transaction 1

Short Req.

Long Resp.

Transaction 2

Page 14: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

14 E. Bolotin – The Power of Priority, NoCs 2007

Example: Priority NoC

Blue delay=XRed delay = 2X+δAverage delay ~ 1.5X

Without contention:X:Delay of long packetδ:Delay of short packet

Vanilla NoC example

A BBlue delay= X+δ Red delay = X+δAverage delay ~ X

Potential delay reduction ~ 0.5X

Priority NoC example

Long Data

Transaction 1

Short Req.

Long Resp.

Transaction 2

Page 15: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

15 E. Bolotin – The Power of Priority, NoCs 2007

Priority NoC: Different Destinations

Very important in wormhole • When ctrl. packet is blocked by other worms

Short Req.

Long Data

Page 16: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

16 E. Bolotin – The Power of Priority, NoCs 2007

Protocol Correctness

L2

Dir

ect

ory

1. Read Req.

2. Read Resp.

4. Invalidation Req.

P0L1

P1L1

3. Read Excl. Req.Legend:

High Priority (ctrl.)

Low Priority (data)

Need state-preserving serialization of transactions in

the processor interface

Page 17: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

17 E. Bolotin – The Power of Priority, NoCs 2007

Numerical Evaluation

• CMP simulator (SIMICS)

Simulate parallel benchmarks

Obtain L2-cache access traces

• QNoC simulator (OPNET)

Simulate distributed coherence protocol over NoC

Measure total RD/RX L2-access delay

Measure total program throughput

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

Page 18: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

18 E. Bolotin – The Power of Priority, NoCs 2007

Priority NoC: Results

Av. Delay Reduction of L2-Transaction in Apache

0.00

5.00

10.00

15.00

20.00

25.00

30.00

1 4 16Link Capacity [gbps]

De

lay

Re

du

cti

on

[%

] Read

Read Exclusive

Av. Delay of L2-Read in Apache

234

5762

286

1301

994

0

200

400

600

800

1000

1200

1400

1 4 16Link Capacity[gbps]

De

lay

[c

yc

les

]

Vanilla NoC

Priority-based NoC

• Short ctrl. packet gets high priority• Long data packet gets low priority

Delay Reduction vs. Network Load

RD Delay - Apache RD/RX Delay Reduction - Apache

Page 19: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

19 E. Bolotin – The Power of Priority, NoCs 2007

Priority NoC: Several Benchmarks

L2 Access Delay Reduction by Priority-based NoC

22.6

31.8

19.6

28.4

13.5

25.3

18.3

32.9

22.3

28.0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

apache zeus fft ocean radix

De

lay

Re

du

cti

on

[%

]

Read Read Exclusive

Delay Reduction Program Speedup

Total Program Speedup by Priority-based NoC

9.48.7 9.0

8.6

5.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

apache zeus fft ocean radix

Sp

ee

du

p [

%]

Page 20: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

20 E. Bolotin – The Power of Priority, NoCs 2007

So Far: The Power of Priority

• Simplicity - Almost for Free

• Significant CMP Speed-up

Good For:

• Coherency

• Traffic differentiation (e.g. Fetch vs. Pre-Fetch)

• Search in DNUCA

• Synchronization (Locks)

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

Page 21: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

21 E. Bolotin – The Power of Priority, NoCs 2007

• Special Broadcast for Short Messages

Broadcast service (e.g. search in DNUCA)

Wormhole broadcast slow and expensive

S&F broadcast embedded in wormhole

• Virtual Ring

No Additional Cost

For Invalidation Multicast

Snooping or synchronization

Advanced Support Functions

S

Source

Replicating

Forwarding

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Page 22: 1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

22 E. Bolotin – The Power of Priority, NoCs 2007

Summary

NoC at CMP Service!

• Shared cache over NoC

• Priority is powerful

• Built-in support functions