kilo-noc: a network-on-chip architecture for scalability and service guarantees boris grot the...

57
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Kilo-NOC: A Network-on-Chip Architecture for Scalability and

Service Guarantees

Boris GrotThe University of Texas at Austin

1970 1975 1980 1985 1990 1995 2000 2005 20101,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Tran

sist

or c

ount

Technology Trends

Core i7

Pentium D

Pentium 4

Pentium

Xeon Nehalem-EX

4004

286386

486

8086

Year of introduction

Tran

sist

or c

ount

2

𝑃𝑒𝑟𝑓 / $

𝑃𝑒𝑟𝑓 /$𝑊𝑎𝑡𝑡

3

Technology Applications

4

Networks-on-Chip (NOCs)

The backbone of highly integrated chipsTransport of memory, operand, and control trafficStructured, packet-based, multi-hop networks Increasing importance with greater levels of integration

Major impact on chip performance, energy, and area

TRIPS: 28% performance losson SPEC 2K in NOC

Intel Polaris: 28% of chip power consumption in NOC

Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10

5

On-chip vs Off-chip Interconnects

Topology Routing Flow

control

Pins Bandwidt

h Power Area

Future NOC Requirements100’s to 1000’s of network clients

Cores, caches, accelerators, I/O ports, …Efficient topologies

High performance, small footprint

Intelligent routingPerformance through better load balance

Light-weight flow controlHigh performance, low buffer requirements

Service Guaranteescloud computing, real-time apps demand QOS support

6

HPCA ‘09

HPCA ‘08

MICRO ‘09

under submission

under submission

7

Outline Introduction Service Guarantees in Networks-on-Chip

Motivation Desiderata, prior work Preemptive Virtual Clock Evaluation highlights

Efficient Topologies for On-chip Interconnects Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work

8

Why On-chip Quality-of-Service? Shared on-chip resources

Memory controllers, accelerators, network-on-chip

… require QOS support fairness, service differentiation, performance

isolation

End-point QOS solutions are insufficient Data has to traverse the on-chip network Need QOS support at the interconnect level

Hard guarantees in NOCs

9

NOC QOS Desiderata Fairness

Isolation of flows

Bandwidth efficiency

Low overhead: delay area energy

10

Conventional QOS Disciplines Fixed schedule

Pros: algorithmic and implementation simplicity Cons: inefficient BW utilization; per-flow queuing Example: Round Robin

Rate-based Pros: fine-grained scheduling; BW efficient Cons: complex scheduling; per-flow queuing Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89]

Frame-based Pros: good throughput at modest complexity Cons: throughput-complexity trade-off; per-flow

queuing Example: Rotating Combined Queuing (RCQ) [ISCA ’96]

Per-flow queuingo Area overheado Energy overheado Delay overhead o Scheduling complexity

11

Preemptive Virtual Clock (PVC) [HPCA ‘09]

Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs.

Full QOS support Fairness, prioritization, performance isolation

Modest area and energy overhead Minimal buffering in routers & source nodes

High Performance Low latency, good BW efficiency

12

PVC: Scheduling Combines rate-based and frame-based

features Rate-based: evolved from Virtual Clock

[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation

f (provisioned rate, consumed BW) Problem: history effect

Flow X

13

PVC: Scheduling Combines rate-based and frame-based

features Rate-based: evolved from Virtual Clock

[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation

f (provisioned rate, consumed BW) Problem: history effect

Framing: PVC’s solution to history effect Frame rollover clears all BW counters Fixed frame duration

14

PVC: Scheduling Combines rate-based and frame-based

features Rate-based: evolved from Virtual Clock

[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation

f (provisioned rate, consumed BW) Problem: history effect

Flow X

Frame roller - BW counters reset - Priorities reset

15

PVC: Freedom from Priority Inversion PVC: simple routers w/o per-flow buffering and

no BW reservation Problem: high priority packets may be blocked by

lower priority packets (priority inversion)

x

16

PVC: Freedom from Priority Inversion PVC: simple routers w/o per-flow buffering and

no BW reservation Problem: high priority packets may be blocked by

lower priority packets (priority inversion)

Solution: preemption of lower priority packets

`

17

PVC: Preemption Recovery Retransmission of dropped packets Buffer outstanding packets at the source node ACK/NACK protocol via a dedicated network

All packets acknowledged Narrow, low-complexity network Lower overhead than timeout-based recovery 64 node network: 30-flit backup buffer per node

suffices

18

PVC: Preemption Throttling Relaxed definition of priority inversion

Reduces preemption frequency Small fairness penalty

Per-flow bandwidth reservation Flits within the reserved quota are non-preemptible Reserved quota is a function of rate and frame size

Coarsened priority classes Mask out lower-order bits of each flow’s BW counter Induces coarser priority classes Enables a fairness/throughput trade-off

19

PVC: Guarantees Minimum Bandwidth

Based on reserved quota Fairness

Subject to BW counter resolution Worst-case Latency

Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1

20

Performance Isolation

PARSECStream

21

Performance Isolation Baseline NOC

No QOS support Globally Synchronized Frames (GSF)

J. Lee, et al. ISCA 2008 Frame-based scheme adapted for on-chip

implementation Source nodes enforce bandwidth quotas via self-

throttling Multiple frames in-flight for performance Network prioritizes packets based on frame number

Preemptive Virtual Clock (PVC) Highest fairness setting (unmasked bandwidth

counters)

22

Performance Isolation

1

2

3

4

5

6

7

8

PARS

EC n

etw

ork

slo

wdo

wn No QOS

GSF

PVC

40 50 46 46 34 55

23

PVC Summary Full QOS support

Fairness & service differentiation Strong performance isolation

High performance Inelaborate routers low latency Good bandwidth efficiency

Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet

24

PVC Summary Full QOS support

Fairness & service differentiation Strong performance isolation

High performance Inelaborate routers low latency Good bandwidth efficiency

Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet

Will it scale to 1000 nodes?

25

Outline Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip

Interconnects Mesh-based networks Toward low-diameter topologies Multidrop Express Channels

Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work

26

NOC Topologies Topology is the principal determinant of

network performance, cost, and energy efficiency

Topology desiderata Rich connectivity reduces router traversals High bandwidth reduces latency and contention Low router complexity reduces area and delay

On-chip constraints 2D substrates limit implementable topologies Logic area/energy constrains use of wire resources Power constrains restrict routing choices

2-D Mesh

27

Pros Low design & layout

complexity Simple, fast routers

P

$

Pros Low design & layout

complexity Simple, fast routers

Cons Large diameter Energy & latency impact

2-D Mesh

28

Pros Multiple terminals at each

node Fast nearest-neighbor

communication via the crossbar

Hop count reduction proportional to concentration degree

Cons Benefits limited by crossbar

complexity

29

Concentrated Mesh (Balfour & Dally, ICS ‘06)

Objectives: Improve connectivity Exploit the wire budget

30

Flattened Butterfly (Kim et al., Micro ‘07)

Point-to-point links

Nodes fully connected in each dimension

31

Flattened Butterfly

Pros Excellent connectivity Low diameter: 2 hops

Cons High channel count:

k2/2 per row/column Low channel utilization Control complexity

32

Flattened Butterfly

Objectives: Connectivity More scalable channel count Better channel utilization

33

[Grot et al., Micro ‘09]

Multidrop Express Channels (MECS)

34

Multidrop Express Channels (MECS)

Point-to-multipoint channels Single source Multiple destinations

Drop points: Propagate further -OR- Exit into a router

35

Multidrop Express Channels (MECS)

36

Pros One-to-many topology Low diameter: 2 hops k channels row/column I/O asymmetry

Cons I/O asymmetry Control complexity

Multidrop Express Channels (MECS)

MECS Summary

MECS: a novel one-to-many topologyExcellent connectivityEffective wire utilizationGood fit for planar substrates

Results summaryMECS: lowest latency, high energy efficiencyMesh-based topologies: best throughputFlattened butterfly: smallest router area

37

38

Outline Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Kilo-NOC: A Networks for 1000+ Nodes

Requirements and obstacles Topology-centric Kilo-NOC architecture Evaluation highlights

Summary and Future Work

39

Scaling to a kilo-node NOC Goal: a NOC architecture that scales to 1000+

clients with good efficiency and strong guarantees

MECS scalability obstacles Buffer requirements: more ports, deeper buffers

area, energy, latency overheads

PVC scalability obstacles Flow state, other storage area, energy

overheads Preemption overheads energy, latency

overheads Prioritization and arbitration latency overheads

40

Scaling to a kilo-node NOC Goal: a NOC architecture that scales to 1000+

clients with good efficiency and strong guarantees

MECS scalability obstacles Buffer requirements: more ports, deeper buffers

area, energy, latency overheads

PVC scalability obstacles Flow state, other storage area, energy

overheads Preemption overheads energy, latency

overheads Prioritization and arbitration latency overheads

Kilo-NOC: Addresses topology and QOS scalability bottlenecks

This talk: reducing QOS overheads

41

NOC QOS: Conventional Approach

Multiple virtual machines (VMs) sharing a die

Shared resources (e.g., memory controllers)VM-private resources (cores, caches)

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

42

NOC QOS: Conventional Approach

NOC contention scenarios: Shared resource

accesses memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

43

NOC QOS: Conventional Approach

NOC contention scenarios: Shared resource

accesses memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

44

NOC QOS: Conventional Approach

NOC contention scenarios: Shared resource

accesses memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

45

NOC QOS: Conventional Approach

NOC contention scenarios: Shared resource

accesses memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

Network-wide guarantees without network-wide QOS

support

46

Kilo-NOC QOS: Topology-centric Approach

Dedicated, QOS-enabled regions Rest of die: QOS-free

A richly-connected topology (MECS) Traffic isolation

Special routing rules Ensure interference

freedom

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

QOS-free

47

Kilo-NOC QOS: Topology-centric Approach

Dedicated, QOS-enabled regions Rest of die: QOS-free

A richly-connected topology (MECS) Traffic isolation

Special routing rules Ensure interference

freedom

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

48

Kilo-NOC QOS: Topology-centric Approach

Dedicated, QOS-enabled regions Rest of die: QOS-free

A richly-connected topology (MECS) Traffic isolation

Special routing rules Ensure interference

freedom

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

49

Kilo-NOC QOS: Topology-centric Approach

Dedicated, QOS-enabled regions Rest of die: QOS-free

A richly-connected topology (MECS) Traffic isolation

Special routing rules Ensure interference

freedom

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

50

Kilo-NOC QOS: Topology-centric Approach

Dedicated, QOS-enabled regions Rest of die: QOS-free

A richly-connected topology (MECS) Traffic isolation

Special routing rules Ensure interference

freedom

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

51

Performance Isolation

S S S

S S S

S S S

S S S

MC

MC

MC

MC

Stream

PVC-enabledMECS topology

52

Performance Isolation

M2

M4

MaS S S

S S S

S S S

S S SMC

MC

MC

MC

MECS topology

With & without network-wide PVC QOS

53

Performance Isolation

1.0

1.2

1.4

1.6

1.8

2.0

PARS

EC n

etw

ork

slo

wdo

wn MECS

MECS + PVC

K-MECS

54

Summary: Scaling NOCs to 1000+ nodes Objectives: good performance, high energy-

and area-efficiency, service guarantees

MECS topology Point-to-multipoint interconnect fabric Rich connectivity: improves performance and

efficiency

PVC QOS scheme Preemptive architecture: reduces buffer

requirements Strong guarantees, performance isolation

55

Summary: Scaling NOCs to 1000+ nodes Topology-aware QOS architecture

Limits the extent of QOS support to a fraction of the die

Reduces network cost, improves performance Enables efficiency-boosting optimizations in QOS-

free regions of the chip Kilo-NOC compared to MECS+PVC:

NOC area reduction of 47% NOC energy reduction of 26-53%

56

AcknowledgementFaculty

Steve Keckler (advisor)Doug BurgerOnur MutluEmmett Witchel

CollaboratorsPaul Gratz Joel Hestness

Special ThanksThe awesome CART group

57