towards scalable, energy-efficient, bus-based on-chip networks aniruddha n. udipi with naveen...

39
Towards Scalable, Energy- Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

Post on 21-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Aniruddha N. Udipi

with Naveen Muralimanohar*,

Rajeev Balasubramonian

University of Utah and *HP Labs

University of Utah 2

Motivation - I

• Future CMPs are likely to be power-limited– On-chip networks consume 20-36% of total chip power– Network power dominated by routers

• Chip design and verification costs are tremendous– Directory-based protocols are complicated and have the inherent

problem of indirection– Snooping-based protocols are well understood and simple to design

• Metal and wiring are cheap and plentiful

• We are no longer pin limited for the interconnection network

University of Utah 3

Motivation - II

• Future of multi-core computing likely to diverge into two separate tracks

– Mid-range multicore machines for home/office

• 16-64 cores– Many-core machines for

scientific/server applications• 1000s of cores

• Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores

• Design energy-efficient networks for moderate core-counts

VM

University of Utah 4

Executive Summary

• Elimination of routers leads us back to bus-based networks

• Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity

• Enhancing the life of buses for moderately sized CMPs– Filtered segmented bus, low-swing wiring, address

interleaved buses, page coloring

University of Utah 5

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing Wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Baseline Chip and Interconnect Organization

University of Utah 6

Core L1

L2

• Simple mesh used for illustration here, other options discussed in the paper

• Static-NUCA shared L2, each line has a “home” slice based on its address

Router

University of Utah 7

Where does energy go in the network?

1.39e-10 J/access

1.56e-11 J/access8X

Router Link Energy estimates based on CACTI 6.0 and Orion 2.0

University of Utah 8

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

University of Utah 9

What is the solution?

• We are left with.. a bus!• Could we really just use a bus?

• Not really–Too many links activated on

every transaction–Energy gained by

eliminating routers lost by activating more links

– Poor performance due to increased arbitration times and network contention

University of Utah 10

We can do better..

Useless snoop: Particular cache line not present in any other core

• Segment and filter snoop transactions at intermediate points

• Two types of filters– Out-filter– In-filter

• Reduces number of links activated

• Allows for safe parallelism (serialization happens at the central bus if required)

Filtered Bus

University of Utah 11

Bus link Filter

Filters

• Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter”

• Each of these is a Counting Bloom Filter

– 2 arrays of 10-bit entries– Subsets of the address bits hashed into

each of these arrays, incremented to add entries, decremented to remove entries

– To test for membership, simply check if entries in both arrays are non-zero

– Compact representation, false positives possible

University of Utah 12

Bus link In + Out Filter

Out-filter - Case 1

University of Utah 13

RHome Segment • Bloom filter in every

segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment

• If a line has never left a segment, none of its transactions need to be seen outside

Energy Saved

• Completely localized transaction

• Only home segment activated

Bus link In - Filter

Activated bus Activated filter

Out - Filter

R – Requested Address

Out-filter – Case 2

University of Utah 14

Home Segment

R

Update

• If the line is being requested from outside its home segment, transaction has to go out on the central bus

• The out-filter of the home segment is updated appropriately

• The in-filter then takes over

R

R R

Bus link

Activated bus Activated filter

In - Filter Out - Filter

R – Requested Address

In-filter

University of Utah 15

RR

R

• Bloom filters keep track of a superset of lines currently present in the segment

• Only broadcast within the local segment if requiredEnergy Saved

Bus link

Activated bus Activated filter

In - Filter Out - Filter

R – Requested Address

Arbitration

• Global arbitration delay is non-trivial for a single bus connecting even 16 cores

• Multi-step arbitration, as required• On every request

– arbitrate for local bus and broadcast

– if filter indicates that the transaction is complete, “validate” broadcast via wired-OR

– if not, arbitrate for central bus and hold broadcast in a single-entry buffer until the central bus is available

– at the remote sub-buses, priority is given to requests originating from the central bus

University of Utah 16

University of Utah 17

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Low-swing Wiring

• Differential low-swing wiring up to 10X more energy efficient than regular wiring

• These have less impact on packet-switched networks since routers are the bottleneck anyway

–Amdahl’s law!

• Slightly increased latency, more metal requirement

University of Utah 18

University of Utah 19

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Address Interleaved Buses

• As core counts increase, increased pressure on the bus due to contention

• At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip

• To shore up performance, increase the number of buses

– different buses handle mutually exclusive addresses

– increased metal requirement

University of Utah 20

University of Utah 21

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page Coloring

• OS-assisted page-coloring for L2 cache• We use a simple first-touch approach• Improved locality helps any network, but is especially well-suited for our network because

– More flexibility in page placement– Less negative impact by sub-optimal page

placement– Improves filter behavior

University of Utah 22

University of Utah 23

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

University of Utah 24

Methodology

• Virtutech SIMICS full-system simulator– “g-cache” significantly modified to add network models

• CACTI 6.0 and Orion 2.0 for router/link energy computation• 16 cores for most experiments, sensitivity analysis for 32- and

64-core systems• 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2• 200 cycle main memory latency• 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire

Region-Of-Interest/parallel section• Baseline routers - 4 VCs, 8 buffers/VC

Energy Consumption – Address Network

University of Utah 25

Ring – 20xGrid – 27xFbfly – 31x

Energy Consumption – Data Network

University of Utah 26

Ring – 2xGrid – 2.5xFbfly – 3x

How does energy consumption reduce?

• Router : Link energy ratio is high enough to significantly impact energy characteristics

• Efficient bloom filters, at 16KB/filter

– Out-filters are 85% accurate (note that there are only false positives, no false negatives)

– In-filters are 90% accurate

University of Utah 27

Effect of Page Coloring

• More locality• Better filtering

– Out filter accuracy increases from 85% to 97%

University of Utah 28

System Performance

University of Utah 29

Ring – 7%Grid – 3%Fbfly – 1%

How does performance improve?

• Two basic reasons– Inherent indirection in directory-based protocols– Deep pipelines in routers increasing the no-load latency

• Avg. latency in bus-based network is 16.4 cycles– Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2

cyc) + Link latency (10.5 cyc)

• Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention

– Link (6 cyc) + Router (9 cyc)

University of Utah 30

Scaling – 32 Cores – Energy

Average energy reduction of

19X in address network, 3X in data network

University of Utah 31

32 Cores – Performance

Average 5% drop in performance

University of Utah 32

Scaling - 64 Cores – Energy

Average reduction of

13X in address network, 2.5X in data network

University of Utah 33

64 Core - Performance

University of Utah 34

Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

Router Optimizations

University of Utah 35

• For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than

– 3.5 X at 16 cores– 4.5X at 32 cores– 7X at 64 cores

• Current energy ratio is approx. 70X

University of Utah 36

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

University of Utah 37

Related Work

• Packet Switched Networks

– Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et al. (HPCA ’09), TRIPS, TILERA

• Hierarchical Networks

– Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09)

• Snoop Filtering

– Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08)

• Bus applications in CMPs

– Manevich et al. (NOCS ’09)

Key Contributions

• For moderate core counts, buses just work!– Dramatic energy reduction

– little or no loss in performance

– simple snooping protocols, reduction in design complexity

• Low-swing wiring• Multiple Address Interleaved buses• OS-assisted page coloring• Potential for router optimization

University of Utah 38

University of Utah 39

Thank you..

• Questions?