noc bergman

Download NoC Bergman

Post on 27-Oct-2015

27 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • Silicon Photonic On-Chip Optical Interconnection NetworksKeren Bergman, Columbia University

    *

    AcknowledgementsColumbiaProf. Luca CarloniDr. Assaf Shacham, Michele Petracca, Ben Lee, Caroline Lai, Howard Wang, Sasha BibermanIBMJeff KashYurii VlasovCornellMichal Lipson

    *

    Emerging Trend of Chip MultiProcessors (CMP)NiagaraSun 2004CELL BEIBM 2005MontecitoIntel 2004TerascaleIntel 2007BarcelonaAMD 2007

    *

    Networks on Chip (NoC)Shared, packet-switched, optimized for communicationsResource efficiencyDesign simplicity IP reusabilityHigh performance

    But no true relief in power dissipationKolodny, 2005

    *

    Chip Multiprocessors: the IBM CellIBM Cell:

    Parameter

    Value

    Technology process

    90nm SOI with low-( dielectrics and 8 metal layers of copper interconnect

    Chip area

    235mm^2

    Number of transistors

    ~234M

    Operating clock frequency

    4Ghz

    Power dissipation

    ~100W

    Percentage of power dissipation due to global interconnect

    30-50%

    Intra-chip, inter-core communication bandwidth

    1.024 Tbps, 2Gb/sec/lane (four shared buses, 128 bits data + 64 bits address each)

    I/O communication bandwidth

    0.819 Tbps (includes external memory)

    Columbia University

    The Interconnection Challenge: Off-Chip BandwidthOff-chip bandwidth is risingPin countSignaling rateSome examples

    *

    Why Photonics for CMP NoC?

    OPTICS: Modulate/receive ultra-high bandwidth data stream once per communication eventTransparency: broadband switch routes entire multi-wavelength high BW streamLow power switch fabric, scalable Off-chip and on-chip can use essentially the same technologyOff-chip BW = On-chip BW for nearly same powerELECTRONICS:Buffer, receive and re-transmit at every switchOff chip is pin-limited and really power hungry

    Photonics changes the rulesfor Bandwidth-per-WattOn-chip AND Off-chip

    *

    Recent advances in photonic integrationInfinera, 2005IBM, 2007Lipson, Cornell, 2005Luxtera, 2005Bowers, UCSB, 2006

    *

    3DI CMP System ConceptFuture CMP system in 22nmChip size ~625mm23D layer stacking used to combine:Multi-core processing planeSeveral memory planesPhotonic NoCProcessor System Stack For 22nm scaling will enable 36 multithreaded cores similar to todays CellEstimated on-chip local memory per complex core ~0.5GB

    *

    Optical NoC: Design ConsiderationsDesign to exploit optical advantages:Bit rate transparency: transmission/switching power independent of bandwidthLow loss: power independent of distance Bandwidth: exploit WDM for maximum effective bandwidths across network(Over) provision maximized bandwidth per portMaximize effective communications bandwidthSeamless optical I/O to external memory with same BW

    Design must address optical challenges:No optical bufferingNo optical signal processingNetwork routing and flow control managed in electronicsDistributed vs. CentralElectronic control path provisioning latency

    Packaging constraints: CMP chip layout, avoid long electronic interfaces, network gateways must be in close proximity on photonic planeDesign for photonic building blocks: low switch radix

    *

    Goal: Design a NoC for a chip multiprocessor (CMP)ElectronicsIntegration density abundant buffering and processingPower dissipation grows with data ratePhotonicsLow loss, large bandwidth, bit-rate transparencyLimited processing, no buffersOur solution a hybrid approach:A dual-network designData transmission in a photonic networkControl in an electronic networkPaths reserved before transmission No optical bufferingPhotonic On-Chip Network

    *

    On-Chip Optical Network ArchitectureBufferless, Deflection-switch basedCell Core (on processor plane)Gateway to Photonic NoC(on processor and photonic planes)

    *

    Key Building Blocks5cm SOI nanowire1.28Tb/s (32 l x 40Gb/s)LOW LOSS BROADBAND NANO-WIRESBROADBAND ROUTER SWITCHHIGH-SPEED RECEIVERIBM/ColumbiaIBM;Cornell/ColumbiaIBM

    *

    4x4 Photonic Switch Element4 deflection switches grouped with electronic control4 waveguide pairs I/O linksElectronic routerHigh speed simple logicLinks optimized for high speedNearly no power consumption in OFF state

    *

    Non-Blocking 4x4 Switch DesignOriginal switch is internally blockingAddressed by routing algorithm in original designLimited topology choicesNew designStrictly non-blocking*Same number of ringsNegligible additional lossLarger area

    * U-turns not allowed

    *

    Design of Nonblocking Network for CMP NoCBegin with crossbar -- strictly non-blocking architectureAny unoccupied input can transmit to any unoccupied output without altering paths taken by other traffic in networkConnections from every input to every outputEach node transmits and receives on independent paths in each dimensionUnidirectional links1 x 2 SwitchesSimple routing algorithm

    *

    12344321Bidirectionality provides for independent reception by two nodes from output (Y) dimensionDesign of photonic nonblocking meshUtilizing nonblocking switch design with increased functionality and bidirectionality enables novel network architecture

    *

    Mapping onto a direct networkInternalizing nodes in a crossbar (indirect network) produces mesh/torus (direct network)

    *

    Nonblocking Torus NetworkInternalizing nodes maintains two nodes per dimensionThere is always an independent path available for a node to transmit/receive on/from in each dimension

    *

    Nonblocking Torus NetworkInternalizing nodes maintains two nodes per dimensionThere is always an independent path available for a node to transmit/receive on/from in each dimension

    *

    Nonblocking Torus NetworkEach node injects into the network on the X dimension12

    *

    Nonblocking Torus NetworkEach node ejects from the network on the Y dimension12

    *

    Nonblocking Torus NetworkFolding the torus to maintain equal path lengths4 4 non-blocking photonic switch1243Non-Blocking 4x4 Design

    *

    Power Analysisstrawman

    Modulator power0.4 pJ/bitReceiver power0.4 pJ/bitOn-chip Broadband Router Power10 mWPower per laser10 mWNumber of wavelength channels250

    Cores per supercore21Supercores per chip36broadband switch routers36Number of lasers250Bandwidth per supercore1 Tb/s

    Optical Tx + Rx Power (36 Tb/s)28.8 WOn-chip Broadband routers (36)360 mWExternal lasers (250)2.5 W

    Total power photonic interconnect31.7 W

    *

    Performance AnalysisGoal to evaluate performance-per-Watt advantage of CMP system with photonic NoCDeveloped network simulator using OMNeT++: modular, open-source, event-driven simulation environmentModules for photonic building blocks, assembled in networkMultithreaded model for complex coresEvaluate NoC performance under uniform random distributionPerformance-per-Watt gains of photonic NoC on FFT application

    *

    Multithreaded complex core modelModel complex core as multithreaded processor with many computational threads executed in parallelEach thread independently make a communications request to any coreThree main blocks:Traffic generator simulates core threads data transfer requests, requests stored in back-pressure FIFO queueScheduler extracts requests from FIFO, generates path setup, electronic interface, blocked requests re-queued, avoids HoL blockingGateway photonic interface, send/receive, read/write data to local memory

    *

    Throughput per coreThroughput-per-core = ratio of time core transmits photonic message over total simulation timeMetric of average path setup timeFunction of message length and network topologyOffered load considered when core is ready to transmitFor uncongested network: throughput-per-core = offered loadSimulation system parameters:36 multithreaded coresDMA transfers of fixed size messages, 16kBLine rate = 960Gbps; Photonic message = 134ns

    *

    Throughput per core for 36-node photonic NoCMultithreading enables better exploitation of photonic NoC high BWGain of 26% over single-threadNon-blocking mesh, shorter average path, improved by 13% over crossbar

    *

    FFT Computation PerformanceWe consider the execution of Cooley-Tukey FFT algorithm using 32 of 36 available coresFirst phase: each core processes: k=m/M sample elementsm = array size of input samplesM = number of coresAfter first phase, log M iterations of computation-step followed by communication-step when cores exchange data in butterflyTime to perform FFT computation depends on core architecture, time for data movement is function of NoC line rate and topologyReported results for FFT on Cell processor, 224 samples FFT executes in ~43ms based on Baileys algorithm.We assume Cell core with (2X) 256MB local-store memory, DPUse Baileys algorithm to complete first phase of Cooley-Tukey in 43msCooley-Tukey requires 5kLogk floating point operations, each iteration after first phase is ~1.8ms for k= 224 Assuming 960Gbps, CMP non-blocking mesh NoC can execute 229 in 66ms

    *

    FFT Computation Power AnalysisFor photonic NoC:Hop between two switches is 2.78mm, with average path of 11 hops and 4 switch element turns32 blocks of 256MB and line rate of 960Gbps, each connection is 105.6mW at interfaces and 2mW in switch turnstotal power dissipation is 3.44WElectronic NoC:Assume equivalent electronic circuit switched networkPower dissipated only for length of optimally repeated wire at 22nm, 0.26pJ/bit/mmSummary: Computation time is a function of the line rate, independent of medium

    *

    FFT Computation Performance ComparisonFFT computation: time ratio and power ratio as function of line rate

    *

    Performance-per-WattTo achieve same execution time (time ratio = 1), electronic NoC must operate at the same line rate of 960Gbps, dissipating 7.6W/connection or ~70X over photonicTotal dissipated power is ~244WTo achieve same power (power ratio