plasticine: a reconfigurable architecture for … › seminar talks › retreat...raghu prabhakar...

Plasticine: A Reconfigurable Architecture For Parallel Patterns

Raghu Prabhakar

Granular Computing

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 2

Important Trends

● Moore’s law, Dennard scaling, Power Wall, Memory Wall

=> Use transistors efficiently to achieve better Performance / Watt

=> Exploit data locality, parallelism

● High NRE costs in fabricating ASICs

=> Build programmable hardware to amortize costs

● Availability of large amounts of data + algorithmic innovations

=> Build hardware with high compute density

=> Programmable Accelerator Architectures

Reconfigurable Accelerators

● Statically reprogrammable data path using configuration bits

● Power Efficiency: Avoids overheads of general purpose CPUs, GPGPUs

Instruction fetch, decode, register file access

40% of datapath energy on CPU[1]

30% of dynamic power on GPU [2]

● Flexibility: Amortizes NRE fabrication costs of ASIC

● FPGAs gaining traction as reconfigurable accelerators

[1] Hameed et al, Understanding Sources of Inefficiency in General-purpose Chips, ISCA 2010

[2] Leng et al, GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA 2013

FPGA: The good and bad

● Bit-level reconfigurable logic elements + static interconnect

● Good

Flexibility, Performance / Watt

Commercially successful, mature toolchain support

● Bad

Architectural overheads: 60% area, power spent in the interconnect

Reduced compute density, slower clock rates

Long compile times, Low-level programming models

Design reconfigurable hardware with the right abstractions

Our Approach

● Parallel Patterns

High-level programming abstractions capturing parallelism and locality

Can express wide variety of applications

Previous work shows programming FPGAs from parallel patterns

Design reconfigurable primitives to accelerate parallel patterns

map zip reduce groupBy

key1 key3key2

Key Observations

● Nested Parallelism

Data and pipeline parallelism at innermost loop level

Coarse-grained pipelining and parallelism at outer levels

● Locality, on-chip bandwidth, and buffering

Large on-chip memories with parallel read/write ports to sustain compute throughput

On-chip memory access patterns can be different

Address partitioning to implement buffering for coarse-grained pipelining

● Dense and sparse memory accesses

Burst DRAM accesses for dense data structures e.g., matrices

Sparse / random DRAM access for sparse data structures e.g., graphs

● Communication

Patterns produce and consume scalar data and arrays

Plasticine

● New reconfigurable accelerator architecture

● Datapath

Hierarchical organization to exploit nested parallelism

● On-chip Memories

Large, banked scratchpad memories with configurable address decoding

Hardware support for generalized double buffering (N-buffering)

● Address generators and address coalescing

Efficient burst access generation for dense data

Scatter-gather support, large number of outstanding requests for sparse data

● Interconnect

Multi-level interconnect to enable scalar, vector, and control communication

Pipelined switches to avoid overheads, long wires

Plasticine: Top-Level

Pattern Compute Unit (PCU)

PCU: Pipeline Network

PCU: Reduction Network

PCU: Shift Network

Pattern Memory Unit (PMU)

Address Generators, Coalescing Unit

● Reconfigurable integer data paths for DRAM address calculation logic

Optimizes for common case for dense ‘burst’ DRAM access

Frees up PCUs for other computation, increases utilization

● Arbitration between multiple address streams

Coalescing unit arbitrates between address generators sharing same DRAM channel

● Scatter-gather support

Coalescing unit maintains sparse request metadata in a coalescing cache

Hardware combines requests belonging to same DRAM burst

Coalescing cache allows large number of outstanding requests

Interconnect

● Three interconnects with different levels of granularity

Vector: Vector (multi-word) granularity

Scalar: Single word granularity

Control: Bit-level granularity

● Pipelined switches to avoid long wires

1 hop = 1 cycle

Enables faster clock rate

● Counters and Control within switches

Outer loop logic mostly involves loop indices and control only

Implementing outer loop logic in PCUs => under utilization

Execution Model and Control

● Scratchpad access decoupled from compute

PMU: Scratchpad read/write address calculation

PCU: Core computation

FIFOs at inputs ease routing constraints

● Decentralized control mechanism to orchestrate execution

Tokens: Feed-forward pulse signals indicating forward flow

Credits: Feedback pulse signals indicating backpressure

Generalizes over any arbitrary level of nested pipelining

● Tokens, Credits, and local FIFO state drives execution

Control blocks contain counters to manage tokens and credits

See paper for details

Application Mapping

Unrolling

Splitting

Virtual PCUs

Mapping

Resource Allocation

Routing

Bitstream generation

Plasticine Bitstream

Koeplinger et al, “Automatic Generation of Efficient

Accelerators for Reconfigurable Hardware”, ISCA 2016

Example: Dot Productval out = Reg[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Reduce(N by B)(out) { i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]

tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)

Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)

}{a, b => a + b}}{a, b => a + b}

DRAMA B

TileA TileB

Example: DotProduct

Evaluation

Sizing, Area, Power, Performance, Perf / W

Architecture Sizing

Plasticine Clock, Area, and Power

Technology Node 28nm

Clock Frequency 1 GHz

Total Area 112.77 mm2

Total Power 49 W

Area Breakdown

PCU48%

PMU30%

Interconnect17%

Plasticine

Regs17%

FIFO10%

Control 1%

Scratchpad90%

FIFO 5%

Regs 4%

Scratchpad FIFO Regs FU Control

Experimental Setup

● Plasticine:

Implemented using Chisel, RTL synthesized with 28nm library

4 DDR3-1600 DRAM channels, peak memory bandwidth = 51.2 GB/s

1 GHz clock

● FPGA:

Altera Stratix V, 28 nm technology

6 DDR3-800 DRAM channels, peak memory bandwidth = 37.5 GB/s

150 MHz clock

Experimental Setup

● Plasticine:

Performance: Cycle-accurate simulation using VCS + DRAMSim2

Area: Synopsys DC after synthesis

Chip Power: RTL trace-driven simulation using PrimeTime

● FPGA:

Performance: Measured execution time on FPGA

Utilization: Reports from Altera logic synthesis tools

Chip Power: Altera PowerPlay tool

Plasticine v/s FPGA

Resource Utilization

PCU PMU AG

FU Reg

Conclusion

● Co-designing reconfigurable architecture and programming models

based on parallel patterns leads to efficient, programmable systems

● Plasticine accelerates dense and sparse applications composed of

parallel patterns

● Design space exploration explores tradeoffs between architecture

parameters and application characteristics

● Up to 95x improvement in Performance, 77x improvement in Perf/W over

FPGA in similar process technology, with an area of 113mm2

The Team

Christos Kozyrakis Kunle Olukotun

Yaqi Zhang David Koeplinger Matt Feldman

Tian Zhao Stefan Hadjis Ardavan Pedram

plasticine: a reconfigurable architecture for … › seminar talks › retreat...raghu prabhakar...

Documents

inter-coarse-grained reconfigurable architecture...

architecture de puissance distribuée reconfigurable

team morphing architecture reconfigurable computational...

onboard processing expandable reconfigurable architecture

fault tolerant fpga reconfigurable hardware architecture -...

cryptarray a scalable and reconfigurable architecture for

a reconfigurable energy storage architecture for energy...

dynamically reconfigurable neuron architecture for the...

reconfigurable dsp architecture

reconfigurable computing architecture for linux - … ·...

reconfigurable computing. this class is about reconfigurable...

wideband reconfigurable harmonically tuned gan … ·...

processor architecture a dynamically reconfigurable

reconfigurable double gate cntfet based nanoelectronic...

plasticine: a reconfigurable architecture for parallel...

a reconfigurable and extendable digital architecture for

reconfigurable architecture for efficient and scalable...

reconfigurable open architecture computing hardware (roach...

information architecture for reconfigurable …

architecture reconfigurable pour un equipement radio