plasticine: a reconfigurable architecture for … › seminar talks › retreat...raghu prabhakar...

Post on 04-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Plasticine: A Reconfigurable Architecture For Parallel Patterns

Raghu Prabhakar

Granular Computing

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 2

Important Trends

● Moore’s law, Dennard scaling, Power Wall, Memory Wall

=> Use transistors efficiently to achieve better Performance / Watt

=> Exploit data locality, parallelism

● High NRE costs in fabricating ASICs

=> Build programmable hardware to amortize costs

● Availability of large amounts of data + algorithmic innovations

=> Build hardware with high compute density

=> Programmable Accelerator Architectures

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 3

Reconfigurable Accelerators

● Statically reprogrammable data path using configuration bits

● Power Efficiency: Avoids overheads of general purpose CPUs, GPGPUs

Instruction fetch, decode, register file access

40% of datapath energy on CPU[1]

30% of dynamic power on GPU [2]

● Flexibility: Amortizes NRE fabrication costs of ASIC

● FPGAs gaining traction as reconfigurable accelerators

[1] Hameed et al, Understanding Sources of Inefficiency in General-purpose Chips, ISCA 2010

[2] Leng et al, GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA 2013

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 4

FPGA: The good and bad

● Bit-level reconfigurable logic elements + static interconnect

● Good

Flexibility, Performance / Watt

Commercially successful, mature toolchain support

● Bad

Architectural overheads: 60% area, power spent in the interconnect

Reduced compute density, slower clock rates

Long compile times, Low-level programming models

Design reconfigurable hardware with the right abstractions

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 5

Our Approach

● Parallel Patterns

High-level programming abstractions capturing parallelism and locality

Can express wide variety of applications

Previous work shows programming FPGAs from parallel patterns

Design reconfigurable primitives to accelerate parallel patterns

map zip reduce groupBy

key1 key3key2

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 6

Key Observations

● Nested Parallelism

Data and pipeline parallelism at innermost loop level

Coarse-grained pipelining and parallelism at outer levels

● Locality, on-chip bandwidth, and buffering

Large on-chip memories with parallel read/write ports to sustain compute throughput

On-chip memory access patterns can be different

Address partitioning to implement buffering for coarse-grained pipelining

● Dense and sparse memory accesses

Burst DRAM accesses for dense data structures e.g., matrices

Sparse / random DRAM access for sparse data structures e.g., graphs

● Communication

Patterns produce and consume scalar data and arrays

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 7

Plasticine

● New reconfigurable accelerator architecture

● Datapath

Hierarchical organization to exploit nested parallelism

● On-chip Memories

Large, banked scratchpad memories with configurable address decoding

Hardware support for generalized double buffering (N-buffering)

● Address generators and address coalescing

Efficient burst access generation for dense data

Scatter-gather support, large number of outstanding requests for sparse data

● Interconnect

Multi-level interconnect to enable scalar, vector, and control communication

Pipelined switches to avoid overheads, long wires

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 8

Plasticine: Top-Level

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 9

Pattern Compute Unit (PCU)

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 10

PCU: Pipeline Network

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 11

PCU: Reduction Network

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 12

PCU: Shift Network

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 13

Pattern Memory Unit (PMU)

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 14

Address Generators, Coalescing Unit

● Reconfigurable integer data paths for DRAM address calculation logic

Optimizes for common case for dense ‘burst’ DRAM access

Frees up PCUs for other computation, increases utilization

● Arbitration between multiple address streams

Coalescing unit arbitrates between address generators sharing same DRAM channel

● Scatter-gather support

Coalescing unit maintains sparse request metadata in a coalescing cache

Hardware combines requests belonging to same DRAM burst

Coalescing cache allows large number of outstanding requests

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 15

Interconnect

● Three interconnects with different levels of granularity

Vector: Vector (multi-word) granularity

Scalar: Single word granularity

Control: Bit-level granularity

● Pipelined switches to avoid long wires

1 hop = 1 cycle

Enables faster clock rate

● Counters and Control within switches

Outer loop logic mostly involves loop indices and control only

Implementing outer loop logic in PCUs => under utilization

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 16

Execution Model and Control

● Scratchpad access decoupled from compute

PMU: Scratchpad read/write address calculation

PCU: Core computation

FIFOs at inputs ease routing constraints

● Decentralized control mechanism to orchestrate execution

Tokens: Feed-forward pulse signals indicating forward flow

Credits: Feedback pulse signals indicating backpressure

Generalizes over any arbitrary level of nested pipelining

● Tokens, Credits, and local FIFO state drives execution

Control blocks contain counters to manage tokens and credits

See paper for details

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 17

Application Mapping

Unrolling

Splitting

DHDL

Virtual PCUs

Mapping

Resource Allocation

Routing

Bitstream generation

Plasticine Bitstream

Koeplinger et al, “Automatic Generation of Efficient

Accelerators for Reconfigurable Hardware”, ISCA 2016

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 18

Example: Dot Productval out = Reg[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Reduce(N by B)(out) { i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]

tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)

Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)

}{a, b => a + b}}{a, b => a + b}

×

DRAMA B

out

acc

TileA TileB

+

+

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 19

Example: DotProduct

A

B

tile

A

tile

B

Evaluation

Sizing, Area, Power, Performance, Perf / W

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 21

Architecture Sizing

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 22

Plasticine Clock, Area, and Power

Technology Node 28nm

Clock Frequency 1 GHz

Total Area 112.77 mm2

Total Power 49 W

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 23

Area Breakdown

PCU48%

PMU30%

Interconnect17%

MC5%

Plasticine

FU72%

Regs17%

FIFO10%

Control 1%

PCU

Scratchpad90%

FIFO 5%

Regs 4%

PMU

Scratchpad FIFO Regs FU Control

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 24

Experimental Setup

● Plasticine:

Implemented using Chisel, RTL synthesized with 28nm library

4 DDR3-1600 DRAM channels, peak memory bandwidth = 51.2 GB/s

1 GHz clock

● FPGA:

Altera Stratix V, 28 nm technology

6 DDR3-800 DRAM channels, peak memory bandwidth = 37.5 GB/s

150 MHz clock

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 25

Experimental Setup

● Plasticine:

Performance: Cycle-accurate simulation using VCS + DRAMSim2

Area: Synopsys DC after synthesis

Chip Power: RTL trace-driven simulation using PrimeTime

● FPGA:

Performance: Measured execution time on FPGA

Utilization: Reports from Altera logic synthesis tools

Chip Power: Altera PowerPlay tool

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 26

Plasticine v/s FPGA

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 27

Resource Utilization

0

10

20

30

40

50

60

70

80

90

100

PCU PMU AG

FU Reg

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 28

Conclusion

● Co-designing reconfigurable architecture and programming models

based on parallel patterns leads to efficient, programmable systems

● Plasticine accelerates dense and sparse applications composed of

parallel patterns

● Design space exploration explores tradeoffs between architecture

parameters and application characteristics

● Up to 95x improvement in Performance, 77x improvement in Perf/W over

FPGA in similar process technology, with an area of 113mm2

June 8, 2017 Plasticine: A Reconfigurable Architecture for Parallel Patterms Slide 29

The Team

Christos Kozyrakis Kunle Olukotun

Yaqi Zhang David Koeplinger Matt Feldman

Tian Zhao Stefan Hadjis Ardavan Pedram

top related