1 november 11, 2015 a massively parallel, hybrid dataflow/von neumann architecture yoav etsion...

1November 11, 2015

A Massively Parallel, Hybrid Dataflow/von Neumann Architecture

Yoav Etsion

November 11, 2015

2November 11, 2015

Massively Parallel Computing

• CUDA/OpenCL are gaining track in high-performance computing (HPC)– Same code; different data

• GPUs deliver better FLOPS per Watt– Available in mobile systems and supercomputers

• But… GPGPUs still suffer from von-Neumann inefficiencies

3November 11, 2015

von-Neumann inefficiencies

• Fetch/Decode/Issue each instruction– Even though most instructions come from loops

• Explicit storage needed for communicating values between instructions– Register file; stack– Data travels

between executionunits and storage

[Understanding Sources of Inefficiency in General-Purpose Chips, Hameed et al., ISCA10]

Component

Inst. fetch

Pipeline registers

Data cache

Register file

Control ALU

Power ]%[

33% 22% 19% 10% 10% 6%

4November 11, 2015

Quantifying inefficiencies:instruction pipeline• Every instruction fetched, decoded and issued• Very wasteful• Most of the execution time is spent in (tight) loops

• Avg. pipeline power consumption:– NVIDIA Tesla

• >10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• ~15% of processor power [Leng et al. ISCA’13]

5November 11, 2015

Quantifying Inefficiencies:Register File• Communication via bulletin board

– 40% of values only read once [Gebhart et al. ISCA’11]

• Avg. register file power consumption:– NVIDIA Tesla

• 5-10% of processor power [Hong and Kim. ISCA’10]– NVIDIA Fermi

• >15% of processor power [Leng et al. ISCA’13]

6November 11, 2015

Alternatives to von-Neumann:Dataflow/spatial computing• Processor is a grid of functional units• Computation graph is mapped to the grid

– Statically, at compile time

• No energy wasted on pipeline– Instructions are statically mapped to nodes

• No energy wasted on RF and data transfers– No centralized register file needed– Save static power and area (128KB on Fermi)

7November 11, 2015

Spatial/Dataflow Computing

int temp1 = a[threadId] * b[threadId];int temp2 = 5 * temp1;if (temp2 > 255 ) { temp2 = temp2 >> 3; result[threadId] = temp2 ;}else result[threadId] = temp2;

a threadIdx entry b

IMM_5 S_LOAS1 S_LOAD2

ALU1_mul ALU2_mul JOIN1

IMM_3 ALU4_ashl ALU3_icmp IMM_256

if_else if_then

S_SOTRE3 result S_SOTRE4

8November 11, 2015

SGMF: A Massively Multithreaded Dataflow Architecture Every thread is a flow through the dataflow graph Many threads execute (flow) in parallel

9November 11, 2015

Execution Overview:Dynamic Dataflow• Each flow/thread is associated with a token• Execute the operation when tokens match• Parallelism is determined by the number of tokens

in the system

OoO LD/ST units

token matching

10November 11, 2015

DESIGN ISSUESA Massively Multithreaded Dataflow Processor

11November 11, 2015

Multithreading Design Issues:Preventing Deadlocks• Imbalanced out-of-order memory responses may

trigger deadlocks

Deadlock due to limited buffer space

OoO LD/ST units

Solution: load-store units limit bypassing to the size of the token buffer

12November 11, 2015

Design issues:Variable path lengths Short paths must wait for long paths

xBubble

f a x x c b

Solution: equalize paths’ lengths

14November 11, 2015

ARCHITECTUREA Massively Multithreaded Dataflow Processor

15November 11, 2015

Architecture overview

Heterogeneous grid of tiles1. Compute tiles: very similar to CUDA cores2. LD/ST tiles: buffer and throttle data3. Control tiles: pipeline buffering and join ops.4. Special tiles: deal with non-pipelined operations

Reference point:– A single grid is the equivalent of a single NVIDIA Streaming

Multiprocessor (SM)– Total buffering capacity in SGMF is less than 30% of that of an

NVIDIA Fermi register file

16November 11, 2015

Architecture overview

18November 11, 2015

EVALUATIONA Massively Multithreaded Dataflow Processor

19November 11, 2015

Methodology

The main HW blocks were Implemented in Verilog Synthesized to a 65nm process

– Validate timing and connectivity– Estimate area and power consumption– The size of one SGMF core synthesized with 65nm process is 54.3mm2

– When scaled down to 40nm, each SGMF core would occupy 21.18mm2

– Nvidia Fermi GTX480 card (40nm) occupies 529mm2

Cycle accurate simulations based on GPGPUSim– We Integrated synthesis results into the GPGPUSim/Wattch power model

Benchmarks from Rodinia suite– CUDA kernels, compiled for SGMF

20November 11, 2015

Single core system SGMF vs. Fermi – Performance

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average012345678 1 token

2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

21November 11, 2015

Single core systemEnergy savings

BFS BP CFD-1 CFD-2 GE-1 GE-2 PF NN Average0

7 1 token2 tokens4 tokens8 tokens16 tokens32 tokens64 tokens

Benchmark

24November 11, 2015

Conclusions

• von-Neumann engines have inherent inefficiencies – Throughput computing can benefit from dataflow/spatial computing

• SGMF can potentially achieve much better performance/power than current GPGPUs– Almost 2x speedup (average) and 50% energy saving– Need to tune the memory system

• Greatly motivates further research– Compilation, place&route, connectivity, …

25November 11, 2015

Thank you!

Questions?

1 november 11, 2015 a massively parallel, hybrid dataflow/von neumann architecture yoav etsion...

Documents

yoav lerman thesis

itay fischhendler, galit cohen-blankshtain and yoav shuali

yoav benjamini, "in the world beyond p

resolution loss without imaging...

computer architecture 2014– out-of-order execution 1...

instant dehazing of images using...

yoav benjamini department of statistics and operations

an introduction to boosting yoav freund banter inc

making a change by jeff pulver yoav segal

magen yoav

1948 – war of independence: operation yoav sunday european

open source developers – go figure! by yoav landman

convite à física - yoav bendov.pdf

arab existentialism _ yoav di-capua - academia

yoav sagi , jila/cu, boulder (soon , technion ,israel)

dr. yoav mazeh

raphael cohen, yoav goldberg and michael elhadad

by : ido shayevitz and yoav shargil supervisor: zvika guz

yoav dori, md phd the childrens hospital of philadelphia

yoav livneh yoav.livneh@mail.huji.ac.il02-6585163