accelerating applications using fpgas satnam singh, microsoft research, cambridge uk

62
Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Upload: anoki

Post on 15-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK. A Heterogeneous Future. Example Speedup: DNA Sequence Matching. Why are regular computers not fast enough?. FPGAs are the Lego of Hardware. multiple independent multi-ported memories. hard and soft - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Accelerating Applications using FPGAs

Satnam Singh, Microsoft Research, Cambridge UK

Page 2: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

A Heterogeneous Future

Page 3: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 4: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 5: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 6: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Example Speedup: DNA Sequence Matching

Page 7: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Why are regular computers not fast enough?

Page 8: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 9: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

FPGAs are the Lego of Hardware

Page 10: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

multipleindependentmulti-ported

memories

fine-grainparallelism

andpipelining

hard and softembeddedprocessors

Page 11: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 12: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

The heart of an FPGA

Page 13: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

LUT4 (OR)

Page 14: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

LUT4 (AND)

Page 15: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

LUTs are higher order functions

i o

lut1

oi1

i0

lut2 lut3 lut4i0

i1i2

i0i1i2i3

o o

inv = lut1 not

and2 = lut2 (&&)

mux = lut3 (l s d0 d1 . if s then d1 else d0)

Page 16: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

FPGAs as Co-Processors

XD2000i FPGA in-socketaccelerator for Intel FSB

XD2000F FPGA in-socketaccelerator for AMD socket F

XD1000 FPGA co-processormodule for socket 940

Page 17: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

What kind of problems fit well on FPGA?

Page 18: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 19: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 20: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 21: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

opportunity

scientific computingdata miningsearchimage processingfinancial analytics

challenge

Page 22: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Fibonacci Example

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...

Page 23: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

entity fib is port (signal clk, rst : in bit ; signal fibnr : out natural) ;end entity fib ;

architecture behavioural of fib is signal lastFib, currentFib : natural ;begin

compute_fibs : process begin wait until clk'event and clk='1' ; if rst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; end if ; end process compute_fibs ;

fibnr <= currentFib ; end architecture behavioural ;

Page 24: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

demonstration...

Page 25: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

data paralleldescriptions

FPGAhardware(VHDL)

GPU code (Accelerator)

SMPC++

Page 26: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 27: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

The Accidental Semi-colon

;

Page 28: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Kiwi

structural imperative (C)parallelimperative

gate-level VHDL/Verilog Kiwi C-to-

gates

&0

0

0

Q

QSET

CLR

S

R

;;;

jpeg.cthread 2

thread 3

thread 1

Page 29: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

KiwiLibrary

Kiwi.cs

circuitmodel

JPEG.cs

Visual Studio

multi-thread simulationdebuggingverification

Kiwi Synthesis

circuitimplementation

JPEG.v

Page 30: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

parallelprogram

C#

Thread 1

Thread 2

Thread 3

Thread 3

C togates

C togates

C togates

C togates

circuit

circuit

circuit

circuitVerilog

for system

Page 31: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Our Implementation• Use regular Visual Studio technology to

generate a .NET IL assembly language file.• Our system then processes this file to

produce a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed

and broken into basic blocks which are then composed.

– The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.

Page 32: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

System Composition

• We need a way to separately develop components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Page 33: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Writing to a Channel

public class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Page 34: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Reading from a Channel

public T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

Page 35: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

systems level concurrency constructsthreads, events, monitors, condition variables

rendezvous join patterns transactionalmemory

dataparallelism

user applications

domain specificlanguages

Page 36: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 37: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;

static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();

Page 38: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }

public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

Page 39: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();

Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();

Page 40: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Filter Example

thread one-placechannel

Page 41: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Page 42: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 43: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Transposed Filter

Page 44: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}

Page 45: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Inter-thread Communication and Synchronization

// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}

Page 46: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}

Page 47: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

using System;using System.Collections.Generic;using System.Text;using Microsoft.Research.DataParallelArrays;using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using IPA = Microsoft.Research.DataParallelArrays.IntParallelArray;namespace ForOxford{ class Program { static void Main(string[] args) { PA.InitGPU(); IPA is1 = new IPA(4, new int[] { 1, 2, 3, 4 }); IPA is2 = new IPA(4, new int[] { 5, 6, 7, 8 }); IPA is3 = new IPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}}

Page 48: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Example: Bitmap Blur(Using Accelerator v1.1.1)

using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using FPA = Microsoft.Research.DataParallelArrays.FloatParallelArray;float[,] Blur (float[] kernel) { FPA pa = new FPA(bitmap); // Convolve in X direction FPA resultX = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPA resultY = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result;}

Page 49: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Expression GraphsFPA pa = new FPA(bitmap);

// Convolve in X directionFPA rX = new FPA(0, pa.Shape);

for (int i = 0; i < kernel.Length; i++){ rX += PA.Shift(pa, 0, i) * kernel[i];}

*

pa

Shift (0,0) k[0]

+

rX

+

*

Shift (0,1) k[1]

+

rX

Page 50: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

class Program { static void Main(string[] args) { IPA.InitGPU();

IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;

IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;

IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}

Page 51: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

class Program { static void Main(string[] args) { IPA.InitFPGA();

IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;

IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;

IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}

Page 52: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

with addr select net_7 <= 10 when 0, 20 when 1, 30 when 2, 40 when 3, 50 when 4;

process begin wait until clk'event and clk='1' ; net_5 <= net_6 + net_7 ; end process ;

process type net_4_delay_type is array (0 to 1) of integer ; variable net_4_delayed : net_4_delay_type ; begin wait until clk'event and clk='1' ; net_4_delayed(0) := net_4_delayed(1) ; net_4_delayed(1) := net_4 ; net_3 <= net_4_delayed(0) - net_5 ; end process ;

Page 53: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 54: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

8.249ns max delay3 x DSP48Es63 slice registers24 slice LUTs

Page 55: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK
Page 56: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

let rec bfly r n = match n with 1 -> r | n -> ilv (bfly r (n-1)) >-> evens r

Page 57: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Cryptol

as = [Ox3F OxE2 Ox65 OxCA] # new;new = [| a ^ b ^ c || a <- as || b <- drop(1,as) || c <- drop(3,as)|];

3Fas E2

^

65 CA

^

new

Page 58: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Bluespec

rule enqueueSOFData (rx_src_rdy_n_input == 0 && rx_sof_n_input == 0 && recv_state == Ready_for_frame) ; fifo_in.enq (rx_data_input) ; recv_state <= Reading_frame ;endrule

Page 59: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Esterel

Esterel design

void uart_device_driver (){.....}

uart.c

VHDL, Verilog -> hardware implementation

C -> software implementation

Page 60: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Some Challenges for Spatial Computing

• Language support:– Specifying resources.– Specifying memory organization.– Specifying timing.– Specifying control.– Models of computation.

• Co-design and verification.• System integration (OS APIs).• AWFUL AWFUL AWUFL vendor tools.

Page 61: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Some Challenges for Heterogeneous Systems

• A single model for programming very different kinds of computational elements?

• Giving up abstractions– memory

• Constant failure.– dynamically re-mapping computations

Page 62: Accelerating Applications using  FPGAs Satnam  Singh, Microsoft Research, Cambridge UK

Questions?