roberto innocente inno@sissa - sissa people personal...

125
May 10, 2014 R.Innocente 1 Reconfigurable Computing Reconfigurable Computing Roberto Innocente [email protected]

Upload: doanlien

Post on 22-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 1

Reconfigurable ComputingReconfigurable Computing

Roberto Innocente

[email protected]

Page 2: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 2

Flexibility

ASICApplication

SpecificIntegrated Circuit

Very inflexible,designed to solve just 1 problem.Energy, space and time

efficient

GPPGeneralPurpose

Processor

Very flexible,can solve any problem. Energy, space and time

inefficient

?

ReconfigurableHardwareFlexible,

But enough energy, time and space efficient

+-

Page 3: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 3

History

Page 4: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 4

Gerald Estrin/1

is credited with the idea, in the '60, of the first reconfigurable

(F+V) FIX+Variable computer

Gerald Estrin. ACM 1960. Organization of computer systems: the fixed plus variable structure computer.

Page 5: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 5

Gerald Estrin/2He envisioned that important gains in performance could be achieved when many computations are executed on appropriate problem oriented configurations.

F+V is made of :

- high speed general computer(the F part) : initially an ibm7090

- various size high speed special structures (the V part) problem specific: trigonometric functions, logarithms, exponential, n-th powers, complex arithmetic, …

V is made of a 36 module positions motherboard which can undergo :

- Function reconfiguration: physically changing some modules

- Routing reconfiguration : changing part of the back wiring

The Rammig machine (1977) : investigation of a reconfigurable machine with no manual or mechanical intervention

Page 6: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 6

Today reconfigurable hardware

Is born out of the will to replace different logic IC(Integrated Circuits), and successively to rapidly prototype large ASICs(Application Specific ICs) or implement SoCs (Sytem On Chip).

In the following slides readers are supposed to be involved in scientific computing and not EE engineers.

Page 7: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 7

Basic digital circuits

AND INVERTER

Shift RegD Type FFMUX

Usually 0=0V, 1=some positive voltage

OR

Page 8: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 8

SSI 74xx IC

Page 9: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 9

PLD

Inconvenience of standard discrete logic circuits :

- 14 pin packages of 4/6 logic functions

- often you had to traverse the PCB to find a free OR or inverter

- if you needed only a few, you had in any case to put an IC with 4/6

Therefore came the idea of PLD (Programmable Logic Device) :

- SPLD (Simple : PAL/PLA)

- CPLD (Complex)

In which a simple interconnection network could be configured melting some internal fuses (fuse technology) to implement combinatorial logic.

Page 10: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 10

disjunctive normal form(aka Sum of products )

Each boolean function of some boolean variables can be represented as a sum of minterms (product of all variables or their complement) .

With 3 boolean vars : a,b,c

are 2 of the 23 = 8 minterms

f (a ,b , c)=a b c+a b c

a b c , a b c

Page 11: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 11

PLA (Programmable Logic Array)

f1=p1+ p2+ p3=x1x2+x1 x3+ x1 x2 x3+x1 x3

Page 12: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 12

FPGA

Also CPLDs showed their limits, therefore in 1985/1990 Xilinx introduced a more flexible design , the

FPGA (Field Programmable Gate Array)

In which the interconnection network is much more flexible and on which also sequential circuits can be easily mapped.

Page 13: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 13

FPGA idea1985 Xilinx – Ross Freeman (inventor of FPGA): “What if we could develop the equivalent of a circuit board full of standard logic parts (like TTL and PAL devices) on a single high density programmable logic chip ?”

- post fabrication programmability by end users

- fabless semiconductor company

Page 14: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 14

Today

Page 15: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 15

FPGA marketDominated by 2 players :

- Altera

- Xilinx

From 67% of 2010, today they share together 90% of the market (4.5 billion usd revenues in 2012)

From sourcetech411(2010)

Page 16: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 16

An important question: are FPGAs green ?

Virtex-7 2000T (one of the top FPGAs) :

~ 20 WXilinx showed 3600 copies of its 8 bit processor nanoblaze running on Virtex-7, consuming 20 W

CPU : ~ 100 WCore i7-4770K Haswell (22 nm) 3.5 GHz@ 4 Cores 84 W

Core i7-3930K Sandybridge-E (32 nm) 3.2 GHz @6Cores 130 W

Xeon E7458 Dunnington (45 nm) 2.4 GHz 90 W

Xeon E7460 Dunnington (45 nm) 2.66 GHz 130 W

GPU : ~ 220 WNvidia Tesla M2090 225 W

Nvidia Tesla K20X 235 W

This is a partial answer. We need to be able to estimate FPGA performance to give a more useful index.

Page 17: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 17

FPGA architecture

From RF and Wireless World

Sea of gates : logic blocks are like islands in a sea of interconnections

Page 18: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 18

Virtex family1998 Virtex 250nm 100mhz 25k-60k cells

2000 Virtex-E 180nm 300mhz 1k-70kcells

2000 Virtex II 150nm to168 mult420mhzupto 93k 4-luts

2005 Virtex-4 90nm 500mhz upto 200k cells

2007 Virtex-5 65nm 550mhz up to 330k cells

Virtex-6 40nm 288-2k DSP to 500k 6-luts

2010 Virtex-7 28nm ~500mhz upto 2000k cells

2014 Virtex-US 20 nm upto 4400k cells

From L Zhuo

Up to ~ 7 billion transistorIntel 2014 15-core Xeon IvyBridge-EX~ 4.3 billion transistorNvidia 2012 GK110 Kepler ~ 7 billion transistor

Page 19: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 19

FPGA/CPU evolution

Page 20: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 20

Virtex-7 is not monolithic

2.5 D technology : 4 FPGA tiles with silicon interposer that provides 10kInterconeections between layers

Page 21: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 21

Enabling technologies

Page 22: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 22

Programming technology/1

Antifuse SRAM

OTP(One time programmable)

Disordered except at very low range

Pass transistor in switch block

Page 23: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 23

Programming technology/2Antifuse

-pros:

cheap, small

-cons:

requires special processing, One time programming

SRAM

-pros:

can be deployed with standard semiconductor process, can be easily reprogrammed

-cons:

large area required(6 transistors)

Page 24: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 24

Confware

The configuration of an FPGA ( that becomes compiled to a stream of bits) is not hardware, nor software.

Someone invented the neologism

confware

The configuration of a reconfigurable hardware.

Page 25: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 25

How you configure an FPGA ?

SRAM cells as a long shift register : loaded serially clocking in the confwareVirtex 7 2000T = 440 Mbits of SRAM cells(simplified : large fpgas can also parallel load the confware)

Page 26: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 26

Logic Blocks/Logic Cells

Page 27: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 27

Fine/coarse grain logic blocksFrom :

- a single transistor (Crosspoint : went in bankrupcy)

- a logic gate

To :

- a complete processor (FPNA: field programmable node arrays)

NB. FPNA is also field programmable neural array

Page 28: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 28

Homogeneous :

- Logic Cells: 4 input LUT(LookUp Table) + FlipFlop

Heterogeneous(modern development) :

- Logic cells

- DSP (Digital Signal Processing)

- Memory blocks

- I/O blocks

The heterogenous architecture is prevalent now. The blocks are configured by SRAM bits usually loaded trough serial ports as already pointed out.

CLB(Configurable Logic Blocks)

Necessary differentiation to allow things like multiplication/addition to be mapped in an efficient way.

Page 29: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 29

Standard Logic Cell

4 input LUT

D type FlipFlop

16 bits of SRAM for conf 1 bit SRAM conf

2:1 Mux

Page 30: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 30

standard LUT (Look Up Table)

0 0000 0

1 0001 1

2 0010 0

3 0011 0

4 0100 1

5 0101 0

6 0110 1

7 0111 1

.. .. ..

Dec Bin Out- 16 x 1 memory

- any boolean function of 4 inputs :

Bit 0

Bit 1

Bit 2

Bit 3

f = x3 x2 x1 x0+ x3 x2 x1 x0+ x3 x2 x1 x0+ x3 x2 x1 x0

NB. LUT rhymes with nut

Page 31: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 31

Uses of Logic Cell

2^4 = 16 x 1 bit memory Any boolean function of 4 inputs

4:1 multiplexer

Page 32: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 32

Virtex-7 Logic Block basics

Page 33: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 33

Virtex-7 Logic sliceFrom Xilinx

4 x 32=128 bit shift reg

Page 34: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 34

Virtex7 CLB slice- 6-input LUT

- 2 5-input LUTs with same inputs

- 2 arbitrary boolean function on 3-input and 2-input or less

Page 35: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 35

Altera ALM

Page 36: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 36

Interconnection network

Page 37: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 37

Interconnection networkHierarchical routing Island type routing(predominant)

Interconnection network can consume 80% of the area of an FPGA !

Nearest neighbours

Page 38: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 38

Programmable switch

Page 39: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 39

SRAM routing: coarse/fine grain5 bit SRAM 1 bit SRAM

Page 40: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 40

Details of island type routing

Page 41: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 41

Disjoint/Wilton switch blocks

Disjoint : wire can only go out on wire of same number, creates routing domainsWilton : can change domain in at least one directions

Page 42: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 42

Channel segments distribution

Page 43: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 43

Columnar architecture7 series Xilinx fpgaColumnar architecture

Page 44: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 44

DSP blocks &floating point

Page 45: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 45

FPGAs floating point in 1994

B. Fagin and C. Renard. Field Programmable Gate Arrays and Floating Point Arithmetic. IEEE Transactions on VLSI Systems, 2(3), September 1994.

Fagin & Renard report that you can implement floating point operators but it is impractical : no

FPGA in existence could contain a single multiplier circuit !!

Page 46: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 46

FPGA fp in 1995

Shirazi & al. On the same line of Fagin & Renard propose 2 custom fp formats 16 and 18 bits total:

they provide for them add,sub, mul, div operators

N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1995.

Page 47: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 47

FPGA fp in 2002Belanovic & Leeser present a library of variable width parameterized floating point operators (superset of the ieee formats)

A Library of Parameterized Floating-point Modules and Their UsePavle Belanovic and Miriam Leeser, 2002

Page 48: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 48

What allowed the breakthrough ?The addition, by major vendors, of hardware multipliers (called DSP blocks) on their FPGA from 2000 on :

- 1st Xilinx on Virtex II

- soon after Altera on Stratix

This started in the last decade also the interest of HPC community :

Cray XD1, Silicon RASC, Convey HC1

HPRC = High Performance Reconfigurable Computing

Page 49: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 49

FPGA MAC operation

Page 50: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 50

Virtex-7 DSP48 high level

From Xilinx

1 bit 2 bit

Page 51: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 51

DSP48E1 details

Page 52: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 52

Altera Stratix V DSP block

4 (*) + 3(+) = 7 flop

Page 53: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 53

Data Flow Graphs (DFG)

Page 54: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 54

Data flow

A representation of a program as a DG(Directed Graph) in which the nodes are the operations and the edges represent the data dependencies from one operation to the next

Page 55: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 55

Control flow/Data Flow

dis2=b**2-4*a*c

If dis2 < 0 complex!

dis=sqrt(dis2)

u1=-b/(2*a)

u2=dis/(2*a)

x1=u1+u2

x2=u1-u2x=

−b2a

±√b2−4ac

2a

Page 56: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 56

A scalar product

Fortran :

acc=0.0

do i=1,4

acc=acc+a(i)*b(i)

enddo

C :

acc=0.0;

for(i=0;i<4;i++){

acc=acc+a[i]*b[i];

}

Page 57: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 57

Time/Space tradeoffs

Page 58: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 58

Systolic array matrix multA(n,n) x B(n,n) requires :2n-1 steps for the last elements to enter the arrayn-1 steps to compute the last c(n,n)n steps to move the result out = 4n-2 steps

Page 59: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 59

Codesign

The implementation of algorithms on FPGAs requires a mix of hw and sw design :

Codesign = hw design + sw design

Page 60: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 60

How to program FPGAs?Mainly with an HDL (Hardware Description Language):

- Verilog(intially developed by Gateway Design Automation, now a std)

- VHDL (out of a standard committee)

But OpenCL, ImpulseC, SystemC, C, Handel-C translators .. are also available.Is this a good idea ?

The problem is that those languages are not thought for describing hardware and the translation finish up usually with a FSM(finite state machine) with 1 state for every statement and then the FSM machine moves along the states .

This is not the way someone skilled would program the FPGA.

Next statelogic

Stateregister

Output Logic

input

clk

D Q

Out

FSM finite state machine

Page 61: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 61

FPGA will win

For many years FPGAs were just prototyping vehicles for ASICs

– Now they are replacing many ASICS & ASSPs

– Watch for the same Trojan effect with FPGAs in HPC

Page 62: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 62

FPGA lingo

Page 63: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 63

Core

Core in FPGA lingo is a function ready to be instantiated into your design as a “black box”. It can be suppliad as HDL or schematic.

It supports design re-use.

Page 64: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 64

Soft/hard cores

On FPGAs functional modules can be implemented :

- using std FPGA resources(logic blocks, DSPs, memory blocks) : softcores

- as an ASIC on the FPGA : hardcores

When the manufacturer puts a processor as an hardcore on the FPGA then it sells this as a SoC (Sytem On Chip) : Dual ARM on Zync-7000 chip, PowerPC on Altera FPGA

Page 65: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 65

IP/open cores

The soft attribute is implied.

Hardware designs in an HDL(eventually using vendor libraries):

- opensource cores : http://opencores.org/

OpenRISC 1000 architecture from the OpenCores community,

the Lattice Semiconductor LM32, the LEON3 from Aeroflex

Gaisler and the OpenSPARC family from Oracle

- proprietary : IP(Intellectual Property) cores

Floating point operators, fft, matrix computations

Page 66: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 66

Commercial offers

Page 67: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 67

Picocomputing

SC6 1U Upto 16 FPGA SC6 4U upto 48

EX-600EX-800

FromPICOCOMPUTING

Page 68: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 68

Bittware Terabox

16 altera stratix-V

From Bittware

Page 69: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 69

DINIGROUP Cluster of 4 Virtex7

From DINIGROUP

Page 70: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 70

Dinigroup Cluster 40 Kintex-7

From DINIGROUP

Page 71: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 71

Maxeler MPC-X

Daresbury Lab UK :The dataflow supercomputer will feature Maxeler developed MPC-X nodes capable of an equivalent 8.52TFLOPs per 1U and 8.97 GFLOPs/Watt.

Page 72: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 72

Convey HC-2 , HC-2ex

Page 73: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 73

Cray XT5h

“Cray introduces an hybrid supercomputer thatcan integrate multiple processor architectures into a single system and accelerate high performance computing (HPC) workflows. The Cray XT5h delivers higher sustained performance, by applying alternative processor architectures across selected applications within an HPC workflow. The Cray XT5h supports avariety of processor technologies, including scalar processors based on AMD OpteronTM dual and quad-core technologies, vectorprocessors, and FPGA accelerators.”

Page 74: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 74

CHRECCenter for High PerformanceReconfigurable ComputingUF/BYU/GWU/VTECH

Page 75: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 75

CHREC Novo-G 384 FPGAs“Novo-G is the most powerful reconfigurable supercomputer in the known world. This unique machine features 192 top-end, 40nm FPGAs (Altera Stratix-IV E530) and 192 top-end, 65nm FPGAs (Stratix-III E260). “

http://www.chrec.org/

(pronounce it as shreck)

Page 76: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 76

BLAST like Smith-Waterman computes local alignment of 2 sequences :

- Novo-BLAST Novo-G/CHREC implementation : faster, same sensitivity

IPC(Isotope Pattern Calculator) of Protein Identification Algorithm :

- speed up 52-366 on single fpga, 1259 on 4 fpgas, 3340 on a node(16 fpgas)

CHREC/2

Page 77: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 77

References forApplications

Page 78: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 78

Linear Algebra for RC

Juan Gonzalez and Rafael C. NúñezLAPACKrc: Fast linear algebra kernels/solvers for FPGAaccelerators(JP 2009)DOD funded

Page 79: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 79

DCT, FFT on FPGAs

Digital Signal Processing with Field Programmable Gate Arrays ,3d edition(2007)

U.Mayer Baese, Springer Verlag

Page 80: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 80

MD on FPGA There are many papers about porting Molecular Dynamics algorithms on FPGAs with substantial positive conclusions about experiments on 1-2 FPGAs. But in the last years there is an embarassing comparison with ANTON (Shaw et al.).

We cant forget that ANTON is a really huge machine consuming over 100 KW !!!!

And is made out of 512 dedicated ASICs at 1ghz!

The comparison with some FPGAs consuming 40/60 W is improper.

FPGA-Accelerated Molecular Dynamics(2013) M. A. Khan,M. Chiu, M. C. Herbordt

Page 81: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 81

Neural networks on FPGAs

Editors : Omondi , Rajakapse (2006)

FPGA implementation of neural networks

ANN(Artificial Neural Network) in integer arithmetic performs 40x better than on GPP (old FPGA, 3 generation old)

Page 82: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 82

Altera Arria 10

Page 83: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 83

Arria10

Page 84: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 84

Arria 10 variable precision DSP block

Altera

A

B

CD

A+C*D = 2 flop

Page 85: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 85

Arria10 estimated sp fp performance

- 2 flops per cycle

- 1688 fp single precision DSP (GX660)

1688*2 = 3376 flops per cycle

3376 * 0.5 ghz ~ 1.7 Teraflops in single precision

Page 86: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 86

Hard single prec FP on FPGA ?!?

For people that can live with single precision this seems a very attractive new feature.

But many think that it is too much a waste of generic resources and claim that what was missing were simpler blocks !

Page 87: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 87

Back of the envelopeperformance estimation

Page 88: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 88

Back of the envelope performance estimation

Given number of

- LUTs

- FFs

- DSPs

offered by an FPGA,

and utilization of resources by operators, estimate the max number of operators that can be implemented on the FPGA

Today FPGA clocks are ~500Mhz=0.5GHz(unavoidable price for flexibility)2000 flops per cycle = 1 Teraflops

Page 89: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 89

Xilinx Virtex-7 family

Virtex-7 slices : 4 x 6-input LUTs, 8 FFsVirtex-7 DSPs : 48 bits pre-adder, 25x18 multiplier, 48 bits accumulatorVirtex LUT ~ 1.6 standard LUT

Page 90: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 90

Custom precision 17/24 bits floating

dsp lut+f lut f # tot dsp tot lut tot f* 2 103 90 112 1080 2160 208440 232200

1 113 97 104 0 0 0 00 377 336 376 0 0 00 0 0 0 0 0 0

0 0 0+ 0 369 301 393 1510 0 1011700 1150620

0 0 0 0 0 0 0 0

Tot 2590 2160 1220140 1382820

Virtex-7 V2000T available resources

slices LUT x FF x dsp 6 input ff

slice slice LUT305400 4 8 2160 1221600 2443200

1.6standard LUTs 1954560

Page 91: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 91

IEEE single precision – 32 bits

dsp lut+f lut f # tot dsp tot lut tot f* 3 120 103 105 700 2100 156100 157500

2 160 128 160 0 0 0 01 331 283 331 0 0 00 665 629 669 0 0 0

0 0 0+ 2 293 225 327 25 50 12950 15500

0 500 407 541 1160 0 1052120 1207560

Tot 1885 2150 1221170 1380560

Virtex-7 V2000T available resources

slices LUT x FF x dsp 6 input ff

slice slice LUT305400 4 8 2160 1221600 2443200

1.6standard LUTs 1954560

Page 92: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 92

IEEE double precision – 64 bits

dsp lut+f lut f # tot dsp tot lut tot f* 11 325 279 421 196 2156 118384 146216

10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0

0 0 0+ 3 895 705 945 1 3 1600 1840

0 989 794 1029 617 0 1100111 1245106

Tot 814 2159 1220095 1393162

Virtex-7 V2000T available resources

slices LUT x FF x dsp 6 input ff

slice slice LUT305400 4 8 2160 1221600 2443200

1.6standard LUTs 1954560

Page 93: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 93

Virtex UltraScale XCVU440 20nm -sampling outIEEE double precision – 64 bits

dsp lut+f lut f # tot dsp tot lut tot f* 11 325 279 421 261 2871 157644 194706

10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0

0 0 0+ 3 895 705 945 3 9 4800 5520

0 989 794 1029 1321 0 2355343 2665778

Tot 1585 2880 2517787 2866004

Virtex Ultra Scale - available resources

slices LUT x FF x dsp 6 input ff

slice slice LUT314820 8 16 2880 2518560 5037120

1.6standard LUTs 4029696

Page 94: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 94

Relative power dissipation/1TDP/peak nominal double fp performance :

Intel Q6600 2.4ghz 105W/ 38 gflops = 2763mW/gflops

Intel Haswell i7-4770K 3.5ghz 84W/ 112 gflops = 750mW/gflops

Intel IvyBridge 3770K 3.5ghz 77W/ 112 gflops = 687mW/gflops

Nvidia Tesla M2090 225W/ 666 gflops = 337mW/gflops

Nvidia Tesla K20X 235W/1310gflops = 179mW/gflops

Xilinx Virtex-US 20W/ 800gflops = 25mW/gflops Ro w 1 Ro w 2 Ro w 3 R ow 4

0 C ol um n 1C ol um n 2C ol um n 3

FPGA computing = green computing

}} ~10x

~30x

Page 95: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 95

Relative power dissipation/2

Intel 2.4 ghz q6600

intel 4770k

intel i7-3770k

tesla m2090

tesla k20x

virtex7

0 500 1000 1500 2000 2500 3000

mW / Gflops

mW

Page 96: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 96

Gflops per Wattpeak nominal double fp performance/TDP :

Intel Q6600 2.4ghz 38 gflops/105 W = 0.36 gflops/W

Intel Haswell i7-4770K 3.5ghz 112 gflops/84 W = 1.33 gflops/W

Intel IvyBridge 3770K 3.5ghz 112 gflops/77 W = 1.45 gflops/W

Nvidia Tesla M2090 666 gflops/225 W = 2.96 gflops/W

Nvidia Tesla K20X 1310 gflops/235 W = 5.57 gflops/W

Xilinx Virtex-US 800 gflops/20 W = 40 gflops/W Ro w 1 Ro w 2 Ro w 3 R ow 4

0 C ol um n 1C ol um n 2C ol um n 3

FPGA computing = green computing

}} ~10x

~30x

Page 97: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 97

Top green500 listgreen500_ranktotal_power Year name Total CoresName ManufacturerCountry

1 28 4,503 2013 2720 TSUBAME-KFC NEC Japan2 53 3,632 2013 5120 Wilkes Dell United Kingdom3 79 3,518 2013 4864 HA-PACS TCA Cray Inc. Japan4 1,754 3,186 2012 115984 Cray Inc. Switzerland5 81 3,131 2013 5720 romeo Bull SA France6 923 3,069 2013 74358 TSUBAME 2.5 NEC/HP Japan7 54 2,702 2013 3080 IBM United States8 270 2,629 2013 15840 IBM Germany9 56 2,629 2013 3264 IBM United States

10 71 2,359 2010 4620 CSIRO GPU Cluster Xenon SystemsAustralia11 179 2,351 2012 38400 SANAM Saudi Arabia12 82 2,299 2011 16384 IBM United States13 82 2,299 2012 16384 Cetus IBM United States14 82 2,299 2012 16384 IBM Poland15 82 2,299 2013 16384 IBM United States16 82 2,299 2012 16384 Vesta IBM United States17 82 2,299 2012 16384 IBM United States18 237 2,243 2013 10920 HPCC Hewlett-PackardUnited States

Mflops/WattLX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20xDell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20xCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20xPiz DaintBull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR, NVIDIA K20xCluster Platform SL390s G7, Xeon X5670 6C 2.930GHz, Infiniband QDR, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20xNitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, Nvidia K20mAdtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000AdtechBlueGene/Q, Power BQC 16C 1.60 GHz, CustomBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.60GHz, CustomBlueGene/Q, Power BQC 16C 1.60GHz, CustomCluster Platform SL250s Gen8, Xeon E5-2665 8C 2.400GHz, Infiniband FDR, Nvidia K20m

Page 98: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 98

Power/Energy efficiency

Page 99: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 99

Power Dissipation

PT=k C V 2 f +Ps

Ed=12

C V 2

A chip is made of millions of CMOS FETs. When input switches, you need to charge the small capacitance :

f times a second gives, together with some constant static dissipation :

Anyway increasing a lot the frequency, the chip becomes unstable unless you increase also the voltage(leakage). Therefore there is in fact a superlinear behaviour vs f:

Page 100: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 100

Dennard scaling(1974)

1

S

S3

S2 = 2x moretransistors

S = 1.4x lowercapacitance

Scale Vdd by S => S2 = 2x lower energy

S2S = 1.4x fastertransistors

Performance scales as S3 = 2.8 while power density stays constant across generations

Page 101: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 101

Fred Pollack(Intel) famous graph(1999)

Power density increases !!!In 2004/2005 we hit the power wall => stop frequency increases

“New microarchitecture challenges in the coming generations of CMOS process technology” F.Pollack

Page 102: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 102

End of Dennard scaling

1

S

S3

S2 = 2x moretransistors

S = 1.4x lowercapacitance S2

S = 1.4x fastertransistors

In submicron technology rigidity in voltage scaling. Power increases by S2 = 2

Page 103: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 103

MOS subthreshold currentScaling down geometry you scale down drain voltage to avoid high electric fields and to decrease energy required to switch. You have to scale down also the threshold voltage to sustain the 30% decrease of gate delay. The small voltage swing that remains is not able to completely turn off the transistor. Subthreshold leakage that was ignored in the past can on modern VLSI chips consume up to ½ of the total power.

Page 104: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 104

Subthreshold leakage

Page 105: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 105

VT

design tradeoff

VGS

log IDS

- Low VT for high ON current :

- High VT for low OFF current

Phenomenology :60-200 mV of V

GS swing decreases I

DS by

one order of magnitude. Today 0.5-0.2V

T doesn't allow the needed swing of V

GS to

shutoff the transistor.

I Dsat ∝(V DD−V T )2

Low VT

=> high IDS

good for ON condition

High VT => low leakage

good for OFF condition

Page 106: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 106

Multicore scaling

65 nm 45 nm 32 nm

4-core 8-core 16-core

Every generation 2x cores, at same or slightly increasing frequency.

Page 107: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 107

Multicore scaling at constant frequency

1

SS2

S2 = 2x moretransistors

S = 1.4x lowercapacitance

} S = 1.4x lowerutilization

We hit the utilization wall => dark silicon

Page 108: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 108

End of multicore scaling

65 nm 32 nm

4 cores 8 cores

Every generation 1.4x cores, at same or slightly increasing frequency.

Dark or dim silicon(“uncore”)

45 nm

5.7 cores

Page 109: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 109

Dark silicon and the end of multicore scaling

Doug Burger (Microsoft) at HiPEAC 2013 :

- till 2004: each semiconductor generation gave transistors smaller, faster and that consume less

- from 2004 to now: we still got smaller transistors, but we could not run them faster (power wall)

- in the future : we will still get smaller transistors but we will not be able to use all of them together(dark silicon) or at max speed.

Page 110: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 110

Scaling the utilization wallG.Venkatesh ASPLOS 10 :

“while the area budget continues to increase exponentially, the power budget has become a first-order design constraint in current processors. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve the system performance.

”The Utilization Wall : With each successive process generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints. [Venkatesh, ASPLOS ‘10]

Single chip heterogeneous computer (E.Chung)

Greater energy efficiency combining GPP with unconventional cores (U-cores) : GPU,FPGA,DSP,ASICs ..

Page 111: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 111

3D FinFET promiseBelow 20nm the roadmap is to use 3D FinFETs :- Faster : +37%- Dynamic Power: -50%- Static Power: -90%

KAIST demonstrated a 3nmFinFET in lab

Page 112: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 112

The trouble with multicoreA famous article of David Patterson (of “Computer architecture: a quantitative approach” fame) on IEEE Spectrum, 2010 :

“Chipmakers are busy designing microprocessors that most programmers can’t program”

“... the semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip - doing so without any clear notion of how such devices would in general be programmed. The hope is that someone will be able to figure out how to do that, but at the moment, the ball is still in the air.”

Page 113: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 113

Verilog

Page 114: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 114

Using Verilog

You write a functional specification (usually) splitted in modules that documents the exact behaviour of the system.

LogicSynthesis

Place &Route

HDL (Verilog)

FPGAASIC

Functionaldesign

Physicaldesign

Gatenetlist

Simulated annealing used here !

NB. place and route of a large design can take 1 day of a fast CPU !!

Page 115: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 115

Verilog/1

Basic module :

// comments in this waymodule name(input x0,x1,input [3:0]y, output out);// x0,x1 are wires, y is a 4 wires bus// out is an output wire// combinational logic use assign  wire x0,x1, [3:0]y, outendmodule

Page 116: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 116

Verilog/2

Combinatorial circuit :// performs not a b c + a not b not cmodule dummy(input a,b,c, output y,z); wire a,b,c,y; assign y = ~a & b & c | a & ~b & ~c; assign z = ~c;endmodule

This is not C ! a,b,c,y,z are wires and y,z change whenever

a or b or c change. To avoid this drama for complex circuitswe use synchronous logic

(everything is stepped in docking stations = Flip flops)

Page 117: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 117

Verilog/3

Page 118: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 118

Verilog/4

A sequential circuit :// a flip flop described in verilogmodule ff(input d, clk, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk) begin q <= d; qbar <= ~d; endendmodule

At a raising edge of the wire clk copy the signal to q and

the inverse of d to qbar

Page 119: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 119

Verilog/5

Page 120: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 120

Verilog/6A more complicate sequential circuit :

// in verilog FF with clear/resetmodule ff(input d, clk,clr, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk, posedge clr) if (clr) q <= 0; else begin q <= d; endendmodule

At a raising edge of the wire clr set q=0, at the raising edge

of clk copy the signal to q and the inverse of d to qbar

Page 121: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 121

Verilog/7

Page 122: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 122

BORPH : Berkeley Operating system for ReProgrammable HardwarePETALINUX : Xilinx linux for Zynq et al.

Page 123: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 123

- Idea of HW unix process : has pid, can be killed like a normal unix process, but in fact is an HW instance on FPGA

- ioreg Virtual File System interface

Borph : Berkeley Operating System

Page 124: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 124

Xilinx Petalinux

The PetaLinux Software Development Kit (SDK) is a development tool that contains everything necessary to build, develop, test and deploy Embedded Linux systems on : Zync-7000, Zedboard, Kintex-7 boards.

PetaLinux consists of : pre-configured binary bootable images, fully customizable Linux for the Xilinx device, and PetaLinux SDK which includes tools and utilities to automate complex tasks across configuration, build, and deployment.

PetaLinux is offered under two separate licenses :

No charge Evaluation license or Commercial licenses

Page 125: Roberto Innocente inno@sissa - SISSA People Personal …people.sissa.it/~inno/pubs/reconfig-computing-16-9-tris.pdf · May 10, 2014 R.Innocente 8 SSI 74xx IC. ... - 14 pin packages

May 10, 2014 R.Innocente 125

END