apenext

29
Paris, May 2003 Piero Vicini - SciParC workshop 1 apeNEXT Piero Vicini ([email protected]) INFN Roma

Upload: jaeger

Post on 19-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

apeNEXT. Piero Vicini ([email protected]) INFN Roma. APE keywords. Parallel system Massively parallel 3D array of computing nodes with periodic boundary conditions Custom system Processor: extensive use of VLSI - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 1

apeNEXT

Piero Vicini ([email protected])

INFN Roma

Page 2: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 2

APE keywords Parallel system

Massively parallel 3D array of computing nodes with periodic boundary conditions

Custom system Processor: extensive use of VLSI

Native implementation of the complex type a x b + c (complex numbers) Large register file VLIW microcode

Node interconnections Optimized for nearest-neighbors communication

Software tools Apese, TAO, OS, Machine simulator

Dense system Reliable and safe HW solution Custom mechanics for “wide” integration

Cheap system 0.5 €/MFlops Very low cost maintenance

Page 3: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 3

The APE familyOur line of Home Made Computers

APE(1988)

APE100(1993)

APEmille(1999)

apeNEXT(2003)

Architecture SIMD SIMD SIMD SIMD++

# nodes 16 2048 2048 4096

Topology flexible 1D rigid 3D flexible 3d flexible 3D

Memory 256 MB 8 GB 64 GB 1 TB

# registers (w.size)

64 (x32) 128 (x32) 512 (x32) 512 (x64)

clock speed 8 MHz 25 MHz 66 MHz 200 MHz

Total Computing Power

~1.5 GFlops ~ 250 GFlops

~ 2 TFlops ~ 8-20 TFlops

Page 4: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 4

APE (‘88) 1 GFlops

Page 5: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 5

APE100 (1993) - 100 GFlops

PB (8 nodes) ~ 400 MFlops

Page 6: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 6

APEmille – 1 TFlops 2048 VLSI processing nodes (0.5 MFlops) SIMD, synchronous communications Fully integrated ”Host computer”, 64 PCs cPCI based

Computing node “Processing Board” (PB)

8 nodes, 4GFlops “Torre”

32 PB, 128GFlops

Page 7: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 7

APEmille installations

Bielefeld 130 GF (2 crates) Zeuthen 520 GF (8 crates) Milan 130 GF (2 crates) Bari 65 GF (1 crates) Trento 65 GF (1 crates) Pisa 325 GF (5 crates) Rome 1 650 GF (10 crates) Rome 2 130 GF (2 crates) Orsay 16 GF (1/4 crates) Swansea 65 GF (1 crates)

Gr. Total ~1966 GF

Page 8: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 8

The apeNEXT architecture3D mesh of computing nodes

Custom VLSI processor - 200 MHz (J&T)

1.6 GFlops per node (complex “normal”)

256 MB (1 GB) memory per node

First neighbor communication network “loosely synchronous”

YZ internal, X on cables

r = 8/16 => 200 MB/s per channel

Scalable 25 GFlops -> 6 Tflops Processing Board 4 x 2 x 2 ~ 26 GF Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF Rack (32 PB) 8 x 8 x 8 ~ 1 TF Large systems (8*n) x 8 x 8

Linux PCs as Host system

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

J&T

DDR-MEM

X+

……Z-

Page 9: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 9

Design metodology VHDL incremental model of the (almost) whole system

Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool

First-Time-Right-Silicon

Software design env.

Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model

Powerful test-bed for test vectors generation

Simplified but complete model of HW-Host interaction

Test environment for development of compilation chain, OS

performance (architecture) evaluation at design time

Page 10: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 10

J&T Asic

J&T module

PB

BackPlaneRack

Assembling apeNEXT…

Page 11: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 11

Overview of the J&T Architecture

Peak floating point performance of about 1.6GFlops IEEE compliant double precision

Integer arithmetic performance of about 400 MIPS Link bandwidth of about 200 MByte/sec each direction

full duplex 7 links: X+,X-,Y+,Y-,Z+,Z-, “7th” (I/O)

Support for current generation DDR memory Memory bandwidth of 3.2 GByte/sec

400 Mword/sec

Page 12: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 12

J&T: Top Level Diagram

Page 13: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 13

The J&T Arithmetic BOX

4 multipliers

4 adder/sub

At 200 MHz (fully piped) = 1.6 GFlops

•Pipelined complex “normal” a*b+c (8 flops) per cycle

Page 14: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 14

The J&T remote IO

fifo-based communication: LVDS 1.6 Gb/s per link

(8 bit @ 200MHz) 6 (+1) independent links

Page 15: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 15

J&T summary

CMOS 0.18u, 7 metal (ATMEL)

200 MHz Double Precision Complex

Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory

channel 6+1 LVDS 200 MB/s links BGA package, 600 pins

Page 16: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 16

Dominant Technologies:

LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers

High-Speed differential connectors:

Samtec QTS (J&T Module)

Erni ERMET-ZD (Backplane)

16 Nodes 3D-Interconnected 4x2x2 Topology 26 Gflops, 4.6 GB Memory

Light System:

J&T Module connectors

Glue Logic (Clock tree 10Mhz)

Global signal interconnection (FPGA)

DC-DC converters (48V to 3.3/2.5)

• Collaboration with NEURICAM spaPB

Page 17: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 17

J&T Module J&T 9 DDR-SDRAM, 256Mbit

(x16) 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control

network) Dual Power 2.5V+1.8V, 7-

10W estimated Dominant technologies:

SSTL-II (memory interface) LVDS (network interface + I/O)

Page 18: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 18

connector kit cost:7KEuro (!)PB Insertion force:80-150 Kg(!)

16 PB Slots + Root Slot Size 447x600 mm2 4600 LVDS differential signals, point-to-point up to 600 Mb/s

16 controlled-imp. layers (32)

Press-fit only Erni/Tyco connectors

ERMET-ZD Providers:

APW (primary)

ERNI (2nd src)

NEXT BackPlane

Page 19: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 19

PB Mechanics

apeNEXT PB

Board-to-Board Connector

PB constraints:

Power consumption: up to 340W

PB-BP insertion force: 80-150 Kg (!)

Fully populated PB weight: 4-5 Kg

Custom design of card frame and insertion tool

Detailed study of airflow

DC/DC

J&T Module

J&T Module

AIR-FLOWCHANNEL

2

TOP VIEW ( local )

AIR-FLOW CHANNEL

1

AIR-FLOWCHANNEL

3

AIR-FLOWCHANNEL

3

Fra

me

a1

b1

b3

a3

b2

a2

Page 20: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 20

Problem: PB weight: 4-5 Kg, PB consumption: 340W (est.) 32 PB + 2 Root Board Power supply: (<48Vx150A per crate) Integrated Host PCs Forced air cooling, Robust, expandable/modular, CE, EMC ....

Solution:42U rack (h: 2,10 m):

EMC proof, efficient cables routing

19”-1U slots per 9 “host PCs” (rack mounted) Hot-swap power supply cabinet (modular) Custom design of “card cage” and “tie bar”

Custom design of cooling system

Rack mechanics

Page 21: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 21

Page 22: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 22

Host I/O Architecture

I2C: bootstrap & control 7th-Link (200MB/s)

Page 23: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 23

Fifo

Altera APEXIIPCI Interface PLDA

7Link Ctrl

I2C Ctrl

PCIMaster

Ctrl

PCITargetCtrl

7Link Ctrl

QDR Mem Ctrl

Fifo

Fifo

QDRMem Bank

PCI Interface 64bit, 66Mhz

PCI Master Mode for 7th Link

PCI Target Mode for I2C

QuadDataRateMemory (x32)

PCI Board, Altera APEX II based

7th Link: 1(2) bidir. Chan.

I2C: 4 independent ports

Host I/O Interface

Page 24: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 24

Status and expected schedule J&T ready to test September 03

We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!

We expect them to work !! The same team designed 7 ASIC of the same complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!

PB,J&T Module, BackPlane, Mechanics were built and tested Within days/weeks the first working apeNEXT computer should

operate Mass production will follow asap

End 2003 mass production will start…. INFN requirements is 8-12 TFlops of computing power !!

Page 25: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 25

Software

TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator

Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit

C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost).

http://www.cs.princeton.edu/software/lcc/

Page 26: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 26

Project Costs

Total development cost of 1700 k€uro

1050 k€uro for VLSI development 550 k€uro non VLSI

Manpower involved = 20 man/year Mass production cost ~ 0.5 €uro/MFlops

Page 27: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 27

Future R&D activities Computing node architecture

Adaptable/reconfigurable computing node Fat operators, short/custom FP data, multiple node integration Evaluation/integration of commercial processor in APE system

Interconnection architecture and technologies Custom ape-like network Interface to host, PCs interconnection

Mechanics assemblies (Perf/Volume,reliability) Rack, Cables, Power distributions etc…

Software Standard languages (C) full support (compiler, linker…) Distributed OS APE system integration in “GRID” environment

Page 28: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 28

Conclusions J&T in fab, ready Summer 03 (300….600 chips) Everything else ready and tested !!! If tests ok

mass production starting 4Q03 All components over-dimensioned

Cooling, LVDS tested @ 400 Mb/s, power supply on boards … Makes possible a technology step with no extra design and

relatively low test effort Installation plans

INFN theoretical group requires 8-12 TFlops (10-15 cabinets)(on delivering of a working machine…)

DESY considering between 8 TFlops to 16 Tflops Paris….

Page 29: apeNEXT

Paris, May 2003 Piero Vicini - SciParC workshop 29

APE in SciParC APE is the actual (“de-facto”) European computing platform for big

volume LQCD applications. But….

“Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): Fluid dynamics (lattice boltzman, weather forecast) Complex Systems (spin glasses, real glasses, protein folding) Neural networks Seismic migration Plasma physics (astrophysics, thermonuclear engines) ……

So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research.

The APE group can (want) contribute in development of such future machines