apenext

Paris, May 2003 Piero Vicini - SciParC workshop 1

apeNEXT

Piero Vicini ([email protected])

INFN Roma


APE keywords Parallel system

Massively parallel 3D array of computing nodes with periodic boundary conditions

Custom system Processor: extensive use of VLSI

Native implementation of the complex type a x b + c (complex numbers) Large register file VLIW microcode

Node interconnections Optimized for nearest-neighbors communication

Software tools Apese, TAO, OS, Machine simulator

Dense system Reliable and safe HW solution Custom mechanics for “wide” integration

Cheap system 0.5 €/MFlops Very low cost maintenance


The APE familyOur line of Home Made Computers

APE(1988)

APE100(1993)

APEmille(1999)

apeNEXT(2003)

Architecture SIMD SIMD SIMD SIMD++

# nodes 16 2048 2048 4096

Topology flexible 1D rigid 3D flexible 3d flexible 3D

Memory 256 MB 8 GB 64 GB 1 TB

# registers (w.size)

64 (x32) 128 (x32) 512 (x32) 512 (x64)

clock speed 8 MHz 25 MHz 66 MHz 200 MHz

Total Computing Power

~1.5 GFlops ~ 250 GFlops

~ 2 TFlops ~ 8-20 TFlops


APE (‘88) 1 GFlops


APE100 (1993) - 100 GFlops

PB (8 nodes) ~ 400 MFlops


APEmille – 1 TFlops 2048 VLSI processing nodes (0.5 MFlops) SIMD, synchronous communications Fully integrated ”Host computer”, 64 PCs cPCI based

Computing node “Processing Board” (PB)

8 nodes, 4GFlops “Torre”

32 PB, 128GFlops


APEmille installations

Bielefeld 130 GF (2 crates) Zeuthen 520 GF (8 crates) Milan 130 GF (2 crates) Bari 65 GF (1 crates) Trento 65 GF (1 crates) Pisa 325 GF (5 crates) Rome 1 650 GF (10 crates) Rome 2 130 GF (2 crates) Orsay 16 GF (1/4 crates) Swansea 65 GF (1 crates)

Gr. Total ~1966 GF


The apeNEXT architecture3D mesh of computing nodes

Custom VLSI processor - 200 MHz (J&T)

1.6 GFlops per node (complex “normal”)

256 MB (1 GB) memory per node

First neighbor communication network “loosely synchronous”

YZ internal, X on cables

r = 8/16 => 200 MB/s per channel

Scalable 25 GFlops -> 6 Tflops Processing Board 4 x 2 x 2 ~ 26 GF Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF Rack (32 PB) 8 x 8 x 8 ~ 1 TF Large systems (8*n) x 8 x 8

Linux PCs as Host system

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

J&T

DDR-MEM

X+

……Z-


Design metodology VHDL incremental model of the (almost) whole system

Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool

First-Time-Right-Silicon

Software design env.

Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model

Powerful test-bed for test vectors generation

Simplified but complete model of HW-Host interaction

Test environment for development of compilation chain, OS

performance (architecture) evaluation at design time


J&T Asic

J&T module

PB

BackPlaneRack

Assembling apeNEXT…


Overview of the J&T Architecture

Peak floating point performance of about 1.6GFlops IEEE compliant double precision

Integer arithmetic performance of about 400 MIPS Link bandwidth of about 200 MByte/sec each direction

full duplex 7 links: X+,X-,Y+,Y-,Z+,Z-, “7th” (I/O)

Support for current generation DDR memory Memory bandwidth of 3.2 GByte/sec

400 Mword/sec


J&T: Top Level Diagram


The J&T Arithmetic BOX

4 multipliers

4 adder/sub

At 200 MHz (fully piped) = 1.6 GFlops

•Pipelined complex “normal” a*b+c (8 flops) per cycle


The J&T remote IO

fifo-based communication: LVDS 1.6 Gb/s per link

(8 bit @ 200MHz) 6 (+1) independent links


J&T summary

CMOS 0.18u, 7 metal (ATMEL)

200 MHz Double Precision Complex

Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory

channel 6+1 LVDS 200 MB/s links BGA package, 600 pins


Dominant Technologies:

LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers

High-Speed differential connectors:

Samtec QTS (J&T Module)

Erni ERMET-ZD (Backplane)

16 Nodes 3D-Interconnected 4x2x2 Topology 26 Gflops, 4.6 GB Memory

Light System:

J&T Module connectors

Glue Logic (Clock tree 10Mhz)

Global signal interconnection (FPGA)

DC-DC converters (48V to 3.3/2.5)

• Collaboration with NEURICAM spaPB


J&T Module J&T 9 DDR-SDRAM, 256Mbit

(x16) 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control

network) Dual Power 2.5V+1.8V, 7-

10W estimated Dominant technologies:

SSTL-II (memory interface) LVDS (network interface + I/O)


connector kit cost:7KEuro (!)PB Insertion force:80-150 Kg(!)

16 PB Slots + Root Slot Size 447x600 mm2 4600 LVDS differential signals, point-to-point up to 600 Mb/s

16 controlled-imp. layers (32)

Press-fit only Erni/Tyco connectors

ERMET-ZD Providers:

APW (primary)

ERNI (2nd src)

NEXT BackPlane


PB Mechanics

apeNEXT PB

Board-to-Board Connector

PB constraints:

Power consumption: up to 340W

PB-BP insertion force: 80-150 Kg (!)

Fully populated PB weight: 4-5 Kg

Custom design of card frame and insertion tool

Detailed study of airflow

DC/DC

J&T Module

J&T Module

AIR-FLOWCHANNEL

2

TOP VIEW ( local )

AIR-FLOW CHANNEL

1

AIR-FLOWCHANNEL

3

AIR-FLOWCHANNEL

3

Fra

me

a1

b1

b3

a3

b2

a2


Problem: PB weight: 4-5 Kg, PB consumption: 340W (est.) 32 PB + 2 Root Board Power supply: (<48Vx150A per crate) Integrated Host PCs Forced air cooling, Robust, expandable/modular, CE, EMC ....

Solution:42U rack (h: 2,10 m):

EMC proof, efficient cables routing

19”-1U slots per 9 “host PCs” (rack mounted) Hot-swap power supply cabinet (modular) Custom design of “card cage” and “tie bar”

Custom design of cooling system

Rack mechanics


Host I/O Architecture

I2C: bootstrap & control 7th-Link (200MB/s)


Fifo

Altera APEXIIPCI Interface PLDA

7Link Ctrl

I2C Ctrl

PCIMaster

Ctrl

PCITargetCtrl

7Link Ctrl

QDR Mem Ctrl

Fifo

Fifo

QDRMem Bank

PCI Interface 64bit, 66Mhz

PCI Master Mode for 7th Link

PCI Target Mode for I2C

QuadDataRateMemory (x32)

PCI Board, Altera APEX II based

7th Link: 1(2) bidir. Chan.

I2C: 4 independent ports

Host I/O Interface


Status and expected schedule J&T ready to test September 03

We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!

We expect them to work !! The same team designed 7 ASIC of the same complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!

PB,J&T Module, BackPlane, Mechanics were built and tested Within days/weeks the first working apeNEXT computer should

operate Mass production will follow asap

End 2003 mass production will start…. INFN requirements is 8-12 TFlops of computing power !!


Software

TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator

Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit

C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost).

http://www.cs.princeton.edu/software/lcc/


Project Costs

Total development cost of 1700 k€uro

1050 k€uro for VLSI development 550 k€uro non VLSI

Manpower involved = 20 man/year Mass production cost ~ 0.5 €uro/MFlops


Future R&D activities Computing node architecture

Adaptable/reconfigurable computing node Fat operators, short/custom FP data, multiple node integration Evaluation/integration of commercial processor in APE system

Interconnection architecture and technologies Custom ape-like network Interface to host, PCs interconnection

Mechanics assemblies (Perf/Volume,reliability) Rack, Cables, Power distributions etc…

Software Standard languages (C) full support (compiler, linker…) Distributed OS APE system integration in “GRID” environment


Conclusions J&T in fab, ready Summer 03 (300….600 chips) Everything else ready and tested !!! If tests ok

mass production starting 4Q03 All components over-dimensioned

Cooling, LVDS tested @ 400 Mb/s, power supply on boards … Makes possible a technology step with no extra design and

relatively low test effort Installation plans

INFN theoretical group requires 8-12 TFlops (10-15 cabinets)(on delivering of a working machine…)

DESY considering between 8 TFlops to 16 Tflops Paris….


APE in SciParC APE is the actual (“de-facto”) European computing platform for big

volume LQCD applications. But….

“Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): Fluid dynamics (lattice boltzman, weather forecast) Complex Systems (spin glasses, real glasses, protein folding) Neural networks Seismic migration Plasma physics (astrophysics, thermonuclear engines) ……

So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research.

The APE group can (want) contribute in development of such future machines

apenext

Documents

crateszeuthen520 gf

cratesmilan130 gf

cratespisa325 gf

gf crate

x b c complex numberslarge

tflops processing board4

node complex normal

gflopspipelined complex