apenext
DESCRIPTION
apeNEXT. Piero Vicini ([email protected]) INFN Roma. APE keywords. Parallel system Massively parallel 3D array of computing nodes with periodic boundary conditions Custom system Processor: extensive use of VLSI - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/1.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 1
apeNEXT
Piero Vicini ([email protected])
INFN Roma
![Page 2: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/2.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 2
APE keywords Parallel system
Massively parallel 3D array of computing nodes with periodic boundary conditions
Custom system Processor: extensive use of VLSI
Native implementation of the complex type a x b + c (complex numbers) Large register file VLIW microcode
Node interconnections Optimized for nearest-neighbors communication
Software tools Apese, TAO, OS, Machine simulator
Dense system Reliable and safe HW solution Custom mechanics for “wide” integration
Cheap system 0.5 €/MFlops Very low cost maintenance
![Page 3: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/3.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 3
The APE familyOur line of Home Made Computers
APE(1988)
APE100(1993)
APEmille(1999)
apeNEXT(2003)
Architecture SIMD SIMD SIMD SIMD++
# nodes 16 2048 2048 4096
Topology flexible 1D rigid 3D flexible 3d flexible 3D
Memory 256 MB 8 GB 64 GB 1 TB
# registers (w.size)
64 (x32) 128 (x32) 512 (x32) 512 (x64)
clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Total Computing Power
~1.5 GFlops ~ 250 GFlops
~ 2 TFlops ~ 8-20 TFlops
![Page 4: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/4.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 4
APE (‘88) 1 GFlops
![Page 5: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/5.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 5
APE100 (1993) - 100 GFlops
PB (8 nodes) ~ 400 MFlops
![Page 6: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/6.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 6
APEmille – 1 TFlops 2048 VLSI processing nodes (0.5 MFlops) SIMD, synchronous communications Fully integrated ”Host computer”, 64 PCs cPCI based
Computing node “Processing Board” (PB)
8 nodes, 4GFlops “Torre”
32 PB, 128GFlops
![Page 7: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/7.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 7
APEmille installations
Bielefeld 130 GF (2 crates) Zeuthen 520 GF (8 crates) Milan 130 GF (2 crates) Bari 65 GF (1 crates) Trento 65 GF (1 crates) Pisa 325 GF (5 crates) Rome 1 650 GF (10 crates) Rome 2 130 GF (2 crates) Orsay 16 GF (1/4 crates) Swansea 65 GF (1 crates)
Gr. Total ~1966 GF
![Page 8: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/8.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 8
The apeNEXT architecture3D mesh of computing nodes
Custom VLSI processor - 200 MHz (J&T)
1.6 GFlops per node (complex “normal”)
256 MB (1 GB) memory per node
First neighbor communication network “loosely synchronous”
YZ internal, X on cables
r = 8/16 => 200 MB/s per channel
Scalable 25 GFlops -> 6 Tflops Processing Board 4 x 2 x 2 ~ 26 GF Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF Rack (32 PB) 8 x 8 x 8 ~ 1 TF Large systems (8*n) x 8 x 8
Linux PCs as Host system
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
J&T
DDR-MEM
X+
……Z-
![Page 9: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/9.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 9
Design metodology VHDL incremental model of the (almost) whole system
Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool
First-Time-Right-Silicon
Software design env.
Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model
Powerful test-bed for test vectors generation
Simplified but complete model of HW-Host interaction
Test environment for development of compilation chain, OS
performance (architecture) evaluation at design time
![Page 10: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/10.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 10
J&T Asic
J&T module
PB
BackPlaneRack
Assembling apeNEXT…
![Page 11: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/11.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 11
Overview of the J&T Architecture
Peak floating point performance of about 1.6GFlops IEEE compliant double precision
Integer arithmetic performance of about 400 MIPS Link bandwidth of about 200 MByte/sec each direction
full duplex 7 links: X+,X-,Y+,Y-,Z+,Z-, “7th” (I/O)
Support for current generation DDR memory Memory bandwidth of 3.2 GByte/sec
400 Mword/sec
![Page 12: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/12.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 12
J&T: Top Level Diagram
![Page 13: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/13.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 13
The J&T Arithmetic BOX
4 multipliers
4 adder/sub
At 200 MHz (fully piped) = 1.6 GFlops
•Pipelined complex “normal” a*b+c (8 flops) per cycle
![Page 14: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/14.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 14
The J&T remote IO
fifo-based communication: LVDS 1.6 Gb/s per link
(8 bit @ 200MHz) 6 (+1) independent links
![Page 15: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/15.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 15
J&T summary
CMOS 0.18u, 7 metal (ATMEL)
200 MHz Double Precision Complex
Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory
channel 6+1 LVDS 200 MB/s links BGA package, 600 pins
![Page 16: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/16.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 16
Dominant Technologies:
LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers
High-Speed differential connectors:
Samtec QTS (J&T Module)
Erni ERMET-ZD (Backplane)
16 Nodes 3D-Interconnected 4x2x2 Topology 26 Gflops, 4.6 GB Memory
Light System:
J&T Module connectors
Glue Logic (Clock tree 10Mhz)
Global signal interconnection (FPGA)
DC-DC converters (48V to 3.3/2.5)
• Collaboration with NEURICAM spaPB
![Page 17: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/17.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 17
J&T Module J&T 9 DDR-SDRAM, 256Mbit
(x16) 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control
network) Dual Power 2.5V+1.8V, 7-
10W estimated Dominant technologies:
SSTL-II (memory interface) LVDS (network interface + I/O)
![Page 18: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/18.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 18
connector kit cost:7KEuro (!)PB Insertion force:80-150 Kg(!)
16 PB Slots + Root Slot Size 447x600 mm2 4600 LVDS differential signals, point-to-point up to 600 Mb/s
16 controlled-imp. layers (32)
Press-fit only Erni/Tyco connectors
ERMET-ZD Providers:
APW (primary)
ERNI (2nd src)
NEXT BackPlane
![Page 19: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/19.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 19
PB Mechanics
apeNEXT PB
Board-to-Board Connector
PB constraints:
Power consumption: up to 340W
PB-BP insertion force: 80-150 Kg (!)
Fully populated PB weight: 4-5 Kg
Custom design of card frame and insertion tool
Detailed study of airflow
DC/DC
J&T Module
J&T Module
AIR-FLOWCHANNEL
2
TOP VIEW ( local )
AIR-FLOW CHANNEL
1
AIR-FLOWCHANNEL
3
AIR-FLOWCHANNEL
3
Fra
me
a1
b1
b3
a3
b2
a2
![Page 20: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/20.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 20
Problem: PB weight: 4-5 Kg, PB consumption: 340W (est.) 32 PB + 2 Root Board Power supply: (<48Vx150A per crate) Integrated Host PCs Forced air cooling, Robust, expandable/modular, CE, EMC ....
Solution:42U rack (h: 2,10 m):
EMC proof, efficient cables routing
19”-1U slots per 9 “host PCs” (rack mounted) Hot-swap power supply cabinet (modular) Custom design of “card cage” and “tie bar”
Custom design of cooling system
Rack mechanics
![Page 21: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/21.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 21
![Page 22: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/22.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 22
Host I/O Architecture
I2C: bootstrap & control 7th-Link (200MB/s)
![Page 23: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/23.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 23
Fifo
Altera APEXIIPCI Interface PLDA
7Link Ctrl
I2C Ctrl
PCIMaster
Ctrl
PCITargetCtrl
7Link Ctrl
QDR Mem Ctrl
Fifo
Fifo
QDRMem Bank
PCI Interface 64bit, 66Mhz
PCI Master Mode for 7th Link
PCI Target Mode for I2C
QuadDataRateMemory (x32)
PCI Board, Altera APEX II based
7th Link: 1(2) bidir. Chan.
I2C: 4 independent ports
Host I/O Interface
![Page 24: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/24.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 24
Status and expected schedule J&T ready to test September 03
We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!
We expect them to work !! The same team designed 7 ASIC of the same complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!
PB,J&T Module, BackPlane, Mechanics were built and tested Within days/weeks the first working apeNEXT computer should
operate Mass production will follow asap
End 2003 mass production will start…. INFN requirements is 8-12 TFlops of computing power !!
![Page 25: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/25.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 25
Software
TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator
Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit
C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost).
http://www.cs.princeton.edu/software/lcc/
![Page 26: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/26.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 26
Project Costs
Total development cost of 1700 k€uro
1050 k€uro for VLSI development 550 k€uro non VLSI
Manpower involved = 20 man/year Mass production cost ~ 0.5 €uro/MFlops
![Page 27: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/27.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 27
Future R&D activities Computing node architecture
Adaptable/reconfigurable computing node Fat operators, short/custom FP data, multiple node integration Evaluation/integration of commercial processor in APE system
Interconnection architecture and technologies Custom ape-like network Interface to host, PCs interconnection
Mechanics assemblies (Perf/Volume,reliability) Rack, Cables, Power distributions etc…
Software Standard languages (C) full support (compiler, linker…) Distributed OS APE system integration in “GRID” environment
![Page 28: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/28.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 28
Conclusions J&T in fab, ready Summer 03 (300….600 chips) Everything else ready and tested !!! If tests ok
mass production starting 4Q03 All components over-dimensioned
Cooling, LVDS tested @ 400 Mb/s, power supply on boards … Makes possible a technology step with no extra design and
relatively low test effort Installation plans
INFN theoretical group requires 8-12 TFlops (10-15 cabinets)(on delivering of a working machine…)
DESY considering between 8 TFlops to 16 Tflops Paris….
![Page 29: apeNEXT](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56814e41550346895dbbb26d/html5/thumbnails/29.jpg)
Paris, May 2003 Piero Vicini - SciParC workshop 29
APE in SciParC APE is the actual (“de-facto”) European computing platform for big
volume LQCD applications. But….
“Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): Fluid dynamics (lattice boltzman, weather forecast) Complex Systems (spin glasses, real glasses, protein folding) Neural networks Seismic migration Plasma physics (astrophysics, thermonuclear engines) ……
So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research.
The APE group can (want) contribute in development of such future machines