some early results for sicortex machines
DESCRIPTION
Some early results for SiCortex machines. John F. Mucci Founder and CEO, SiCortex, Inc. Lattice 2008. The Company. Computer Systems company building complete, high processor count, richly interconnected, low power Linux computers - PowerPoint PPT PresentationTRANSCRIPT
Some early results for SiCortex machines
John F. Mucci
Founder and CEO, SiCortex, Inc.
Lattice 2008
The Company
Computer Systems company building complete, high processor count, richly interconnected, low power Linux computers
Strong belief (and now some proof) that a more efficient HPC computer can be built from the silicon up.
Around 80 or so really bright people, plus me.
Venture funded, based in Maynard, Massachusetts, USA
http://www.sicortex.com, for whitepapers and tech. info
What We Are Building
A family of fully-integrated HPC Linux systems delivering best-of-breed:
Total Cost of Ownership
Delivered Performance:Per Watt
Per Square Foot
Per Dollar
Per Byte/IO
Usability and deployability
Reliability
SiCortex Product Family
105 Gflops48 GB
6.5 GB/s I/O200 Watts
0.95 Teraflops864 GB
30 GB/s I/O2.5 KWatts
8.55 Teraflops7.8 Terabytes216 GB/S I/O20.5 Kwatts
2.3m2
2.14 Teraflops1.8 Terabytes68 GB/S I/O5+ KWatts
Make it Green, Don't Paint it Green
Through increasing component density and integration ~= Performance
~= Reliability
~= 1/Power
Innovate where it counts!Single core performance and architecture
ScalabilityMemory, Network, I/O
ReliabilityNetwork, on-chip and off-chip ECC, software to recover
Software usability
Buy, borrow, the rest...
6
The SiCortex Node Chip
QuickTime™ and a decompressor
are needed to see this picture.
7
27-Node Module
3x 8-lane PCIe Modules
27x Node
54x DDR2 DIMM
2x Gigabit Ethernet
Fibre Channel10 Gb EthernetInfiniBand
Compute: 236 GF/secMemory b/w: 345 GB/secFabric b/w: 78 GB/secI/O b/w: 7.5 GB/sec
Fabric Interconnect
8
35
34
33
32
31
30
29
28
27
26 25 24 23 22 21 20 19 18
0 1 2 3 4 5 6 7 8
9
10
11
12
13
14
15
16
17
Network and DMA Engine
NetworkUnidirectional, 3 unique routes between any pair.3 in, 3 out, plus loopback, each 2GB/sFully passive, reliable, in order, no switch
DMA Engine100% user level, no pinningSend/Get/Put semanticsRemote Initiation
MPI: 1.0us, 1.5GB/sec
9
Standard Linux/MPI Environment
Integrated, Tuned and TestedOpen Source:• Linux• GNU C, C++, Fortran• Cluster file system (Lustre)• MPI libraries (MPICH2)• Math libraries• Performance tools• Management software
SiCortex:• Optimized compiler• Console, boot, diagnostics• Low-level communications
libraries, device drivers• Management software• 5 Minute boot time
Licensed:• Debugger, Trace Visualizer
GNU
MPI
Libraries
gentooLinux
QCD: MILC and su3_rmd
A widely used Lattice Gauge Theory QCD simulation for:
Molecular dynamics evolution, hadron spectroscopy, matrix elements and charge studies
The ks_imp_dyn/su3_rmd case is a widely studied benchmark.
Time tends to be dominated by the Conjugate Gradient
http://physics.indiana.edu/~sg/milc.html
http://faculty.cs.tamu.edu/wuxf/research/tamu-MILC.pdf
MILC su3_rmd Scaling (Input-10/12/14)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 10 100 1000 10000
# Cores
Slowdown
v10v12v14AMD64/IBBGL
12
Understanding what might be possible
SiCortex system is new; compared to 10+ years of hacking and optimizationSo we took a look at a test suite and benchmark provided by Andrew Pochinsky @ MITLooks at the problem in three phases
What performance do you get running from L1 cacheWhat performance do you get running from L2And from main memoryVery useful to see where cycles and time are spent.
And gives hints about what compilers might do and how to restructure codes.
13
So what did we see?
By select hand coding of Andrews code we have seen:
Out of L1 cache 1097 Mflops Out of L2 cache 703 MflopsOut of Memory 367 Mflops
Compiler is improving each time we dive deeper into the code.But we’re not experts on QCD, could use some help.
What conclusion might we draw
Good communications makes for excellent scaling (MILC)
Working on single node performance tuning (on Pochinsky code) gives direction on performance and insight for compiler.
DWF formulations have higher computation/communications ratio. And we do quite well. Will do even better with formulations that have increased communications.
15
SiCortex and Really Green HPC
Come downstairs (at the foot of the stairs) and take a look and give it a try.
It’s a 72 processor (100+ Gflop) desktop system using ~200 watts. Totally compatible with its bigger family members. Up to 5832 processor system.
More delivered performance per square foot, per dollar, and per watt
16
Performance Criteria for the Tools Suite
•Work on unmodified codes•Quick and easy characterization of:
– Hardware utilization (on and off-core)– Memory– I/O– Communication– Thread/Task load balance
•Detailed analysis using sampling•Simple instrumentation•Advanced instrumentation and tracing•Trace-based visualization•Expert access to PMU and perfmon2
17 Proprietary and Confidential
• papiex - Overall application performance • mpipex - MPI profiling• ioex – I/O profiling• hpcex - source code profiling• pfmon - highly focused instrumentation• gptlex – dynamic call path generation• tauex - automatic profiling and visualization• vampir - parallel execution tracing
• gprof is there too (but is not MPI-aware)
Application Performance Tool Suite
18
For fun (and debugging)
19
For fun (and debugging)
20
Thanks