top500 supercomputers

The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and uses a portable implementation of the Linpack benchmark

TOP5

00 ANDREWFECHEYRLIPPENS

CAC Actividad 1: Top 500 Supercomputers Sites“Anyone can build a fast CPU. The trick is to build a fast system” -- Seymour Cray

In the early 1990s, a new definition of supercomputer was needed to produce meaningful statistics. After experimenting with metrics based on processor count in 1992, the idea was born at the University of Mannheim to compose a detailed listing of installed systems and use it as the basis. This was the beginning of the TOP500 list. In early 1993 Jack Dongarra was convinced to join the project with his Linpack benchmark. A first test version was produced in May 1993, partially based on data available on the Internet. Since June 1993 the TOP500 is produced bi-annually based on site and vendor submissions only.

Today, the list is compiled by Hans Meuer of the University of Mannheim, Germany, Jack Dongarra of the University of Tennessee, Knoxville, Erich Strohmaier and Horst Simon of NERSC/Lawrence Berkeley National Laboratory. The list is updated twice a year. The first of these updates always coincides with the International Supercomputer Conference in June, the second one is presented in November at the IEEE Super Computer Conference in the USA.

With two lists compiled every year from 1993 up until now, it brings the total number to 33 lists. Next we will take a closer look at the last iteration of the list, the June 2009 edition.

Content tableCurrent TOP500 2The World’s Fastest Computer 3Previous #1 Machines 4Evolution of Supercomputers 4The Linpack Benchmark 5

TMC CM-5 (1993)#1 on the first top 500

http://en.wikipedia.org/wiki/Computer


http://en.wikipedia.org/wiki/Supercomputers

http://en.wikipedia.org/wiki/Supercomputers

The June 2009 TOP500 Ranking

The 33rd edition of the TOP500 list of the world’s most powerful supercomputers is still led by Roadrunner, but shows that two of the top 10 positions are now claimed by new systems in Germany. The latest listing also includes a brand-new player, an IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia, ranked at No. 14.

Maintaining its hold on second place is the Cray XT5 Jaguar system installed at the DOE’s Oak Ridge National Laboratory. Jaguar reached 1,059 petaflops shortly after its installation but due to its heavy workload no further measurements were possible.

In third place, a new contender has emerged: a new IBM BlueGene/P system called JUGENE and installed at the Forschungszentrum Juelich (FZJ) in Germany. It achieved 825,5 teraflop/s on the Linpack benchmark and has a theoretical peak performance of just above 1 petaflops. FZJ is also home to the new No.

10 system, JUROPA, which is built from Bull Novascale and Sun SunBlade x6048 servers and achieved 274,8 teraflops.

The U.S. is clearly the leading consumer of HPC systems with 291 of the 500 systems. Followed by Europe (145 systems) and Asia(49 systems). The bubbles are a graphical representation of the computing power of the most represented countries.

Looking at the architectures used in the whole TOP500, we notice that most machines are classified as Clusters (82%), a smaller set as MPP (17,6%) and a tiny remainder as the aging Constellations architecture (0,4%). We will go more into detail on the trends in supercomputering after we discussed Roadrunner.

MPP Massive parallel processing (MPP) refers to a computer system with many independent arithmetic units or entire microprocessors, that run in parallel. The term massive connotes hundreds if not thousands of such units. All of the processing elements are connected together to be one very large computer.

Cluster A group of PCs connected though a switched network, working together closely to solve the same problems. Clusters are often built from commercial off-the-shelf components to produce a cost-effective alternative to an MPP supercomputer.

Constellation A cluster of shared-memory multiprocessors. “If there are more microprocessors in a node than there are nodes in the commodity cluster, it is referred to as a constellation” -- Jack Dongarra et al. 2003

# Name Country Cores RMax RPeak Power Processor Arch OS Arch Interconnect1

2

3

4

5

6

7

8

9

10

RoadRunnerIBM

U.S.A. 129600 1105000 1456700 2483 AMD64 OpteronPowerXCell 8i

Linux Cluster Infiniband

JaguarCray Inc.

U.S.A. 150152 1059000 1381400 6951 AMD64 Opteron Quad Core

CNL MPP XT4 Internal Interconnect

JugeneIBM

Germany 294912 825500 1002700 2268 PowerPC 450 CNK/SLES 9 MPP Proprietary

PleiadesSGI

U.S.A. 51200 487005 608829 2090 Intel EM64T Xeon E54xx

SLES10 + SGI ProPack 5

MPP Infiniband

BlueGene/LIBM

U.S.A. 212992 478200 596378 2330 PowerPC 440 CNK/SLES 9 MPP Proprietary

Kraken XT5Cray Inc.

U.S.A. 66000 463300 607200 AMD64 Opteron Quad Core

CNL MPP XT4 Internal Interconnect

BlueGene/PIBM


RangerSun

U.S.A. 62976 433200 579379 2000 AMD64 Opteron Quad Core

Linux Cluster Infiniband

DawnIBM


JuropaBull SA

Germany 26304 274800 308283 1549 Intel EM64T Xeon X55xx

SUSE Linux Cluster Infiniband QDR Sun M9 / Mellanox / ParTec

International Business Machines Corporation, also know as “Big Blue”, is a multinational computer technology and IT consulting corporation. IBM is the leader in supercomputing as provider of 35 of the world's 100 most powerful supercomputers in the TOP500 list and is also leading energy-efficiency rankings with 19 of 20 highest megaflops per watt systems in the Green500 list.

LANL has always been an early adopter of transformational high performance computing (HPC) technology. For example, in the 1970s when HPC was scalar; LANL adopted vector (Cray 1). In the 1980s when HPC was vector; LANL adopted data parallel (TMC CM-1). In the 1990s when HPC was data parallel; LANL adopted distributed memory (TMC CM-5). In the 2000s HPC was distributed memory; LANL adopted hybrid (Roadrunner).

The fastest computer in the world: IBM RoadrunnerThe fastest computer in the world, built by IBM for the U.S. Department

of Energy’s Los Alamos National Laboratory is called “Roadrunner”. It achieved a performance of 1,026 petaflops in May 2008 becoming the first supercomputer ever to reach the petaflop milestone. It captured the top spot on the June 2008 TOP500 list, beating IBM BlueGene/L at DOE’s Lawrence Livermore National Laboratory. BlueGene/L, with a performance of 478.2 teraflop/s is now ranked No. 5 after holding the top position from November 2004 until June 2008.

Roadrunner differs from many contemporary supercomputers in that it is a hybrid system, using two different processor architectures. The hybrid design consists of dual-core Opteron server processors manufactured by AMD using the standard AMD64 architecture. Attached to each Opteron core is a Cell processor manufactured by IBM using the Power architecture. As a supercomputer, Roadrunner is considered an Opteron cluster with Cell accelerators.

The machine takes up 560 square meters, uses 92 kilometers of fiber optic cable, weighs in at 270.000 kilograms and requires 2,9 megawatts of power. The cluster is composed of specially designed TriBlade servers connected by Infiniband.

Logically, a TriBlade consists of two dual-core Opterons and four PowerXCell 8i CPUs. Physically, a TriBlade consists of one LS21 Opteron blade, an expansion blade, and two QS22 Cell blades. The LS21

has two 1.8 GHz dual-core Opterons with 16 GB memory. Each QS22 has two PowerXCell 8i CPUs, running at 3.2 GHz and 8 GB memory. The expansion blade connects the two QS22 via four PCIe x8 links to the LS21, two links for each QS22. It also provides outside connectivity via an Infiniband 4x DDR adapter. This makes a total width of four slots for a single TriBlade. Three TriBlades fit into one BladeCenter H chassis.

A Connected Unit is 60 BladeCenter H full of TriBlades, that is 180 TriBlades. All TriBlades are connected to a 288-port Voltaire ISR2012 Infiniband switch. Each CU also has access to the Panasas file system through twelve System x3755 servers. The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR2012 switches.

http://en.wikipedia.org/wiki/Multinational_corporation

http://en.wikipedia.org/wiki/Multinational_corporation



http://en.wikipedia.org/wiki/Technology

http://en.wikipedia.org/wiki/Technology

http://en.wikipedia.org/wiki/Information_technology_consulting

http://en.wikipedia.org/wiki/Information_technology_consulting

http://en.wikipedia.org/wiki/Corporation

http://en.wikipedia.org/wiki/Corporation

Evolution of architecture types

On the graph above, the different architecture types are plotted by share of the total computing power (% of total Rmax) since the first list was compiled in June 1993. In the early years supercomputers were often SMP or MPP machines, with the latter of both architectures dominating the list for almost 16 years, only to be challenged by the recent and often more cost-efficient Cluster architecture.

Clusters started to appear in the TOP500 list in 1997. The Berkeley NOW (Network of Workstations), a Myrinet cluster of 100 UltraSPARC-I computers, ranked #344 in the June 1997 TOP500 list with a LINPACK performance of 10,140 Gflops. The adoption of clusters, collections of workstations/PCs connected by a local network, has virtually exploded since the introduction of the first Beowulf cluster in 1994. The attraction lies in the (potentially) low cost of both hardware and software and the control that builders/users have over their system. Clusters represent 88% of the systems count and 58,6% of the processing power in the latest TOP500 edition.

Evolution of interconnections

The choice of interconnect has a major impact on the efficiency (Rmax/Rpeak) of a supercomputer. Even more so for clusters with a huge amount of commodity nodes, as communication between them can easily become the bottleneck.

The first supercomputers to occupy the TOP500 lists made use of Fat Trees to communicate. During the ‘90s both the proprietary Cray

25%

50%

75%

100%

jun 93 jun 95 jun 97 jun 99 jun 01 jun 03 jun 05 jun 07 jun 09

SMP Constellations ClusterMPP SIMD Single Processor

25%

50%

75%

100%

jun 93 jun 95 jun 97 jun 99 jun 01 jun 03 jun 05 jun 07 jun 09

Cray Interconnect NUMAlink SP SwitchInfiniband Quadrics MyrinetGigabit Ethernet Proprietary Fat TreeCrossbar

Evolution

Ten machines have occupied the first spot on the list since 1993

1. TMC CM-5The Thinking Machines Corporation CM5 at LANL was #1 on the first list in ‘93, and used a MIMD architecture based on a fat tree network of SuperSPARC I 32 MHz processors.

2. Fujitsu Numerical Wind TunnelConquered #1 on the November ‘93 list and stayed on top until ’95. The Numerical Wind Tunnel used a Full distributed crossbar to connect 140 Fujitsu 105 MHz cores.

3. Intel Paragon XP/S140Briefly took the #1 spot in June ’94. This MPP had an astonishing 3680 Intel 80860 50 MHz cores and used a 2D mesh interconnect.

4. Hitachi SR2201In June ‘96 this machine captured the top spot with 220,4 Gflops. 1024 PA-RISC HARP-1E 150 MHz processors, using a Hyper crossbar .

5. Hitachi CP-PACSTop performer on the November ’96 list. Doubled the amount of PA-RISC HARP-1E 150 MHz processors of it’s predecessor.

6. Intel ASCI RedThe ASCI Red was a mesh-based MIMD MPP machine initially consisting of 4510 Intel Pentium Pro processors @ 200MHz. First to break the teraflop barrier and stayed on top until June 2000.

7. IBM ASCI WhiteNov 2000. Based on IBM's RS/6000 SP computer. 512 nodes of 16 POWER3 375 MHz processors.

8. NEC Earth SimulatorJun ’02. NEC’s ES runs global climate models at 35,8 teraflops using 640 nodes of 8 vector processors.

9. IBM Blue Gene/LNov ’04. The last top ranked MPP. 70,72 teraflops with PowerPC cores.

10. IBM RoadrunnerCurrent #1 since June ‘08. A hybrid (Opteron + Cell) cluster with Infiniband interconnects.

http://en.wikipedia.org/wiki/Fat_tree

http://en.wikipedia.org/wiki/Fat_tree

http://en.wikipedia.org/wiki/Pentium_Pro

http://en.wikipedia.org/wiki/Pentium_Pro

http://en.wikipedia.org/wiki/Global_climate_model




Running HPL

MachineTo put the numbers from the TOP500 list in perspective, I ran the HPL Linpack Benchmark on a 4 year old Supermicro 4020C-T Server. The server has been in use since 2005 as a web application server and consists of two Opteron 246 processors running at 2.0Ghz. The server has access to 2GB of DDR400 ECC Registered memory, which could not be fully used during the benchmark, as normal operations of the server could not be shutdown.

The server is running the Gentoo Linux distribution. The OpenMPI, Blas, Lapack and HPL source code was compiled with the default settings. No time was spend trying to optimize the compilation of the binaries.

CommandsAs the reference Blas version is written in fortran, I first had to recompile GCC with fortran support.

USE="fortran" emerge -av gcc

Downloading and compiling HPL and it’s dependencies is done automatically in Gentoo with the emerge command.

emerge -av sys-cluster/hpl

Copy the /usr/share/hpl/HPL.dat configuration file into a working directory and edit it. To start the benchmark with 2 processes run this command:

mpirun -np 2 /usr/bin/xhpl

Achieved resultsTo find a decent Rmax I tried some combinations of the N (problem size) and NB (block size) parameters of HPL. With N=550 and NB=12 the system managed to achieve

1,509 Gflops.In 1993 this result would have put this dual processor AMD Opteron system on position #180 of the TOP500 list.

Interconnect and Crossbars were hugely popular, followed by SP Switches during the crossing of the millennium.

Myrinet, the low latency interconnect, was highly in demand starting from 2001 until the higher bandwidth Infiniband technology gained the upper hand in 2007. Nowadays most supercomputers are interconnected using conventional Gigabit Ethernet (56% of all systems) and Infiniband (30%), which is often used on higher ranked systems.

The Linpack Benchmark As a yardstick of performance the TOP500 list uses the ‘best’

performance as measured by the LINPACK Benchmark. LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s. It makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.

The LINPACK Benchmark was introduced to the TOP500 list by Jack Dongarra and chosen because it is widely used and performance numbers are available for almost all relevant systems. The benchmark measures a system's floating point computing power: it calculates how fast a computer solves a dense N by N system of linear equations Ax = b, which is a common task in engineering. The solution is obtained by Gaussian elimination with partial pivoting, with 2/3 N3 + O(N2) floating point operations. The result is reported in millions of floating point operations per second (Mflops or Gflops).

For the TOP500, a version of the benchmark is used that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. This performance does not reflect the overall performance of a given system, as no single number ever can. It does, however, reflect the performance of a dedicated system for solving a dense system of linear equations. Since the problem is very regular, the performance achieved is quite high, and the performance numbers give a good correction of peak performance. The TOP500 list uses the “Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers” (HPL) which can be found at http://www.netlib.org/benchmark/hpl/.

By measuring the actual performance for different problem sizes n, a user can get not only the maximal achieved performance Rmax for the problem size Nmax but also the problem size N1/2 where half of the performance Rmax is achieved. These numbers together with the theoretical peak performance Rpeak are the numbers given in the TOP500. In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 N3 + O(N2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.

http://www.netlib.org/benchmark/hpl/

http://www.netlib.org/benchmark/hpl/

Bibliographyhttp://www.top500.org/static/lists/1993/06/top500_199306.pdf

http://en.wikiquote.org/wiki/Seymour_Cray

http://en.wikipedia.org/wiki/Top500

http://en.wikipedia.org/wiki/IBM_Roadrunner

http://www.ibm.com/systems/deepcomputing/top500.html

http://www.top500.org/project/linpack

http://www.gentoo.org/proj/en/science/blas-lapack.xml

http://www.top500.org/lists/2008/11

http://en.wikipedia.org/wiki/LINPACK

http://www.top500.org/orsc/2006/clusters

http://now.cs.berkeley.edu

Jack Dongarra, Thomas Sterling, Horst Simon, Erich Strohmaier, "High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions," Computing in Science and Engineering, vol. 7, no. 2, pp. 51-59, March/April, 2005.

http://www.top500.org/static/lists/1993/06/top500_199306.pdf

http://www.top500.org/static/lists/1993/06/top500_199306.pdf





















top500 supercomputers

Documents