full circle: simulating linux clusters on linux clusters · 9 ibm research bluegene/l | full...

27
IBM Research June 2003 | BlueGene/L © 2003 IBM Corporation Full Circle: Simulating Linux Clusters on Linux Clusters L. Ceze, K. Strauss, G. Almasi, P. Bohrer, J. Brunheroto, C. Cascaval, J. Castanos, D. Lieber, X. Martorell, J. Moreira, A. Sanomiya, E. Schenfeld

Upload: lamkhanh

Post on 20-May-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

IBM Research

June 2003 | BlueGene/L © 2003 IBM Corporation

Full Circle:Simulating Linux Clusters on Linux Clusters

L. Ceze, K. Strauss, G. Almasi, P. Bohrer, J. Brunheroto, C. Cascaval, J. Castanos, D. Lieber,

X. Martorell, J. Moreira, A. Sanomiya, E. Schenfeld

2

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

3

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Introduction

� Schedule of the BlueGene/L supercomputer project requires concurrent development of hardware and software

� We needed a tool to support development of software in advance of hardware availability

� Solution: develop BGLsim, an architecturally accurate simulator of BlueGene/L at the machine instruction level

� BGLsim is a full system simulator:v Processors and floating-point unitsv Memory hierarchyv Ethernet devicesv Interrupt controllersv Interconnection networks

� BGLsim simulate multi-node BlueGene/L systems� BGLsim can execute exactly the same binary code that executes in real hardware� BGLsim has been used to develop and test compilers, operating systems, run-time

libraries, communication libraries, device drivers, benchmarks, applications� Modern Linux clusters provide the horsepower necessary to run large BGLsim

instances: systems as large as 512 compute nodes have been simulated

4

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

5

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Board(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR

BlueGene/L

6

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Compute Density

2 Pentium 4/1U42 1U/rack84 processors/rack

2 Pentium 4/blade (7U)14 blade/chassis6 chassis/frame168 processors/rack

2 dual CPU/compute card16 compute cards/node card16 node cards/midplane2 midplanes/rack2048 processors/rack

7

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

BlueGene/L fundamentals

� A large number of nodes (65,536)v Low-power (20W) nodes for densityv High floating-point performancev System-on-a-chip technology

� Nodes interconnected as 64x32x32 three-dimensional torus

v Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware

v Bisection bandwidth per node is proportional to n2/n3

v Auxiliary networks for I/O and global operations

� Applications consist of multiple processes with message passing

v Strictly one process/nodev Minimum OS involvement and overhead

8

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

BlueGene/L interconnection networks

3 Dimensional Torusv Interconnects all compute nodes (65,536)v Virtual cut-through hardware routingv 1.4Gb/s on all 12 node links (2.1 GB/s per node)v Communications backbone for computationsv 350/700 GB/s bisection bandwidth

Global Treev One-to-all broadcast functionalityv Reduction operations functionalityv 2.8 Gb/s of bandwidth per linkv Latency of tree traversal in the order of 2 µsv Interconnects all compute and I/O nodes (1024)

Ethernetv Incorporated into every node ASICv Active in the I/O nodes (1:64)v All external comm. (file I/O, control, user interaction, etc.)

9

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

BlueGene/L hardware architecture

� System-on-a-chip technology delivers high-performance with high-density and low-power

v Two processor (with dual-element floating point units) and 4 MB of shared L3 cache in one compute chip

v Compute chip + external DRAM = one nodev Compute nodes have 256 MB of memory, I/O nodes 512 MB

� Interconnection networks are built into the nodes: tree, torus, Ethernet, JTAG, global interrupts

� Main communication network is three-dimensional torus – nearest neighbor links only

� Tree also requires nearest neighbor links only� Extremely large systems can be built, BlueGene/L machine at

Lawrence Livermore National Laboratory will have 65,536 compute nodes organized as 64 x 32 x 32 torus

10

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

BlueGene/L system architecture overview

Functional Ethernet

Functional Ethernet

I/O Node 0

Linux

ciod

C-Node 0

CNK

I/O Node 1023

Linux

ciod

C-Node 0

CNK

C-Node 63

CNK

C-Node 63

CNK

Control Ethernet

Control Ethernet

IDo chip

Scheduler

Console

ServiceNode

ServiceNode

MMCS

JTAG

torus

tree

Database

Front-endNodes

Pset 1023

Pset 0

I2C

FileServers

11

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

12

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Single-node simulation stack

X86 hardware

X86 Linux

BGLsim

BG/L hardware

BG/L Linux

Application

13

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

The BlueGene/L node

CPU 0

PPC440

FPU

L2 $

CPU 1

PPC440

FPU

L2 $

Interrupt controller PLB

SRAM

L3 $

Lockbox

Torus/Tree

MAL

OPB

EMAC

Memory

JTAG

14

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Main simulation loop

Poll Devices

Execute

Decode

Fetch

Update Timer

Access Memory

Check Permissions

Translate Address

Update Registers

15

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

The call-through services

� The call-through mechanism supports interaction between running code and the simulator itself:

v Host real-time clock

v Performance and instruction countersv Tracing and histogramsv Stop/suspend simulation

v Access to host file system and environment variablesv Task operations (context switching, creating/terminating processes)

� Interaction between operating system in simulated machine and simulator enable tracing and counting on a per process basis

16

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Debugging support in BGLsim

� Single-instruction step

� Running debugger on simulated machine� Running debugger on a different machine

� Kernel debugging

17

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

18

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Multi-node simulation stack

Application

X86 MPI

BGLsim

BG/L hardware

BG/L Linux

BG/L MPI

X86 Linux

X86 hardware

19

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Multi-node simulation with BGLsim

� Three types of processes:v bglsim: simulates a single node of BlueGene/L

v ethgw: gateway between simulated and real Ethernetsv idochip: interface between control system and simulated machine

� Five networks are simulated:v Torus

v Treev Ethernetv Global interrupts

v JTAG

� What is not simulated:v Control system, File servers, front-end nodes

20

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Simulating BlueGene/L with BGLsim

EthernetEthernet

BGLsim

Linux

ciod

BGLsim

BLRTS

BGLsim

Linux

ciod

BGLsim

BLRTS

BGLsim

BLRTS

BGLsim

BLRTS

ControlEthernet

ControlEthernet

IDo chipsimulator

ServiceNode

MMCS

Scheduler

cioman

FileServers

CommFabric library (tree, torus, JTAG)

CommFabric library (tree, torus, JTAG)

Ethernetgateway

Tapdaemon

Database

Front-endNodes

21

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

22

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Practical experience with BGLsim

� Porting Linux to the BlueGene/L I/O nodes

� Development of BlueGene/L compute node kernel� Testing of BlueGene/L compilers (particularly, double FPU testing)

� Development of MPI implementation for BlueGene/L

� Execution of MPI benchmarks and applications� Porting to LoadLeveler to BlueGene/L

23

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

NAS Parallel Benchmarks IS experiment

0

5

10

15

20

25

30

35

40

45

1 2 4 8 16 32

Mill

ion

s

number of nodes

aver

age

inst

ruct

ion

s p

er n

od

e

TotalComputationCommunication

24

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Experience with real hardware

� All our hardware experience so far has been with single-node systems

� The compute node kernel (CNK) was extensively tested on BGLsim and had some testing on a VHDL simulator – it executed on real hardware with zero changes

� Same thing for LINPACK – BGLsim + VHDL and then straight to hardware

� All 8 NAS Parallel Benchmarks (serial and parallel versions) were extensively tested on BGLsim and two of them (serial version) were tested on VHDL – 7 of the 8 simply executed on real hardware with no changes!

� The control system required some tweaking – JTAG is the one component of BGLsim we took short cuts

25

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

NAS Parallel Benchmarks slowdown

280270350680MG

1250153016701750EP

--340670LU

230480570790FT

500510540640CG

150170220260IS

400450530-SP

490570660-BT

168/942Benchmark/#nodes

26

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Outline

� Introduction

� The BlueGene/L supercomputer� Single-node simulation

� Multi-node simulation

� Practical experiences� Conclusions

27

IBM Research

BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation

Conclusions

� BGLsim is a complete parallel system simulator – it is being used for software development and hardware analysis of BlueGene/L

� BGLsim runs exactly the same code that runs in real hardware� Together with instrumented system software, BGLsim can collect

additional performance information not available in real hardware� BGLsim has been used in the development of compilers, operating

systems, run-time libraries, communication libraries, control and monitoring systems, job scheduling and management tools

� What we simulated with BGLsim works on real hardware!� Linux clusters that deliver large amounts of computing power for low

cost make this simulation-intensive approach feasible� Current work: adding timing models to BGLsim to provide more

performance information