lrz supermuc one year of operation · provides quantifiable cost reduction – pue 1.1 (supermuc...

13
© 2013 IBM Corporation IBM Systems & Technology Group LRZ SuperMUC One year of Operation IBM Deep Computing 13.03.2013 Klaus Gottschalk IBM HPC Architect

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

LRZ SuperMUC One year of Operation

IBM Deep Computing

13.03.2013 Klaus Gottschalk – IBM HPC Architect

Page 2: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Leibniz Computing Center‘s new HPC System is now installed and operational

2

Page 3: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

SuperMUC Technical Highlights

• 3 PFLOP Computer in Germany in Gauß-Center

• 9414 Nodes with 2 Intel Sandy Bridge EP

• 209 Nodes with 4 Intel Westmere EX

• 3 PFLOP/s Peak Performance

• 327 TB Memory

• Infiniband Interconnect

• Large File Space for multiple purpose

• 10 PByte File Space based on IBM GPFS

with 200GByte/s aggregated I/O Bandwidth

• 2 PByte NAS Storage with 10GByte/s

aggregated I/O Bandwidth

• No GPGPUs or other Accelerator Technology

• Innovative Technology for Energy Efficient Computing

• Hot Water Cooling

• Energy Aware Scheduling

• Most Energy and Cooling Efficient high End HPC System: PUE 1.1

Page 4: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

SuperMUC Energy Efficiency Goals

• Stable, highly scalable, efficient Hardware

based on standard x86 components

• Save 40% of Energy compared to air-cooled HPC systems

• Hot Water cooling allowing for Fee Cooling

all year around

• Using standard components

• Easy serviceable

• Frequency controlled nodes

• Optimize Application Energy consumption during use

• Energy saving if not in use

• Power Aware Job Scheduling

• Run Application at optimal clock rate – according to predefined policies

• Deliver an energy report after job run

End of a LINPack Run on 240 nodes

Page 5: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Page 6: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

iDataPlex dx360 M4 water cooled - with Intel Sandy Bridge CPU

Page 7: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Island Architecture – Muticluster GPFS

Martin W Hiegl / Uwe Tron 09.09.2012

N1

N2

N3

N4

N5

N6

N7

N8

N65

N66

N67

N68

N69

N70

N71

N72

SFA

12k

SFA

12k

IB P2P

GPFS

Server Cluster

I/O Island

Core

Switch

Spine Switch 1

Spine Switch 2

Spine Switch 3

Spine Switch 124

Spine Switch 125

Spine Switch 126

Island 2

Core SW

Island 2

Core SW

Island 18

Core SW

2

2

Page 8: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Option: Direct Water Cooling (DWC)

Direct cooling in water at heat source (95%) - no media change

Less noise in machine room – no spinning fans

Cooling of system without need for Chillers – PUE 1.05 – 1.10

Node inlet temperature between 18 – 45°C

Inlet temperature can vary with seasons based on achievable temperature

Lower and more stable CPU Core temperature (max 70°C)

About 10% less leak current compared to air cooled systems

Similar pipework requirements as rear door heat exchangers*

Clear Advantages:

Less energy consumption for cooling of the system (about 40%)

Less energy consumption of the CPU (10%)

Enables usage of Turbo Mode with all Cores

Better TCO and higher efficiency of compute power usage

8

(*) see ASHRAE Technical Committee 9.9 Whitepaper: 2011 Thermal Guidelines for Liquid Cooled Data Processing Environments

Page 9: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Option: Energy Aware Scheduling (EAS)

Policy based steering of node CPU clock for user batch jobs

Batch scheduler estimates application run time based on clock rate

Admin defined policies determine node clock rate at application execution time

Currently unused nodes will be powered down

EAS is part of IBM LoadLeveler and xCAT and will be ported to LSF

Clear Advantages:

Less power consumption of application that cannot gain performance from high clock rates

Reduction of power consumption of idle nodes

Observation of operational limits

Example SuperMUC :

• Default clock rate is 2.2GHz

• Higher rates up to 2.7 GHz (or Turbo Mode) for applications that will gain performance

• LINPACK measurement done with Intel Turbomode

9

Page 10: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

LINPACK on May 31, 2012

0:================================================================================

0:T/V N NB P Q Time Gflops

0:--------------------------------------------------------------------------------

0:WR01C2R4 5201920 160 256 576 32387.59 2.897e+06

0:--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-

0:Max aggregated wall time rfact . . . : 7.38

0:+ Max aggregated wall time pfact . . : 6.52

0:+ Max aggregated wall time mxswp . . : 6.31

0:Max aggregated wall time update . . : 31901.91

0:+ Max aggregated wall time laswp . . : 4309.07

0:Max aggregated wall time up tr sv . : 10.35

0:--------------------------------------------------------------------------------

0:||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006563 ...... PASSED

0:============================================================================

0:

0:Finished 1 tests with the following results:

0: 1 tests completed and passed residual checks,

0: 0 tests completed and failed residual checks,

0: 0 tests skipped because of illegal input values.

0:----------------------------------------------------------------------------

0:

0:End of Tests.

0:============================================================================

------- done running linpack at 05-31-12--06:43:59 --------

Page 11: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

Value of the SuperMUC System

• SuperMUC represents tightly integrated innovative solution, with a value proposition which reduces client’s

total cost of ownership and which address growth areas of x86 and green computing.

• Energy- and cooling efficiency characteristics of hardware and HPC Software Stack

provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling)

• Holistic view of the Supercomputer Hardware, -Software and Applications

• Running cost of client reduced by 40% compared to HPC standard system of similar size

• Scalability, functionality and quality of hardware, software and service provide a

qualifiable cost advantage

• Fewer problems because of leveraging experience from other platforms

• Faster problem resolution because of integrating development and support

• Running cost of client reduced by less downtime

• Running cost of client reduced by less management effort

Page 12: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group

One Year of Operation

• Direct water cooling is relievable and stable – Summer and Winter

• LRZ Decision: Inlet Temperature varies with outdoor temperature

between 18 – 45°C

• Energy Saving Goal of LRZ and IBM is achieved

• Hardware Failures are below the expected Range

• Island based architecture for Infiniband, xCAT, GPFS, LoadLeveler

proves its scalability for large systems

• Power Consumption metering based on iPDUs down to outlet level

• Hardware and Service Monitoring based on Icinga

• Automated Call Home on Failure for all system parts

• Log file analysis based on Splunk

Martin W Hiegl / Uwe Tron 09.09.2012

Page 13: LRZ SuperMUC One year of Operation · provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling) • Holistic view of the Supercomputer Hardware, -Software and Applications

© 2013 IBM Corporation

IBM Systems & Technology Group