lrz supermuc one year of provides quantifiable cost reduction...

Download LRZ SuperMUC One year of provides quantifiable cost reduction ¢â‚¬â€œ PUE 1.1 (SuperMUC incl. Cooling) ¢â‚¬¢

If you can't read please download the document

Post on 09-Oct-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    LRZ SuperMUC One year of Operation

    IBM Deep Computing

    13.03.2013 Klaus Gottschalk – IBM HPC Architect

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    Leibniz Computing Center‘s new HPC System is now installed and operational

    2

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    SuperMUC Technical Highlights

    • 3 PFLOP Computer in Germany in Gauß-Center

    • 9414 Nodes with 2 Intel Sandy Bridge EP

    • 209 Nodes with 4 Intel Westmere EX

    • 3 PFLOP/s Peak Performance

    • 327 TB Memory

    • Infiniband Interconnect

    • Large File Space for multiple purpose

    • 10 PByte File Space based on IBM GPFS

    with 200GByte/s aggregated I/O Bandwidth

    • 2 PByte NAS Storage with 10GByte/s

    aggregated I/O Bandwidth

    • No GPGPUs or other Accelerator Technology

    • Innovative Technology for Energy Efficient Computing

    • Hot Water Cooling

    • Energy Aware Scheduling

    • Most Energy and Cooling Efficient high End HPC System: PUE 1.1

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    SuperMUC Energy Efficiency Goals

    • Stable, highly scalable, efficient Hardware

    based on standard x86 components

    • Save 40% of Energy compared to air-cooled HPC systems

    • Hot Water cooling allowing for Fee Cooling

    all year around

    • Using standard components

    • Easy serviceable

    • Frequency controlled nodes

    • Optimize Application Energy consumption during use

    • Energy saving if not in use

    • Power Aware Job Scheduling

    • Run Application at optimal clock rate – according to predefined policies

    • Deliver an energy report after job run

    End of a LINPack Run on 240 nodes

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    iDataPlex dx360 M4 water cooled - with Intel Sandy Bridge CPU

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    Island Architecture – Muticluster GPFS

    Martin W Hiegl / Uwe Tron 09.09.2012

    N1

    N2

    N3

    N4

    N5

    N6

    N7

    N8

    N65

    N66

    N67

    N68

    N69

    N70

    N71

    N72

    SFA

    12k

    SFA

    12k

    IB P2P

    GPFS

    Server Cluster

    I/O Island

    Core

    Switch

    Spine Switch 1

    Spine Switch 2

    Spine Switch 3

    Spine Switch 124

    Spine Switch 125

    Spine Switch 126

    Island 2

    Core SW

    Island 2

    Core SW

    Island 18

    Core SW

    2

    2

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    Option: Direct Water Cooling (DWC)

     Direct cooling in water at heat source (95%) - no media change

     Less noise in machine room – no spinning fans

     Cooling of system without need for Chillers – PUE 1.05 – 1.10

     Node inlet temperature between 18 – 45°C

     Inlet temperature can vary with seasons based on achievable temperature

     Lower and more stable CPU Core temperature (max 70°C)

     About 10% less leak current compared to air cooled systems

     Similar pipework requirements as rear door heat exchangers*

    Clear Advantages:

     Less energy consumption for cooling of the system (about 40%)

     Less energy consumption of the CPU (10%)

     Enables usage of Turbo Mode with all Cores

     Better TCO and higher efficiency of compute power usage

    8

    (*) see ASHRAE Technical Committee 9.9 Whitepaper: 2011 Thermal Guidelines for Liquid Cooled Data Processing Environments

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    Option: Energy Aware Scheduling (EAS)

     Policy based steering of node CPU clock for user batch jobs

     Batch scheduler estimates application run time based on clock rate

     Admin defined policies determine node clock rate at application execution time

     Currently unused nodes will be powered down

     EAS is part of IBM LoadLeveler and xCAT and will be ported to LSF

    Clear Advantages:

     Less power consumption of application that cannot gain performance from high clock rates

     Reduction of power consumption of idle nodes

     Observation of operational limits

    Example SuperMUC :

    • Default clock rate is 2.2GHz

    • Higher rates up to 2.7 GHz (or Turbo Mode) for applications that will gain performance

    • LINPACK measurement done with Intel Turbomode

    9

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    LINPACK on May 31, 2012

    0:================================================================================

    0:T/V N NB P Q Time Gflops

    0:--------------------------------------------------------------------------------

    0:WR01C2R4 5201920 160 256 576 32387.59 2.897e+06

    0:--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-

    0:Max aggregated wall time rfact . . . : 7.38

    0:+ Max aggregated wall time pfact . . : 6.52

    0:+ Max aggregated wall time mxswp . . : 6.31

    0:Max aggregated wall time update . . : 31901.91

    0:+ Max aggregated wall time laswp . . : 4309.07

    0:Max aggregated wall time up tr sv . : 10.35

    0:--------------------------------------------------------------------------------

    0:||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006563 ...... PASSED

    0:============================================================================

    0:

    0:Finished 1 tests with the following results:

    0: 1 tests completed and passed residual checks,

    0: 0 tests completed and failed residual checks,

    0: 0 tests skipped because of illegal input values.

    0:----------------------------------------------------------------------------

    0:

    0:End of Tests.

    0:============================================================================

    ------- done running linpack at 05-31-12--06:43:59 --------

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    Value of the SuperMUC System

    • SuperMUC represents tightly integrated innovative solution, with a value proposition which reduces client’s

    total cost of ownership and which address growth areas of x86 and green computing.

    • Energy- and cooling efficiency characteristics of hardware and HPC Software Stack

    provides quantifiable cost reduction – PUE 1.1 (SuperMUC incl. Cooling)

    • Holistic view of the Supercomputer Hardware, -Software and Applications

    • Running cost of client reduced by 40% compared to HPC standard system of similar size

    • Scalability, functionality and quality of hardware, software and service provide a

    qualifiable cost advantage

    • Fewer problems because of leveraging experience from other platforms

    • Faster problem resolution because of integrating development and support

    • Running cost of client reduced by less downtime

    • Running cost of client reduced by less management effort

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

    One Year of Operation

    • Direct water cooling is relievable and stable – Summer and Winter

    • LRZ Decision: Inlet Temperature varies with outdoor temperature

    between 18 – 45°C

    • Energy Saving Goal of LRZ and IBM is achieved

    • Hardware Failures are below the expected Range

    • Island based architecture for Infiniband, xCAT, GPFS, LoadLeveler

    proves its scalability for large systems

    • Power Consumption metering based on iPDUs down to outlet level

    • Hardware and Service Monitoring based on Icinga

    • Automated Call Home on Failure for all system parts

    • Log file analysis based on Splunk

    Martin W Hiegl / Uwe Tron 09.09.2012

  • © 2013 IBM Corporation

    IBM Systems & Technology Group

Recommended

View more >