IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 1
Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 2
Agenda
• IBM Power Accelerated Computing Platform requirements• Structure of Power Accelerated Computing Platform • Lessons learned deploying large CORAL HPC Clusters• How to get started with Power Accelerated Computing Platform • Discussion
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 2
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 3
IBM Power Accelerated Computing Platform
IBM Power ACP gives clients their own AI installation based upon the world’s most powerful and smartest scientific supercomputer
Supports • High Performance Computing (HPC)• Artificial Intelligence (AI)• Machine Learning / Deep Learning
Based upon IBM CORAL!
Natural markets: Research Labs, Universities,
Government Labs, Military Research, Industry 3
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 4
Questions?Complete Solutions for AI and Modern HPC
– CORAL Servers (POWER9 – IBM Power System AC922)
– Management Servers/Head Nodes
– Networking : Ethernet and IB
– Elastic Storage Server
– Linux and Software Development tools
– Pre-Sales/Install expert review by IBM Systems Lab Services
– Hardware Configuration assembly in IBM facility
– Software Installation and Configuration by IBM before delivery
– Installation and connectivity support with IBM Systems Lab Services
– Software Flexibility: HPC and/or PowerAI base or PowerAI Enterprise, and/or H2O
How Do I Deploy AI
at my Company?
I want to run Workloads and
Experiments on Summit!
I want to explore
Quantum Computing
Power AI Reference Architecture: https://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iw
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 4
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 5
CORAL and Summit & Sierra
CORAL = Collaboration of Oak Ridge, Argonne & Lawrence Livermore National Labs
Summit, Ascent and Peak are cluster names of Oak Ridge
Sierra, Lassen, Ansel and Butte are cluster names at Lawrence Livermore
Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 5
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 6
Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 6
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 7
Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 7
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 8
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 9
Designed for the AI EraDesigned for the AI EraDesigned for the AI EraDesigned for the AI EraArchitected for the modern analytics and AI workloads that fuel insights
An Acceleration Superhighway An Acceleration Superhighway An Acceleration Superhighway An Acceleration Superhighway Unleash state of the art IO and accelerated computing potential in the post “CPU-only” era
Delivering EnterpriseDelivering EnterpriseDelivering EnterpriseDelivering Enterprise----Class AIClass AIClass AIClass AIFlatten the time to AI value curve by accelerating the journey to build,train, and infer deep neural networks
AC922AC922AC922AC922IBM POWER SYSTEM
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 10
The POWER9 processor
17LEVELS
OF METAL
>15MILES OF
WIRE
8BILLION
TRANSISTORS
4GHZPEAK
FREQUENCY
>24BVIAS
7TB/sOn chip
BW
~1TB/sBW into
chip
1stchip
with PCIe4
2x
1.5x
2x
1.4x
Core performance
vs x86
performance
vs POWER8
more memory
vs POWER8
More memory
bandwidth vs x86
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 11
Watching Processors Evolve!
HPC analyst Addison Snell (CEO of Intersect360 Research) ….commented by email.
“One, Power9 has excellent memory bandwidth and performance.
Two, it is a great platform for attaching accelerators or co-processors. It’s an odd statement of direction, but maybe a visionary one,
essentially saying a processor isn’t about computation per se, but rather it’s about feeding data to other computational elements.”
IBM and Business Partner Use Only
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 12
High level System Overview
2-Socket, 2U Packaging
32, 40 (air) or 36,44 (water) P9 Processor cores
4 NVIDIA Volta V100 NVLink2 GPUs
2 TB Memory (16x - 128GB DIMMs)
4 PCIe Gen4 Slots
2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage
Supports 1.6, 3.2 and 6.4TB NVMe Adapters
Redundant Hot Swap Power Supplies and Fans
Default 3 year 9x5 warranty, 100% CRU
IBM Power System AC922 - POWER9 with increased GPU and IO bandwidth for differentiation
Realize unprecedented performance and application gains with POWER9 and NVLink 2.0
• 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server
• PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3
• CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be maintained at high levels (3.3 / 3.45ghz, air/water).
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 13
IBM Spectrum LSF SuitesPowerful Workload Management
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation 13
The suite delivers:
• Enhanced Utilization of assets through effective scheduling and sharing policies
• Enhancing User Productivity through ease of use, accessibility and simplification
• Operational Efficiency through insight of how the HPC environment is being used
Comprehensive GPU, Container and Hybrid Cloud Support
The LSF Suite for HPC is available at no charge via the IBM Academic Initiative
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 14
AI Changes Everything for Data
14
Diversity of Data
– Local, HDFS, NFS, Posix, Cloud
Amount of Data
– A Petabyte is just a starting point
Delivery of Data
– Gigabytes/Sec/Server to feed GPU
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 15
The IBM ESS FamilyThe IBM ESS FamilyThe IBM ESS FamilyThe IBM ESS Family
• Over 1000 1000 1000 1000 ESS Installed
• Over 300300300300 ESS customers
• Over 5,0005,0005,0005,000 Spectrum Scale clients
The Storage Built for AI!The Storage Built for AI!The Storage Built for AI!The Storage Built for AI!
IBM Spectrum Scale with Elastic Storage Server Family
IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage
EnvironmentsEnvironmentsEnvironmentsEnvironments
Five 9’s Reliability!Five 9’s Reliability!Five 9’s Reliability!Five 9’s Reliability!
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 16
ESS Installation at ORNL
77 ESS Systems delivering:
• Single Namespace up to 250 Petabytes
• 2.5 TB/s large block sequential IO performance
• 2.6M file creates/sec for 32KB files in unique directories
• 50K file creates/sec to single shared directory
• Spectrum Scale RAID with declustered erasure coding
• 16 GB/Second of Data I/O to a Single Server
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 17
IBM Systems
IBM Elastic Storage Server (ESS) Family
| 17
Model GL4S: 4 Enclosures, 20U
334 NL-SAS, 2 SSD
Model GL6S:6 Enclosures, 28U
502 NL-SAS, 2 SSD
Model GL2S: 2 Enclosures, 12U
166 NL-SAS, 2 SSD
Capacity
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
36 GB/s12 GB/s 24 GB/s
Model GS1S24 SSD
Model GS2S48 SSD
Model GS4S96 SSD
Speed
40 GB/s
14 GB/s
26 GB/s
Model GL1S: 1 Enclosures, 9U
82 NL-SAS, 2 SSD
ESS 5U84
Storage
6 GB/s
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
38 GB/s 40 GB/s
Model GH14S: 1 2U24 Enclosure SSD
4 5U84 Enclosure HDD
334 NL-SAS, 24 SSD
Model GH24S: 2 2U24 Enclosure SSD
4 5U84 Enclosure HDD
334 NL-SAS, 48 SSD
Hybrid
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 18
© IBM Corporation 2018 18
New ESS C-Series Maximum Density with Room to Upgrade and Grow!
New! Model GL2C: 2 Enclosures, 12U
210 NL-SAS, 2 SSD
New! Model GL4C 4 Enclosures, 16U
432 NL-SAS, 2 SSD
New! Model GL6C: 6 Enclosures, 28U
634 NL-SAS, 2 SSD
1.0 PB Disk 2.0 PB Disk 4.2 PB Disk 6.3 PB Disk
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
New! Model GL2C: 1 Enclosure, 8U
104 NL-SAS, 2 SSD
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 19
IBM Systems
Power Accelerated Computing Platform – Sample Building Block View
Compute: AC922
2 or 4 GPUs
ManagementL922 or AC922
Elastic Storage
Server
(5147 & 5148)
Mellanox
Switches
1-4 S42 Racks
AC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI Servers• Featured in ORNL and LLNL CORAL Installs• ExaOps of demonstrated AI Performance• Able to Process more than 20 GB/S of Data• Add Servers as Workloads Grow!
AC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI Servers• Featured in ORNL and LLNL CORAL Installs• ExaOps of demonstrated AI Performance• Able to Process more than 20 GB/S of Data• Add Servers as Workloads Grow!
IBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI Workloads• Density meets Performance• High Density Petabytes in Minimum Space• Featured in ORNL and LLNL Installs• Grow Performance by Scaling Up or Out!• Supports IB and Ethernet!
IBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI Workloads• Density meets Performance• High Density Petabytes in Minimum Space• Featured in ORNL and LLNL Installs• Grow Performance by Scaling Up or Out!• Supports IB and Ethernet!
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 20
20
PowerAIOpen-Source Based
Enterprise AI Platform
Open Source Frameworks:
Supported Distribution
Developer Ease-of-Use Tools
Faster Training Times viaHW & SW Performance Optimizations
Integrated & Supported AI Platform3-4x Speedup for AI TrainingEase of Use Tools for Data Scientists
GPU-Accelerated Power Servers
Storage
Caffe
SnapML
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 21
21
5x Faster Data Communication with Unique CPU-GPU NVLink High-Speed Connection
1 TB
Memory
POWER9
CPU
V100 GPU V100 GPU
170GB/s
NVLink150 GB/s
1 TB
Memory
POWER9
CPU
V100 GPU V100 GPU
170GB/s
NVLink150 GB/s
IBM Power System AC922Deep Learning Server (4-GPU Config)
Store Large Models in System Memory
Operate on One Layer at a Time
Fast Transfer via NVIDIA
NVLink
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 22
PowerAI
22
Deep Learning Impact Deep Learning Impact Deep Learning Impact Deep Learning Impact (DLI) Module(DLI) Module(DLI) Module(DLI) Module
Data & Model Management, ETL, Visualize, Advise
PowerAI: Open Source ML Frameworks
Large Model Support (LMS)
Distributed Deep Learning (DDL)
Auto-HyperParameter Tuning
PowerAIEnterprise
Auto-ML for Images & VideoPowerAI
Vision
Accelerated Infrastructure
Accelerated Servers Storage
SnapML
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 23
Simplified
Management
Faster Time
to Results
Increased Resource
Utilization
Enterprise
Solution
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 24
Power AI Enterprise Project Examples
© 2017 IBM Corporation 24
IndustryIndustryIndustryIndustry ScenarioScenarioScenarioScenario
Banking
Credit Scoring
Face Masking Detection
Stock Index Futures Prediction
Research Exploration
OCR recognition correction
Securities Company Logo and name auto matching
AI on cloud
Hand writing recognition
Insurance Work order auto clustering/handling
IndustryIndustryIndustryIndustry ScenarioScenarioScenarioScenario
TelcomNetwork cabling detection
Service halt handling
ManufacturingLED Panel defect inspection
Steel quality classification
Wafer Flaw detection
Energy Power transmission line safety detection
Healthcare Pathologic analysis
Retail Retail market analysis via image recognition
Public Satellite photo fault reorganization
Transportation Train & subway defect inspection
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 25
PowerAI Vision: “Point-and-Click” AI for Images & Video
Label Image orVideo Data
Auto-Train AI Model Package & Deploy AI Model
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 26
PowerAI Vision Project Examples
26IBM Supply Chain Engineering / DOC ID / / © 2017 IBM Corporation
Defect Identification•Wafer Fab Inspection – Electronics•Cam Shaft Inspection – Automotive•Seat Inspection – Automotive•PCBA Inspection – Electronics•Utility disk Inspection – Energy/Utilities •Mainframe assemble inspection – Electronics •Ceramic capacitor - Electronics•Defective Components – Oil/Gas
Facial / Object Recognition•Safety/Security - Transit, Banking, Gaming•Building Infrastructure – Building/Construction•Service – Retail, Food•Traffic – Municipal
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 27
Power Accelerated Computing Platform – Building Blocks
| 27
AC922
8335-GTG
2 or 4 GPUs
9008-22L or
8335-GTG
4 – 15 Compute Servers*
1 – 3
Management / Login Servers
(1st rack)
Elastic Storage
Server
(5147 & 5148)
0-1 ESS per cluster
(optional, 1st rack)
Mellanox
Switches
IB and Ethernet
Switches (Mellanox)
(Shared w/ESS)
IB TOR switch
Enet TOR switch
ESS
1-4 S42 Racks
xCAT / Manager /
Login node
ESS mgmt. node or
protocol nodes
Hardware Building Blocks
Compute Nodes
* 7 max in 1st rack15 max in 2nd - 4th
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 27
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 28
Power Accelerated Computing Platform
Configurable HW to simplify creation of “CORAL Like” scale out clusters
-Configurable to support HPC, Power AI, and in the future, Quantum Simulator stacks- Simplifies ability to configure complex configs for scale out infrastructure- Software customization & fully rack integrated in IBM manufacturing
- Determined in IBM System Lab Services Implementation Design Workshop- Optional On-Site network Integration and knowledge transfer available
- Option to assemble in Rochester, MN Pre-build lab if customer wants to use their own switches, racks or desire Water Cooled AC922 Compute processors
StorageStorageStorageStorage ComputeComputeComputeCompute ManagementManagementManagementManagement SwitchesSwitchesSwitchesSwitches RackRackRackRack
Elastic Storage Server
AC922 (2 or 4 GPU)8335-GTG
L922+ and/or AC922 (0,2,4 GPU)
8335-GTG
Mellanox One to four 42U Racks
(S42)
Optional Air Cooled OnlySame Processors as in
CORAL Servers
100Gb InfiniBand
40Gb Ethernet10Gb Ethernet1Gb Ethernet
If you really need more, let us know!
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 28
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 29
Software that can be customized at IBM Manufacturing *
* Assuming customer has required licenses (design workshop)
Optional frameworks/levels as identified in the Implementation Design Workshop :Anaconda Caffe IBM Advanced Toolchain Jupyter NotebookKerasPython PyTorchTensorFlow xCAT XGBOOST (latest git code)
Red Hat OS 7.5 (5639-RLE)IBM Spectrum Scale Client Mellanox OFED driver (Mellanox)NVIDIA CUDA Software (Nvidia)
PowerAI Base (5765-PAI)PowerAI Enterprise (5765-AIE)
Spectrum ConductorDL Impact PowerAI
PowerAI Vision (5737-H10)H2O Driverless AI (5639-AIH)
IBM Spectrum LSF Suite (5737-F30)IBM Compilers – XLC/C++/Fortran, gccESSL (5765-L61)IBM Spectrum MPI (5725-G83)Performance Toolkit (5765-PD2)xCAT support (5771-CAT)
Base
AI
HPC
Optional Open Source for P9
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 29
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 30
30
How do I get started?
What use cases in my company will have payback?Who can help my company customize the software?Who can provide knowledge transfer to my personnel?
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 31
Detailed abstract: This session typically includes discussions on:
Overview of industry and cross industry use cases• Discussion of Open Source Cognitive technologies such as Tensorflow, Caffe, Theano, Torch, • Discussion on data layer technologies such as Hadoop, NoSQL, NewSQL and relational DB technologies and
the Importance of End to End process (Governance and Data management)• Discussion of Customer Specific use cases including feasibility assessment. • Develop action plan to assist the customer to Identify and justify Cognitive use cases (ROI or or ROI factors)
• ID infrastructure actions necessary to support Cognitive project
Email: [email protected] Online Request: https://ibm.biz/BdFfcV
Cognitive Discovery Workshop: Helping you identify the right cognitive use cases
Objective: To provide an overview of Cognitive technologies, explore potential uses cases and how they canbe deployed to provide business value. The key focus is to identify potential use cases for Proofof Concept project.
How’s it Delivered ? A 4-6 hour Face to Face workshop at customer location delivered by a IBM Cognitive Workshop team
What’s the output ? Potential use cases and an action plan to help team select an appropriate Cognitive project.
Who should attend ? Key IT resources, Data Scientist/Customer Data Architect, LOB(Business Sponsor), any others the customer team feels are important to the discussion.
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation 31
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 32
Discovery Workshop
32
Time Topic SpeakerSpeakerSpeakerSpeaker AudienceAudienceAudienceAudience
9:00-9:15 am Introductions and Review Workshop Objectives All Execs, LOB, IT Liaisons
9:15-10:45 am Executive Session-What is AI -Art of the Possible-Short Demo – H2O
IBM Execs, LOB, IT Liaisons
10:45 – 11:00am Break
11:00 – 11:45 pm Introduce Use Case Workshop-Answering lingering Q&A -Each LOB department mission overview & focus areas
LOB, IT Liaisons
11:45 - 12:30 pm Industry Examples of Applied AI-Group Discussion on applicability to Customer
IBM/Client LOB, IT Liaisons
12:30 – 1:00 pm Lunch
1:00 – 2:30 pm Discussion and Identification of Use cases by LOB. -Feasibility and Impact of Use Cases-Identify High Interest and Highest Value Use Cases for Customer
IBM/Client LOB, IT Liaisons
2:30 – 2:45 pm Break
2:45 – 4:00 pm Develop Action Plan for Creation of Exec Proposal for High Value Use Cases -Use Case Pay Back, Cognitive Work Flow, Timeline, Data Strategy-Cognitive Skill Set, Data Strategy, POC/Trial Implementation steps
IBM/Client LOB, IT Liaison for identified use
cases
Executive Session
Use Case Discovery
Business CaseDevelopment
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 33
Power ACP – IBM Systems Lab Services
Manufacturing Install
Implementation Design Workshop
Hardware Racking, Software
Customization in Manufacturing
Network Integration & Knowledge Transfer on
site
- Develops information to enable majority of system implementation and tailoring to occur in IBM Manufacturing
- Done on customer site
Note: This step mandatory for enabling Note: This step mandatory for enabling Note: This step mandatory for enabling Note: This step mandatory for enabling manufacturing SW preloadmanufacturing SW preloadmanufacturing SW preloadmanufacturing SW preload
- Install, Configure & Verify software - Optional network integration- Done on customer site- Billable to customer- Knowledge Transfer on solution
configuration
Contact us today [email protected] the Web: www.ibm.com/it-infrastructure/services/lab-services PartnerWorld:
www.ibm.com/partnerworld/systems/services/lab-services Email us:
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation33
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 34
IBM Systems Lab Services Implementation Design Workshop
Onsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast time----totototo----benefit implementationbenefit implementationbenefit implementationbenefit implementation- Develops information to enable majority of system implementation and tailoring to occur in
IBM Manufacturing - Documents software and infrastructure required to enable customer use cases- Includes:
- Data Center personnel to ensure client data center is ready for the Power Accelerated Computing Platform implementation
- Customer personnel to determine customization of software like PowerAI Enterprise or PowerAI Vision or H20
- Client networking team to document customization needed for networking (IPs, VLANS, Uplinks, etc)
- Creation of the implementation documentation that will be used for customization at IBM Manufacturing and for solution knowledge transfer
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 34
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 35
End Result at the Data Center
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 35
Not This This
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 36
Lessons learned with Summit on deploying large HPC Clusters
IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 36
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 37
Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 37
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 38
Deployment of Large HPC Clusters Lessons Learned
38
Architecture for scale is important. In our case, the network architecture was quite successful, and service nodes were used to distribute provisioning workload across many nodes.
Most of the effort in deploying a large cluster is in the infrastructure racks
Switch-level discovery becomes critical for large-scale rapid deployment of racks. Cabling verification and double-checking node positions became important.
It's important to establish a good, complete set of node-level diagnostics to run on every node in the cluster, and to run this set of diagnostics on a continuous basis
Establish a process and mechanism to deploy updates continuously to the cluster, for both software and firmware. This includes both stateful and stateless nodes.
Expect issues at scale with most tools
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 39
Performance Testing as you go
39
One of the final objectives for the cluster deployment was a submission to the Top 500
For Sierra, HPL (Linpack) became an extraordinarily valuable tool for exercising a cluster, and finding and diagnosing performance issues
We started small at the node level, and worked up to the rack level, row level and cluster level. In this way, we could identify performance issues at the micro level, rather than the macro level. When tuned well, node level and rack level performance was remarkably similar.
Node level HPL identifies CPU, GPU and memory performance issues
Rack-level HPL identifies Infiniband performance issues both at individual nodes and at the rack-level IB switches
Row-level HPL identifies performance issues in some core IB switches. For example, we saw performance issues in the eastern end of one row in Sierra
Cluster-level HPL identifies issues at very large scale, and provides opportunities for novel approaches to HPL
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 40
Power Accelerated Computing Platform
| 40
Getting Started • IBM Cognitive Systems Solution Center (CSSC)
Optional Discovery Workshop to identify use cases• Email: [email protected]• Submit Online Request: https://ibm.biz/BdFfcV
• IBM Systems Lab Services three Stage Approach
i. Implementation Design Workshop
ii. Manufacturing Customization
iii. Data Center Integration• Email: [email protected] or • Fred Robinson [email protected]
•Configurator: eConfig -> Power -> Solutions -> Power
ACP
Getting Started • IBM Cognitive Systems Solution Center (CSSC)
Optional Discovery Workshop to identify use cases• Email: [email protected]• Submit Online Request: https://ibm.biz/BdFfcV
• IBM Systems Lab Services three Stage Approach
i. Implementation Design Workshop
ii. Manufacturing Customization
iii. Data Center Integration• Email: [email protected] or • Fred Robinson [email protected]
•Configurator: eConfig -> Power -> Solutions -> Power
ACP
IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 41
IBM Systems Lab Services
Proven expertise to help leaders plan, design, and implement the essential IT infrastructure for what comes next
Our team of 1,000+ consultants, engage
worldwide in pre and post sales
opportunities in:
Power Systems
Storage and Software Defined
Infrastructure
IBM Z and LinuxONE
HPC & Deep Learning
Systems Consulting
Migration Factory
Technical Training and Events
[email protected]/it-infrastructure/services/lab-servicesFred Robinson [email protected]
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 42
IBM Power Accelerated Computing Platform
IBM Power ACP gives clients their own AI installation based upon the world’s most powerful and smartest scientific supercomputer
Includes everything required for success!• Networking• Servers• Storage • Software• Services• Support
Leverage CORAL success TODAY!42
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 43
Notices and disclaimers
© Copyright IBM Corporation 2018
• © 2018 International Business Machines Corporation. No part of
this document may be reproduced or transmitted in any form without written permission from IBM.
• U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
• Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided.
• IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”
• Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
• Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are presented as illustrations of how those
• customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other
operating environments may vary.
• References in this document to IBM products, programs, or services
does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.
• Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
• It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law.
43
IBM LSF & HPC User Group @ SC18
© IBM Corporation 2018 44
Notices and disclaimers continued
© Copyright IBM Corporation 2018
• Information concerning non-IBM products was obtained from the suppliers of
those products, their published announcements or other publicly available
sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products
to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose.
• The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com and [names of other referenced IBM
products and services used in the presentation] are trademarks
of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
• .
44