super computing 18, mc04 building your own mini- coral

44
IBM LSF & HPC User Group @ SC18 © IBM Corporation 2018 1 Super Computing 18, MC04 Building your own mini- CORAL : Power Accelerated Computing Platform IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation

Upload: others

Post on 01-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 1

Super Computing 18, MC04 Building your own mini-CORAL : Power Accelerated Computing Platform

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation

Page 2: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 2

Agenda

• IBM Power Accelerated Computing Platform requirements• Structure of Power Accelerated Computing Platform • Lessons learned deploying large CORAL HPC Clusters• How to get started with Power Accelerated Computing Platform • Discussion

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 2

Page 3: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 3

IBM Power Accelerated Computing Platform

IBM Power ACP gives clients their own AI installation based upon the world’s most powerful and smartest scientific supercomputer

Supports • High Performance Computing (HPC)• Artificial Intelligence (AI)• Machine Learning / Deep Learning

Based upon IBM CORAL!

Natural markets: Research Labs, Universities,

Government Labs, Military Research, Industry 3

Page 4: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 4

Questions?Complete Solutions for AI and Modern HPC

– CORAL Servers (POWER9 – IBM Power System AC922)

– Management Servers/Head Nodes

– Networking : Ethernet and IB

– Elastic Storage Server

– Linux and Software Development tools

– Pre-Sales/Install expert review by IBM Systems Lab Services

– Hardware Configuration assembly in IBM facility

– Software Installation and Configuration by IBM before delivery

– Installation and connectivity support with IBM Systems Lab Services

– Software Flexibility: HPC and/or PowerAI base or PowerAI Enterprise, and/or H2O

How Do I Deploy AI

at my Company?

I want to run Workloads and

Experiments on Summit!

I want to explore

Quantum Computing

Power AI Reference Architecture: https://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iwhttps://ibm.ent.box.com/s/8w75cdh6s4smgix7ckoh4yisn06h93iw

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 4

Page 5: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 5

CORAL and Summit & Sierra

CORAL = Collaboration of Oak Ridge, Argonne & Lawrence Livermore National Labs

Summit, Ascent and Peak are cluster names of Oak Ridge

Sierra, Lassen, Ansel and Butte are cluster names at Lawrence Livermore

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 5

Page 6: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 6

Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 6

Page 7: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 7

Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 7

Page 8: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 8

Page 9: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 9

Designed for the AI EraDesigned for the AI EraDesigned for the AI EraDesigned for the AI EraArchitected for the modern analytics and AI workloads that fuel insights

An Acceleration Superhighway An Acceleration Superhighway An Acceleration Superhighway An Acceleration Superhighway Unleash state of the art IO and accelerated computing potential in the post “CPU-only” era

Delivering EnterpriseDelivering EnterpriseDelivering EnterpriseDelivering Enterprise----Class AIClass AIClass AIClass AIFlatten the time to AI value curve by accelerating the journey to build,train, and infer deep neural networks

AC922AC922AC922AC922IBM POWER SYSTEM

Page 10: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 10

The POWER9 processor

17LEVELS

OF METAL

>15MILES OF

WIRE

8BILLION

TRANSISTORS

4GHZPEAK

FREQUENCY

>24BVIAS

7TB/sOn chip

BW

~1TB/sBW into

chip

1stchip

with PCIe4

2x

1.5x

2x

1.4x

Core performance

vs x86

performance

vs POWER8

more memory

vs POWER8

More memory

bandwidth vs x86

Page 11: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 11

Watching Processors Evolve!

HPC analyst Addison Snell (CEO of Intersect360 Research) ….commented by email.

“One, Power9 has excellent memory bandwidth and performance.

Two, it is a great platform for attaching accelerators or co-processors. It’s an odd statement of direction, but maybe a visionary one,

essentially saying a processor isn’t about computation per se, but rather it’s about feeding data to other computational elements.”

IBM and Business Partner Use Only

Page 12: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 12

High level System Overview

2-Socket, 2U Packaging

32, 40 (air) or 36,44 (water) P9 Processor cores

4 NVIDIA Volta V100 NVLink2 GPUs

2 TB Memory (16x - 128GB DIMMs)

4 PCIe Gen4 Slots

2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage

Supports 1.6, 3.2 and 6.4TB NVMe Adapters

Redundant Hot Swap Power Supplies and Fans

Default 3 year 9x5 warranty, 100% CRU

IBM Power System AC922 - POWER9 with increased GPU and IO bandwidth for differentiation

Realize unprecedented performance and application gains with POWER9 and NVLink 2.0

• 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server

• PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3

• CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be maintained at high levels (3.3 / 3.45ghz, air/water).

Page 13: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 13

IBM Spectrum LSF SuitesPowerful Workload Management

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation 13

The suite delivers:

• Enhanced Utilization of assets through effective scheduling and sharing policies

• Enhancing User Productivity through ease of use, accessibility and simplification

• Operational Efficiency through insight of how the HPC environment is being used

Comprehensive GPU, Container and Hybrid Cloud Support

The LSF Suite for HPC is available at no charge via the IBM Academic Initiative

Page 14: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 14

AI Changes Everything for Data

14

Diversity of Data

– Local, HDFS, NFS, Posix, Cloud

Amount of Data

– A Petabyte is just a starting point

Delivery of Data

– Gigabytes/Sec/Server to feed GPU

Page 15: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 15

The IBM ESS FamilyThe IBM ESS FamilyThe IBM ESS FamilyThe IBM ESS Family

• Over 1000 1000 1000 1000 ESS Installed

• Over 300300300300 ESS customers

• Over 5,0005,0005,0005,000 Spectrum Scale clients

The Storage Built for AI!The Storage Built for AI!The Storage Built for AI!The Storage Built for AI!

IBM Spectrum Scale with Elastic Storage Server Family

IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage IBM is the World Leader in Software Defined Storage

EnvironmentsEnvironmentsEnvironmentsEnvironments

Five 9’s Reliability!Five 9’s Reliability!Five 9’s Reliability!Five 9’s Reliability!

Page 16: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 16

ESS Installation at ORNL

77 ESS Systems delivering:

• Single Namespace up to 250 Petabytes

• 2.5 TB/s large block sequential IO performance

• 2.6M file creates/sec for 32KB files in unique directories

• 50K file creates/sec to single shared directory

• Spectrum Scale RAID with declustered erasure coding

• 16 GB/Second of Data I/O to a Single Server

Page 17: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 17

IBM Systems

IBM Elastic Storage Server (ESS) Family

| 17

Model GL4S: 4 Enclosures, 20U

334 NL-SAS, 2 SSD

Model GL6S:6 Enclosures, 28U

502 NL-SAS, 2 SSD

Model GL2S: 2 Enclosures, 12U

166 NL-SAS, 2 SSD

Capacity

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

ESS 5U84

Storage

36 GB/s12 GB/s 24 GB/s

Model GS1S24 SSD

Model GS2S48 SSD

Model GS4S96 SSD

Speed

40 GB/s

14 GB/s

26 GB/s

Model GL1S: 1 Enclosures, 9U

82 NL-SAS, 2 SSD

ESS 5U84

Storage

6 GB/s

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

ESS 5U84 Storage

38 GB/s 40 GB/s

Model GH14S: 1 2U24 Enclosure SSD

4 5U84 Enclosure HDD

334 NL-SAS, 24 SSD

Model GH24S: 2 2U24 Enclosure SSD

4 5U84 Enclosure HDD

334 NL-SAS, 48 SSD

Hybrid

Page 18: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 18

© IBM Corporation 2018 18

New ESS C-Series Maximum Density with Room to Upgrade and Grow!

New! Model GL2C: 2 Enclosures, 12U

210 NL-SAS, 2 SSD

New! Model GL4C 4 Enclosures, 16U

432 NL-SAS, 2 SSD

New! Model GL6C: 6 Enclosures, 28U

634 NL-SAS, 2 SSD

1.0 PB Disk 2.0 PB Disk 4.2 PB Disk 6.3 PB Disk

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

New! Model GL2C: 1 Enclosure, 8U

104 NL-SAS, 2 SSD

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

4U106

Storage

Page 19: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 19

IBM Systems

Power Accelerated Computing Platform – Sample Building Block View

Compute: AC922

2 or 4 GPUs

ManagementL922 or AC922

Elastic Storage

Server

(5147 & 5148)

Mellanox

Switches

1-4 S42 Racks

AC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI Servers• Featured in ORNL and LLNL CORAL Installs• ExaOps of demonstrated AI Performance• Able to Process more than 20 GB/S of Data• Add Servers as Workloads Grow!

AC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI ServersAC992: The World’s Premier AI Servers• Featured in ORNL and LLNL CORAL Installs• ExaOps of demonstrated AI Performance• Able to Process more than 20 GB/S of Data• Add Servers as Workloads Grow!

IBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI Workloads• Density meets Performance• High Density Petabytes in Minimum Space• Featured in ORNL and LLNL Installs• Grow Performance by Scaling Up or Out!• Supports IB and Ethernet!

IBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI WorkloadsIBM Elastic Storage Server for AI Workloads• Density meets Performance• High Density Petabytes in Minimum Space• Featured in ORNL and LLNL Installs• Grow Performance by Scaling Up or Out!• Supports IB and Ethernet!

Page 20: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 20

20

PowerAIOpen-Source Based

Enterprise AI Platform

Open Source Frameworks:

Supported Distribution

Developer Ease-of-Use Tools

Faster Training Times viaHW & SW Performance Optimizations

Integrated & Supported AI Platform3-4x Speedup for AI TrainingEase of Use Tools for Data Scientists

GPU-Accelerated Power Servers

Storage

Caffe

SnapML

Page 21: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 21

21

5x Faster Data Communication with Unique CPU-GPU NVLink High-Speed Connection

1 TB

Memory

POWER9

CPU

V100 GPU V100 GPU

170GB/s

NVLink150 GB/s

1 TB

Memory

POWER9

CPU

V100 GPU V100 GPU

170GB/s

NVLink150 GB/s

IBM Power System AC922Deep Learning Server (4-GPU Config)

Store Large Models in System Memory

Operate on One Layer at a Time

Fast Transfer via NVIDIA

NVLink

Page 22: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 22

PowerAI

22

Deep Learning Impact Deep Learning Impact Deep Learning Impact Deep Learning Impact (DLI) Module(DLI) Module(DLI) Module(DLI) Module

Data & Model Management, ETL, Visualize, Advise

PowerAI: Open Source ML Frameworks

Large Model Support (LMS)

Distributed Deep Learning (DDL)

Auto-HyperParameter Tuning

PowerAIEnterprise

Auto-ML for Images & VideoPowerAI

Vision

Accelerated Infrastructure

Accelerated Servers Storage

SnapML

Page 23: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 23

Simplified

Management

Faster Time

to Results

Increased Resource

Utilization

Enterprise

Solution

Page 24: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 24

Power AI Enterprise Project Examples

© 2017 IBM Corporation 24

IndustryIndustryIndustryIndustry ScenarioScenarioScenarioScenario

Banking

Credit Scoring

Face Masking Detection

Stock Index Futures Prediction

Research Exploration

OCR recognition correction

Securities Company Logo and name auto matching

AI on cloud

Hand writing recognition

Insurance Work order auto clustering/handling

IndustryIndustryIndustryIndustry ScenarioScenarioScenarioScenario

TelcomNetwork cabling detection

Service halt handling

ManufacturingLED Panel defect inspection

Steel quality classification

Wafer Flaw detection

Energy Power transmission line safety detection

Healthcare Pathologic analysis

Retail Retail market analysis via image recognition

Public Satellite photo fault reorganization

Transportation Train & subway defect inspection

Page 25: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 25

PowerAI Vision: “Point-and-Click” AI for Images & Video

Label Image orVideo Data

Auto-Train AI Model Package & Deploy AI Model

Page 26: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 26

PowerAI Vision Project Examples

26IBM Supply Chain Engineering / DOC ID / / © 2017 IBM Corporation

Defect Identification•Wafer Fab Inspection – Electronics•Cam Shaft Inspection – Automotive•Seat Inspection – Automotive•PCBA Inspection – Electronics•Utility disk Inspection – Energy/Utilities •Mainframe assemble inspection – Electronics •Ceramic capacitor - Electronics•Defective Components – Oil/Gas

Facial / Object Recognition•Safety/Security - Transit, Banking, Gaming•Building Infrastructure – Building/Construction•Service – Retail, Food•Traffic – Municipal

Page 27: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 27

Power Accelerated Computing Platform – Building Blocks

| 27

AC922

8335-GTG

2 or 4 GPUs

9008-22L or

8335-GTG

4 – 15 Compute Servers*

1 – 3

Management / Login Servers

(1st rack)

Elastic Storage

Server

(5147 & 5148)

0-1 ESS per cluster

(optional, 1st rack)

Mellanox

Switches

IB and Ethernet

Switches (Mellanox)

(Shared w/ESS)

IB TOR switch

Enet TOR switch

ESS

1-4 S42 Racks

xCAT / Manager /

Login node

ESS mgmt. node or

protocol nodes

Hardware Building Blocks

Compute Nodes

* 7 max in 1st rack15 max in 2nd - 4th

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 27

Page 28: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 28

Power Accelerated Computing Platform

Configurable HW to simplify creation of “CORAL Like” scale out clusters

-Configurable to support HPC, Power AI, and in the future, Quantum Simulator stacks- Simplifies ability to configure complex configs for scale out infrastructure- Software customization & fully rack integrated in IBM manufacturing

- Determined in IBM System Lab Services Implementation Design Workshop- Optional On-Site network Integration and knowledge transfer available

- Option to assemble in Rochester, MN Pre-build lab if customer wants to use their own switches, racks or desire Water Cooled AC922 Compute processors

StorageStorageStorageStorage ComputeComputeComputeCompute ManagementManagementManagementManagement SwitchesSwitchesSwitchesSwitches RackRackRackRack

Elastic Storage Server

AC922 (2 or 4 GPU)8335-GTG

L922+ and/or AC922 (0,2,4 GPU)

8335-GTG

Mellanox One to four 42U Racks

(S42)

Optional Air Cooled OnlySame Processors as in

CORAL Servers

100Gb InfiniBand

40Gb Ethernet10Gb Ethernet1Gb Ethernet

If you really need more, let us know!

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 28

Page 29: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 29

Software that can be customized at IBM Manufacturing *

* Assuming customer has required licenses (design workshop)

Optional frameworks/levels as identified in the Implementation Design Workshop :Anaconda Caffe IBM Advanced Toolchain Jupyter NotebookKerasPython PyTorchTensorFlow xCAT XGBOOST (latest git code)

Red Hat OS 7.5 (5639-RLE)IBM Spectrum Scale Client Mellanox OFED driver (Mellanox)NVIDIA CUDA Software (Nvidia)

PowerAI Base (5765-PAI)PowerAI Enterprise (5765-AIE)

Spectrum ConductorDL Impact PowerAI

PowerAI Vision (5737-H10)H2O Driverless AI (5639-AIH)

IBM Spectrum LSF Suite (5737-F30)IBM Compilers – XLC/C++/Fortran, gccESSL (5765-L61)IBM Spectrum MPI (5725-G83)Performance Toolkit (5765-PD2)xCAT support (5771-CAT)

Base

AI

HPC

Optional Open Source for P9

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 29

Page 30: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 30

30

How do I get started?

What use cases in my company will have payback?Who can help my company customize the software?Who can provide knowledge transfer to my personnel?

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation

Page 31: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 31

Detailed abstract: This session typically includes discussions on:

Overview of industry and cross industry use cases• Discussion of Open Source Cognitive technologies such as Tensorflow, Caffe, Theano, Torch, • Discussion on data layer technologies such as Hadoop, NoSQL, NewSQL and relational DB technologies and

the Importance of End to End process (Governance and Data management)• Discussion of Customer Specific use cases including feasibility assessment. • Develop action plan to assist the customer to Identify and justify Cognitive use cases (ROI or or ROI factors)

• ID infrastructure actions necessary to support Cognitive project

Email: [email protected] Online Request: https://ibm.biz/BdFfcV

Cognitive Discovery Workshop: Helping you identify the right cognitive use cases

Objective: To provide an overview of Cognitive technologies, explore potential uses cases and how they canbe deployed to provide business value. The key focus is to identify potential use cases for Proofof Concept project.

How’s it Delivered ? A 4-6 hour Face to Face workshop at customer location delivered by a IBM Cognitive Workshop team

What’s the output ? Potential use cases and an action plan to help team select an appropriate Cognitive project.

Who should attend ? Key IT resources, Data Scientist/Customer Data Architect, LOB(Business Sponsor), any others the customer team feels are important to the discussion.

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation 31

Page 32: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 32

Discovery Workshop

32

Time Topic SpeakerSpeakerSpeakerSpeaker AudienceAudienceAudienceAudience

9:00-9:15 am Introductions and Review Workshop Objectives All Execs, LOB, IT Liaisons

9:15-10:45 am Executive Session-What is AI -Art of the Possible-Short Demo – H2O

IBM Execs, LOB, IT Liaisons

10:45 – 11:00am Break

11:00 – 11:45 pm Introduce Use Case Workshop-Answering lingering Q&A -Each LOB department mission overview & focus areas

LOB, IT Liaisons

11:45 - 12:30 pm Industry Examples of Applied AI-Group Discussion on applicability to Customer

IBM/Client LOB, IT Liaisons

12:30 – 1:00 pm Lunch

1:00 – 2:30 pm Discussion and Identification of Use cases by LOB. -Feasibility and Impact of Use Cases-Identify High Interest and Highest Value Use Cases for Customer

IBM/Client LOB, IT Liaisons

2:30 – 2:45 pm Break

2:45 – 4:00 pm Develop Action Plan for Creation of Exec Proposal for High Value Use Cases -Use Case Pay Back, Cognitive Work Flow, Timeline, Data Strategy-Cognitive Skill Set, Data Strategy, POC/Trial Implementation steps

IBM/Client LOB, IT Liaison for identified use

cases

Executive Session

Use Case Discovery

Business CaseDevelopment

Page 33: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 33

Power ACP – IBM Systems Lab Services

Manufacturing Install

Implementation Design Workshop

Hardware Racking, Software

Customization in Manufacturing

Network Integration & Knowledge Transfer on

site

- Develops information to enable majority of system implementation and tailoring to occur in IBM Manufacturing

- Done on customer site

Note: This step mandatory for enabling Note: This step mandatory for enabling Note: This step mandatory for enabling Note: This step mandatory for enabling manufacturing SW preloadmanufacturing SW preloadmanufacturing SW preloadmanufacturing SW preload

- Install, Configure & Verify software - Optional network integration- Done on customer site- Billable to customer- Knowledge Transfer on solution

configuration

Contact us today [email protected] the Web: www.ibm.com/it-infrastructure/services/lab-services PartnerWorld:

www.ibm.com/partnerworld/systems/services/lab-services Email us:

[email protected]

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation33

Page 34: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 34

IBM Systems Lab Services Implementation Design Workshop

Onsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast timeOnsite customer workshop to enable a fast time----totototo----benefit implementationbenefit implementationbenefit implementationbenefit implementation- Develops information to enable majority of system implementation and tailoring to occur in

IBM Manufacturing - Documents software and infrastructure required to enable customer use cases- Includes:

- Data Center personnel to ensure client data center is ready for the Power Accelerated Computing Platform implementation

- Customer personnel to determine customization of software like PowerAI Enterprise or PowerAI Vision or H20

- Client networking team to document customization needed for networking (IPs, VLANS, Uplinks, etc)

- Creation of the implementation documentation that will be used for customization at IBM Manufacturing and for solution knowledge transfer

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 34

Page 35: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 35

End Result at the Data Center

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 35

Not This This

Page 36: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 36

Lessons learned with Summit on deploying large HPC Clusters

IBM Systems Lab Services/ SC18 / November, 2018 / © 2018 IBM Corporation 36

Page 37: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 37

Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 37

Page 38: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 38

Deployment of Large HPC Clusters Lessons Learned

38

Architecture for scale is important. In our case, the network architecture was quite successful, and service nodes were used to distribute provisioning workload across many nodes.

Most of the effort in deploying a large cluster is in the infrastructure racks

Switch-level discovery becomes critical for large-scale rapid deployment of racks. Cabling verification and double-checking node positions became important.

It's important to establish a good, complete set of node-level diagnostics to run on every node in the cluster, and to run this set of diagnostics on a continuous basis

Establish a process and mechanism to deploy updates continuously to the cluster, for both software and firmware. This includes both stateful and stateless nodes.

Expect issues at scale with most tools

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation

Page 39: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 39

Performance Testing as you go

39

One of the final objectives for the cluster deployment was a submission to the Top 500

For Sierra, HPL (Linpack) became an extraordinarily valuable tool for exercising a cluster, and finding and diagnosing performance issues

We started small at the node level, and worked up to the rack level, row level and cluster level. In this way, we could identify performance issues at the micro level, rather than the macro level. When tuned well, node level and rack level performance was remarkably similar.

Node level HPL identifies CPU, GPU and memory performance issues

Rack-level HPL identifies Infiniband performance issues both at individual nodes and at the rack-level IB switches

Row-level HPL identifies performance issues in some core IB switches. For example, we saw performance issues in the eastern end of one row in Sierra

Cluster-level HPL identifies issues at very large scale, and provides opportunities for novel approaches to HPL

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation

Page 40: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 40

Power Accelerated Computing Platform

| 40

Getting Started • IBM Cognitive Systems Solution Center (CSSC)

Optional Discovery Workshop to identify use cases• Email: [email protected]• Submit Online Request: https://ibm.biz/BdFfcV

• IBM Systems Lab Services three Stage Approach

i. Implementation Design Workshop

ii. Manufacturing Customization

iii. Data Center Integration• Email: [email protected] or • Fred Robinson [email protected]

•Configurator: eConfig -> Power -> Solutions -> Power

ACP

Getting Started • IBM Cognitive Systems Solution Center (CSSC)

Optional Discovery Workshop to identify use cases• Email: [email protected]• Submit Online Request: https://ibm.biz/BdFfcV

• IBM Systems Lab Services three Stage Approach

i. Implementation Design Workshop

ii. Manufacturing Customization

iii. Data Center Integration• Email: [email protected] or • Fred Robinson [email protected]

•Configurator: eConfig -> Power -> Solutions -> Power

ACP

IBM Systems Lab Services/ SC18 / November , 2018 / © 2018 IBM Corporation

Page 41: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 41

IBM Systems Lab Services

Proven expertise to help leaders plan, design, and implement the essential IT infrastructure for what comes next

Our team of 1,000+ consultants, engage

worldwide in pre and post sales

opportunities in:

Power Systems

Storage and Software Defined

Infrastructure

IBM Z and LinuxONE

HPC & Deep Learning

Systems Consulting

Migration Factory

Technical Training and Events

[email protected]/it-infrastructure/services/lab-servicesFred Robinson [email protected]

Page 42: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 42

IBM Power Accelerated Computing Platform

IBM Power ACP gives clients their own AI installation based upon the world’s most powerful and smartest scientific supercomputer

Includes everything required for success!• Networking• Servers• Storage • Software• Services• Support

Leverage CORAL success TODAY!42

Page 43: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 43

Notices and disclaimers

© Copyright IBM Corporation 2018

• © 2018 International Business Machines Corporation. No part of

this document may be reproduced or transmitted in any form without written permission from IBM.

• U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

• Information in these presentations (including information relating to products that have not yet been announced by IBM) has been

reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided.

• IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”

• Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

• Performance data contained herein was generally obtained in a

controlled, isolated environments. Customer examples are presented as illustrations of how those

• customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other

operating environments may vary.

• References in this document to IBM products, programs, or services

does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

• Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

• It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law.

43

Page 44: Super Computing 18, MC04 Building your own mini- CORAL

IBM LSF & HPC User Group @ SC18

© IBM Corporation 2018 44

Notices and disclaimers continued

© Copyright IBM Corporation 2018

• Information concerning non-IBM products was obtained from the suppliers of

those products, their published announcements or other publicly available

sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products

to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose.

• The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com and [names of other referenced IBM

products and services used in the presentation] are trademarks

of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

• .

44