leverage ocp design…flexible & easy design for different applications –i 12 cpu 0 cpu 1...

23

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …
Page 2: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Leverage OCP Design Advantages on EIA 19” Accelerator Server

HPC & GPU/FPGA

Technology

Gregary Liu, Product Director, Wiwynn Corporation

Page 3: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 4: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 5: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Brief System Overview – I • System Design Advantages

• High CFM/watt thermal design for large-scale simulation models and DL training at all workloads

• By selecting different PCIe Topologies and PCIe cards, various different applications can be addressed

• OCP Related Design Highlights

• Front IO Access

• Tool-less ME design for labor-saving

• Integrated field proven Mt. Olympus M/B for high quality assurance

Page 6: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Brief System Overview – II • EIA 19” Design Highlights

• Standard 4RU High-Power Server design

• Designed for 8 double-width PCIe G3 x16 slots

adopt to various accelerators for different workloads

• Dual-Zone thermal/cooling design Cold air run through PCIe card directly

• CRPS PSU 2+2 Power redundancy

• Scalable design easily migrated to ORv2

So, how do we achieve them?

Page 7: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Accelerator Server Basics

FanBoard

3+1 Redundant Fan

PowerBoard

2+2 Redundant PSU

PCIecard

x1 SW IC

Switch Board

x4

SW IC

x4

Page 8: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 9: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

High CFM/watt Thermal Efficiency – I Two isolated cooling zones enable cold air run through PCIe cards directly

GPGPU Fan

PSU

Cold air

Cold air

Hot air

Hot air

Hot air

Hot air

(Side view)

Cold air

Cold air

(Top view)Cold air

PCIe cardscooling zone

Server board cooling zone

Cold air

Cold air

Cold air

• Thermal efficiency

0.135 CFM/watt, at 30°C

0.117 CFM/watt, at 25°C

Exceed DC requirement

Page 10: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

High CFM/watt Thermal Efficiency – II 3+1 System Fan Redundant design for up to 2.8KW workload @ 35ºC

CPU 82 C134.7W

CPU 84 C134.7W

73 C / 248 W

73 C / 247 W

74 C / 247 W

73 C / 248 W

76 C / 248 W

76 C / 247 W

77 C / 246 W

77 C / 248 W

35°C Inlet

Location of failed fan

SSD40.1C

DIMM 56 C

GP

U inle

t 3

5.3°C

PSU

inle

t 4

1.8 C

GP

U o

utlet

52.5°C

48°C Outlet

Page 11: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 12: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Flexible & Easy Design for Different Applications – I

12

CPU 0 CPU 1

PCIe3 Switch 0

PCIe3 Switch 1

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

UPI

PCIe Gen3 Switch Board

Project Olympus server board

2P Intel Xeon-SP

1* PCIe x16

GPU0-3 GPU4-7

CPU-PCIe Cards Topology 1 –Balance ModeCPU:GPU = 1:4

Higher bandwidth between CPU and

GPU.

PCIe x16 cable

Page 13: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Flexible & Easy Design for Different Applications – II

CPU 0 CPU 1

PCIe3 Switch 0

PCIe3 Switch 1

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

UPI

PCIe Gen3 Switch Board

Project Olympus server board

2P Intel Xeon-SP

GPU0-3 GPU4-7

CPU-PCIe Cards Topology 2 –Cascade ModeCPU:GPU = 1:8

PCIe x16 cablePeer to Peer

performance can be extended to 8

PCIe cards

2* PCIe x16

Page 14: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 15: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Tool-less Design for PCIe Cards Maintenance – I Module and tool-less design for DW PCIemaintenance

•Modular SW tray for easy DW PCIe cards swap

•Using quarter-turn fastener for PCIe cards replacement

15

GPU tray for Serviceability

Quarter turnfasten / release

Page 16: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Tool-less Design for PCIe Cards Maintenance – II Rotatable SSD bracket for PCIe card maintenance

•Tool-less design

•Prevents interference on serviceability on M/B

16

Front PCIe Card Maintenance

SSD module andRotational bracket

Page 17: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Serviceability Design for Fan and SSD ReplacementModularized and labor-saving design•Hot plug fan module with labor-saving handle for fast replacement

•Hot plug, front access SSDs are tool-less designSSD serviceability

SSD Carrier

Fan cage

Fan serviceability

Labor saving handle

Page 18: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 19: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Retrofit to 4OU chassis to fit for ORv2 supporting 12V DC busbar

• Redesign PTB for power transition to server board and PCIe switch board

• Support up to 8x SATA SSDs

Processor 2S Intel® Xeon® Processor Scalable Family

DIMM 1.5TB DDR4; up to 2666 MT/s; 24 DIMM slots

StorageDrive support 8 x 2.5” hot plug SATA HDDs/SSDs

M.2 SSD Module 4 onboard M.2 modules

Accelerator PCIe 3.0 slot 8, GPU/FPGA/Flash add-in cards

Expansion Slot PCIe Gen3 (x16) 3, (1 or 2 reserved for GPU connection)

System Dimensions (mm) 4OU; 188 (H) x 537 (W) x 879 (D)

Design Extension to ORv2

Page 20: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

• Brief System Overview

• High CFM/watt Thermal Efficiency

• Flexible & Easy Design for Different Applications

• Design for Serviceability

• Design Extension to ORv2

• Power Distribution Design for ORv2

Agenda

Specifications

HPC

Page 21: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Power Distribution Design for ORv2

• Dual Busbar Clips to support up to 2.8KW

• Power transition board (PTB) for MB, Switch board, Fan board

Bus Clip 1

Fan Board

Bus Clip 2

12V 12V 12V

12V

Mt. Olympus12V

PCIe Switch Board

PTBGPGPU Cards x8

Page 22: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …

Q&A

Page 23: Leverage OCP Design…Flexible & Easy Design for Different Applications –I 12 CPU 0 CPU 1 PCIe3 Switch 0 PCIe3 Switch 1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 UPI PCIe Gen3 …