new technology in high performance computing bae wonsoung ph.d. tao computing inc....

New Technology in High Performance Computing

Bae WonSoung Ph.D.TAO Computing inc.

[email protected]

1.Introduction hardware acceleration technology (Hybrid computing)

2. CSX600 Acceleration technology parallelism based on SIMD architecture

3. CSX600 Performance

4. Approach to Blade system etc

5. Application Performance, Research & Development

6. Case Study & Prospect

7. Discussion

Contents

Introduction

Hybrid computing 의 대두 현재의 cpu 의 기술의 한계 (chip 의 직접화의 한계 ) multi-core 기술

산술연산기능상대적 비율의 감소 범용 ( 개인용 ) 위주의 개발 특화

CPU operation Acceleratedoperation + Hybrid Computing

Hybrid computing

Acceleration & Hybrid computing

Acceleration & Hybrid computing technology

1. FPGA (Field Programmable Gate Array)

xilinx, Altera (SGI)

2. GPU( Graphic Processor Unit)

ATI (AMD) CTM project , NVIDIA (CUDA/Tesla)

3. Game Processor

Cell, Mercury

4. SIMD processor ( Accelerated coprocessor)

CSX600

FPGAs work best for bit and integer data types

• Excellent for bit-twiddling, like cryptography• Fast for integer manipulations, like genomic

algorithms for pattern matching• Marginal for 32-bit floating-point; have to do over

200 at once to compete with current general purpose chips

• Poor at 64-bit floating-point… don’t use them as a supplementary FLOPS unit• “Programming” is really circuit design, though tools are

making this easier.• Compare speed against meticulously coded assembler on node, not casual coding on node.• Where FPGAs make the most sense is in creating instructions very unlike those provided by the node instruction set.• FPGAs can cost more than an entire server!

Where GPUs can help with HPC applications

• Single-precision calculations where answer quality is less important than raw speed– Seismic exploration– Some types of Quantum Chromodynamics

• Graphics-type calculations, obviously, for visualization and result display

• Only 32-bit, and non-IEEE rounding degrades accuracy cumulatively• Can use over 200 watts, multiple slots• Often only have a few megabytes of local store (frame buffer architecture)• Cheap hardware but very expensive software (the kind you create yourself)• Game processors don’t match endian-ness of node CPUs

Performance via limited power dissipation

CSX600 Acceleration Technology

• Dual ClearSpeed CSX600 coprocessors• 96 GFLOPS “theoretical” Peak• R∞ ≈ 75 GFLOPS for 64-bit matrix multiply (DGEMM) calls

– Hardware does also support 32-bit floating point and integer calculations

• 133 MHz PCI-X 2/3rds length (8”) form factor and PCI-e(8X) form factor• 1GByte of memory on the board• Drivers today for Linux (RedHat and Suse) and

Windows(XP,sever2003(32/64 bit) wccs2003)• Low power; 25 Watts typical• Multiple boards can be used together for greater performance

X620 board e620 board

CSX600 Configuration

CSX600 Architecture

Introduction to PEs (processing Elements)

CSX600 Performance Data

64-bit Function Operations per Second (Billions)

0.0

0.5

1.0

1.5

2.0

2.5

Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv

Function name

2.6 GHz dual-core Opteron

3 GHz dual-core Woodcrest

ClearSpeed Advance card

Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card

Doubling host-to-card bandwidth has minor effect because of I/O overlap.A zero-latency connection would not visibly affect either curve!

2000 4000 6000 8000 10000

10

20

30

40

50

60

70

PCI-X

PCIe

Matrix size

GFLOPS

CSX600 DGEMM Performance

CSX600 LINPACK Performance

System SpecificationLinpack Result (GFLOPS)

Elapsed Time

4 nodes (16GB) w/o Advance boards 136.048.4 minutes

4 nodes (16GB) w/ 2 x Advance boards each

364.218.4 minutes

1 node (16GB) w/o Advance boards 34.0

1 node (16GB) w/ 2 x Advance boards

90.1

Note: Previously published Linpack results for similar single node systems were 34.9 GFLOPS for the standard node and 93 GFLOPS for an accelerated node with two ClearSpeed Advance boards. The variations are a result of small differences between system configurations and problem sizes used during the benchmark runs.

FFT performance

Approach to blade systems or etc

Example of a possible 1U server

Rack width

(19 in.)

6.1 in. card

2.66 GHzDual-socketquad-core x86plus 16 GB memory, 4x InfiniBand.

• Two ClearSpeed PCIe cards on risers

• 1 or 2 GB DRAM/card• Enclosure supplies

25W, cooling per card• Total power draw: 450

W

– 277 peak GFLOPS (64-bit)– ~225 DGEMM GFLOPS sustained with MKL+new DGEMM– ~170 LINPACK GFLOPS sustained– 18 or 20 GB DRAM on host: 2 or 4 GB on CS cards, 16 GB on

x86 host

Server and workstation installations

• Can be installed in standard servers and workstations– E.g. HP DL380 takes 2 boards

• Some 4U servers could take 6-8 boards• Potential for a PCI backplane chassis to take

anything from 12 to 19 boards in 4U– A 19-board system could be under 750W in 4U– Could put 9 of these in a 42U rack -> 171 boards– If each board 10X a fast x86 core, equivalent to

1,710 cores in a single rack but for only 6.8kW of power

Blade installations

• Can be installed in blades and via expansion units– E.g. IBM PEU2 compatible with HS21 takes 2

boards– HP blade case using customizing expansion

unit

Case installation

A cluster of four nodes 136 GFLOPS Hardware configuration

• Intel®Xeon® 5160 (Woodcrest) dual core processors 4 nodes (16CPU) • consuming 1,940 watts

• Two Advance accelerator boards in each server

• Cluster performance increase over 364 GFLOPS • Adding only 200 watts

TOP500 (November 1996)

•Center for Computational Science at the University Of Tsukuba in Japan•2048 CPU Hitachi system 368.2 GFLOPS

HP DL380 takes 2 boards

Conventional Way

• 32 kwatts X3+ α

• 10x3+α sq. ft.

• ~$500,000 X3+ α =~$1500,000+ α

• 2.2TFLOPS(lower)

• Reconstruction facility Expense

ClearSpeed Way

Increasing capacity to 2.2 TFLOPS

Clustering system 구축

Clustering system in CSX600

Titech 500.org ranking

• Announced on Monday 9th of October 2006: Tokyo Tech have accelerated their Linux supercomputer, TSUBAME, from 38 TFLOPS to 47 TFLOPS with 360 ClearSpeed Advance boards– An increase in performance of 24%, but for just a

1% increase in power consumption – 10,368 AMD Opteron cores with just 360

ClearSpeed Advance boards– #9 in November 2006 Top500– 1st accelerated system in the Top500

Professor Matsuoka standing beside TSUBAME at Tokyo Tech

Application performance & R&D

Structural Analysis

ElectromagneticModeling

Radar Cross-Section

Ab initio Computational Chemistry

Global Illumination Graphics

LINPACK speed correlates with many real applications

• Dense matrix-matrix kernels: order N 3 ops on order N 2 data

– CFD ,CAE by boundary element and Green’s function methods

• N-body interactions: order N 2 ops on order N data

– CFD, CAE with high mean-free-path

• Some sparse matrix operations: order NB 2 ops on order NB data where B is the average matrix band size

– CFD,CAE with finite element methods

• Time-space marching: order N 4 ops on order N 3 data

– CFD,CAE with finite difference methods; data must reside on board

• Fourier transforms: order N log N ops on order N data, with other processing to increase data re-use

– CFD, CAE by spectral methods or pseudo spectral methods

Application area (CAE CFD)

Ax = b

N equations N unknowns

N iterations

ClearSpeed accelerates the DGEMM kernel of equation solving that takes over 90% of the time.

DGEMM

DGEMM

DGEMM

DGEMM

Volume = 1⁄3 N3

multiply-adds

Solving N equations takes order N 3 work

Accelerating sparse solvers: ANSYS & LS-DYNAAccelerating sparse solvers: ANSYS & LS-DYNA

• Potentially pure plug-and-play• No added license fee• Demands ClearSpeed’s 64-bit

precision and speed• Enabled by recent DGEMM

improvements; still needs symmetric ATA modification

• Could enable some Computational Fluid Dynamics acceleration (codes based on finite elements)50,000 dense

equationsAccelerator can solveat over 50 GFLOPS

becomes…

3.6x net applicationacceleration

Non-solverSolver setup

DGEMMon x86

host Non-solverSolver setup

DGEMMwith ClearSpeed

10x

10 million degrees of freedom (sparse)

Matlab /Mathematica DGEMM performance

NEW DGEMM 이용 62.16 GFLOPs in Matlab2006a xeon system

• AMBER (porting complete) accelerate 4~10x대표적인 molecular dynamics 해석 및 시뮬레이션프로그램 분자레벨의 동역학 모의실험에 많이 이용되고 있으며 생물분야에서도 단백질구조체에 대한 거동해석 및 모의실험에도 많이 이용 . AMBER 의 경우는 상용이고 GROMACS 의 경우는 프리웨어임

분자동력학 분석 및 모의실험을 수행하는 프로그램 성격상 많은 대단위 행렬계산이 필요 병렬연산에 관한 부분까지 포함 하여 많은 부분이수치적연산에 의존

Amber acceleration

http://www.vigyaancd.org/images/gromacs.jpg

AMBER module

HostAdvance X620

Speedup

Gen. Born 1: 83.5 min. 24.6 min. 3.4x speedup

Gen. Born 2: 84.6 min. 23.5 min. 3.6x speedup

Gen. Born 6: 37.9 min. 4.0min. 9.4x speedup Host: 2.8GHz Pentium 4 EMT64, OS: RHEL4-64, CSXL: version 2.50

Amber acceleration

Monte Carlo methods exploit high local bandwidth

• Monte Carlo methods are ideal for ClearSpeed acceleration:– High regularity and locality of the algorithm– Very high compute to I/O ratio– Very good scalability to high degrees of parallelism– Needs 64-bit

• Excellent results for parallelization– Achieving 10X performance per Advance card vs. highly

optimized code on the fastest x86 CPUs available today– Maintains high precision required by the computations

True 64-bit IEEE 754 floating point throughout– 25 W per card typical when card is computing

• ClearSpeed has a Monte Carlo example code, available in source form for evaluation

CSX600 Monte Carlo Improvement I

Monte Carlo scale like the NAS “EP” benchmark

CSX600 Monte Carlo Improvement II

• 1 CPU, no acceleration : 400M samples, 60 seconds • 1 Advance board : 400M samples, 2.9 seconds, 20x

speed up • 2 Advance boards : 400M samples, 1.5 seconds, 40x

speed up• 4 Advance boards : 400M samples, 0.8 seconds, 79x

speed up

Why do Monte Carlo apps need 64-bit?• Accuracy increases as the

square root of the number of trials, so five-decimal accuracy takes 10 billion trials.

• But, when you sum many similar values, you start to scrape off all the significant digits.

• 64-bit summation needed to get a single-precision result!

Single precision:1.0000x108 + 1= 1.0000x108

Double precision:1.0000x108 + 1 = 1.00000001x108

CSX600 Monte Carlo Improvement III

CSX600 Financial Application

Black-Scholes analytic pricing formula

Binomial method

Monte Carlo

Finite difference method

Broadie-Glasserman random tree method

Overall speed up gain range from 2x~ 70X

Precision double

Customizing

CSX600 지원 software 개발 환경

Windows series

CSX600 Summary & discussion

• For acceleration of numerically-intensive codes, such as – Math functions (RNGs, sin, cos, log, exp, sqrt etc.) – Standard libraries (Level 3 BLAS, LAPACK, FFTs)

• ~70 GFLOPS sustained from a ClearSpeed Advance™ board• Accelerated functions are callable from C/C++ or Fortran• ~25 watts per single-slot board

– Performance measured in GFLOPS per watt rather than MFLOPS per watt

• Multiple ClearSpeed Advance™ boards can be used together for even higher performance and compute density

• Current ClearSpeed board is mature product in production systems at top500 site

• ClearSpeed can deliver:– > 100 TFLOPS for June top500 and – > 1 PFLOPS for November 2006 top500

new technology in high performance computing bae wonsoung ph.d. tao computing inc....

Documents

accelerated node

node cpusperformance

minutes1 node

standard node

node instruction

clearspeed advance boards

gb wo advance boards136

gb wo advance boards34