the high-end virtualization company

Confidential and Proprietary

THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION – CREATING THE POWER OF ONE

Nir Paikowsky

VP Services

SDSC Summer Institute ScaleMP Introduction


Agenda

• Introduction to ScaleMP and vSMP Foundation

• Customer Use Cases

• Example of using OpenMP and R

• Overview of vSMP Foundation Developer Productivity Tools


Introduction to ScaleMP

1

8/19/2012 - 3 -


300+ Customers Worldwide

Israel

Qatar India

China

Brazil

South Africa

Canada

US

Russia

Spain

France

Belgium

Denmark

UK

Korea

HK

Vietnam

Malaysi

a

Taiwan

Poland

Italy

German

y NL

Australi

a

Norway

Japan

Creating the Power of One

1 VM 1 OS N x Servers N x OS

ScaleMP Fact-sheet

Founded in 2003

Software-based Shared-memory System

World’s largest SMP: 32,768 CPUs & 256 TB RAM

Processor and interconnect agnostic

Product shipping since 2006

Channel-only business model

300+ customers worldwide

8/19/2012 - 4 -


Selected Customers and Applications Life Sciences

Computational Chemistry

• AMBER

• CFOUR

• DOCK

• GAMESS

• Gaussian

• GOLD

• NWChem

• Octopus

• OpenEye FRED

• OpenEye OMEGA

• Schrödinger Jaguar

• Schrödinger Glide

• SCM ADF

• VASP Molecular Dynamics

• GROMACS

• MOLPRO

• NAMD

• OpenEye ROCS

• Schrödinger Desmond

• Turbomole

Numerical Simulations

• Octave

• R

• MathWorks MATLAB

• Wolfram Mathematica

Manufacturing Structural Mechanics

• ABAQUS/Explicit

• ABAQUS/Standard

• ALTAIR Radios

• ANSYS Mechanical

• LSTC LS-DYNA

• NASTRAN

• TNO Diana Fluid Dynamics

• ANSYS CFX

• AVL FIRE

• EXA PowerFlow

• EZNSS

• FLUENT (+TGrid)

• GeoDict

• MHD3D

• NASA Cart3D

• STAR-CCM+

• STAR-CD

• Tgrid Other

• Comsol

• inTrace OpenRT

Energy • IMEX

• Norsar 3D

• Paradigm GeoDepth

• Schlumberger ECLIPSE

EDA • Cadence

• HSPICE

• Mentor

• Quartus

• Silvaco SmartSpice

• Synopsys

Bio-informatics • 454/Newbler

• Abyss

• Bowtie

• CLC Bio

• FASTA

• HMMER

• Illumina

• mpiBLAST

• SOAPDenovo

• Velvet

Weather Forecasting

• MITgcm

• MM5 (MPI & OpenMP)

• MOM4

• WRF

Finance • KX

• Wombat

…and many more homegrown applications

8/19/2012 - 5 -

http://en.wikipedia.org/wiki/Image:NorthropGrummanLogo.png

http://en.wikipedia.org/wiki/Image:Dunlop_tyres.svg

http://en.wikipedia.org/wiki/Image:Honeywell_logo.svg

http://en.wikipedia.org/wiki/File:Naval_Research_Laboratory.png

http://images.google.com/imgres?imgurl=http://www.tcnj.edu/~teamdpx/DPX II/images/Logos/pppl_logo.jpg&imgrefurl=http://www.tcnj.edu/~teamdpx/DPX II/Home2.html&usg=__X4orpqIVPwXPwEBZYfxvhVcCGYM=&h=300&w=750&sz=54&hl=en&start=1&um=1&tbnid=ldjFqN8maJ7myM:&tbnh=56&tbnw=141&prev=/images?q=pppl&hl=en&rls=com.microsoft:en-us&um=1

http://upload.wikimedia.org/wikipedia/en/8/8d/Idf_logo4.jpg

http://en.wikipedia.org/wiki/Image:Total_SA_Logo.svg

http://en.wikipedia.org/wiki/Image:Total_SA_Logo.svg

http://en.wikipedia.org/wiki/Image:Official_Yale_Shield.png

http://en.wikipedia.org/wiki/Image:Cornell_emblem.png

http://en.wikipedia.org/wiki/File:Harvard_Wreath_Logo_1.svg


Server Aggregation Virtualization

2

8/19/2012 - 6 -


Server Virtualization

Hypervisor or VMM

Virtual Machines

App

OS

App

OS

App

OS

Virtual Machine

App

OS

Hypervisor or VMM

Hypervisor or VMM

Hypervisor or VMM

Hypervisor or VMM

AGGREGATION Concatenation of physical resources

PARTITIONING Subset of the physical resources

8/19/2012 - 7 -

http://images.google.com/imgres?imgurl=http://www.rim.edu.bt/rhce/images/redhat.jpg&imgrefurl=http://www.rim.edu.bt/rhce/index.php&usg=__LiXECjM28PHWQZ-cQ6h8i641fms=&h=396&w=360&sz=63&hl=en&start=2&um=1&tbnid=mw3P22rAAis9PM:&tbnh=124&tbnw=113&prev=/images?q=redhat&hl=en&rls=com.microsoft:en-us&um=1


Usage Models for Server Aggregation Large Memory I/O Intensive

Compute Intensive Consolidation

Virtual SMP

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

Operating System

App.

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

Operating System

App. App. App. App. App. App. App. App.

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

Operating System

App.

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

CPU CPU

I/O

Memory

Operating System

App.

8/19/2012 - 8 -


Boot Process

3

8/19/2012 - 9 -


Boot Process

2 Quad Core Processors

48 GB RAM vSMP Foundation loaded

from Network or Flash

8/19/2012 - 10 -


Boot Process

Total: 768 GB RAM 66 GB used by vSMP Foundation

VM will have 32 processors (128 cores)

16 nodes found

8/19/2012 - 11 -


Loading the OS

RHEL 5 boots Loading kernel

8/19/2012 - 12 -


Loading the OS

Kernel boot messages

Ready to Rock’n’roll

Intel 5000 processors

OS detects the 126th processor

Device detection

8/19/2012 - 13 -


vSMP Foundation in Work

OS detects 691 GB RAM

OS detects all drives

OS detects all PCI devices

Application uses >400 GB RAM

8/19/2012 - 14 -


Performance

4

8/19/2012 - 15 -

Confidential and Proprietary 8/19/2012 - 16 -

9,2

97

14

,07

8

23

,35

2

40

,46

9

80

,21

5

15

9,7

27

31

7,3

47

62

7,3

46

1,2

47

,17

5

1,5

90

,38

3

2,2

59

,70

9

100%

76%

42%

36% 36% 36% 36% 35% 35% 33% 32%

100% 99% 99% 98% 97% 96%

92%

87%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

1 2 6 12 24 48 96 192 384 512 768

Effi

cie

ncy

MB

/Se

c

# Cores

STREAM-OMP (Triad)

Triad (MB/s) Efficiency (CPU) Efficiency (Board)

More performance info at http://www.ScaleMP.com/performance

http://www.scalemp.com/performance




8/19/2012 - 17 -




Last Update:

Throughput / MPI Large Memory Threaded

ABAQUS STANDARD

23

2,6

65

12

7,7

44

59

,96

4

39

,46

7

0

50,000

100,000

150,000

200,000

250,000

48GB 96GB 128GB 192GB

Ru

nti

me

[se

c]

Total RAM [GB]

1 x Dual-Socket (X5570, 2.93GHz)

4 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation

• Performance comparison of – Single board with growing amounts of

memory – Aggregated memory of 4 systems with

vSMP Foundation • Compute limited to 8 cores on the first

node

– Workload requires 173GB RAM

• vSMP Foundation provide better

performance than a system with the maximum amount of memory – ~ 6 X compared to standard single board – 1.5 X compared to largest memory

configuration

• System configuration: • 1 Dual Socket:

– X5570 @ 2.93 GHz – 48GB up to 128GB RAM

• vSMP Foundation aggregated system – 4 X Dual-socket servers:

» Intel Xeon X5570 @ 2.93 GHz » 48GB / Board

LARGE MEMORY: ABOUT 180GB OF MEMORY + I/O

8/19/2012 18

5/25/2010

Confidential and Proprietary 8/19/2012 - 19 -





See more performance info at http://www.ScaleMP.com/performance

8/19/2012 - 20 -




Solutions

5

8/19/2012 - 21 -


vSMP Foundation - Solutions

Single Operating System

Single (large) System

Single Infrastructure

OPEX savings

Cluster management and server

consolidation

vSMP Foundation for Cluster

CAPEX savings

Compute and memory demanding

applications

vSMP Foundation for SMP

Flexibility (CAPEX and OPEX savings)

Cloud enabler – on-demand infrastructure

vSMP Foundation for Cloud

Single OS Single Infrastructure Single System 8/19/2012 - 22 -


Legacy vs. On-demand Datacenter

• Mixed infrastructure: – The best machine for each workload

– One size doesn’t fit it all

• High maintenance

• Complicated provisioning

• Can’t accommodate for changes in demand over time

• Budget bet!

8/19/2012 - 23 -

Legacy Datacenter

Fat node cluster: 4 x 512 (2,048) GB RAM

128 nodes cluster: • 128 x 16 (2,048) cores • 4GB RAM / core

Shared memory system: 256 cores


On-demand Datacenter


8/19/2012 - 24 -

Legacy Datacenter

Fat node cluster: 4 x 512 (2,048) GB RAM

128 nodes cluster: • 128 x 16 (2,048) cores • 4GB RAM / core

Shared memory system: 256 cores

192 nodes cluster: • 192 x 16 (3,072) cores • 75%: 4GB RAM / core • 25%: 8GB RAM / core

256 cores VM

756GB RAM VM

512GB RAM VM


On-demand Datacenter


• 25% more compute resources

• Smaller rack-space footprint

• Lower power consumption

• Pay as you grow

8/19/2012 - 25 -

192 nodes cluster: • 192 x 16 (3,072) cores • 75%: 4GB RAM / core • 25%: 8GB RAM / core

256 cores VM

756GB RAM VM

512GB RAM VM


On-demand Infrastructure

• More capacity per $ – Single, cost-effective platform

across the board

• Increased utilization – Shape the resource to the

application usage model – Larger, unified and less-

fragmented run-queues

• Greater customer satisfaction – On-demand resource allocation

• Choice of provisioning system – Adaptive Computing’s MOAB – Bright Cluster Manager Suite – HP’s Insight CMU – IBM’s xCAT – ROCKS – IBM - Platform

Create

Add Nodes

Run

8/19/2012 - 26 - Single Infrastructure


Products Comparison Flexibility Scalability

Performance Licensing

vSMP Foundation Advanced Platform

vSMP Foundation Advanced Platform

On-demand SMP Up to 128 nodes

Multi-rail InfiniBand Floating license

vSMP Foundation

vSMP Foundation

Static VM

Up to 32 nodes

Single-rail InfiniBand

Node-locked license

Virtual SMP

Detailed comparison: http://www.ScaleMP.com/compare

8/19/2012 - 27 -

http://www.scalemp.com/compare




Customer Use Cases

6

8/19/2012 - 28 -


Large memory system based on commodity

hardware

Customer Use Cases

• Customer: Design & manufacturing of light to heavy-duty trucks

• Previous platform : SGI Altix

• Challenges: – Run a variety of commercial engineering applications – Ability to support distributed as well as large memory workloads – Be able to grow as needed (and as budget allows)

• Applications: Abaqus Standard, Star-CCM+, Fluent, etc..

• Solution: Blades with Intel Dual-Processor:

– 4 X Nehalem EP + 6 X Westmere EP – 768 GB RAM – 104 Cores

• Benefits: – Performance: 2.6 X better performance compared SGI Altix – Ability to run large memory workloads on same cluster hardware – Efficient use of software licenses – Ability to increase capacity while leverage existing investment

SMB MANUAFACTURING COMPANY

8/19/2012 - 29 -


Last Update:


ABAQUS STANDARD

23

2,6

65

12

7,7

44

59

,96

4

39

,46

7

0

50,000

100,000

150,000

200,000

250,000

48GB 96GB 128GB 192GB

Ru

nti

me

[se

c]

Total RAM [GB]

1 x Dual-Socket (X5570, 2.93GHz)

4 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation




node




configuration

• System configuration: • 1 Dual Socket:

– X5570 @ 2.93 GHz – 48GB up to 128GB RAM




8/19/2012 30

5/25/2010


Last Update:


ABAQUS STANDARD




node




configuration

• System configuration: • SDSC/Gordon :

– 32 X Dual-socket servers (Intel Xeon E5-2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled

• SDSC/Gordon - Native Compute Node: – 2 X Dual-socket servers (Intel Xeon E5-

2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled


8/19/2012 31

5/28/2012

37

,71

6

27

,06

4

66

,27

4

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

8 16

Ru

nti

me

[se

c]

# Cores

Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz) - Runtime

Gordon - Native Node(Intel E5-2670 @ 2.6GHz) - Runtime


Last Update:


ABAQUS STANDARD

36

,24

0 40

,32

0

11

,61

8

40

,32

0

29

,31

6

23

,80

2

4,4

67

29

,31

6

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

Model A 121GB RAM

93GB I/O

Model B 60GB RAM 113GB I/O

Model C 73GB RAM 30GB I/O

Total Group 254GB RAM

236B I/O

Ru

nti

me

[se

c]

SGI Altix 350 (32 Cores 256GB RAM)

8 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation - 48 Cores 384GB RAM - SW-RAID • Performance comparison of 3 Abaqus

Jobs running concurrently on – SGI Altix system – Aggregated system with vSMP

Foundation

• vSMP Foundation provide superior

performance than a hardware based SMP – Up to 2.6X Better Performance

• System configuration: • SGI Altix 350

– 32 Cores & 1.6 GHz – 256 GB RAM



– Total: » 64 Cores » 384GB RAM

MIXED WORKLOAD: 3 ABAQUS JOBS RUNNING CONCURRENTLY

8/19/2012 32

5/25/2010


Homogeneous yet flexible

environment, supporting large

memory and throughput jobs

Customer Use Cases

• Customer: Roche / 454 Life Sciences

• Previous platform: Large scale clusters

• Challenges: – Provide a large memory system for genome/transcriptome assembly and mapping – Support I/O intensive workloads – Rely on existing IT standards of using HP dual socket blades

• Applications: 454/Newbler, Signal Processing, Blast/Megablast, Velvet, Matlab, …

• Solution:

– Leveraging IT standard HP BL460 for total of: 96 Cores, 768 GB RAM, 1 TB of internal storage, 40 TB external fast storage

– On-demand solution

• Benefits: – Large memory: The feasibility of running large memory workloads – Performance:

• Ability to use faster CPUs for serial workloads • Significant performance improvement (5X)

– Simplicity: Homogenous yet flexible environment

NEXT GENERATION SEQUENCING

8/19/2012 - 33 -


454: Customer Experience & Quotes

• One of our scientists is testing a dataset we had not been able to run on anything else before as all the other large RAM systems are constantly in use by other departments or don’t have enough RAM (more than 256 GB needed)

• It does not make sense to purchase multiple large RAM systems when the need for large genome assembly is not constant, the system would be idle for days plus multiple systems require more administration

• Ideal situation for us is when the hardware can be abstracted and reconfigured on demand which allows for better core/RAM balance and better resource utilization

19-Aug-12 34


Last Update:


454 NEWBLER

0.5

10

.0

40

.0

0.3

1.9

7.2

6

80

240

0

50

100

150

200

250

0

5

10

15

20

25

30

35

40

45

50

Yeast - Saccharomyces cerevisiae

12M BP (2GB)

10% of Full Human Genome 1000 Genome Project

330M BP (50GB)

30% of Full Human Genome 1000 Genome Project

990M BP (150GB)

RA

M [

GB

]

Ru

nti

me

[H

ou

rs]

Workload [Dataset Size – Base Pairs & GB]

HDDs

Ram-Drive

RAM [GB]

• Performance comparison of DNA Sequencing workload

• Large amounts of memory required, performs extensive I/O and is not using more than 4-8 cores during most of the run

• vSMP Foundation provides improved performance: – Running large sequencing inputs which

are not feasible otherwise – Taking advantage of aggregated RAM for

I/O – vSMP Foundation achieves 2 to 5 X

better performance

• System configuration:

• Ram-drive: vSMP Foundation: 16 X Dual-socket servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)

• HDD: Dual Socket (X5570 @ 2.93 GHz, 24GB RAM)

DNA SEQUENCING

8/19/2012 35

9/1/2009

Ran out of memory after

40 Hours


Customer Use Cases

• Customer: Samuel Roberts Noble Foundation

• Other platform: Cluster

• Challenges: Need to process large amount of data, requiring >100 cores and 2TB-8TB RAM

• Applications: SOAP De-novo, Velvet, Bowtie, Top-Hat, etc..

• Solution: Starting with 120 cores and 2TB in a pay-as-you grow model, based on 10 x HP DL380 G7

• Benefits: – Provides large addressable memory – Enables new research that wasn’t possible before – Reduced runtime by 2X-3X – Ability to grow: Customer can add more systems when needed based on upcoming

funding – Cost-Effective Solution

NEXT GENERATION SEQUENCING FOR PLANT BIOLOGY

8/19/2012 - 36 -

Highly Utilized Large Memory Infrastructure


Performance Evaluation

Assembly Task & Scale

FatBoy (192GB mem)

ScaleMP

100,000 454 (MIRA)

19 mins 18 mins

44.7M Solexa (Velvet)

65 mins 44 mins

0.5M 454 (MIRA) 8 hrs 36 mins (192 GB mem maxed out)

3 hrs 1 min

4.5M 454 (MIRA) 54 days (192 GB mem maxed out)

33 days


Customer Use Cases

• Customer: University of Florida - Bio-informatics Research Center

• Previous platform: HPC Cluster

• Problems: – Processing of DNA sequencing data requires large amounts of memory (300GB and

growing rapidly) – Complete end-to-end processing flow involves steps which require or can benefit from large memory

or fast I/O – Would like the ability to use OpenMP for certain phases in their overall process

• Applications: Roche/454 Newbler, Home-grown code, …

• Solution:

– 16 Intel dual-processor Xeon systems to provide 768 GB RAM, 32 sockets (128 cores) single virtual system running Linux with vSMP Foundation

– Solution was extended to 32 blades, total of 32 sockets in “Cloud” configuration

• Benefits: – Better performance: 2X – 10X better performance than existing solution – Be able to solve larger problems – Versatility: The ability to perform the complete process flow in a

single system

DNA SEQUENCING AND BIO INFORMATICS RESEARCH CENTER

Cost effective SMP for large

memory problems

8/19/2012 - 38 -


Example: Mapping Time

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

Total Linear Parallel

Map

pin

g Ti

me

(h

ou

rs)

ScaleMP

HPC


Flexible and versatile computing platform

in a cost-effective way

Customer Use Cases

• Customer: Brown University Center for Computation and Visualization • Challenges:

– Need to support a wide and diverse community of computational scientists – Provide a large memory system for genome assembly (>1TB RAM) – Support I/O intensive workloads – Support high computational workloads

• Applications: Velvet (>900GB assemblies), Matlab, Home-Grown • Solution:

– 16 x Intel Dual-Socket server, each with: 96GB RAM, 8 Cores (part of a 200+ nodes cluster)

– Total: 128 Cores, 1.5 TB RAM – GPFS for fast access to external storage

• Benefits:

– Provides large addressable memory – Delivers flexibility to perform variety of high-performance computing tasks – Enables new research that wasn’t possible before – Cost-Effective Solution

LARGE MEMORY FOR GENOMICS AS WELL AS OTHER SCIENCE DOMAINS

“… the first well-balanced high performance computing platform that Brown University has had to support a wide variety of applications.”

—Sam Fulcomer, associate director for the Center for Computation and Visualization, Brown University

8/19/2012 - 40 -


Last Update:


VELVET

7.1

5.2

0

1

2

3

4

5

6

7

8

1 x 256GB = 256GB

3 x 96GB = 288GB

Ru

nti

me

[H

ou

rs]

Total RAM

Nehalem EX: X7560 @ 2.26GHz - 256GB RAM

vSMP Foundation: 3 x 2 x Intel Xeon X5570 @ 2.93 GHz; 3 x 96GB / Board (Total 288GB)

• Performance comparison of Velvet job requiring ~ 220GB RAM: – Single board with 256 GB of memory – Aggregated systems 288 GB or memory

(3 boards x 96GB)

• vSMP Foundation reduce runtime by over 25% due to the ability to combine faster dual-socket processors with large amounts of memory

• Systems configuration: • 1 Quad Socket:

– X7560 @ 2.26 GHz, 256GB RAM • vSMP Foundation aggregated system

– 3 X Dual-socket servers: » Intel Xeon X5570 @ 2.93 GHz » 96GB / Board

• Public Dataset:

PUBLIC DATASET – UNIVERSITY OF MARYLAND CENTER FOR BIOINFORMATICS

8/19/2012 41

7/20/2010

http://cichlid.umd.edu/download/reads.shuffled.total_RNA_no_exp_cov_230_20.fasta.gz




Cost-effective SMP for large

memory and high core count jobs

Customer Use Cases

• Customer: Southern Methodist University - Computational and Theoretical Chemistry Group: developing a new approach for investigating the mechanism of chemical reaction (URVA: Unified Reaction Valley Approach)

• Previous platform: SGI Altix

• Challenges : – Prof. Elfi Kraka, joined SMU at 2009, and was looking at an SMP system to replace

the SGI Altix system she used to work with – Ability to run large Gaussian jobs (both large core count and large amounts of memory) – Ability to fit into a limited budget

• Applications: Gaussian (03 & 09), CFOUR, other

• Solution:

– C7000 with 16 Intel Dual-Processor Nehalem-EP blades (BL460) – Total: 128 Cores @ 2.93 GHz, 768 GB RAM, >10TB of storage

• Benefits:

– Better performance: 5X – 10X better performance than previous solution – Ability to solve larger problems – Simplification: Easy to work with a single system

CHEMISTRY DEPARTMENT

8/19/2012 - 42 -


SMU: Customer Experience & Quotes

End user:

• For Gaussian jobs, our new system is 9 times faster than the previous SGI Altix

– ~30 Days for a job on the Altix

– 3-5 Days on the existing system

• System is used 100% of the time

• The biggest benefit is that it is a single system to work with

Sys-Admin:

• The system is very easy to administer.

• On a day to day basis there is basically no maintenance required beyond a standard Linux system

• We would like to expand with the Cloud version as soon as funding is available

19-Aug-12 43


Last Update:


GAUSSIAN: VERY LARGE MODEL (1)

• Comparison of very large Gaussian workload running on a single system using HDDs vs. vSMP Foundation using RAM-drive for I/O

• vSMP Foundation provides excellent performance – ~10.0 X faster compared to 8 core run – ~4 X faster compared to quad socket

system


• vSMP Foundation: 11 X Dual-socket servers (Intel Xeon X5660 @ 2.80 GHz, 48 GB RAM)

• Dual Socket System: Intel E5640 @ 2.67 GHz, 48 GB RAM

• Quad Socket System: Intel X7550 @ 2.00 GHz, 128 GB RAM

CUSTOMER MOLECULE

8/19/2012 44

11/10/2011

49

,18

7

30

,15

8

25

,48

3

96

,63

0

24

8,2

57

0

50,000

100,000

150,000

200,000

250,000

8 32 48 96 128

Ru

nti

me

[se

c]

# Cores

vSMP Foundation - 11 x (Dual-Socket (Intel X5660 @ 2.80GHz), 48GB RAM) - Packed Intel Xeon X7550 @ 2.00GHz, 128GB RAM Intel Xeon E5640 @ 2.67GHz, 48GB RAM


Last Update:


GAUSSIAN: VERY LARGE MODEL (2)

• Comparison of very large Gaussian workload

• vSMP Foundation provides excellent performance compared to native 8 socket system


• Native: 8-Socket server - Intel Xeon E7-2850 @ 2.00GHz (512 GB RAM)

• SDSC/Gordon : – 32 X Dual-socket servers (Intel Xeon E5-

2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled

CUSTOMER MOLECULE

8/19/2012 45

4/15/2012

35

,38

0

14

,47

6

48

,23

6

26

,00

0

23

,10

1

0

10,000

20,000

30,000

40,000

50,000

60,000

20 32 40 64 80

Ru

nti

me

[se

c]

# Cores

Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz)

Native (8-Socket) Intel E7- 2850 @ 2.00GHz - Time/Step


Last Update:


GAUSSIAN: LARGE MEMORY VS. I/O

4,2

58

3,8

32

3,0

38

2,1

31

1,8

12

1.40

2.00

2.35

1.00

1.25

1.50

1.75

2.00

2.25

2.50

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

4 8 16 R

ela

tive

pe

rfo

rman

ce

Ru

nti

me

[se

c]

Cores

HDDs

Ram-Drive

Relative performance (to 4 core HDD)

• Comparison of very large Gaussian workload running on a single system using HDDs vs. vSMP Foundation using RAM-drive for I/O – Single system using HDDs – Aggregated system using with vSMP

Foundation, enabling RAM-drive for I/O

• vSMP Foundation provides improved performance: – Scales with aggregated memory for I/O

(using RAM-drive) – 2.0 X faster compared to HDD

performance (8 cores) – 2.35 X faster with higher core count (16

cores)


• vSMP Foundation: 16 X Dual-socket servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)

• Comparable system: Dual Socket (Intel X5570 @2.93 GHz, 48 GB RAM)

CUSTOMER MOLECULE

8/19/2012 46

9/1/2009


Customer Use Cases

• Customer: Medical Research Institute

• Previous platform: HP Superdome System

• Challenges: – Need to perform high performance image processing on very large MRI scans – Scanned data for a single run is currently over 200GB. Memory requirements are expected to grow

significantly with the introduction of full body scan with more sensors – Would like to use and commercial tools for faster development – Would like to standardize on x86 architecture due to lower costs and open standards

• Applications: Siemens CT processing, MATLAB, BLAS, Home-grown code, …

• Solution:

– 16 Intel dual-processor Xeon systems to provide 1TB RAM, 32 sockets (128 cores) single virtual system running Linux with vSMP Foundation

• Benefits:

– Better performance: Solution evaluated and found to be faster than any other alternative system

– Cost: Significant savings compared to alternative system (order of magnitude)

– Versatility: Also being used for MPI jobs as part of large cluster

MEDICAL RESEARCH INSTITUTE

Large memory for multi-threaded

programming

8/19/2012 - 47 -


Last Update:


MATLAB - THREADED (1)

74

5

19

7

10

1

58

42

35

1.00

3.77

7.37

12.86

17.91

21.54

-

5

10

15

20

25

0

100

200

300

400

500

600

700

800

1 4 8 16 32 64 Ef

fici

en

cy

Ru

nti

me

[se

c]

# Cores

Runtime [sec] Speedup • Performance of matrix multiplication

with the threaded version of MATLAB

• vSMP Foundation enables end user to scale their MATLAB jobs, while running it in the same way they run MATLAB on their desktop

• System configuration: • vSMP Foundation: 16 X Dual-socket

servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)

16,000 X 16,000 MATRIX MULTIPLICATION

8/19/2012 48

9/1/2009

`


Last Update:


MATLAB – THREADED (3)

• Performance of single MATLAB simulation on increasing # of cores

• Simulation requiring 60GB of RAM - 40% over server RAM capacity - using matrixes of 65K.


• vSMP Foundation: 11 X Dual-socket servers (Intel Xeon X5660 @2.80 GHz, 48 GB RAM @ 1066MHz)

SINGLE SIMULATION – SOLVING LARGE PROBLEMS

8/19/2012

10/11/2011

49

1,8

70

1,2

26

89

7

1.00

1.53

2.09

0.0

0.5

1.0

1.5

2.0

2.5

-

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

12 24 48

Effi

cie

ncy

Co

mp

ute

-Tim

e [

Sec]

Cores

Runtime (Size = 65K 60GB RAM)

Speedup (Size = 65K)


Hand-On Examples Open-MP

7

8/19/2012 - 50 -


The “Pseudo-Code”

/* run through rows of Matrix A */

for (i=0; i<NRA; i++)

/* run through columns of Matrix B */

for (k=0; k<NCB; k++)

/* initialize cell content of Matrix C */

c[i][k] = 0.0;

/* run through columns of Matrix A */

for (j=0; j<NCA; j++)

/* calculate value of cell in Matrix C */

c[i][k] += a[i][j] * b[j][k];

next

next

next

8/19/2012 51


Serial Code

#include <stdio.h>

#include <stdlib.h>

#define dt(start, end) ((end.tv_sec - start.tv_sec) + \

1/1000000.0*(end.tv_usec - start.tv_usec))

#define NRA 9000 /* number of rows in matrix A */

#define NCA 1500 /* number of columns in matrix A */

#define NCB 700 /* number of columns in matrix B */

/* matrix A/B to be multiplied, C is the result */

double a[NRA][NCA], b[NCA][NCB], c[NRA][NCB];

main(int argc, char **argv){

int i,j,k;

struct timeval srun, scalc, ecalc;

gettimeofday(&srun, NULL);

for (i=0; i<NRA; i++) /* initialize matrix A */


a[i][j] = (double)(i+j);

for (i=0; i<NCA; i++) /* initialize matrix B */

for (j=0; j<NCB; j++)

b[i][j] = (double)(i*j);

gettimeofday(&scalc, NULL);

for (i=0; i<NRA; i++) /* C = A * B */

for (k=0; k<NCB; k++) {

c[i][k] = 0.0;


c[i][k] += a[i][j] * b[j][k];

}

gettimeofday(&ecalc, NULL);

printf("%s: Init time: %3.02f Calc time: %3.02f\n",

argv[0], dt(srun, scalc), dt(scalc, ecalc));

}

8/19/2012 52

“sed '/^$/d' serial.c | wc -l“ = 30 lines of code


Serial Code vs. Parallel Code (OpenMP)

8/19/2012 53

#include <stdio.h>

#include <stdlib.h>









int i,j,k;










for (i=0; i<NRA; i++) /* C = A * B */

for (k=0; k<NCB; k++) {

c[i][k] = 0.0;


c[i][k] += a[i][j] * b[j][k];

}




}

#include <stdio.h>

#include <stdlib.h>









int i,j,k;










#pragma omp parallel for shared(a,b,c) private(i,j,k) schedule(dynamic)

for (i=0; i<NRA; i++) /* C = A * B */

for (k=0; k<NCB; k++) {

c[i][k] = 0.0;


c[i][k] += a[i][j] * b[j][k];

}




}

“sed '/^$/d' serial.c | wc -l“ = 30 lines of code “sed '/^$/d' openmp.c | wc -l“ = 31 lines of code


# Make sure to use Intel compilers to build the application

# Intel compiler runtime environment setting

source ../intell.sh

# Setting stacksize to unlimited

ulimit -s unlimited

# ScaleMP preload library that throttles down unnecessary system calls

export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/lib64/libvsmpclib.so

np=16

# Alternatively you can use dynamic placement using numabind

# export PATH=/opt/ScaleMP/numabind/bin:${PATH}

# export LD_LIBRARY_PATH=/opt/ScaleMP/numabind/lib

# The cpu/offset to start from is calculated dynamically by numabind

# export KMP_AFFINITY=compact,verbose,0,`numabind --offset $np 2>/dev/null`

export OMP_NUM_THREADS=$np

/usr/bin/time ./openmp-app > log-openmp-app-$np.txt

# See /opt/ScaleMP/examples/OpenMP for more information

Sample OpenMP Run Script

54

Example OpenMP Application (openmp-app) run script


Parallelization Programming Model

Serial OpenMP MPI

Lines of Code 30 31 105

CPUs Calculation Time (sec.)

1 121.17

2 62.48 65.96

4 31.66 36.14

8 15.63 19.67

16 9.27 11.60

32 5.29 8.50

64 3.70 9.97

12

1.1

7

62

.48

31

.66

15

.63

9.2

7

5.2

9

3.7

0

65

.96

36

.14

19

.67

11

.60

8.5

0

9.9

7

-

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1 2 2 4 4 8 8 16 16 32 32 64 64

Ru

nti

me

(sec

.)

CPUs

Runtime Summary (sec.)

Init time MPI (Calc only) OpenMP (Calc only) Serial (Calc only)

SUMMARY

8/19/2012 55


Hand-On Examples Parallel R

7

8/19/2012 - 56 -


Parallel R

• lapply – apply a function to all variables in a data frame

• mclapply: lapply in multi-core environment

Benefits:

- No need to export data to all nodes

- All parallel jobs share the same state

- As a result parallel execution is more efficient

8/19/2012 57


Last Update:


R – PARALLEL (THREADED)

10

.61

6.3

5

3.8

7

8.3

8

5.5

0

2.8

3

1.5

7

0.8

4

0.5

6

0

1

2

3

4

5

6

7

8

9

10

11

8 16 20 32 40 64 80 128 256

Ru

nti

me

[H

ou

rs]

# Cores

8 Socket Server: Intel E7- 2850 @ 2.00GHz

Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz) - Time/Step

• Performance of parallel R, for multi-core environment, using a growing number of workers

• Simulation requiring over 250GB RAM

• vSMP Foundation leverages R’s simple mechanism of multi-core scaling

• System configuration: • SDSC/Gordon :

– 32 X Dual-socket servers (Intel Xeon E5-2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled

• 8-Socket server - Intel E7- 2850 @ 2.00GHz (512 GB RAM)

BIO-INFORMATICS: PROCESSING OF SHORT READS

8/20/2012 58

8/19/2012


Example 1

library(foreach)

library(doMC)

nWorkers = 16

registerDoMC(nWorkers)

getDoParWorkers()

check <-function(n) {

for(i in 1:100) {

sme <- matrix(rnorm(160000), 400,400)

solve(sme)

}

}

times <- 2000

system.time(x <- foreach(j=1:times ) %dopar% check(j))

8/19/2012 - 59 -


Example 2

See: http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf

• We would use the sample data set geneData from the Biobase package in Bioconductor.

This data set contains expression data for 500 genes (rows) in 26 samples (columns).

• What we want to do with this data is to calculate the correlation coefficients of gene expression values between all pairs of genes for these 26 samples, in order to find out which genes are under shared regulation. To calculate the correlation between two genes, say gene 12 and 13: > cor(geneData[12, ], geneData[13, ])

8/19/2012 - 60 -

http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf


Example 2 – cont.

To calculate correlation for all gene pairs, we will use the lapply function, which takes a list as its first argument, and pass each element from the list to a function specified as the second argument. The return value from lapply is a list with the same

length with the input list, and each output element corresponds the input element processed by the user-specified function.

We will first define our own function that accepts one element from the variable “pair” and returns the the correlation coefficient between the pair of genes specified

in the input element:

> geneCor <- function(x, gene = geneData) {

+ cor(gene[x[1], ], gene[x[2], ]) }

If we want to calculate the correlation between gene 12 and gene 13:

> geneCor(c(12, 13))

[1] 0.548477

Now we want to calculate correlation coefficients for the first 3 pairs of genes using

lapply and our geneCor:

> out <- lapply(pair[1:3], geneCor)

8/19/2012 - 61 -


Example 2 – cont.

To calculate the time taken to do the correlation between all pairs stored in our

variable “pair”:

> system.time(out <- lapply(pair, geneCor))

8/19/2012 - 62 -


Profiling Tools

8

8/19/2012 - 63 -


Performance != Efficiency

• vSMP Foundation: designed to provide the user information about system efficiency – Efficiency information reveals the major application scalability issues

that reduce system performance

• No changes required for the application code. – Does not affect timing!

• Used for pinpoint scalability issues as well as for software tuning: – Commercial codes: ANSYS, ABAQUS, Intel (MKL), HP (MPI), Paradigm

Geophysical, MOLPRO, Schrödinger

– Open-source codes: Linux Kernel 2.6, glibc, MPICH1, MPICH2

8/19/2012 - 64 -


0

20000

40000

60000

80000

100000

120000

0

20

40

60

80

10

0

12

0

14

0

16

0

18

0

20

0

22

0

24

0

26

0

28

0

30

0

32

0

34

0

36

0

38

0

40

0

42

0

44

0

46

0

48

0

50

0

52

0

OS App

0

20000

40000

60000

80000

100000

120000

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

380

400

420

440

460

480

500

520

OS App

vSMProfile: Performance Tuning PINPOINTING CACHE-MISSES

8/19/2012 65


vSMProfile: Performance Tuning PINPOINTING CACHE-MISSES

8/19/2012 66


vSMProfile: Performance Tuning

• Points developers to line of code, that’s causing performance problems

• Significantly reduces application tuning time

• Eliminates performance optimization guesswork

• Not available on other architectures

Developer Productivity =

Optimized Application

8/19/2012 - 67 -


vSMProfile: Performance Tuning

8/19/2012 - 68 -


THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION – CREATING THE POWER OF ONE

the high-end virtualization company

Documents