the high-end virtualization company
TRANSCRIPT
Confidential and Proprietary
THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION – CREATING THE POWER OF ONE
Nir Paikowsky
VP Services
SDSC Summer Institute ScaleMP Introduction
Confidential and Proprietary
Agenda
• Introduction to ScaleMP and vSMP Foundation
• Customer Use Cases
• Example of using OpenMP and R
• Overview of vSMP Foundation Developer Productivity Tools
Confidential and Proprietary
300+ Customers Worldwide
Israel
Qatar India
China
Brazil
South Africa
Canada
US
Russia
Spain
France
Belgium
Denmark
UK
Korea
HK
Vietnam
Malaysi
a
Taiwan
Poland
Italy
German
y NL
Australi
a
Norway
Japan
Creating the Power of One
1 VM 1 OS N x Servers N x OS
ScaleMP Fact-sheet
Founded in 2003
Software-based Shared-memory System
World’s largest SMP: 32,768 CPUs & 256 TB RAM
Processor and interconnect agnostic
Product shipping since 2006
Channel-only business model
300+ customers worldwide
8/19/2012 - 4 -
Confidential and Proprietary
Selected Customers and Applications Life Sciences
Computational Chemistry
• AMBER
• CFOUR
• DOCK
• GAMESS
• Gaussian
• GOLD
• NWChem
• Octopus
• OpenEye FRED
• OpenEye OMEGA
• Schrödinger Jaguar
• Schrödinger Glide
• SCM ADF
• VASP Molecular Dynamics
• GROMACS
• MOLPRO
• NAMD
• OpenEye ROCS
• Schrödinger Desmond
• Turbomole
Numerical Simulations
• Octave
• R
• MathWorks MATLAB
• Wolfram Mathematica
Manufacturing Structural Mechanics
• ABAQUS/Explicit
• ABAQUS/Standard
• ALTAIR Radios
• ANSYS Mechanical
• LSTC LS-DYNA
• NASTRAN
• TNO Diana Fluid Dynamics
• ANSYS CFX
• AVL FIRE
• EXA PowerFlow
• EZNSS
• FLUENT (+TGrid)
• GeoDict
• MHD3D
• NASA Cart3D
• STAR-CCM+
• STAR-CD
• Tgrid Other
• Comsol
• inTrace OpenRT
Energy • IMEX
• Norsar 3D
• Paradigm GeoDepth
• Schlumberger ECLIPSE
EDA • Cadence
• HSPICE
• Mentor
• Quartus
• Silvaco SmartSpice
• Synopsys
Bio-informatics • 454/Newbler
• Abyss
• Bowtie
• CLC Bio
• FASTA
• HMMER
• Illumina
• mpiBLAST
• SOAPDenovo
• Velvet
Weather Forecasting
• MITgcm
• MM5 (MPI & OpenMP)
• MOM4
• WRF
Finance • KX
• Wombat
…and many more homegrown applications
8/19/2012 - 5 -
Confidential and Proprietary
Server Virtualization
Hypervisor or VMM
Virtual Machines
App
OS
App
OS
App
OS
Virtual Machine
App
OS
Hypervisor or VMM
Hypervisor or VMM
Hypervisor or VMM
Hypervisor or VMM
AGGREGATION Concatenation of physical resources
PARTITIONING Subset of the physical resources
8/19/2012 - 7 -
Confidential and Proprietary
Usage Models for Server Aggregation Large Memory I/O Intensive
Compute Intensive Consolidation
Virtual SMP
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
Operating System
App.
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
Operating System
App. App. App. App. App. App. App. App.
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
Operating System
App.
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
CPU CPU
I/O
Memory
Operating System
App.
8/19/2012 - 8 -
Confidential and Proprietary
Boot Process
2 Quad Core Processors
48 GB RAM vSMP Foundation loaded
from Network or Flash
8/19/2012 - 10 -
Confidential and Proprietary
Boot Process
Total: 768 GB RAM 66 GB used by vSMP Foundation
VM will have 32 processors (128 cores)
16 nodes found
8/19/2012 - 11 -
Confidential and Proprietary
Loading the OS
Kernel boot messages
Ready to Rock’n’roll
Intel 5000 processors
OS detects the 126th processor
Device detection
8/19/2012 - 13 -
Confidential and Proprietary
vSMP Foundation in Work
OS detects 691 GB RAM
OS detects all drives
OS detects all PCI devices
Application uses >400 GB RAM
8/19/2012 - 14 -
Confidential and Proprietary 8/19/2012 - 16 -
9,2
97
14
,07
8
23
,35
2
40
,46
9
80
,21
5
15
9,7
27
31
7,3
47
62
7,3
46
1,2
47
,17
5
1,5
90
,38
3
2,2
59
,70
9
100%
76%
42%
36% 36% 36% 36% 35% 35% 33% 32%
100% 99% 99% 98% 97% 96%
92%
87%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
1 2 6 12 24 48 96 192 384 512 768
Effi
cie
ncy
MB
/Se
c
# Cores
STREAM-OMP (Triad)
Triad (MB/s) Efficiency (CPU) Efficiency (Board)
More performance info at http://www.ScaleMP.com/performance
Confidential and Proprietary
More performance info at http://www.ScaleMP.com/performance
8/19/2012 - 17 -
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
ABAQUS STANDARD
23
2,6
65
12
7,7
44
59
,96
4
39
,46
7
0
50,000
100,000
150,000
200,000
250,000
48GB 96GB 128GB 192GB
Ru
nti
me
[se
c]
Total RAM [GB]
1 x Dual-Socket (X5570, 2.93GHz)
4 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation
• Performance comparison of – Single board with growing amounts of
memory – Aggregated memory of 4 systems with
vSMP Foundation • Compute limited to 8 cores on the first
node
– Workload requires 173GB RAM
• vSMP Foundation provide better
performance than a system with the maximum amount of memory – ~ 6 X compared to standard single board – 1.5 X compared to largest memory
configuration
• System configuration: • 1 Dual Socket:
– X5570 @ 2.93 GHz – 48GB up to 128GB RAM
• vSMP Foundation aggregated system – 4 X Dual-socket servers:
» Intel Xeon X5570 @ 2.93 GHz » 48GB / Board
LARGE MEMORY: ABOUT 180GB OF MEMORY + I/O
8/19/2012 18
5/25/2010
Confidential and Proprietary 8/19/2012 - 19 -
More performance info at http://www.ScaleMP.com/performance
Confidential and Proprietary
See more performance info at http://www.ScaleMP.com/performance
8/19/2012 - 20 -
Confidential and Proprietary
vSMP Foundation - Solutions
Single Operating System
Single (large) System
Single Infrastructure
OPEX savings
Cluster management and server
consolidation
vSMP Foundation for Cluster
CAPEX savings
Compute and memory demanding
applications
vSMP Foundation for SMP
Flexibility (CAPEX and OPEX savings)
Cloud enabler – on-demand infrastructure
vSMP Foundation for Cloud
Single OS Single Infrastructure Single System 8/19/2012 - 22 -
Confidential and Proprietary
Legacy vs. On-demand Datacenter
• Mixed infrastructure: – The best machine for each workload
– One size doesn’t fit it all
• High maintenance
• Complicated provisioning
• Can’t accommodate for changes in demand over time
• Budget bet!
8/19/2012 - 23 -
Legacy Datacenter
Fat node cluster: 4 x 512 (2,048) GB RAM
128 nodes cluster: • 128 x 16 (2,048) cores • 4GB RAM / core
Shared memory system: 256 cores
Confidential and Proprietary
On-demand Datacenter
Legacy vs. On-demand Datacenter
8/19/2012 - 24 -
Legacy Datacenter
Fat node cluster: 4 x 512 (2,048) GB RAM
128 nodes cluster: • 128 x 16 (2,048) cores • 4GB RAM / core
Shared memory system: 256 cores
192 nodes cluster: • 192 x 16 (3,072) cores • 75%: 4GB RAM / core • 25%: 8GB RAM / core
256 cores VM
756GB RAM VM
512GB RAM VM
Confidential and Proprietary
On-demand Datacenter
Legacy vs. On-demand Datacenter
• 25% more compute resources
• Smaller rack-space footprint
• Lower power consumption
• Pay as you grow
8/19/2012 - 25 -
192 nodes cluster: • 192 x 16 (3,072) cores • 75%: 4GB RAM / core • 25%: 8GB RAM / core
256 cores VM
756GB RAM VM
512GB RAM VM
Confidential and Proprietary
On-demand Infrastructure
• More capacity per $ – Single, cost-effective platform
across the board
• Increased utilization – Shape the resource to the
application usage model – Larger, unified and less-
fragmented run-queues
• Greater customer satisfaction – On-demand resource allocation
• Choice of provisioning system – Adaptive Computing’s MOAB – Bright Cluster Manager Suite – HP’s Insight CMU – IBM’s xCAT – ROCKS – IBM - Platform
Create
Add Nodes
Run
8/19/2012 - 26 - Single Infrastructure
Confidential and Proprietary
Products Comparison Flexibility Scalability
Performance Licensing
vSMP Foundation Advanced Platform
vSMP Foundation Advanced Platform
On-demand SMP Up to 128 nodes
Multi-rail InfiniBand Floating license
vSMP Foundation
vSMP Foundation
Static VM
Up to 32 nodes
Single-rail InfiniBand
Node-locked license
Virtual SMP
Detailed comparison: http://www.ScaleMP.com/compare
8/19/2012 - 27 -
Confidential and Proprietary
Large memory system based on commodity
hardware
Customer Use Cases
• Customer: Design & manufacturing of light to heavy-duty trucks
• Previous platform : SGI Altix
• Challenges: – Run a variety of commercial engineering applications – Ability to support distributed as well as large memory workloads – Be able to grow as needed (and as budget allows)
• Applications: Abaqus Standard, Star-CCM+, Fluent, etc..
• Solution: Blades with Intel Dual-Processor:
– 4 X Nehalem EP + 6 X Westmere EP – 768 GB RAM – 104 Cores
• Benefits: – Performance: 2.6 X better performance compared SGI Altix – Ability to run large memory workloads on same cluster hardware – Efficient use of software licenses – Ability to increase capacity while leverage existing investment
SMB MANUAFACTURING COMPANY
8/19/2012 - 29 -
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
ABAQUS STANDARD
23
2,6
65
12
7,7
44
59
,96
4
39
,46
7
0
50,000
100,000
150,000
200,000
250,000
48GB 96GB 128GB 192GB
Ru
nti
me
[se
c]
Total RAM [GB]
1 x Dual-Socket (X5570, 2.93GHz)
4 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation
• Performance comparison of – Single board with growing amounts of
memory – Aggregated memory of 4 systems with
vSMP Foundation • Compute limited to 8 cores on the first
node
– Workload requires 173GB RAM
• vSMP Foundation provide better
performance than a system with the maximum amount of memory – ~ 6 X compared to standard single board – 1.5 X compared to largest memory
configuration
• System configuration: • 1 Dual Socket:
– X5570 @ 2.93 GHz – 48GB up to 128GB RAM
• vSMP Foundation aggregated system – 4 X Dual-socket servers:
» Intel Xeon X5570 @ 2.93 GHz » 48GB / Board
LARGE MEMORY: ABOUT 180GB OF MEMORY + I/O
8/19/2012 30
5/25/2010
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
ABAQUS STANDARD
• Performance comparison of – Single board with growing amounts of
memory – Aggregated memory of 4 systems with
vSMP Foundation • Compute limited to 8 cores on the first
node
– Workload requires 173GB RAM
• vSMP Foundation provide better
performance than a system with the maximum amount of memory – ~ 6 X compared to standard single board – 1.5 X compared to largest memory
configuration
• System configuration: • SDSC/Gordon :
– 32 X Dual-socket servers (Intel Xeon E5-2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled
• SDSC/Gordon - Native Compute Node: – 2 X Dual-socket servers (Intel Xeon E5-
2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled
LARGE MEMORY: ABOUT 180GB OF MEMORY + I/O
8/19/2012 31
5/28/2012
37
,71
6
27
,06
4
66
,27
4
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
8 16
Ru
nti
me
[se
c]
# Cores
Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz) - Runtime
Gordon - Native Node(Intel E5-2670 @ 2.6GHz) - Runtime
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
ABAQUS STANDARD
36
,24
0 40
,32
0
11
,61
8
40
,32
0
29
,31
6
23
,80
2
4,4
67
29
,31
6
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
Model A 121GB RAM
93GB I/O
Model B 60GB RAM 113GB I/O
Model C 73GB RAM 30GB I/O
Total Group 254GB RAM
236B I/O
Ru
nti
me
[se
c]
SGI Altix 350 (32 Cores 256GB RAM)
8 x (Dual-Socket (X5570, 2.93GHz), 48GB RAM) - vSMP Foundation - 48 Cores 384GB RAM - SW-RAID • Performance comparison of 3 Abaqus
Jobs running concurrently on – SGI Altix system – Aggregated system with vSMP
Foundation
• vSMP Foundation provide superior
performance than a hardware based SMP – Up to 2.6X Better Performance
• System configuration: • SGI Altix 350
– 32 Cores & 1.6 GHz – 256 GB RAM
• vSMP Foundation aggregated system – 8 X Dual-socket servers:
» Intel Xeon X5570 @ 2.93 GHz » 48GB / Board
– Total: » 64 Cores » 384GB RAM
MIXED WORKLOAD: 3 ABAQUS JOBS RUNNING CONCURRENTLY
8/19/2012 32
5/25/2010
Confidential and Proprietary
Homogeneous yet flexible
environment, supporting large
memory and throughput jobs
Customer Use Cases
• Customer: Roche / 454 Life Sciences
• Previous platform: Large scale clusters
• Challenges: – Provide a large memory system for genome/transcriptome assembly and mapping – Support I/O intensive workloads – Rely on existing IT standards of using HP dual socket blades
• Applications: 454/Newbler, Signal Processing, Blast/Megablast, Velvet, Matlab, …
• Solution:
– Leveraging IT standard HP BL460 for total of: 96 Cores, 768 GB RAM, 1 TB of internal storage, 40 TB external fast storage
– On-demand solution
• Benefits: – Large memory: The feasibility of running large memory workloads – Performance:
• Ability to use faster CPUs for serial workloads • Significant performance improvement (5X)
– Simplicity: Homogenous yet flexible environment
NEXT GENERATION SEQUENCING
8/19/2012 - 33 -
Confidential and Proprietary
454: Customer Experience & Quotes
• One of our scientists is testing a dataset we had not been able to run on anything else before as all the other large RAM systems are constantly in use by other departments or don’t have enough RAM (more than 256 GB needed)
• It does not make sense to purchase multiple large RAM systems when the need for large genome assembly is not constant, the system would be idle for days plus multiple systems require more administration
• Ideal situation for us is when the hardware can be abstracted and reconfigured on demand which allows for better core/RAM balance and better resource utilization
19-Aug-12 34
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
454 NEWBLER
0.5
10
.0
40
.0
0.3
1.9
7.2
6
80
240
0
50
100
150
200
250
0
5
10
15
20
25
30
35
40
45
50
Yeast - Saccharomyces cerevisiae
12M BP (2GB)
10% of Full Human Genome 1000 Genome Project
330M BP (50GB)
30% of Full Human Genome 1000 Genome Project
990M BP (150GB)
RA
M [
GB
]
Ru
nti
me
[H
ou
rs]
Workload [Dataset Size – Base Pairs & GB]
HDDs
Ram-Drive
RAM [GB]
• Performance comparison of DNA Sequencing workload
• Large amounts of memory required, performs extensive I/O and is not using more than 4-8 cores during most of the run
• vSMP Foundation provides improved performance: – Running large sequencing inputs which
are not feasible otherwise – Taking advantage of aggregated RAM for
I/O – vSMP Foundation achieves 2 to 5 X
better performance
• System configuration:
• Ram-drive: vSMP Foundation: 16 X Dual-socket servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)
• HDD: Dual Socket (X5570 @ 2.93 GHz, 24GB RAM)
DNA SEQUENCING
8/19/2012 35
9/1/2009
Ran out of memory after
40 Hours
Confidential and Proprietary
Customer Use Cases
• Customer: Samuel Roberts Noble Foundation
• Other platform: Cluster
• Challenges: Need to process large amount of data, requiring >100 cores and 2TB-8TB RAM
• Applications: SOAP De-novo, Velvet, Bowtie, Top-Hat, etc..
• Solution: Starting with 120 cores and 2TB in a pay-as-you grow model, based on 10 x HP DL380 G7
• Benefits: – Provides large addressable memory – Enables new research that wasn’t possible before – Reduced runtime by 2X-3X – Ability to grow: Customer can add more systems when needed based on upcoming
funding – Cost-Effective Solution
NEXT GENERATION SEQUENCING FOR PLANT BIOLOGY
8/19/2012 - 36 -
Highly Utilized Large Memory Infrastructure
Confidential and Proprietary
Performance Evaluation
Assembly Task & Scale
FatBoy (192GB mem)
ScaleMP
100,000 454 (MIRA)
19 mins 18 mins
44.7M Solexa (Velvet)
65 mins 44 mins
0.5M 454 (MIRA) 8 hrs 36 mins (192 GB mem maxed out)
3 hrs 1 min
4.5M 454 (MIRA) 54 days (192 GB mem maxed out)
33 days
Confidential and Proprietary
Customer Use Cases
• Customer: University of Florida - Bio-informatics Research Center
• Previous platform: HPC Cluster
• Problems: – Processing of DNA sequencing data requires large amounts of memory (300GB and
growing rapidly) – Complete end-to-end processing flow involves steps which require or can benefit from large memory
or fast I/O – Would like the ability to use OpenMP for certain phases in their overall process
• Applications: Roche/454 Newbler, Home-grown code, …
• Solution:
– 16 Intel dual-processor Xeon systems to provide 768 GB RAM, 32 sockets (128 cores) single virtual system running Linux with vSMP Foundation
– Solution was extended to 32 blades, total of 32 sockets in “Cloud” configuration
• Benefits: – Better performance: 2X – 10X better performance than existing solution – Be able to solve larger problems – Versatility: The ability to perform the complete process flow in a
single system
DNA SEQUENCING AND BIO INFORMATICS RESEARCH CENTER
Cost effective SMP for large
memory problems
8/19/2012 - 38 -
Confidential and Proprietary
Example: Mapping Time
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Total Linear Parallel
Map
pin
g Ti
me
(h
ou
rs)
ScaleMP
HPC
Confidential and Proprietary
Flexible and versatile computing platform
in a cost-effective way
Customer Use Cases
• Customer: Brown University Center for Computation and Visualization • Challenges:
– Need to support a wide and diverse community of computational scientists – Provide a large memory system for genome assembly (>1TB RAM) – Support I/O intensive workloads – Support high computational workloads
• Applications: Velvet (>900GB assemblies), Matlab, Home-Grown • Solution:
– 16 x Intel Dual-Socket server, each with: 96GB RAM, 8 Cores (part of a 200+ nodes cluster)
– Total: 128 Cores, 1.5 TB RAM – GPFS for fast access to external storage
• Benefits:
– Provides large addressable memory – Delivers flexibility to perform variety of high-performance computing tasks – Enables new research that wasn’t possible before – Cost-Effective Solution
LARGE MEMORY FOR GENOMICS AS WELL AS OTHER SCIENCE DOMAINS
“… the first well-balanced high performance computing platform that Brown University has had to support a wide variety of applications.”
—Sam Fulcomer, associate director for the Center for Computation and Visualization, Brown University
8/19/2012 - 40 -
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
VELVET
7.1
5.2
0
1
2
3
4
5
6
7
8
1 x 256GB = 256GB
3 x 96GB = 288GB
Ru
nti
me
[H
ou
rs]
Total RAM
Nehalem EX: X7560 @ 2.26GHz - 256GB RAM
vSMP Foundation: 3 x 2 x Intel Xeon X5570 @ 2.93 GHz; 3 x 96GB / Board (Total 288GB)
• Performance comparison of Velvet job requiring ~ 220GB RAM: – Single board with 256 GB of memory – Aggregated systems 288 GB or memory
(3 boards x 96GB)
• vSMP Foundation reduce runtime by over 25% due to the ability to combine faster dual-socket processors with large amounts of memory
• Systems configuration: • 1 Quad Socket:
– X7560 @ 2.26 GHz, 256GB RAM • vSMP Foundation aggregated system
– 3 X Dual-socket servers: » Intel Xeon X5570 @ 2.93 GHz » 96GB / Board
• Public Dataset:
PUBLIC DATASET – UNIVERSITY OF MARYLAND CENTER FOR BIOINFORMATICS
8/19/2012 41
7/20/2010
http://cichlid.umd.edu/download/reads.shuffled.total_RNA_no_exp_cov_230_20.fasta.gz
Confidential and Proprietary
Cost-effective SMP for large
memory and high core count jobs
Customer Use Cases
• Customer: Southern Methodist University - Computational and Theoretical Chemistry Group: developing a new approach for investigating the mechanism of chemical reaction (URVA: Unified Reaction Valley Approach)
• Previous platform: SGI Altix
• Challenges : – Prof. Elfi Kraka, joined SMU at 2009, and was looking at an SMP system to replace
the SGI Altix system she used to work with – Ability to run large Gaussian jobs (both large core count and large amounts of memory) – Ability to fit into a limited budget
• Applications: Gaussian (03 & 09), CFOUR, other
• Solution:
– C7000 with 16 Intel Dual-Processor Nehalem-EP blades (BL460) – Total: 128 Cores @ 2.93 GHz, 768 GB RAM, >10TB of storage
• Benefits:
– Better performance: 5X – 10X better performance than previous solution – Ability to solve larger problems – Simplification: Easy to work with a single system
CHEMISTRY DEPARTMENT
8/19/2012 - 42 -
Confidential and Proprietary
SMU: Customer Experience & Quotes
End user:
• For Gaussian jobs, our new system is 9 times faster than the previous SGI Altix
– ~30 Days for a job on the Altix
– 3-5 Days on the existing system
• System is used 100% of the time
• The biggest benefit is that it is a single system to work with
Sys-Admin:
• The system is very easy to administer.
• On a day to day basis there is basically no maintenance required beyond a standard Linux system
• We would like to expand with the Cloud version as soon as funding is available
19-Aug-12 43
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
GAUSSIAN: VERY LARGE MODEL (1)
• Comparison of very large Gaussian workload running on a single system using HDDs vs. vSMP Foundation using RAM-drive for I/O
• vSMP Foundation provides excellent performance – ~10.0 X faster compared to 8 core run – ~4 X faster compared to quad socket
system
• System configuration:
• vSMP Foundation: 11 X Dual-socket servers (Intel Xeon X5660 @ 2.80 GHz, 48 GB RAM)
• Dual Socket System: Intel E5640 @ 2.67 GHz, 48 GB RAM
• Quad Socket System: Intel X7550 @ 2.00 GHz, 128 GB RAM
CUSTOMER MOLECULE
8/19/2012 44
11/10/2011
49
,18
7
30
,15
8
25
,48
3
96
,63
0
24
8,2
57
0
50,000
100,000
150,000
200,000
250,000
8 32 48 96 128
Ru
nti
me
[se
c]
# Cores
vSMP Foundation - 11 x (Dual-Socket (Intel X5660 @ 2.80GHz), 48GB RAM) - Packed Intel Xeon X7550 @ 2.00GHz, 128GB RAM Intel Xeon E5640 @ 2.67GHz, 48GB RAM
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
GAUSSIAN: VERY LARGE MODEL (2)
• Comparison of very large Gaussian workload
• vSMP Foundation provides excellent performance compared to native 8 socket system
• System configuration:
• Native: 8-Socket server - Intel Xeon E7-2850 @ 2.00GHz (512 GB RAM)
• SDSC/Gordon : – 32 X Dual-socket servers (Intel Xeon E5-
2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled
CUSTOMER MOLECULE
8/19/2012 45
4/15/2012
35
,38
0
14
,47
6
48
,23
6
26
,00
0
23
,10
1
0
10,000
20,000
30,000
40,000
50,000
60,000
20 32 40 64 80
Ru
nti
me
[se
c]
# Cores
Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz)
Native (8-Socket) Intel E7- 2850 @ 2.00GHz - Time/Step
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
GAUSSIAN: LARGE MEMORY VS. I/O
4,2
58
3,8
32
3,0
38
2,1
31
1,8
12
1.40
2.00
2.35
1.00
1.25
1.50
1.75
2.00
2.25
2.50
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
4 8 16 R
ela
tive
pe
rfo
rman
ce
Ru
nti
me
[se
c]
Cores
HDDs
Ram-Drive
Relative performance (to 4 core HDD)
• Comparison of very large Gaussian workload running on a single system using HDDs vs. vSMP Foundation using RAM-drive for I/O – Single system using HDDs – Aggregated system using with vSMP
Foundation, enabling RAM-drive for I/O
• vSMP Foundation provides improved performance: – Scales with aggregated memory for I/O
(using RAM-drive) – 2.0 X faster compared to HDD
performance (8 cores) – 2.35 X faster with higher core count (16
cores)
• System configuration:
• vSMP Foundation: 16 X Dual-socket servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)
• Comparable system: Dual Socket (Intel X5570 @2.93 GHz, 48 GB RAM)
CUSTOMER MOLECULE
8/19/2012 46
9/1/2009
Confidential and Proprietary
Customer Use Cases
• Customer: Medical Research Institute
• Previous platform: HP Superdome System
• Challenges: – Need to perform high performance image processing on very large MRI scans – Scanned data for a single run is currently over 200GB. Memory requirements are expected to grow
significantly with the introduction of full body scan with more sensors – Would like to use and commercial tools for faster development – Would like to standardize on x86 architecture due to lower costs and open standards
• Applications: Siemens CT processing, MATLAB, BLAS, Home-grown code, …
• Solution:
– 16 Intel dual-processor Xeon systems to provide 1TB RAM, 32 sockets (128 cores) single virtual system running Linux with vSMP Foundation
• Benefits:
– Better performance: Solution evaluated and found to be faster than any other alternative system
– Cost: Significant savings compared to alternative system (order of magnitude)
– Versatility: Also being used for MPI jobs as part of large cluster
MEDICAL RESEARCH INSTITUTE
Large memory for multi-threaded
programming
8/19/2012 - 47 -
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
MATLAB - THREADED (1)
74
5
19
7
10
1
58
42
35
1.00
3.77
7.37
12.86
17.91
21.54
-
5
10
15
20
25
0
100
200
300
400
500
600
700
800
1 4 8 16 32 64 Ef
fici
en
cy
Ru
nti
me
[se
c]
# Cores
Runtime [sec] Speedup • Performance of matrix multiplication
with the threaded version of MATLAB
• vSMP Foundation enables end user to scale their MATLAB jobs, while running it in the same way they run MATLAB on their desktop
• System configuration: • vSMP Foundation: 16 X Dual-socket
servers (Intel Xeon X5570 @ 2.93 GHz, 48 GB RAM)
16,000 X 16,000 MATRIX MULTIPLICATION
8/19/2012 48
9/1/2009
`
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
MATLAB – THREADED (3)
• Performance of single MATLAB simulation on increasing # of cores
• Simulation requiring 60GB of RAM - 40% over server RAM capacity - using matrixes of 65K.
• System configuration:
• vSMP Foundation: 11 X Dual-socket servers (Intel Xeon X5660 @2.80 GHz, 48 GB RAM @ 1066MHz)
SINGLE SIMULATION – SOLVING LARGE PROBLEMS
8/19/2012
10/11/2011
49
1,8
70
1,2
26
89
7
1.00
1.53
2.09
0.0
0.5
1.0
1.5
2.0
2.5
-
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
12 24 48
Effi
cie
ncy
Co
mp
ute
-Tim
e [
Sec]
Cores
Runtime (Size = 65K 60GB RAM)
Speedup (Size = 65K)
Confidential and Proprietary
The “Pseudo-Code”
/* run through rows of Matrix A */
for (i=0; i<NRA; i++)
/* run through columns of Matrix B */
for (k=0; k<NCB; k++)
/* initialize cell content of Matrix C */
c[i][k] = 0.0;
/* run through columns of Matrix A */
for (j=0; j<NCA; j++)
/* calculate value of cell in Matrix C */
c[i][k] += a[i][j] * b[j][k];
next
next
next
8/19/2012 51
Confidential and Proprietary
Serial Code
#include <stdio.h>
#include <stdlib.h>
#define dt(start, end) ((end.tv_sec - start.tv_sec) + \
1/1000000.0*(end.tv_usec - start.tv_usec))
#define NRA 9000 /* number of rows in matrix A */
#define NCA 1500 /* number of columns in matrix A */
#define NCB 700 /* number of columns in matrix B */
/* matrix A/B to be multiplied, C is the result */
double a[NRA][NCA], b[NCA][NCB], c[NRA][NCB];
main(int argc, char **argv){
int i,j,k;
struct timeval srun, scalc, ecalc;
gettimeofday(&srun, NULL);
for (i=0; i<NRA; i++) /* initialize matrix A */
for (j=0; j<NCA; j++)
a[i][j] = (double)(i+j);
for (i=0; i<NCA; i++) /* initialize matrix B */
for (j=0; j<NCB; j++)
b[i][j] = (double)(i*j);
gettimeofday(&scalc, NULL);
for (i=0; i<NRA; i++) /* C = A * B */
for (k=0; k<NCB; k++) {
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] += a[i][j] * b[j][k];
}
gettimeofday(&ecalc, NULL);
printf("%s: Init time: %3.02f Calc time: %3.02f\n",
argv[0], dt(srun, scalc), dt(scalc, ecalc));
}
8/19/2012 52
“sed '/^$/d' serial.c | wc -l“ = 30 lines of code
Confidential and Proprietary
Serial Code vs. Parallel Code (OpenMP)
8/19/2012 53
#include <stdio.h>
#include <stdlib.h>
#define dt(start, end) ((end.tv_sec - start.tv_sec) + \
1/1000000.0*(end.tv_usec - start.tv_usec))
#define NRA 9000 /* number of rows in matrix A */
#define NCA 1500 /* number of columns in matrix A */
#define NCB 700 /* number of columns in matrix B */
/* matrix A/B to be multiplied, C is the result */
double a[NRA][NCA], b[NCA][NCB], c[NRA][NCB];
main(int argc, char **argv){
int i,j,k;
struct timeval srun, scalc, ecalc;
gettimeofday(&srun, NULL);
for (i=0; i<NRA; i++) /* initialize matrix A */
for (j=0; j<NCA; j++)
a[i][j] = (double)(i+j);
for (i=0; i<NCA; i++) /* initialize matrix B */
for (j=0; j<NCB; j++)
b[i][j] = (double)(i*j);
gettimeofday(&scalc, NULL);
for (i=0; i<NRA; i++) /* C = A * B */
for (k=0; k<NCB; k++) {
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] += a[i][j] * b[j][k];
}
gettimeofday(&ecalc, NULL);
printf("%s: Init time: %3.02f Calc time: %3.02f\n",
argv[0], dt(srun, scalc), dt(scalc, ecalc));
}
#include <stdio.h>
#include <stdlib.h>
#define dt(start, end) ((end.tv_sec - start.tv_sec) + \
1/1000000.0*(end.tv_usec - start.tv_usec))
#define NRA 9000 /* number of rows in matrix A */
#define NCA 1500 /* number of columns in matrix A */
#define NCB 700 /* number of columns in matrix B */
/* matrix A/B to be multiplied, C is the result */
double a[NRA][NCA], b[NCA][NCB], c[NRA][NCB];
main(int argc, char **argv){
int i,j,k;
struct timeval srun, scalc, ecalc;
gettimeofday(&srun, NULL);
for (i=0; i<NRA; i++) /* initialize matrix A */
for (j=0; j<NCA; j++)
a[i][j] = (double)(i+j);
for (i=0; i<NCA; i++) /* initialize matrix B */
for (j=0; j<NCB; j++)
b[i][j] = (double)(i*j);
gettimeofday(&scalc, NULL);
#pragma omp parallel for shared(a,b,c) private(i,j,k) schedule(dynamic)
for (i=0; i<NRA; i++) /* C = A * B */
for (k=0; k<NCB; k++) {
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] += a[i][j] * b[j][k];
}
gettimeofday(&ecalc, NULL);
printf("%s: Init time: %3.02f Calc time: %3.02f\n",
argv[0], dt(srun, scalc), dt(scalc, ecalc));
}
“sed '/^$/d' serial.c | wc -l“ = 30 lines of code “sed '/^$/d' openmp.c | wc -l“ = 31 lines of code
Confidential and Proprietary
# Make sure to use Intel compilers to build the application
# Intel compiler runtime environment setting
source ../intell.sh
# Setting stacksize to unlimited
ulimit -s unlimited
# ScaleMP preload library that throttles down unnecessary system calls
export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/lib64/libvsmpclib.so
np=16
# Alternatively you can use dynamic placement using numabind
# export PATH=/opt/ScaleMP/numabind/bin:${PATH}
# export LD_LIBRARY_PATH=/opt/ScaleMP/numabind/lib
# The cpu/offset to start from is calculated dynamically by numabind
# export KMP_AFFINITY=compact,verbose,0,`numabind --offset $np 2>/dev/null`
export OMP_NUM_THREADS=$np
/usr/bin/time ./openmp-app > log-openmp-app-$np.txt
# See /opt/ScaleMP/examples/OpenMP for more information
Sample OpenMP Run Script
54
Example OpenMP Application (openmp-app) run script
Confidential and Proprietary
Parallelization Programming Model
Serial OpenMP MPI
Lines of Code 30 31 105
CPUs Calculation Time (sec.)
1 121.17
2 62.48 65.96
4 31.66 36.14
8 15.63 19.67
16 9.27 11.60
32 5.29 8.50
64 3.70 9.97
12
1.1
7
62
.48
31
.66
15
.63
9.2
7
5.2
9
3.7
0
65
.96
36
.14
19
.67
11
.60
8.5
0
9.9
7
-
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1 2 2 4 4 8 8 16 16 32 32 64 64
Ru
nti
me
(sec
.)
CPUs
Runtime Summary (sec.)
Init time MPI (Calc only) OpenMP (Calc only) Serial (Calc only)
SUMMARY
8/19/2012 55
Confidential and Proprietary
Parallel R
• lapply – apply a function to all variables in a data frame
• mclapply: lapply in multi-core environment
Benefits:
- No need to export data to all nodes
- All parallel jobs share the same state
- As a result parallel execution is more efficient
8/19/2012 57
Confidential and Proprietary
Last Update:
Throughput / MPI Large Memory Threaded
R – PARALLEL (THREADED)
10
.61
6.3
5
3.8
7
8.3
8
5.5
0
2.8
3
1.5
7
0.8
4
0.5
6
0
1
2
3
4
5
6
7
8
9
10
11
8 16 20 32 40 64 80 128 256
Ru
nti
me
[H
ou
rs]
# Cores
8 Socket Server: Intel E7- 2850 @ 2.00GHz
Gordon - vSMP Foundation (Intel E5-2670 @ 2.6GHz) - Time/Step
• Performance of parallel R, for multi-core environment, using a growing number of workers
• Simulation requiring over 250GB RAM
• vSMP Foundation leverages R’s simple mechanism of multi-core scaling
• System configuration: • SDSC/Gordon :
– 32 X Dual-socket servers (Intel Xeon E5-2670 @ 2.60 GHz, 64 GB RAM), HT off, Turbo Boost not enabled
• 8-Socket server - Intel E7- 2850 @ 2.00GHz (512 GB RAM)
BIO-INFORMATICS: PROCESSING OF SHORT READS
8/20/2012 58
8/19/2012
Confidential and Proprietary
Example 1
library(foreach)
library(doMC)
nWorkers = 16
registerDoMC(nWorkers)
getDoParWorkers()
check <-function(n) {
for(i in 1:100) {
sme <- matrix(rnorm(160000), 400,400)
solve(sme)
}
}
times <- 2000
system.time(x <- foreach(j=1:times ) %dopar% check(j))
8/19/2012 - 59 -
Confidential and Proprietary
Example 2
See: http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf
• We would use the sample data set geneData from the Biobase package in Bioconductor.
This data set contains expression data for 500 genes (rows) in 26 samples (columns).
• What we want to do with this data is to calculate the correlation coefficients of gene expression values between all pairs of genes for these 26 samples, in order to find out which genes are under shared regulation. To calculate the correlation between two genes, say gene 12 and 13: > cor(geneData[12, ], geneData[13, ])
8/19/2012 - 60 -
Confidential and Proprietary
Example 2 – cont.
To calculate correlation for all gene pairs, we will use the lapply function, which takes a list as its first argument, and pass each element from the list to a function specified as the second argument. The return value from lapply is a list with the same
length with the input list, and each output element corresponds the input element processed by the user-specified function.
We will first define our own function that accepts one element from the variable “pair” and returns the the correlation coefficient between the pair of genes specified
in the input element:
> geneCor <- function(x, gene = geneData) {
+ cor(gene[x[1], ], gene[x[2], ]) }
If we want to calculate the correlation between gene 12 and gene 13:
> geneCor(c(12, 13))
[1] 0.548477
Now we want to calculate correlation coefficients for the first 3 pairs of genes using
lapply and our geneCor:
> out <- lapply(pair[1:3], geneCor)
8/19/2012 - 61 -
Confidential and Proprietary
Example 2 – cont.
To calculate the time taken to do the correlation between all pairs stored in our
variable “pair”:
> system.time(out <- lapply(pair, geneCor))
8/19/2012 - 62 -
Confidential and Proprietary
Performance != Efficiency
• vSMP Foundation: designed to provide the user information about system efficiency – Efficiency information reveals the major application scalability issues
that reduce system performance
• No changes required for the application code. – Does not affect timing!
• Used for pinpoint scalability issues as well as for software tuning: – Commercial codes: ANSYS, ABAQUS, Intel (MKL), HP (MPI), Paradigm
Geophysical, MOLPRO, Schrödinger
– Open-source codes: Linux Kernel 2.6, glibc, MPICH1, MPICH2
8/19/2012 - 64 -
Confidential and Proprietary
0
20000
40000
60000
80000
100000
120000
0
20
40
60
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
OS App
0
20000
40000
60000
80000
100000
120000
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
420
440
460
480
500
520
OS App
vSMProfile: Performance Tuning PINPOINTING CACHE-MISSES
8/19/2012 65
Confidential and Proprietary
vSMProfile: Performance Tuning
• Points developers to line of code, that’s causing performance problems
• Significantly reduces application tuning time
• Eliminates performance optimization guesswork
• Not available on other architectures
Developer Productivity =
Optimized Application
8/19/2012 - 67 -