scalemp - introduction

37
THE HIGH - END VIRTUALIZATION COMPANY SERVER AGGREGATION CREATING THE POWER OF ONE Aggregate. Scale. Simplify. Save. Short Introduction

Upload: ibm-india-smarter-computing

Post on 24-May-2015

491 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: ScaleMP - Introduction

THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION – CREATING THE POWER OF ONE

Aggregate. Scale. Simplify. Save.

Short Introduction

Page 2: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

COMPANY AND PRODUCTS

13-Sep-10 2

Page 3: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Virtualization software for aggregating multiple off-the-shelf systems into a single virtual machine,

providing improved usability and higher performance

What We Do ?

16 x OS

16 x OS

16 x OS

16 x OS

16 x OS

16 x OS

16 x OS

16 x OS

16 x OS N x OS

N x Servers

1 OS

1 VM

13-Sep-10 3

Page 5: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Virtualization Defined…

Partitioning Providing a virtual resource that is a

subset of the physical resource

Aggregation Providing a virtual resource that is a

concatenation of several physical resources “Utilization”

Single system only

Disk Partitioning

VLANs

Server Virtualization

Hardware

Array-based

Software

Volume Mgmt

Switch-based

Stack- based

Mainframe Hypervisor /

VMM

“Management, Capability”

Disk Concatenation

Link Aggregation

Server Aggregation

Volume Mgmt

Hardware

Array- based

OS- based

Switch- based

SMP, MPP

Software

…abstracting the physical characteristics of a computing resource…

13-Sep-10 5

Page 6: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Server Virtualization

Hypervisor or VMM

Virtual Machines

App

OS

App

OS

App

OS

Virtual Machine

App

OS

Hypervisor or VMM

Hypervisor or VMM

Hypervisor or VMM

Hypervisor or VMM

AGGREGATION Concatenation of physical resources

PARTITIONING Subset of the physical resource

13-Sep-10 6

Page 7: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

vSMP Foundation Aggregation Platform

Cluster Manageability

SMP Cost Savings & Performance

Cloud Flexibility

Cost Savings

• Up to 5X cost savings

Performance

• Leveraging latest Intel processors

• Best x86 solution by SPEC CPU2006

• #7th best shared-memory by STREAM (memory bandwidth)

Reliability

• Fault detection and component isolation

• Redundant backplane support

Capabilities

• Up to 16,384 CPUs and 64TB RAM

Flexibility

• On-the-fly aggregated VM provisioning and tear-down

• Scaling memory or CPU

Utilization

• Resource fragmentation reduction

• Support any programming model (serial, throughput, multi-threaded, large-memory) without machine boundary

Integration

• Network installation provides seamless integration with any grid management system

Manageability

• Single Operating System (OS) for up to 128 nodes

• OS driven job scheduling and resource management

Performance

• InfiniBand performance with zero management and knowhow

Storage

• Built-in cluster file system

Installation

• Unboxing to production in less than 3 hours

13-Sep-10 7

Page 8: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

TECHNOLOGY

13-Sep-10 8

Page 9: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

How Does it Work ?

Multiple Computers with Multiple Operating Systems

Multiple Computers with a Single Operating System

13-Sep-10 9

Page 10: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Bare Metal, Distributed Virtual Machine Monitor

• Loaded at boot time

– Supported boot devices: USB, IDE, CompactFlash or Network Image (PXE)

• Fabric probing and VM setup

• Loading the OS and maintaining I/O and memory coherency

– Software interception engine creates a uniform execution environment

– Creates the relevant BIOS environment to present the OS (and the SW stack above it) a single system

– Exposes all available CPU, Memory and I/O resources to the OS

– I/O resources unified into a single PCI hierarchy

Network Boot

Up to 16 servers (today), 128 servers (3Q2010)

Multiple Computers with a Single Operating System

How Does it Work ?

13-Sep-10 10

Page 11: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

• Systems configuration can be different

– Aggregating systems with different boards, I/O configurations, processors speed and memory configuration

– Only one type of CPU will be presented to the OS

• >10 different coherency mechanisms

• Aggregated hardware I/O compatibility list include devices from Intel, Broadcom, LSI, ATI, Emulex, Adaptec and others

1 x RAM

4 x RAM

1 x RAM

4 x RAM

1 x RAM

1 x RAM

1 x RAM

1 x RAM

Up to 4TB aggregated (today), 64TB aggregated (3Q2010)

14 x RAM

Aggregated System Multiple Computers

with a Single Operating System

How Does it Work ?

13-Sep-10 11

Page 12: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

• Decision criteria:

– Access pattern and historical knowledge on the reference (node level)

– Memory block association (system level)

– Code scanning

– Virtual address access pattern

• Transfer size:

– 4K native cache line

– 128KB transfers for pre-fetch

– Instruction size: byte size for PINNED memory

• Ownership:

– No backstore or persistent memory location. Board- local caching.

– Memory migration and replication - Shared / Exclusive / Invalid

– Pre-fetch with persistence ownership (motivated by code-scanning or I/O access pattern)

14 x RAM

Coherent Memory

Memory Coherency Multiple Computers

with a Single Operating System

How Does it Work ?

13-Sep-10 12

Page 13: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

• Hardware resiliency (by the hardware vendor):

– Redundant power

– Redundant cooling

– Memory mirroring

• vSMP Foundation architecture resiliency:

– Active-active backplane

– Fault isolation – immediate restart without failed components

– Partitioning / containers support

• Remote management (by hardware vendor + ScaleMP):

– Hardware chassis management

– vSMP Foundation native Serial-over-LAN (SOL) support

Resiliency Multiple Computers

with a Single Operating System

How Does it Work ?

13-Sep-10 13

Page 14: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

• vSMP Foundation Productivity Pack: Automatic full software stack installer and tuning tool

• Real time statistics counters and system profiler. Designed to provide the user information about system efficiency

– Efficiency information reveals the major application scalability issues that reduce system performance

– No changes required for the application code; does not affect timing

– Used for pinpoint scalability issues as well as for software tuning; not available on other architectures

• OS-level tuning: Enabled by single kernel compile option

• Tuned libraries: MPI (MPICH), OpenMP (Intel), MKL (Intel)

• Application execution guidelines

Optional Complete Software Stack

Project Start

+3 days

+7 days

+10 days

Actual Case

How Does it Work ?

13-Sep-10 14

Page 15: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

CUSTOMER USE CASES

13-Sep-10 15

Page 16: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

DEPLOYMENT SCENARIOS

3.1

13-Sep-10 16

Page 17: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SMP (1)

13-Sep-10 17

Page 18: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SMP (2)

vSMP Foundation DC2:

13-Sep-10 18

Page 19: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SMP (3)

Ethernet Switch

13-Sep-10 19

Page 20: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Cluster

13-Sep-10 20

Page 21: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Cloud

10 STEPS in 10 MINUTES 1. User submits a job requiring 80cores

and 400GB RAM

2. Resource manager (RM) find it needs 10 systems (48GB RAM each)

3. RM identifies 10 available nodes from the environment

4. RM requests vSMP Foundation for Cloud for a 10-node image

5. RM points the 10 nodes to boot vSMP Foundation for Cloud image (using PXE)

6. All 10 nodes PXE-boot the image and aggregate to form a single virtual system

7. The aggregated Virtual Machine, boots Linux OS from either local drive or PXE-boot (using vendor tag ScaleMP)

8. The virtual system now operational and ready to run jobs (up 80-cores, ~400GB)

9. RM runs the job

10.Job finished – 10 nodes back to the pool and available for other jobs

13-Sep-10 21

Page 22: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Storage

13-Sep-10 22

Page 23: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SIMPLICITY

3.2

13-Sep-10 23

Page 24: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Customer Use Cases

• Customer: Auburn University, School of Engineering

• Current platform: None. Just getting into HPC

• Problems: – Compute requirements were growing as number of users/students was growing

– No in-house skills to run x86 InfiniBand cluster

– Limited operational budget to hire additional sys-admin resources

• Applications: Commercial application (mostly Fluent and MATLAB)

• Solution: – 4 full blade chassis, each aggregated as a single system with 128 cores and 384 GB RAM and 5 TB of

internal storage

– Total: 64 physical nodes, 512 cores, 20TB storage - running as a cluster of 4 ‘Fat-Nodes’

• Benefits: – Low OPEX:

• No additional IT required for day-to-day operations

• The need to manage only 4 ‘Fat-Nodes’

• Internal storage is embedded in each ‘Fat-Node’

– Simplicity: InfiniBand performance without the complexity of managing such a solution

ENGINEERING FACULTY

LARGE SCALE DEPLOYMENT WITHOUT THE COMPLEXITY

13-Sep-10 24

Page 25: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Customer Use Cases

• Customer: Mid-size Engineering Services Company

• Current platform: Multiple 2-socket workstations

• Problems: – Existing models (Abaqus) grow fast and can’t fit the engineers workstation

– Interested in running apps in batch at night

– No in-house skills to run x86 InfiniBand cluster (although the application runs nicely on InfiniBand cluster) . Can’t afford RISC systems

• Solution: – 4 Intel dual-processor Xeon systems to provide 128GB RAM, 8 sockets (16 cores) single virtual system

running Linux with vSMP Foundation

• Benefits: – Performance: Solution significantly faster than existing workstations. Performance is comparable to

cluster performance (using vendor benchmarks).

– Low OPEX: No IT required for day-to-day operation

– Versatility: Batch mode at night. Daytime jobs are executed on the system while using the workstation for display only. Multi-user environment with perfect scaling – and sharing without performance degradation.

– Investment protection: Expected to expand the system by adding additional 4 nodes (to a total of 256GB RAM, 32 cores)

ENGINEERING SERVICES COMPANY

INNOVATION WITHOUT

COMPLEXITY

13-Sep-10 25

Page 26: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SIMPLIFYING INTER-PROCESS

COMMUNICATION

Customer Use Cases

• Customer: Hedge Fund

• Current platform: Multiple 4-Socket Servers

• Problems: – A single 4-socket server did not provide enough performance required for customer business targets

– Multiple 4-socket servers required complex decomposition and introduced challenges in transferring data between processes in a short and deterministic time (low latency and small jitters)

• Ethernet based solution could not provide this / IB solution is too complex to manage and program for

– Co-location at exchanges for a solution comprised of multiple systems is complicate

• Applications: KX, WOMBAT, home-grown code

• Solution: – 16 Intel dual-processor Xeon systems to provide 0.5TB RAM, 32 sockets (128 cores) single virtual

system running Linux with vSMP Foundation

• Benefits: – Reduced latency and latency variance

– Simpler solution: Deploy and management of a single system

– Better utilization: Single system reduces resources fragmentation

– Simpler programming model: No need for specific InfiniBand programming

FINANCIAL SERVICES

13-Sep-10 26

Page 27: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

FLEXIBILITY

3.3

13-Sep-10 27

Page 28: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

COST EFFECTIVE FLEXIBLE SOLUTION WITH HIGH

UTILIZATION

Customer Use Cases

• Customer: Hosted HPC resource provider

• Current platform: Clusters and SMP machines

• Problems: – Need to run MPI as well as OpenMP (shared memory) codes

– Large shared memory jobs require dedicated proprietary hardware

– Low utilization on shared memory systems

• Applications: A variety of commercial codes

• Solution: – Original: 4 systems, total of 8 sockets (32 cores) and 128GB RAM

– Solution was extended to 16 nodes

• Benefits: – Utilization: Rely on standard commodity hardware

– Flexibility: Using same system for both shared memory and cluster benchmarks, resulting in high utilization

HOSTED HPC RESOURCE PROVIDER

13-Sep-10 28

Page 29: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

ELASTIC VM SOLUTION AIMED FOR DATA

INTENSIVE COMPUTING

Customer Use Cases

• Customer: San Diego Supercomputer Center (SDSC)

• Current platform: 8-socket AMD systems

• Problems: – Require an infrastructure for data intensive computing

– Need large memory system (TBs in size), depending on job need

– Require the ability to access quickly large amounts of storage

• Applications: A variety of data intensive codes (Astronomy, Genomics, Data Mining, etc..)

• Solution: – Initial Deployment: 4 ‘Super Nodes’, each with 768GB RAM, 128 cores, 10TB internal storage

– Complete Deployment (2011): 1,024 servers with vSMP Foundation for Cloud. Could be aggregated up to 32 ‘Super Nodes’ each nodes is 32 servers, resulting in 2TB RAM and 8TB of SSDs

– On demand allocation using web-request and fast (<10 minutes) provisioning

• Benefits: – Flexibility: Provision multiple ’Super Nodes’ of various

sizes according to need

– Performance: Extremely fast hierarchical memory solution: RAM Aggregated RAM Aggregated SSDs

SUPER COMPUTER CENTER

13-Sep-10 29

Page 30: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

CAPABILITY

3.4

13-Sep-10 30

Page 31: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Customer Use Cases

• Customer: Global Energy Company

• Current platform: x86 grid

• Problems: – Using in-house single-threaded simulation tools in throughput mode. Each simulation

memory footprint has grown over the years and sometimes (10%) exceeds 32GB.

– Application runs on x86 only

– Used to reschedule failed runs on large-memory systems

• Solution: – 6 Intel dual-processor Xeon systems to provide 192GB RAM, 12 sockets (48 cores) single virtual

system running Linux with vSMP Foundation

• Benefits: – Versatility: Both large and small workloads used concurrently on the same system

– Utilization: Higher utilization compared to grid due to lower infrastructure fragmentation

– Investment protection: Solution expanded by 100% since initial installation

GLOBAL ENERGY COMPANY

SINGLE INFRASTRUCTURE FOR HORIZONTAL AND VERTICAL APPLICATION SCALING – PLUG & PLAY

13-Sep-10 31

Page 32: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SCALEUP AT SCALEOUT

PRICING

Customer Use Cases

• Customer: Formula1 team

• Current platform: Large-memory Itanium-based system

• Problems: – Need to generate large mesh as part of pre-processing of whole-car

simulation (FLUENT TGrid)

– Mesh requirements are ~200GB in size

– Expect to grow significantly within 12 months after initial deployment

– Would like to standardize on x86 architecture due to lower costs and open standards

• Solution: – 12 Intel dual-processor Xeon systems to provide 384GB RAM single virtual system running Linux with

vSMP Foundation

• Benefits: – Better performance: Solution evaluated and found to be faster than

alternative systems (x86 and non-x86)

– Cost: Significant savings compared to alternative system

– Versatility: Also being used to run FLUENT (MPI) as part of large cluster

– Investment protection: Solution can grow

FORMULA1 TEAM

13-Sep-10 32

Page 33: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SIMPLE AND FLEXIBLE COST

EFFECTIVE SOLUTION

Customer Use Cases

• Customer: Weather forecasting service provider

• Current platform: SGI Altix with 32 cores

• Problems: – Need to run MPI as well as OpenMP codes

– System needs to be deployed remotely, and hence needs to be simple to manage

– Data processing flow is complex and requires transferring large amounts of data between steps

• Applications: MM5, WRF, MAWSIP, Home-grown code for data transformation

• Solution: – 4 Intel Nehalem dual socket blades, total of 8 sockets (32 cores) and 192GB RAM

– Internal storage

– Solution was extended to 8 blades, total of 16 sockets (64 cores) and 384GB RAM

• Benefits: – Performance: 2.5 X better performance on same # of cores (32)

– Simpler solution: Significantly reduced capital expense, allowed the customer to have a higher # of cores

– Simplicity: Simple to manage by domain experts (weather forecast scientists)

– Dataflow remains within the system, leveraging internal storage

WEATHER FORECASTING SERVICE PROVIDER

13-Sep-10 33

Page 34: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

Customer Use Cases

• Customer: Large European semiconductor manufacturer

• Current platform: Proprietary large shared memory system + compute grid

• Problems: – Dedicated and proprietary systems were expensive

– Utilization and efficiency of the existing large memory system was low for throughput jobs

– The mix of throughput and large memory jobs required maintaining 2 separate environments

• Solution: – 8 Intel dual-socket (quad-core) Xeon systems to provide 300GB RAM single virtual system running

Linux with vSMP Foundation

• Benefits: – The ability to run large memory jobs when required

– Flexibility: Switch to between large memory and throughput jobs in an efficient way on the fly

– Consistency: Underlying hardware is aligned with standard hardware used for the rest of the compute grid

– Better utilization: Having a single system reduces resources fragmentation

– Performance: Leverage most recent Intel CPUs for large-memory jobs

SEMICONDUCTOR MANUFACTURER

FLEXIBLE SYSTEM FOR THROUGHPUT AND LARGE MEMORY JOBS

13-Sep-10 34

Page 35: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

LARGE MEMORY FOR MULTI-THREADED PROGRAMMING

Customer Use Cases

• Customer: Medical Research Institute

• Current platform: HP Superdome System

• Problems: – Need to perform high performance image processing on very large MRI scans

– Scanned data for a single run is currently over 200GB. Memory requirements are expected to grow significantly with the introduction of full body scan with more sensors

– Would like to use and commercial tools for faster development

– Would like to standardize on x86 architecture due to lower costs and open standards

• Applications: Siemens CT processing, MATLAB, BLAS, Home-grown code, …

• Solution: – 16 Intel dual-processor Xeon systems to provide 1TB RAM, 32 sockets (128 cores) single virtual

system running Linux with vSMP Foundation

• Benefits: – Better performance: Solution evaluated and found to be faster

than any other alternative system

– Cost: Significant savings compared to alternative system (order of magnitude)

– Versatility: Also being used for MPI jobs as part of large cluster

MEDICAL RESEARCH INSTITUTE

13-Sep-10 35

Page 36: ScaleMP - Introduction

Aggregate. Scale. Simplify. Save. Confidential and Proprietary

SHARED MEMORY MULTI-THREADED PROGRAMMING

Customer Use Cases

• Customer: RWTH Aachen - Polytechnic University

• Current platform: 4 socket x86 server

• Problems: – Need to run auto-parallelized code (using OpenMP) with high core count

– Other solution found not to work: • Cannot afford proprietary SMP

• Cluster-OpenMP proved not to scale

– Has to use OpenMP for faster development (machine generated Fortran code)

• Applications: Home-grown codes (SHEMAT Suite, FIRE, …)

• Solution: – 13 Intel dual-processor Xeon systems to provide 26 sockets (104 cores)

– Single virtual system running Linux with vSMP Foundation

• Benefits: – Performance: Solution scales at over 95% efficiency

to 104 cores with OpenMP codes

– Cost: Significant savings compared to alternative solutions

– System scale: Largest x86 system available

ACADEMIC RESEARCH

13-Sep-10 36

Page 37: ScaleMP - Introduction

THE HIGH-END VIRTUALIZATION COMPANY SERVER AGGREGATION – CREATING THE POWER OF ONE

Aggregate. Scale. Simplify. Save.

Shai Fultheim Founder and President [email protected], +1 (408) 480 1612