high performance and productivity computing with windows hpc …hpc%2b... · 2008-09-25 · hpc at...

High Performance and Productivity ComputingHigh Performance and Productivity Computingwith Windows HPCwith Windows HPC

George YanGeorge YanGroup ManagerGroup ManagerWindows HPCWindows HPCMicrosoft ChinaMicrosoft China

HPC at MicrosoftHPC at Microsoft1997 1997 NCSA deploys first Windows clusters NCSA deploys first Windows clusters on NT4on NT4

20002000 Windows Server 2000 shipsWindows Server 2000 ships

20012001 Microsoft Computational Clustering Microsoft Computational Clustering Preview kit and Beowulf Cluster Computing Preview kit and Beowulf Cluster Computing with Windows book releasedwith Windows book released

20022002 Cornell Theory Center migrates to allCornell Theory Center migrates to all--Windows infrastructure, eventually reaching Windows infrastructure, eventually reaching over 600 nodes and 1,200 user accounts, over 600 nodes and 1,200 user accounts, first Top500 appearancefirst Top500 appearance

2003 2003 Argonne National Labs Argonne National Labs releases MPICH on Windowsreleases MPICH on Windows

HPC at MicrosoftHPC at Microsoft

20042004 Windows HPC team established in Windows HPC team established in both Redmond and Shanghaiboth Redmond and Shanghai

20052005 Microsoft launches HPC entry at Microsoft launches HPC entry at SCSC‘‘05 in Seattle with Bill Gates keynote05 in Seattle with Bill Gates keynote

2006 2006 Windows Compute Cluster Server Windows Compute Cluster Server 2003 ships2003 ships

20072007 Microsoft named one of the Top 5 Microsoft named one of the Top 5 companies to watch in HPC at SCcompanies to watch in HPC at SC’’0707

2008 2008 Windows HPC Server 2008Windows HPC Server 2008

Spring 2008, NCSA, #239472 cores, 68.5 TF, 77.7%

Fall 2007, Microsoft, #116

2048 cores, 11.8 TF, 77.1%

Spring 2007, Microsoft, #106

2048 cores, 9 TF, 58.8%

Spring 2006, NCSA, #130

896 cores, 4.1 TF

Spring 2008, Umea, #405376 cores, 46 TF, 85.5%

30% efficiency30% efficiencyimprovementimprovement

Windows HPC Server 2008

Windows Compute Cluster 2003

Winter 2005, Microsoft4 procs, 9.46 GFlops

Spring 2008, Aachen, #1002096 cores, 18.8 TF, 76.5%

HPC Clusters in Every LabHPC Clusters in Every Lab

X64 Server

Parallelism EverywhereParallelism Everywhere

“… we see a very significant shift in what architectures will look like in the future ...

fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations.”

Intel Developer Forum, Spring 2004Pat Gelsinger

Chief Technology Officer, Senior Vice PresidentIntel Corporation

February, 19, 2004

10,00010,000

1,0001,000

100100

1010

11

‘‘7070 ‘‘8080 ‘‘9090 ‘‘0000 ‘‘1010

Po

wer

De

nsit

y (W

/cm

Po

wer

De

nsit

y (W

/cm

22))

40044004

80088008

80808080

80858085

80868086

286286386386

486486

PentiumPentium®® processorsprocessors

Hot PlateHot Plate

Nuclear ReactorNuclear Reactor

Rocket NozzleRocket Nozzle

SunSun’’s Surfaces Surface

Intel Dev eloper Forum, Spring 2004 Intel Dev eloper Forum, Spring 2004 -- Pat GelsingerPat Gelsinger

ManyMany--core Peak Parallel GOPs

core Peak Parallel GOPs

Single Threaded Perf 10% per yearSingle Threaded Perf 10% per year

To Grow, To Keep Up,To Grow, To Keep Up,We Must Embrace Parallel ComputingWe Must Embrace Parallel Computing

GO

PS

GO

PS

32,76832,768

2,0482,048

128128

1616

20042004 20062006 20082008 20102010 20122012 20152015

TodayToday’’s Architecture: Heat becoming an s Architecture: Heat becoming an unmanageable problem!unmanageable problem!

Parallelism OpportunityParallelism Opportunity80X80X

TodayToday’’s Environments Environment

Corporate InfrastructureStorage

Clusters/Super Computers

High Speed networking

Engineers

Scientists

Information workers

Compilers

Debuggers

Specialized languages

Mainstream

Technologies

Financial Analysts

High Productivity ComputingHigh Productivity Computing

Combined Infrastructure

Integrated Desktop and HPC Environment

Unified Development

Environment

MicrosoftMicrosoft’’s Productivity Visions Productivity Vision

AdministratorAdministrator Application DeveloperApplication Developer End End -- UserUser

� Integrated Turnkey Solution

� Simplified Setup and

Deployment

� Built-In Diagnostics� Efficient Cluster Utilization

� Integrates with IT

Infrastructure and Policies

� Highly Productive Parallel

Programming Frameworks

� Service-Oriented HPC

Applications� Support for Key HPC

Development Standards

� Unix Application Migration

� Seamless Integration with

Workstation Applications

� Integrated Collaboration and

Workflow Solutions� Secure Job Execution and

Data Access

� World-class Performance

Windows HPC allows you to accomplish more, in less time, with reduced effort

by leveraging users existing skills and integrating with the tools they are already

using.

Industry Focused SolutionsIndustry Focused Solutions

AutomotiveAerospaceGeo

ServicesFinancialServicesAcademia Government

LifeSciences

Systems Systems ManagementManagement

Job Job SchedulingScheduling

MPIMPIStorageStorage

�Rapid large scale deployment

and built-in diagnostics suite

�Integrated monitoring, management and reporting

�Familiar UI and rich scripting interface

�Integrated security via Active

Directory�Support for batch, interactive

and service-oriented applications

�High availability scheduling�Interoperability via OGF’s HPC

Basic Profile

�MS-MPI stack based on MPICH2 reference implementation

�Performance improvements for

RDMA networking and multi-core shared memory

�MS-MPI integrated with Windows

Event Tracing

�Access to SQL, Windows and

Unix file servers

�Key parallel file server vendor support (GPFS, Lustre, Panasas)

�In-memory caching options

Windows HPC Server 2008Windows HPC Server 2008

Ease of deploymentEase of deployment

Ease of DeploymentEase of Deployment

Comprehensive Diagnostics SuiteComprehensive Diagnostics Suite

Single Management ConsoleSingle Management Console

Integrated MonitoringIntegrated Monitoring

BuiltBuilt--in Reportingin Reporting

Integrated Job SchedulingIntegrated Job Scheduling

Scheduler

UDF

Jobs

Results

UDF

User App

MPI

Compute NodeJob

Execution

Head NodeJob

Mgmt

Resource Mgmt

Cluster Mgmt

Scheduling

Service Oriented HPCService Oriented HPC

UDF

UDF

UDF

UDF

UDF

UDF

HPC SOA Programming ModelHPC SOA Programming Model

for (i = 0; i < 100,000,000; i++)

{

r[i] =

worker.DoWork(dataSet[i]);

}

reduce ( r );

Session session = new session(startInfo);

PricingClient client = new P

ricingClient(binding,

session.EndpointAddress);

for (i = 0; I < 100,000,000, i++)

{

client.BeginDoWork(dataset[i],

new AsyncCallback(callback), i)

}

void callback(IAsyncResult handle)

{

r = client.EndDoWork(handle);

// aggregate results

reduce ( r );

}

Sequential Parallel

A big model (requires

Large memory machines)

An ISV application (requires

Nodes where the application is

installed)

MAT

LAB

C0C1

M

C2C3

M

Quad-core

C0C1

M

C2C3

|||||||| ||| |||||

||| ||||| ||| |||||

M

M

M

M

M

M

M

M

P0 P1

P2 P3

32-core

IO IO

4-way Structural Analysis MPI Job

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

MATL

AB

Multi -threaded application

(requires machine with many

Cores)

MATLAB

Numa

Aware

Capacity

Aware

Application

Aware

Placement via Job Context

node grouping, job templates, filters

NetworkDirectNetworkDirectA A newnew RDMA networking interface built for speed and RDMA networking interface built for speed and stabilitystability

VerbsVerbs--based design for close based design for close fit with native, highfit with native, high--perf perf networking interfacesnetworking interfaces

Equal to HardwareEqual to Hardware--Optimized stacks for MPI Optimized stacks for MPI micromicro--benchmarksbenchmarks

2 usec latency, 2 GB/sec 2 usec latency, 2 GB/sec bandwidth on ConnectXbandwidth on ConnectX

OpenFabrics driver for OpenFabrics driver for Windows includes support Windows includes support for Network Direct, Winsock for Network Direct, Winsock Direct and IPoIB protocolsDirect and IPoIB protocols

User

Mode

Kernel

Mode

TCP/Ethernet

Networking

Kern

el B

y-P

ass

MPI AppSocket-

Based App

MS-MPI

Windows Sockets (Winsock + WSD)

Networking HardwareNetworking HardwareNetworking Hardware

Networking HardwareNetworking HardwareHardware Driver

Networking Hardware

Networking HardwareMini-port

Driver

TCP

NDIS

IP

Networking HardwareNetworking HardwareUser Mode Access Layer

Networking

HardwareNetworking

HardwareWinSock Direct

Prov ider

Networking Hardware

Networking Hardware

NetworkDirect

Prov ider

RDMA Networki ng

OS

Component

CCP

Component

IHV

Component(ISV) App

Partnering for PerformancePartnering for Performance

Networking Hardware vendorsNetworking Hardware vendors

NetworkDirect design reviewNetworkDirect design review

NetworkDirect & WinsockDirect provider developmentNetworkDirect & WinsockDirect provider development

Windows Core Networking TeamWindows Core Networking Team

Commercial Software VendorsCommercial Software Vendors

Win64 best practicesWin64 best practices

MPI usage patternsMPI usage patterns

Collaborative performance tuning Collaborative performance tuning

4 benchmarking centers online4 benchmarking centers online

IBM, HP, Dell, SGIIBM, HP, Dell, SGI

Now working with Cray!Now working with Cray!

Devs can't tune what they can't seeDevs can't tune what they can't seeMSMS--MPI integrated with MPI integrated with Event Tracing for WindowsEvent Tracing for Windows

Single, timeSingle, time--correlated log of: correlated log of: OS, driver, MPI, and app eventsOS, driver, MPI, and app events

CCSCCS--specific additionsspecific additions

HighHigh--precision CPU clock correctionprecision CPU clock correction

Log consolidation from multiple Log consolidation from multiple compute nodes into a single record of compute nodes into a single record of parallel app execution parallel app execution

Dual purpose: Dual purpose:

Performance Analysis Performance Analysis

Application TroubleApplication Trouble--ShootingShooting

Trace Data DisplayTrace Data Display

Visual Studio & Windows ETW toolsVisual Studio & Windows ETW tools

Intel Collector/AnalyzerIntel Collector/Analyzer

VampirVampir

JumpshotJumpshot

HPC Storage SolutionsHPC Storage SolutionsA

ggre

gate

(M

b/s

/core

)

Number of cores in cluster

• Windows Server 2003 • Windows Server 2008

…

• HP - PolyServe • Ibrix - Fusion• Quantum - StorNext

• SANbolic – Melio file system

• IBM – GPFS

• Panasas – Active Scale

• Sun - Lustre

Windows Subsystem for Unix applicationsWindows Subsystem for Unix applicationsComplete SVRComplete SVR--5 and BSD UNIX environment with 5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers300 commands, utilizes, shell scripts, compilers

Visual Studio extensions for debugging POSIX Visual Studio extensions for debugging POSIX applicationsapplications

Support for 32 and 64Support for 32 and 64--bit applicationsbit applications

Recent port of WRF weather modelRecent port of WRF weather model350K lines, Fortran 90 and C using MPI, OpenMP350K lines, Fortran 90 and C using MPI, OpenMP

Traditionally developed for Unix HPC systemsTraditionally developed for Unix HPC systems

Two dynamical cores, full range of physics optionsTwo dynamical cores, full range of physics options

Porting experiencePorting experienceFewer than 750 lines of code changedFewer than 750 lines of code changed

Changes in only several hundred of lines of code, Changes in only several hundred of lines of code, primarily in the build mechanism (Makefiles, scripts)primarily in the build mechanism (Makefiles, scripts)

Level of effort and nature of tasks not unlike porting Level of effort and nature of tasks not unlike porting to any new version of UNIXto any new version of UNIX

Performance on par with the Linux systemsPerformance on par with the Linux systems

Unix Application Porting

F# is...F# is......a ...a functional, objectfunctional, object--oriented, oriented,

imperative and explorativeimperative and explorativeprogramming language for .NETprogramming language for .NET

Example: Taming Asynchronous I/OExample: Taming Asynchronous I/Ousing System;using System.IO;using System.Threading;

public class BulkImageProcAsync

{public const String ImageBaseName = "tmpImage-";public const int numImages = 200;

public const int numPixels = 512 * 512;

// ProcessImage has a simple O(N) loop, and you can vary the number

// of times you repeat that loop to make the application more CPU-// bound or more IO-bound.

public static int processImageRepeats = 20;

// Threads must decrement NumImagesToFinish, and protect

// their access to it through a mutex.public static int NumImagesToFinish = numImages;public static Object[] NumImagesMutex = new Object[0];

// WaitObject is signalled when all image processing is done.public static Object[] WaitObject = new Object[0];

public class ImageStateObject{

public byte[] pixels;

public int imageNum;public FileStream fs;

}

public static void ReadInImageCallback(IAsyncResult asyncResult){

ImageStateObject state = (ImageStateObject)asyncResult.AsyncState;Stream stream = state.fs;

int bytesRead = stream.EndRead(asyncResult);if (bytesRead != numPixels)

throw new Exception(String.Format

("In ReadInImageCallback, got the wrong number of " +"bytes from the image: {0}.", bytesRead));

ProcessImage(state.pixels, state.imageNum);

stream.Close();

// Now write out the image.// Using asynchronous I/O here appears not to be best practice.// It ends up swamping the threadpool, because the threadpool

// threads are blocked on I/O requests that were just queued to// the threadpool. FileStream fs = new FileStream(ImageBaseName + state.imageNum +

".done", FileMode.Create, FileAccess.Write, FileShare.None,4096, false);

fs.Write(state.pixels, 0, numPixels);fs.Close();

// This application model uses too much memory.// Releasing memory as soon as possible is a good idea, // especially global state.

state.pixels = null;fs = null;

// Record that an image is finished now.lock (NumImagesMutex){

NumImagesToFinish--;if (NumImagesToFinish == 0){

Monitor.Enter(WaitObject);Monitor.Pulse(WaitObject);

Monitor.Exit(WaitObject);}

}

}

public static void ProcessImagesInBulk(){

Console.WriteLine("Processing images... ");long t0 = Environment.TickCount;NumImagesToFinish = numImages;

AsyncCallback readImageCallback = newAsyncCallback(ReadInImageCallback);

for (int i = 0; i < numImages; i++){

ImageStateObject state = new ImageStateObject();

state.pixels = new byte[numPixels];state.imageNum = i;// Very large items are read only once, so you can make the

// buffer on the FileStream very small to save memory.FileStream fs = new FileStream(ImageBaseName + i + ".tmp",

FileMode.Open, FileAccess.Read, FileShare.Read, 1, true);state.fs = fs;fs.BeginRead(state.pixels, 0, numPixels, readImageCallback,

state);}

// Determine whether all images are done being processed.// If not, block until all are finished.

bool mustBlock = false;lock (NumImagesMutex){

if (NumImagesToFinish > 0)mustBlock = true;

}

if (mustBlock){

Console.WriteLine("All worker threads are queued. " +" Blocking until they complete. numLeft: {0}",NumImagesToFinish);

Monitor.Enter(WaitObject);Monitor.Wait(WaitObject);Monitor.Exit(WaitObject);

}long t1 = Environment.TickCount;

Console.WriteLine("Total time processing images: {0}ms",(t1 - t0));

}

}

Processing 200 images in

parallel

let ProcessImageAsync(i) =

async { let inStream = File.OpenRead(sprintf "source%d.jpg" i)

let! pixels = inStream.ReadAsync(numPixels)

let pixels' = TransformImage(pixels,i)

let outStream = File.OpenWrite(sprintf "result%d.jpg" i)

do! outStream.WriteAsync(pixels')

do Console.WriteLine "done!" }

let ProcessImagesAsync() =

Async.Run (Async.Parallel

[ for i in 1 .. numImages -> ProcessImageAsync(i) ])

Read from the

file,asynchronously

“!”

= “asynchronous”

Write the result, asynchronously

This object coordinates

Equivalent F# code

(same perf)

Generate the tasks and queue them in parallel

Open the file,

synchronously

Example: Taming Asynchronous I/OExample: Taming Asynchronous I/O

Application BenefitsApplication Benefits

The most productive The most productive distributed application distributed application

development environmentdevelopment environment

System BenefitsSystem Benefits

CostCost--effective, reliable and effective, reliable and

high performance server high performance server operating systemoperating system

Cluster BenefitsCluster Benefits

Complete HPC cluster Complete HPC cluster platform integrated with the platform integrated with the

enterprise infrastructureenterprise infrastructure

Microsoft HPC++ ExperienceMicrosoft HPC++ Experience

ResourcesResources

www.microsoft.com/hpcwww.microsoft.com/hpc

www.microsoft.com/sciencewww.microsoft.com/science

www.microsoft.com/serverswww.microsoft.com/servers

www.microsoft.com/sqlwww.microsoft.com/sql

www.microsoft.com/excelwww.microsoft.com/excel

research.microsoft.com/fsharpresearch.microsoft.com/fsharp

www.osl.iu.edu/research/mpi.netwww.osl.iu.edu/research/mpi.net

www.microsoft.com/msdnwww.microsoft.com/msdn

www.microsoft.com/technetwww.microsoft.com/technet

Thank you!

high performance and productivity computing with windows hpc …hpc%2b... · 2008-09-25 · hpc at...

Documents