high performance and productivity computing with windows hpc …hpc%2b... · 2008-09-25 · hpc at...
TRANSCRIPT
High Performance and Productivity ComputingHigh Performance and Productivity Computingwith Windows HPCwith Windows HPC
George YanGeorge YanGroup ManagerGroup ManagerWindows HPCWindows HPCMicrosoft ChinaMicrosoft China
HPC at MicrosoftHPC at Microsoft1997 1997 NCSA deploys first Windows clusters NCSA deploys first Windows clusters on NT4on NT4
20002000 Windows Server 2000 shipsWindows Server 2000 ships
20012001 Microsoft Computational Clustering Microsoft Computational Clustering Preview kit and Beowulf Cluster Computing Preview kit and Beowulf Cluster Computing with Windows book releasedwith Windows book released
20022002 Cornell Theory Center migrates to allCornell Theory Center migrates to all--Windows infrastructure, eventually reaching Windows infrastructure, eventually reaching over 600 nodes and 1,200 user accounts, over 600 nodes and 1,200 user accounts, first Top500 appearancefirst Top500 appearance
2003 2003 Argonne National Labs Argonne National Labs releases MPICH on Windowsreleases MPICH on Windows
HPC at MicrosoftHPC at Microsoft
20042004 Windows HPC team established in Windows HPC team established in both Redmond and Shanghaiboth Redmond and Shanghai
20052005 Microsoft launches HPC entry at Microsoft launches HPC entry at SCSC‘‘05 in Seattle with Bill Gates keynote05 in Seattle with Bill Gates keynote
2006 2006 Windows Compute Cluster Server Windows Compute Cluster Server 2003 ships2003 ships
20072007 Microsoft named one of the Top 5 Microsoft named one of the Top 5 companies to watch in HPC at SCcompanies to watch in HPC at SC’’0707
2008 2008 Windows HPC Server 2008Windows HPC Server 2008
Spring 2008, NCSA, #239472 cores, 68.5 TF, 77.7%
Fall 2007, Microsoft, #116
2048 cores, 11.8 TF, 77.1%
Spring 2007, Microsoft, #106
2048 cores, 9 TF, 58.8%
Spring 2006, NCSA, #130
896 cores, 4.1 TF
Spring 2008, Umea, #405376 cores, 46 TF, 85.5%
30% efficiency30% efficiencyimprovementimprovement
Windows HPC Server 2008
Windows Compute Cluster 2003
Winter 2005, Microsoft4 procs, 9.46 GFlops
Spring 2008, Aachen, #1002096 cores, 18.8 TF, 76.5%
HPC Clusters in Every LabHPC Clusters in Every Lab
X64 Server
Parallelism EverywhereParallelism Everywhere
“… we see a very significant shift in what architectures will look like in the future ...
fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations.”
Intel Developer Forum, Spring 2004Pat Gelsinger
Chief Technology Officer, Senior Vice PresidentIntel Corporation
February, 19, 2004
10,00010,000
1,0001,000
100100
1010
11
‘‘7070 ‘‘8080 ‘‘9090 ‘‘0000 ‘‘1010
Po
wer
De
nsit
y (W
/cm
Po
wer
De
nsit
y (W
/cm
22))
40044004
80088008
80808080
80858085
80868086
286286386386
486486
PentiumPentium®® processorsprocessors
Hot PlateHot Plate
Nuclear ReactorNuclear Reactor
Rocket NozzleRocket Nozzle
SunSun’’s Surfaces Surface
Intel Dev eloper Forum, Spring 2004 Intel Dev eloper Forum, Spring 2004 -- Pat GelsingerPat Gelsinger
ManyMany--core Peak Parallel GOPs
core Peak Parallel GOPs
Single Threaded Perf 10% per yearSingle Threaded Perf 10% per year
To Grow, To Keep Up,To Grow, To Keep Up,We Must Embrace Parallel ComputingWe Must Embrace Parallel Computing
GO
PS
GO
PS
32,76832,768
2,0482,048
128128
1616
20042004 20062006 20082008 20102010 20122012 20152015
TodayToday’’s Architecture: Heat becoming an s Architecture: Heat becoming an unmanageable problem!unmanageable problem!
Parallelism OpportunityParallelism Opportunity80X80X
TodayToday’’s Environments Environment
Corporate InfrastructureStorage
Clusters/Super Computers
High Speed networking
Engineers
Scientists
Information workers
Compilers
Debuggers
Specialized languages
Mainstream
Technologies
Financial Analysts
High Productivity ComputingHigh Productivity Computing
Combined Infrastructure
Integrated Desktop and HPC Environment
Unified Development
Environment
MicrosoftMicrosoft’’s Productivity Visions Productivity Vision
AdministratorAdministrator Application DeveloperApplication Developer End End -- UserUser
� Integrated Turnkey Solution
� Simplified Setup and
Deployment
� Built-In Diagnostics� Efficient Cluster Utilization
� Integrates with IT
Infrastructure and Policies
� Highly Productive Parallel
Programming Frameworks
� Service-Oriented HPC
Applications� Support for Key HPC
Development Standards
� Unix Application Migration
� Seamless Integration with
Workstation Applications
� Integrated Collaboration and
Workflow Solutions� Secure Job Execution and
Data Access
� World-class Performance
Windows HPC allows you to accomplish more, in less time, with reduced effort
by leveraging users existing skills and integrating with the tools they are already
using.
Industry Focused SolutionsIndustry Focused Solutions
AutomotiveAerospaceGeo
ServicesFinancialServicesAcademia Government
LifeSciences
Systems Systems ManagementManagement
Job Job SchedulingScheduling
MPIMPIStorageStorage
�Rapid large scale deployment
and built-in diagnostics suite
�Integrated monitoring, management and reporting
�Familiar UI and rich scripting interface
�Integrated security via Active
Directory�Support for batch, interactive
and service-oriented applications
�High availability scheduling�Interoperability via OGF’s HPC
Basic Profile
�MS-MPI stack based on MPICH2 reference implementation
�Performance improvements for
RDMA networking and multi-core shared memory
�MS-MPI integrated with Windows
Event Tracing
�Access to SQL, Windows and
Unix file servers
�Key parallel file server vendor support (GPFS, Lustre, Panasas)
�In-memory caching options
Windows HPC Server 2008Windows HPC Server 2008
Ease of deploymentEase of deployment
Ease of DeploymentEase of Deployment
Comprehensive Diagnostics SuiteComprehensive Diagnostics Suite
Single Management ConsoleSingle Management Console
Integrated MonitoringIntegrated Monitoring
BuiltBuilt--in Reportingin Reporting
Integrated Job SchedulingIntegrated Job Scheduling
Scheduler
UDF
Jobs
Results
UDF
User App
MPI
Compute NodeJob
Execution
Head NodeJob
Mgmt
Resource Mgmt
Cluster Mgmt
Scheduling
Service Oriented HPCService Oriented HPC
UDF
UDF
UDF
UDF
UDF
UDF
HPC SOA Programming ModelHPC SOA Programming Model
for (i = 0; i < 100,000,000; i++)
{
r[i] =
worker.DoWork(dataSet[i]);
}
reduce ( r );
Session session = new session(startInfo);
PricingClient client = new P
ricingClient(binding,
session.EndpointAddress);
for (i = 0; I < 100,000,000, i++)
{
client.BeginDoWork(dataset[i],
new AsyncCallback(callback), i)
}
void callback(IAsyncResult handle)
{
r = client.EndDoWork(handle);
// aggregate results
reduce ( r );
}
Sequential Parallel
A big model (requires
Large memory machines)
An ISV application (requires
Nodes where the application is
installed)
MAT
LAB
C0C1
M
C2C3
M
Quad-core
C0C1
M
C2C3
|||||||| ||| |||||
||| ||||| ||| |||||
M
M
M
M
M
M
M
M
P0 P1
P2 P3
32-core
IO IO
4-way Structural Analysis MPI Job
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
MATL
AB
Multi -threaded application
(requires machine with many
Cores)
MATLAB
Numa
Aware
Capacity
Aware
Application
Aware
Placement via Job Context
node grouping, job templates, filters
NetworkDirectNetworkDirectA A newnew RDMA networking interface built for speed and RDMA networking interface built for speed and stabilitystability
VerbsVerbs--based design for close based design for close fit with native, highfit with native, high--perf perf networking interfacesnetworking interfaces
Equal to HardwareEqual to Hardware--Optimized stacks for MPI Optimized stacks for MPI micromicro--benchmarksbenchmarks
2 usec latency, 2 GB/sec 2 usec latency, 2 GB/sec bandwidth on ConnectXbandwidth on ConnectX
OpenFabrics driver for OpenFabrics driver for Windows includes support Windows includes support for Network Direct, Winsock for Network Direct, Winsock Direct and IPoIB protocolsDirect and IPoIB protocols
User
Mode
Kernel
Mode
TCP/Ethernet
Networking
Kern
el B
y-P
ass
MPI AppSocket-
Based App
MS-MPI
Windows Sockets (Winsock + WSD)
Networking HardwareNetworking HardwareNetworking Hardware
Networking HardwareNetworking HardwareHardware Driver
Networking Hardware
Networking HardwareMini-port
Driver
TCP
NDIS
IP
Networking HardwareNetworking HardwareUser Mode Access Layer
Networking
HardwareNetworking
HardwareWinSock Direct
Prov ider
Networking Hardware
Networking Hardware
NetworkDirect
Prov ider
RDMA Networki ng
OS
Component
CCP
Component
IHV
Component(ISV) App
Partnering for PerformancePartnering for Performance
Networking Hardware vendorsNetworking Hardware vendors
NetworkDirect design reviewNetworkDirect design review
NetworkDirect & WinsockDirect provider developmentNetworkDirect & WinsockDirect provider development
Windows Core Networking TeamWindows Core Networking Team
Commercial Software VendorsCommercial Software Vendors
Win64 best practicesWin64 best practices
MPI usage patternsMPI usage patterns
Collaborative performance tuning Collaborative performance tuning
4 benchmarking centers online4 benchmarking centers online
IBM, HP, Dell, SGIIBM, HP, Dell, SGI
Now working with Cray!Now working with Cray!
Devs can't tune what they can't seeDevs can't tune what they can't seeMSMS--MPI integrated with MPI integrated with Event Tracing for WindowsEvent Tracing for Windows
Single, timeSingle, time--correlated log of: correlated log of: OS, driver, MPI, and app eventsOS, driver, MPI, and app events
CCSCCS--specific additionsspecific additions
HighHigh--precision CPU clock correctionprecision CPU clock correction
Log consolidation from multiple Log consolidation from multiple compute nodes into a single record of compute nodes into a single record of parallel app execution parallel app execution
Dual purpose: Dual purpose:
Performance Analysis Performance Analysis
Application TroubleApplication Trouble--ShootingShooting
Trace Data DisplayTrace Data Display
Visual Studio & Windows ETW toolsVisual Studio & Windows ETW tools
Intel Collector/AnalyzerIntel Collector/Analyzer
VampirVampir
JumpshotJumpshot
HPC Storage SolutionsHPC Storage SolutionsA
ggre
gate
(M
b/s
/core
)
Number of cores in cluster
• Windows Server 2003 • Windows Server 2008
…
• HP - PolyServe • Ibrix - Fusion• Quantum - StorNext
• SANbolic – Melio file system
• IBM – GPFS
• Panasas – Active Scale
• Sun - Lustre
Windows Subsystem for Unix applicationsWindows Subsystem for Unix applicationsComplete SVRComplete SVR--5 and BSD UNIX environment with 5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers300 commands, utilizes, shell scripts, compilers
Visual Studio extensions for debugging POSIX Visual Studio extensions for debugging POSIX applicationsapplications
Support for 32 and 64Support for 32 and 64--bit applicationsbit applications
Recent port of WRF weather modelRecent port of WRF weather model350K lines, Fortran 90 and C using MPI, OpenMP350K lines, Fortran 90 and C using MPI, OpenMP
Traditionally developed for Unix HPC systemsTraditionally developed for Unix HPC systems
Two dynamical cores, full range of physics optionsTwo dynamical cores, full range of physics options
Porting experiencePorting experienceFewer than 750 lines of code changedFewer than 750 lines of code changed
Changes in only several hundred of lines of code, Changes in only several hundred of lines of code, primarily in the build mechanism (Makefiles, scripts)primarily in the build mechanism (Makefiles, scripts)
Level of effort and nature of tasks not unlike porting Level of effort and nature of tasks not unlike porting to any new version of UNIXto any new version of UNIX
Performance on par with the Linux systemsPerformance on par with the Linux systems
Unix Application Porting
F# is...F# is......a ...a functional, objectfunctional, object--oriented, oriented,
imperative and explorativeimperative and explorativeprogramming language for .NETprogramming language for .NET
Example: Taming Asynchronous I/OExample: Taming Asynchronous I/Ousing System;using System.IO;using System.Threading;
public class BulkImageProcAsync
{public const String ImageBaseName = "tmpImage-";public const int numImages = 200;
public const int numPixels = 512 * 512;
// ProcessImage has a simple O(N) loop, and you can vary the number
// of times you repeat that loop to make the application more CPU-// bound or more IO-bound.
public static int processImageRepeats = 20;
// Threads must decrement NumImagesToFinish, and protect
// their access to it through a mutex.public static int NumImagesToFinish = numImages;public static Object[] NumImagesMutex = new Object[0];
// WaitObject is signalled when all image processing is done.public static Object[] WaitObject = new Object[0];
public class ImageStateObject{
public byte[] pixels;
public int imageNum;public FileStream fs;
}
public static void ReadInImageCallback(IAsyncResult asyncResult){
ImageStateObject state = (ImageStateObject)asyncResult.AsyncState;Stream stream = state.fs;
int bytesRead = stream.EndRead(asyncResult);if (bytesRead != numPixels)
throw new Exception(String.Format
("In ReadInImageCallback, got the wrong number of " +"bytes from the image: {0}.", bytesRead));
ProcessImage(state.pixels, state.imageNum);
stream.Close();
// Now write out the image.// Using asynchronous I/O here appears not to be best practice.// It ends up swamping the threadpool, because the threadpool
// threads are blocked on I/O requests that were just queued to// the threadpool. FileStream fs = new FileStream(ImageBaseName + state.imageNum +
".done", FileMode.Create, FileAccess.Write, FileShare.None,4096, false);
fs.Write(state.pixels, 0, numPixels);fs.Close();
// This application model uses too much memory.// Releasing memory as soon as possible is a good idea, // especially global state.
state.pixels = null;fs = null;
// Record that an image is finished now.lock (NumImagesMutex){
NumImagesToFinish--;if (NumImagesToFinish == 0){
Monitor.Enter(WaitObject);Monitor.Pulse(WaitObject);
Monitor.Exit(WaitObject);}
}
}
public static void ProcessImagesInBulk(){
Console.WriteLine("Processing images... ");long t0 = Environment.TickCount;NumImagesToFinish = numImages;
AsyncCallback readImageCallback = newAsyncCallback(ReadInImageCallback);
for (int i = 0; i < numImages; i++){
ImageStateObject state = new ImageStateObject();
state.pixels = new byte[numPixels];state.imageNum = i;// Very large items are read only once, so you can make the
// buffer on the FileStream very small to save memory.FileStream fs = new FileStream(ImageBaseName + i + ".tmp",
FileMode.Open, FileAccess.Read, FileShare.Read, 1, true);state.fs = fs;fs.BeginRead(state.pixels, 0, numPixels, readImageCallback,
state);}
// Determine whether all images are done being processed.// If not, block until all are finished.
bool mustBlock = false;lock (NumImagesMutex){
if (NumImagesToFinish > 0)mustBlock = true;
}
if (mustBlock){
Console.WriteLine("All worker threads are queued. " +" Blocking until they complete. numLeft: {0}",NumImagesToFinish);
Monitor.Enter(WaitObject);Monitor.Wait(WaitObject);Monitor.Exit(WaitObject);
}long t1 = Environment.TickCount;
Console.WriteLine("Total time processing images: {0}ms",(t1 - t0));
}
}
Processing 200 images in
parallel
let ProcessImageAsync(i) =
async { let inStream = File.OpenRead(sprintf "source%d.jpg" i)
let! pixels = inStream.ReadAsync(numPixels)
let pixels' = TransformImage(pixels,i)
let outStream = File.OpenWrite(sprintf "result%d.jpg" i)
do! outStream.WriteAsync(pixels')
do Console.WriteLine "done!" }
let ProcessImagesAsync() =
Async.Run (Async.Parallel
[ for i in 1 .. numImages -> ProcessImageAsync(i) ])
Read from the
file,asynchronously
“!”
= “asynchronous”
Write the result, asynchronously
This object coordinates
Equivalent F# code
(same perf)
Generate the tasks and queue them in parallel
Open the file,
synchronously
Example: Taming Asynchronous I/OExample: Taming Asynchronous I/O
Application BenefitsApplication Benefits
The most productive The most productive distributed application distributed application
development environmentdevelopment environment
System BenefitsSystem Benefits
CostCost--effective, reliable and effective, reliable and
high performance server high performance server operating systemoperating system
Cluster BenefitsCluster Benefits
Complete HPC cluster Complete HPC cluster platform integrated with the platform integrated with the
enterprise infrastructureenterprise infrastructure
Microsoft HPC++ ExperienceMicrosoft HPC++ Experience
ResourcesResources
www.microsoft.com/hpcwww.microsoft.com/hpc
www.microsoft.com/sciencewww.microsoft.com/science
www.microsoft.com/serverswww.microsoft.com/servers
www.microsoft.com/sqlwww.microsoft.com/sql
www.microsoft.com/excelwww.microsoft.com/excel
research.microsoft.com/fsharpresearch.microsoft.com/fsharp
www.osl.iu.edu/research/mpi.netwww.osl.iu.edu/research/mpi.net
www.microsoft.com/msdnwww.microsoft.com/msdn
www.microsoft.com/technetwww.microsoft.com/technet
Thank you!