netsolve henri casanova and jack dongarra university of tennessee and oak ridge national laboratory
Post on 15-Jan-2016
221 views
TRANSCRIPT
![Page 1: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/1.jpg)
NetSolve
Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve
![Page 2: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/2.jpg)
Objectives
Harnessing vast computational resources on the network Hardware Software
Convenient for scientific computing community Reducing installation and programming
overhead Masking complexity related to distributed
computing
![Page 3: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/3.jpg)
Computation-Sharing Models Proxy Computing
Data
CodeDataCode
Client Server
Computation on the server
![Page 4: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/4.jpg)
Computation-Sharing ModelsCode Shipping
CodeData
Client Server
Computation on the client
Code
![Page 5: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/5.jpg)
Computation-Sharing ModelsRemote Computation
DataData
Client Server
Computation on the server
Code
![Page 6: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/6.jpg)
Design issues
Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance
![Page 7: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/7.jpg)
NetSolve Architecture
“OS”
Resources
![Page 8: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/8.jpg)
NetSolve Organization and Operation
![Page 9: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/9.jpg)
NetSolve Client Interface
C, Fortran, Java, Matlab, and Mathematica
>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);
>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);
Not ready>> x= netsolve_nb(’wait’, request);
![Page 10: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/10.jpg)
NetSolve Wrappers
Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile
Compiled into wrappers around scientific librariesXDR for platform-independent data transfer
![Page 11: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/11.jpg)
NetSolve Load Balancing
Assigning a task to the “best” machine Establishing a performance model
Network delay, server properties, task properties Measuring and monitoring dynamic system
states
Load balancing at a finer granularity Parallelism through non-blocking interface Task migration
![Page 12: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/12.jpg)
NetSolve Fault Tolerance
Inter-server fault toleranceFault tolerance among NetSolve
servers
Intra-server fault toleranceFault tolerance within a NetSolve
server
![Page 13: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/13.jpg)
NetSolve Fault Tolerance Inter-server Fault Tolerance
Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration
Introducing NetSolve storage servers Store checkpoints or any information related
to fault tolerance (must be platform-independent)
No reliance on failed or overloaded server for task migration
![Page 14: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/14.jpg)
NetSolve Fault ToleranceIntra-server Fault Tolerance
Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance
![Page 15: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/15.jpg)
Diskless Checkpointing Checksums and Reverse Computation
Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken
at N servers (stored in memory) A checksum of checkpoints stored at the
checkpointing server Rollback using reverse computation State recovery using the checksum
![Page 16: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/16.jpg)
Applications
MCell with NetSolveLarge code, small data
Matlab with NetSolveTradeoffs between parallelism and
overhead
IPARS with NetSolveImageVision with NetSolve
![Page 17: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/17.jpg)
![Page 18: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/18.jpg)
Integration with ScaLAPACK
![Page 19: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/19.jpg)
Integration with Condor
![Page 20: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/20.jpg)
Integration with Ninf
![Page 21: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory](https://reader033.vdocuments.mx/reader033/viewer/2022051215/56649d535503460f94a2f222/html5/thumbnails/21.jpg)
Conclusion
An interesting infrastructure for sharing computational resourcesBoth software and hardware
Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific