the naregi resource management framework satoshi matsuoka professor, global scientific information...
TRANSCRIPT
The NAREGI Resource Management Framework
Satoshi MatsuokaProfessor, Global Scientific Information and
Computing Center, Deputy Director, NAREGI Project
Tokyo Institute of Technology / NII
National Research Grid Infrastructure (NAREGI) 2003-2007
• Petascale Grid Infrastructure R&D for Future Deployment– $45 mil (US) + $16 mil x 5 (2003-2007) = $125 mil total– Hosted by National Institute of Informatics (NII) and Institute of Molecular Science (IMS)– PL: Ken Miura (FujitsuNII)
• Sekiguchi(AIST), Matsuoka(Titech), Shimojo(Osaka-U), Aoyagi (Kyushu-U)…– Participation by multiple (>= 3) vendors, Fujitsu, NEC, Hitachi, NTT, etc.– Follow and contribute to GGF Standardization, esp. OGSA
AIST
Grid MiddlewareGrid Middleware
SuperSINETSuperSINET
Grid R&D Infrastr.Grid R&D Infrastr.15 TF-100TF15 TF-100TF
Grid and Grid and NetworkNetwork
ManagementManagement
““NanoGrid”NanoGrid”IMS ~10TFIMS ~10TF
(BioGrid(BioGridRIKEN)RIKEN)
OtherOtherInst.Inst.
National ResearchNational ResearchGrid Middleware R&DGrid Middleware R&D
NanotechNanotechGrid AppsGrid Apps
(Biotech(BiotechGrid Apps)Grid Apps)
(Other(OtherApps)Apps)
Titech
Fujitsu
NECOsaka-U
U-Kyushu Hitachi
Focused “Grand Challenge” Grid Apps Areas
U-Tokyo
Ind
us
try
/So
cie
tal F
ee
db
ac
k
Inte
rna
tio
na
l In
fra
str
uct
ura
l Co
llab
ora
tio
n
●
Restructuring Univ. IT Research ResourcesExtensive On-Line Publications of Results
Management Body / Education & Training
Deployment of NAREGI Middleware
Virtual LabsLive Collaborations
Towards a Cyber-Science Infrastructure for R & D
UPKI: National Research PKI Infrastructure
★
★
★★★
★★
☆
SuperSINET and Beyond: Lambda-based Academic Networking Backbone
Cyber-Science Infrastructure ( CSI)
Hokkaido-U
Tohoku-U
Tokyo-UNII
Nagoya-U
Kyoto-U
Osaka-U
Kyushu-U
( Titech, Waseda-U, KEK, etc.)
NAREGIOutput
GeNii (Global Environment forNetworked Intellectual Information)
NII-REO (Repository of ElectronicJournals and Online Publications
Participating Organizations
• National Institute of Informatics-NII– (HQ + Center for Grid Research & Development)
• Institute for Molecular Science-IMS– (Center for Computational Grid Nano‐science Applications)
• Universities and National Labs (Leadership & Deployment)– AIST, Titech, Osaka-u, Kyushu-u, etc. => Leadership– National Centers => Customers, Test Deployment– Collaborating Application Labs (w/IMS)
• Developer Vendors (Workforce)– Fujitsu, NEC, Hitachi, Somu, NTT, etc. => ~150FTEs
• Application Vendors– “Consortium for Promotion of Grid Apps in Industry”
FY2003 FY2004 FY2005 FY2006 FY2007
UNICORE/Globus/Condor-based R&D Framework
OGSA/WSRF-based R&D Framework
Utilization of NAREGI NII-IMS Testbed Utilization of NAREGI-Wide Area Testbed
PrototypingNAREGI Middleware
Components
Development andIntegration of
αrelease
Development and Integration of
βrelease
Evaluation on NAREGI Wide-area
Testbed Development ofOGSA-based Middleware Verification
& EvaluationOf Ver. 1
Apply ComponentTechnologies toNano Apps and
Evaluation
Evaluation of αrelease in NII-IMS
Testbed
Evaluation of βReleaseBy IMS and other
Collaborating Institutes
Deployment of β
αRelease(private)
βRelease(public, Ma
y 2006)
Version1.0
Release
Mid
po
int
Eva
lua
tion
NAREGI Middleware Roadmap
NAREGI Software Stack (Beta Ver. 2006)
Computing Resources
NII IMS Research Organizations etc
(( Globus,Condor,UNICOREGlobus,Condor,UNICOREOGSA / WSRF)OGSA / WSRF)
SuperSINET
Grid-Enabled Nano-Applications
Grid PSEGrid Programing
-Grid RPC -Grid MPI
Grid Visualization
Grid VM
Packag
ing
DistributedInformation Service
Grid Workflow
Super Scheduler
High-Performance & Secure Grid Networking
Data
• Work Package Structure (~150 FTEs): – Universities and National Labs: technology leadership– Vendors (Fujitsu, NEC, Hitachi, etc.): professional develop
ment
• WP-1: Resource Management:– Matsuoka(Titech), Saga (NII), Nakada (AIST/Titech)
• WP-2: Programming Middleware:– Sekiguchi(AIST), Ishikawa(U-Tokyo), Tanaka(AIST)
• WP-3: Application Grid Tools:– Usami (NII), Kawata(Utsunomiya-u)
• WP-4: Data Management (new FY 2005, Beta):– Matsuda (Osaka-U)
• WP-5: Networking & Security– Shimojo(Osaka-u), Kanamori (NII), Oie( Kyushu Tech.)
• WP-6: Grid-enabling Nanoscience Appls– Aoyagi(Kyushu-u)
R&D in Grid Software and Networking Area (Work Packages)
NAREGI Programming Models• High-Throughput Computing
– But with complex data exchange inbetween– NAREGI Workflow or GridRPC
• Metacomputing (cross-machine parallel)– Workflow (w/co-scheduling) + GridMPI– GridRPC (for task-parallel or task/data-parallel)
• Coupled Multi-Resolution Simulation– Workflow (w/co-scheduling) + GridMPI + Coupling Co
mponents• Mediator (coupled simulation framework)• GIANT (coupled simulation data exchange framework)
Nano-Science : coupled simluations on the Grid as the sole future for true scalability
The only way to achieve true scalability!
… between Continuum & Quanta.
10 -6 10-9 m
Material physics 的 (Infinite system) ・ Fluid dynamics ・ Statistical physics ・ Condensed matter theory …
Molecular Science ・ Quantum chemistry ・ Molecular Orbital method ・ Molecular Dynamics …
Multi-Physics
Limit of Computing Capability
Limit of Idealization
Coordinates decoupled resources;
Meta-computing, High throughput computing,
Multi-Physics simulationw/ components and data from different groups
within VO composed in real-time
Old HPC environment:・ decoupled resources,・ limited users,・ special software, ...
NAREGI Alpha Middleware (2004-5)
1. Resource Management ( WP1 )• Unicore/Globus/Condor as “Skeletons” and underlying service provider• OGSA-EMS Job Management, Brokering, Scheduling• Coupled Simulation on Grid (Co-Allocation/Scheduling)• WS(RF)-based Monitoring/Accounting Federation
2. Programming (WP2 )• High-Performance Standards-based GridMPI ( MPI-1/2, IMPI)
• Highly tuned for Grid environment (esp. collectives)• Ninf-G: Scientific RPC for Grids
3. Applications Environment ( WP3 )• Application Discovery and Installation Management ( PSE )• Workflow Tool for coupled application interactions• WSRF-based Large Scale Grid-based Visualization
4. Networking and Security ( WP5 )• Production-Grade CA• Unicore and GSI security federation• Out-of-band Real-time network monitor/control
5 . Grid-enabling Nano Simulations ( WP6)• Framework for coupled simulations
First OGSA-EMS based impl. In the world
WSRF-based Terabyte interactive
visualization
Drop-in Replacement for SimpleCA
Large scale coupled simluation of proteins in a s
olvent on a Grid
GridRPC GGF Programming Standard
DistributedServers
LifeCycle of Grid Apps and Infrastructure
MetaComputing
Workflows and Coupled Apps / User
Many VO Users
Application Contents Service
HL WorkflowNAREGI WFML
Dist. Grid Info Service
GridRPC/Grid MPI UserApps
UserApps
UserApps
VO Application Developers&Mgrs
GridVM GridVMGridVM
SuperScheduler
Data 1 Data 2 Data n
Meta-data
Meta-data
Meta-data
Grid-wide Data Management Service (GridFS, Metadata, Staging, etc.)
Place & register data on the Grid
Assign metadata to data
NAREGI DistributedInformation Service
PSE
Registration & Deployment of Applications(OGSA-ACS)
Application Pool
⑦RegisterDeployment Info.
Server#1 Server#2 Server#3
Compiling OK!
Test Run NG! Test Run OK!Test Run OK!
ResourceInfo.
③Compile
Application Summary Program Source Files Input Files Resource Requirementsetc.
ApplicationDeveloper
⑥Deploy
④Send back Compiled
Application Environment
①Register Application②Select Compiling Host
⑤Select Deployment Host
Computing ResourceComputing ResourceComputing ResourceComputing Resource
GridVM(OGSA-SC)
GridVM(OGSA-SC)
Accounting
CIM
UR/RUS
GridVMGridVM
ResourceInfo.
Reservation, Submission,Query, Control…
ClientClient
ConcreteJSDL
ConcreteJSDL
Workflow
AbstractJSDL
Dynamic Resource and Exec. Mgmt. (OGSA-EMS + NAREGI
Extensions)
SuperScheduler
(OGSA-RSS)
DistributedInformation Service
(CIM)
NetworkInfo.
NetworkMonitoring & Control
DAI
ResourceQuery
Reservation basedCo-Allocation
CIM
SC
WFML2BPEL
SS
NAREGI JM(SS) Java I/F module
NAREGI-WP3 WorkFlowTool
JM-Client
SubmitStatusDeleteCancel
CES
PBS, LoadLeveler
EPS
CSG
IS
RS
Cancel StatusSubmit
BPEL2WFST
CreateActivity(FromBPEL)GetActivityStatusControlActivity
CES
OGSA-DAI
CIM
DB: PostgreSQL
GenerateSQL QueryFromJSDL
SelectResourceFromJSDL MakeReservation
CancelReservation
S R
GRAM4
GridVM
AGG-SC
FTS-SC
Globus-url-copy
CES-SC
Uber-ftp
CES
CES
S
SR
R
CES-SC
GRAM4 specificGRAM4 specific
SCSC
CES-SC
Co-allocation/MPI Job
File Transfer
Simple Job
CreateActivity(FromJSDL)GetActivityStatusControlActivity
MakeReservationCancelReservation
GenerateCandidateSet
delete
JM (BPEL Engine)
S
NAREGI Execution Management
“Under the Hood”AbbreviationSS:JSDL: JM:EPS:SG:RS:IS:SC:AGG-SC:CES-SC:FTS-SC:BES:CES:CIM:
Super SchedulerJob Submission Description LanguageJob ManagerExecution Planning ServiceCandidate Set GeneratorReservation ServiceInformation ServiceService ContainerAggregate SCCo-allocation Execution SCFile Transfer Service-SCBasic Execution ServiceCo-allocation Execution ServiceCommon Information Model
JSDL
JSDL
JSDL
JSDL
JSDL
JSDL
JSDL
JSDL
JSDL JSDL
JSDL
JSDL
JSDL JSDL
JSDL
NAREGI-WFML
BPEL (include JSDL)
Invoke EPS
Invoke CES
JSDLRSL
JSDL
GRAM4 specific
OGSA-RUS
SC(OGSA)
■ Resource requirement: 2 OS types
OS: Linux# of CPUs: 12(# of MPI tasks)
# of nodes: 6Physical Memory: >= 32MB/CPUWalltime limit: 8 min…
OS: AIX# of CPUs: 8 (# of MPI tasks)
# of nodes: 1Physical Memory: >= 256MB/CPUWalltime limit: 8 min…
Linux Cluster4CPU, 2nodes
Linux PC2CPU, 1node
Linux Cluster4CPU, 2nodes
Linux Cluster10CPU, 5nodes
AIX server32CPU, 1node
Resources
Coupled job requirement
AttributesAttributes DescriptionsDescriptions/naregi-jsdl:SubJobID Subjob ID
/naregi-jsdl:CPUCount # of total CPUs
/naregi-jsdl:TasksPerHost # of tasks per host
/naregi-jsdl:TotalTasks # of total MPI tasks
/naregi-jsdl:NumberOfNodes # of nodes
/naregi-jdsl:CheckpointablePeriod Check point period
/jsdl:PhysicalMemory Physical memory for each process
/jsdl:ProcessVirtualMemoryLimit Virtual memory for each process
/naregi-jsdl:JobSatrtTrigger Job start time
/jsdl:WallTimeLimit Wall time limit
/jsdl:CPUTimeLimit CPU time limit
/jsdl:FileSizeLimit Maximum file size
/jsdl:Queue Queue name
NAREGI extensions for parallel jobs
NAREGI extension for co-allocation
■ Job Submission with JDSL
JSDL: GGF Job Submission Description Language and NAREGI Extensions
Meta computing scheduler is required to allocate and to execute jobs on multiple sites simultaneously.
The super scheduler negotiates with local RSs on job execution time and reserves resources which can execute the jobs simultaneously.
3, 4: Co-allocation and Reservation
ReservationService
ServiceContainer
LocalRS 1
ServiceContainer
LocalRS 2
Cluster (Site) 1
Cluster (Site) 2
Execution Planning Services
with Meta-Scheduling
Candidate Set Generator
Local RS #:Local Reservation Service #
AbstractJSDL(10)
①
AbstractJSDL(10)
②Candidates: Local RS 1 EPR (8) Local RS 2 EPR (6)
④
ConcreteJSDL(8)
(1) 14:00- (3:00)
++ConcreteJSDL(2)
(3)Local RS1 Local RS2 (EPR)
(4)Abstract Agreement Instance EPR
(2)
⑥
ConcreteJSDL(8)
15:00-18:00⑦
ConcreteJSDL(2)
15:00-18:00
⑨
Super SchedulerSuper Scheduler
GridVMGridVM
Distributed Information ServiceDistributed Information Service
Distributed Information Service
③
ConcreteJSDL(8)
⑧
create an agreement instance
ConcreteJSDL(2)
⑩
create an agreement instance
AbstractJSDL(10)
create an agreement instance
⑤
⑪
NAREGI Grid Information Service Overview
DistributedInformation ServiceClient
(Resource Broker etc.)
Clientlibrary
APP
Java-API
DecodeService
ORDB Light-weightCIMOMService
AggregateService
NotificationOS
Processor
File System
CIM Providers
● ●
RUSport
GWSDL(RUS::insert)
GridVM(Chargeable
Service)
Job Queue
Cell Domain Information Service
Node B
Node A
Node C
ACL
GridVM
Cell Domain Information Service
Information Service Node Information Service Node
… Redundancy for FT in ‘2005
… Hierarchical filtered aggregation... Distributed Query …
PerformanceGanglia
OGSA-DAI
Gridmap SS/RB,GridVM
ViewerUserAdmin.
Sample: CIM Schema and NAREGI Extensions for GGF Usage Record
1
Bold existed for useBold added for use*
0..1
*
ManagedElement( SeeCoreModel)
ManagedSystemElement( See Core Model)
LogicalElement( See Core Model)
EnabledLogicalElement( See Core Model)
Job
String Owner
datetime StartTimedatetime ElapsedTime
...
H_UsageRecordString KeyInfo;
String recordID {key};String createTime;
H_UsageRecordProcessors Int processorUsage;
String description{key}; Int metric;
float comsumptionRate;
H_UsageRecordDisk Int diskStorage;
String description{key}; String volumeUnit; String phaseUnit;
Int metric; int type;
H_UsageRecordMemory Int memoryUsage;
String description{key}; String volumeUnit; String phaseUnit;
Int metric; int type;
H_UsageRecordSwap Int swapUsage;
String description{key}; String volumeUnit; String phaseUnit;
Int metric; int type;
H_UsageRecordTimeDuration String timeDuration;
H_UsageRecordTimeInstant String timeInstant;
String description{key};
StatisticalDataInstanceID:string{key}ElementName:string
ElementStatisticalData
JobUsageRecord
ConcreteJobString InstanceID {key};
String Name;
BatchJobString JobID
int MaxCPUTime
uint16 JobState
datetime TimeCompletedString JobOrigination
String Task
H_UsageRecordPropertyDifferentiated
Int number {key};
H_UsageRecordIDString globalJobI
dString globalUserID
String JobStatusdatetime TimeSubmitted
uint32 JobRunTimes
datetime UntilTimeString Notify;
uint32 Priorityuint PercentComplete
datetime TimeOfLastStateChange
boolean DeleteOnCompletionuint16 ErrorCode
...
String storeSubjectName;String storeTimeStamp;
String description{key};
H_UsageRecordCPUDuration int type;
H_UsageRecordNodeCount Int nodeUsage;
String description{key}; Int metric;
H_UsageRecordNetwork Int networkUsage;
String description{key}; Int units; int metric;
JobDestination CreationClassName: string { key}
Name: string {override, key}
JobQueue QueueStatus: uint16 {enum} QueueStatusInfo: string DefaultJobPriority: uint32 MaxTimeOnQueue: datatime MaxJobsOnQueue: uint32
MaxJobCPUTime: uint32
System
ComputerSystem NameFormat: {override, enum} OtherIdentifyingInfo: string [ ] IdentifyingDescriptions: string [ ] Dedicated: uint16 [ ] {enum} ResetCapability: uint16 {enum}
PowerManagementCapabilities: uint16[ ] {D, enum} SetPowerState ( [IN] PowerState: uint32 {enum} [IN] Time: datetime) : uint32 {D}
HostedJobDestinationCreationClassName: string { key}
Name: string {override, key}
NameFormat: string
PrimaryOwnerName: string
PrimaryOwnerContact: string
Roles: string[]
JobDestination Jobs
UnitaryComputerSystem InitialLoadInfo: string LastLoadInfo: string WakeUpTypes: uint16 {enum}
ComponentCS
Bold new in ‘04
Usage Record Schema (GGF/UR-WG)
SystemPartition
Virtual execution environment on each site• Job execution services• Resource management services• Secure and isolated environment
Virtual execution environment on each site• Job execution services• Resource management services• Secure and isolated environment
Resource
info.
SuperScheduler
SuperScheduler
GridVMGridVM
InformationService
InformationService
Accounting
CIM
UR/RUS
GridVMGridVM
ResourceInfo.
Reservation, Submission,Query, Control…
WorkFlowTool
WorkFlowTool
ConcreteJSDL
ConcreteJSDL
AbstractJSDL
WSRF service
GridVM: OGSA-EMS Service Container for Local Job submissions & Mgmt
Platform independence as OGSA-EMS SC• WSRF OGSA-EMS Service Container interface for heterogeneous platforms and local schedulers• “Extends” Globus4 WS-GRAM • Job submission using JSDL• Job accounting using UR/RUS• CIM provider for resource information
Meta-computing and Coupled Applications• Advanced reservation for co-Allocation
Site Autonomy•WS-Agreement based job execution•XACML-based access control of resource usage
Virtual Organization (VO) Management• Access control and job accounting based on VOs (VOMS & GGF-UR)
GridVM Features
GridVM Architecture (β Ver.)
GridVMscheduler
RFT File Transfer
Local scheduler
GridVMEngine
Upper layerM/W+BES API
globusrun RSL+JSDL’
Delegate
Transferrequest
GRAMservices
DelegationScheduler
Event Generator
GRAMAdapter
GridVMJobFactory
GridVMJob
SUDO
Extension Service
Basic job management+ Authentication, Authorization
Site
Combined as an extension module to GT4 GRAM Future Compliance and Extensions to OGSA-BES
Combined as an extension module to GT4 GRAM Future Compliance and Extensions to OGSA-BES
sandboxing
Data 1 Data 2 Data n
Grid-wide File System
MetadataManagement
Data Access Management
Data ResourceManagement
Job 1
Meta-data
Meta-data
Data 1
Grid Workflow
Data 2 Data n
NAREGI Data Grid beta (WP4)
Job 2 Job n
Meta-data
Job 1
Grid-wide Data Sharing Service
Job 2
Job n
Data Grid Components
Import data into workflow
Place & register data on the Grid
Assign metadata to data
Store data into distributed file nodes
Data 1 Data 2 Data n
Gfarm 1.2 PL4
Data Access Management
NAREGI WP5 Details
Job 1
Data SpecificMetadata DB
Data 1 Data n
Job 1 Job nImport data into workflow
Job 2
Computational Nodes
FilesystemNodes
Job n
Data ResourceInformation DB
OGSA-DAI WSRF2.0
OGSA-DAI WSRF2.0
Globus Toolkit 4.0.1
Tomcat5.0.28
PostgreSQL 8.0
MySQL 5.0
Workflow (WFML =>
BPEL+JSDL)
Super Scheduler (SS) (OGSA-RSS)
Data Staging
Place data on the Grid
Data Resource Management
Metadata Construction
OGSA-RSSFTS SC
GridFTP
GGF-SRM (beta2)
Highlights of NAREGI Beta (May 2006, GGF17/GridWorld)
• Professionally developed and tested• “Full” OGSA-EMS incarnation
– Full C-based WSRF engine (Java -> Globus 4)– OGSA-EMS/RSS WSRF components– Full WS-Agreement brokering and co-allocation– GGF JSDL1.0-based job submission, authorization, etc.– Support for more OSes (AIX, Solaris, etc.) and BQs
• Sophisticated VO support for identity/security/monitoring/accounting (extensions of VOMS/MyProxy, WS-* adoption)
• WS- Application Deployment Support via GGF-ACS• Comprehensive Data management w/Grid-wide FS• Complex workflow (NAREGI-WFML) for various coupled simulatio
ns• Overall stability/speed/functional improvements• To be interoperable with EGEE, TeraGrid, etc. (beta2)
~ 3000 CPUs~ 17 Tflops
Center for GRID R&D(NII)
~ 5 Tflops
ComputationalNano-science Center(IMS)
~ 10 Tflops
Osaka Univ.BioGrid
TiTechCampus Grid
AISTSuperCluster
ISSPSmall Test App
Clusters
Kyoto Univ.Small Test App
Clusters
Tohoku Univ.Small Test App
Clusters
KEKSmall Test App
Clusters
Kyushu Univ.Small Test App
Clusters
AISTSmall Test App
Clusters
NAREGI Phase 1 Testbed
Super-SINET(10Gbps MPLS)
Dedicated TestbedNo “ballooning” w/production resourcesProduction to reach > 100Teraflops
Computer System for Grid Software Infrastructure R & DCenter for Grid Research and Development (5T flops , 700GB)
File Server (PRIMEPOWER 900 + ETERNUS3000 + ETERNUS LT160 ) (SPARC64V1.3GHz)(SPARC64V1.3GHz)
1node1node / 8CPU
SMP type Compute Server (PRIMEPOWER HPC2500 )
1node (UNIX, 1node (UNIX, SPARC64V1.3GHz/64CPU)SPARC64V1.3GHz/64CPU)
SMP type Compute Server (SGI Altix3700 )
1node 1node (Itanium2 1.3GHz/32CPU)(Itanium2 1.3GHz/32CPU)
SMP type Compute Server (IBM pSeries690 )
1node 1node (Power4(Power4 1.3GHz 1.3GHz/32CPU)/32CPU)
InfiniBand 4X(8Gbps)
InfiniBand 4X (8Gbps)
Distributed-memory typeCompute Server(HPC LinuxNetworx )
GbE (1Gbps)
Distributed-memory type Compute Server(Express 5800 )
GbE (1Gbps)
GbE (1Gbps)
GbE (1Gbps)
SuperSINET
High Perf. Distributed-memory Type Compute Server (PRIMERGY RX200 )
High Perf. Distributed-memory type Compute Server (PRIMERGY RX200 ) Memory 130GB
Storage 9.4TB
Memory 130GBStorage 9.4TB
128CPUs128CPUs(Xeon, 3.06GHz)(Xeon, 3.06GHz)+Control Node+Control Node
Memory 65GBStorage 9.4TB
Memory 65GBStorage 9.4TB
128 CPUs128 CPUs(Xeon, 3.06GHz)(Xeon, 3.06GHz)+Control Node+Control Node
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
Memory 65GBStorage 4.7TB
128 CPUs 128 CPUs (Xeon, 2.8GHz)(Xeon, 2.8GHz)+Control Node+Control Node
128 CPUs 128 CPUs (Xeon, 2.8GHz)(Xeon, 2.8GHz)+Control Node+Control Node
128 CPUs(Xeon, 2.8GHz)+Control Node128 CPUs(Xeon, 2.8GHz)+Control Node
Distributed-memory type Compute Server (Express 5800 )
128 CPUs 128 CPUs (Xeon, 2.8GHz)(Xeon, 2.8GHz)+Control Node+Control Node
Distributed-memory typeCompute Server(HPC LinuxNetworx )
Ext. NW
Intra NW-AIntra NW
Intra NW-B
L3 SWL3 SW
1Gbps1Gbps
(upgradable (upgradable
To 10Gbps)To 10Gbps)
L3 SWL3 SW
1Gbps1Gbps
(Upgradable (Upgradable
to 10Gbps)to 10Gbps)
Memory 16GB Storage 10TB Back-up Max.36.4TB
Memory 16GB Storage 10TB Back-up Max.36.4TB
Memory 128GBStorage 441GBMemory 128GBStorage 441GB
Memory 32GBStorage 180GBMemory 32GBStorage 180GB
Memory 64GBStorage 480GBMemory 64GBStorage 480GB
SMP type Computer Server
Memory 3072GB Storage 2.2TB Memory 3072GB Storage 2.2TB
Distributed-memory type Computer Server(4 units)818 CPUs(Xeon, 3.06GHz)+Control Nodes818 CPUs(Xeon, 3.06GHz)+Control NodesMyrinet2000 (2Gbps)
Memory 1.6TBStorage 1.1TB/unitMemory 1.6TBStorage 1.1TB/unit
File Server 16CPUs (SPARC64 GP, 675MHz)(SPARC64 GP, 675MHz)
Memory 8GBStorage 30TBBack-up 25TB
Memory 8GBStorage 30TBBack-up 25TB
5.4 TFLOPS 5.0 TFLOPS16ways-50nodes (POWER4+ 1.7GHz)16ways-50nodes (POWER4+ 1.7GHz)Multi-stage Crossbar NetworkMulti-stage Crossbar Network
L3 SWL3 SW
1Gbps1Gbps
(Upgradable to 10(Upgradable to 10Gbps)Gbps)
VPN
Firewall
CA/RA Server
SuperSINETCenter for Grid R & D
Front-end Server
Computer System for Nano Application R & DComputational Nano science Center ( 10 T flops , 5TB)
Front-end Server
Hokkaido UniversityInformation Initiative Center
HITACHI SR110005.6 Teraflops
Tohoku UniversityInformation Synergy Center
NEC SX-7NEC TX7/AzusA
University of TokyoInformation Technology Center
HITACHI SR8000HITACHI SR11000 6 TeraflopsOthers (in institutes)
Nagoya UniversityInformation Technology Center
FUJITSU PrimePower250011 Teraflops
Osaka UniversityCyberMedia Center
NEC SX-5/128M8HP Exemplar V2500/N1.2 Teraflops
Kyoto UniversityAcademic Center for Computing and Media Studies FUJITSU PrimePower2500
10 Teraflops
Kyushu UniversityComputing and Communications Center
FUJITSU VPP5000/64IBM Power5 p5955 Teraflops
University Computer Centers(excl. National Labs) circa Spring 2006
10Gbps SuperSINET Interconnecting the Centers
Tokyo Inst. TechnologyGlobal Scientific Informationand Computing Center
NEC SX-5/16, Origin2K/256
HP GS320/64 80~100Teraflops 2006
University of Tsukuba
FUJITSU VPP5000CP-PACS 2048 (SR8000 proto)
National Inst. of InformaticsSuperSINET/NAREGI Testbed17 Teraflops
~60 SC Centers in Japan
10Petaflop center by 2011
External Grid
Connectivity
External Grid
Connectivity
NEC/Sun Campus Supercomputing Grid: Core Supercomputer Infrastructure @ Titech GSIC - operational late Spring 2006 -
24+24Gbpsbidirectional
FileServer FileServer
C C
Storage BNEC iStorage S1800AT
Phys. Capacity 96TB RAID6All HDD, Ultra Reliable
200+200Gbpsbidirectional
2-1
External 10Gbps
Switch Fabric ・・・42 units
Storage Server ASun Storage (TBA), all HDD
Physical Capacity 1PB, 40GB/s
500GB48disks
500GB48disks
InfiniBand Network Voltaire ISR 9288 x 6
1400Gbps Unified & Redundunt Interconnect
By Spring 2006 deployment, planned ClearSpeed extension to 655 nodes, (1 CSX600 per SunFire node)
+a Small SX-8i> 100 TeraFlops (50 Scalar + 60 SIMD-Vector)Approx. 140,000 execution units, 70 cabinets
max config ~4500 Cards, ~900,000 exec. units,500Teraflops possible, Still 655 nodes, ~1MW
500GB48disks
ClearSpeed CSX600Initially 360nodes
96GFlops/Node
35TFlops (Peak)
Total 1.1PB
SunFire (TBA)655nodes16CPU/node
10480CPU/50TFlops (Peak)Memory: 21.4TB
NAREGI is/has/will…• Is THE National Research Grid in Japan
– Part of CSI and future Petascale initiatives– METI “Business Grid” counterpart 2003-2005
• Has extensive commitment to WS/GGF-OGSA– Entirely WS/Service Oriented Architecture– Set industry standards e.g. 1st impl. of OGSA-EMS
• Will work with EU/US/AP counterparts to realize a “global research grid”– Various talks have started, incl. SC05 interoperability meeting
• Will deliver first OS public beta in May 2006– To be distributed @ GGF17/GridWorld in Tokyo
NAREGI is not/doesn’t/won’t…• Is NOT an academic research project
– All professional developers from Fujitsu, NEC, Hitachi, NTT, …
– No students involved in development
• Will NOT develop all software by itself– Will rely on external components in some cases– Must be WS and other industry standards compliant
• Will NOT deploy its own production Grid– Although there is a 3000-CPU testbed– Work with National Centers for CSI deployment
• Will NOT hinder industry adoption at all costs– Intricate open source copyright and IP policies– We want people to save/make money using NAREGI
MW
Summary NAREGI Middleware Objectives Different Work Package deliverables implement Grid Services that combine to provide the followings:
・ Allow users to execute complex jobs with various interactions on resources across multiple sites on the Grid
・ E.g., nano-science multi-physics coupled simulations w/execution components & data spread across multiple groups within the nano-VO
・ Stable set of middleware to allow scalable andsustainable operations of centers asresource and VO hosting providers
・ Widely adopt and contribute to grid standards,and provide open-source reference implementations, for GGF/OGSA.
⇒ Sustainable Research Grid Infrastructure.
NAREGI Middleware
Allocation, scheduling
OGSA Grid Services
Requirements.
The Ideal World: Ubiquitous VO & user management for international
e-Science
Europe: EGEE, UK e-Science, …
US: TeraGrid, OSG,
Japan: NII CyberScience (w/NAREGI), … Other Asian Efforts (GFK, China Grid, etc.)…
Gri
d R
eg
ion
al In
frast
ruct
ura
l Eff
ort
sC
olla
bora
tive t
alk
s on
PM
A,
etc
.
HEPGridVO
NEES-EDGridVO
AstroIVO
Standardization,commonality in software platforms will realize this
Differentsoftware stacksbut interoperable
Backup
Summary: NAREGI Middleware
Users describe ・ Component availability within VO ・ Workflow being executed with abstract requirement of sub jobs.
NAREGI M/W ・ discovers appropriate grid resources, ・ allocates resources executes on virtualized environments ・ as a ref. impl. of OGSA-EMS and others ・ Programming Model : GridMPI & Grid RPC ・ User Interface : Deployment, Visualization, WFT
will be evaluated on NAREGI testbed. ・~ 15TF, 3000 CPUs ・ SuperSINET 10Gbps national backbone
Abstract Requirement
NAREGI M/W
Allocates jobs appropriately.
・ Users can execute complex jobs across resource sites, w/mulitple components and data from groups with a VO such as coupled simulations in nano-science.・ NAREGI M/W allocates the jobs to grid resources appropriately.
concretization
General-purpose
Research Grid e-Infra.
The NAREGI WP6 RISM-FMO-Mediator Coupled Application
RISM FMOReference Interaction Site Model Fragment Molecular Orbital method
Site A
MPICH-G2, Globus
RISM
RISM
FMOFMO
Site B
GridMPI
Data Transformationbetween Different Meshes
Electronic Structurein Solutions
Electronic StructureAnalysis
Solvent Distribution Analysis
“Mediator”“Mediator”
Run a job; Nano-Application
■ Job: Grid enabled FMO (Fragment Molecular Orbital method) : a method of calculating electron density and energy of a big molecule, divided to small fragments, originally developed by Dr. Kitaura, AIST.
↓ Workflow based FMO developed by NAREGI.
Total Energy
Fragment DB Search
Visualization
An example ofHigh throughput
computing.
■ ■ ■ # of frag.
■ ■ ■ ■ ■
# of pairs■ ■ ■ ■ ■
Dimer calculation(pair of monomers)
Monomer calculation
9: Grid Accounting⑬
“Cell Domain” Info.Service
Query
“Cell Domain” Info.Service
Query
GridVM_Engine
Process
Computing node
GridVM_Scheduler
usage usage
“Cell domain” Info.Service
UR - XML
RUS Query
Cluster / Site
Info.Service “Node”(Upper layer)
RUSQuery
Users
can search and summarize their URs by
・ Global User ID・ Global Job ID・ Global Resource ID・ time range …
collects and maintains Job Usage Records in the form of the GGF proposed standard
RUS : Resource Usage ServiceUR : Usage Record
GridVM_Engine
Process
Computing node
GridVM_Engine
Process
Computing node
usage
(GGF RUS,UR).