hpc open forum for researchers
DESCRIPTION
HPC Open Forum for Researchers. Overview. Received $1.8 million grant to expand Cardinal Research Cluster (CRC) and research computing infrastructure Identified weak links in CRC Identified needs for new hardware based on current usage and requests Developed recommendations. - PowerPoint PPT PresentationTRANSCRIPT
[email protected]/it/research
HPC Open Forum for Researchers HPC Open Forum for Researchers
[email protected]/it/research
Overview
• Received $1.8 million grant to expand Cardinal Research Cluster (CRC) and research computing infrastructure
• Identified weak links in CRC
• Identified needs for new hardware based on current usage and requests
• Developed recommendations
[email protected]/it/research
High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics
Server/Storage
Visualizationserver
Global Storage
Login nodescrc.hpc.louisville.edu
1/10 Gbps Ethernet
Visualization Cluster (CECS)
100+ TB
Campus network
4x DDR Infiniband
SMP IBM p57016 CPUs
Statisticalserver
Cardinal Research Cluster - CRC
[email protected]/it/research
CRC Limitations
• Network limitations Network switch has no free ports—zero room for expansion
Limited capacity to campus backbone and Internet2
• Storage limitations Scratch space is already becoming full Slow/unreliable performance of GPFS storage
Lack of good archiving system
• Single points of failure No redundancy in storage servers, all must be online to function
No backup hardware for management, queue, and user nodes
[email protected]/it/research
Usage Trends
• Lots of serial or single-node jobs, very few massively parallel jobs
Bioinformatics jobs
Molecular dynamics jobs
Some Gaussian jobs are single node, none should be more than ~4 nodes
• Current massively parallel jobs are well-served by existing InfiniBand nodes
[email protected]/it/research
Researcher Requests
• Expand storage capacity
• Provide ability to have larger quotas
• Provide data archiving and management
• Expand visualization servers
• Provide ability to quickly add applications servers
[email protected]/it/research
Other Considerations
• Need for separate statistical server to free shared memory p570 system for computational focus
• Need to implement second phase of Oracle RAC redundancy (extended RAC)
• Need for general purpose applications servers that can be allocated for dedicated research applications
• Need for local scratch disks on compute nodes
• Need for facilities upgrades (cooling and power)
[email protected]/it/research
Recommendation - Networking
• Remark: CRC network switch cannot be expanded and is a single point of failure
• Recommendation: Redesign networking for expansion of research computing infrastructure and improved connectivity
Add new core switch for shared resources including storage, user nodes, p570, viz, and stats server
Add switch for expansion of compute nodes and servers on the CRC
Expand connectivity to campus backbone network and Internet2
[email protected]/it/research
CRC – Network Redesign
[email protected]/it/research
Recommendation - Storage
• Remark: Address storage space expansion and performance issues
• Recommendation: Add storage space Increase number of storage servers Increase allocation of scratch space Review quota structure with governance committee Develop archiving systems Continue to address GPFS tuning concerns
[email protected]/it/research
Recommendation - Computation
• Remark: Lots of serial or single-node jobs, very few massively parallel jobs
• Recommendation: Implement new cluster optimized for high-throughput serial processing
Utilize blade centers to provide a low cost way to maximize number of compute nodes
14 nodes/blade center – 168 cores/blade center – allows most jobs to run in a single blade with a high-speed network among the nodes
network between blade centers offers less optimal inter-blade communication than intra-blade communication
[email protected]/it/research
Recommendation – Computation
• Remark: Address requested and required capabilities
• Recommendations: Add dedicated statistical server Implement extended Oracle RAC Add rack of general-purpose servers Add visualization systems Expand local scratch disk on compute nodes Provide backup server(s) for queue and management nodes
[email protected]/it/research
Datacenter Requirements
• Proposed project to upgrade cooling & electrical in darkroom
Submitted ARI-R2 grant application - stimulus funding for renovation or expansion of a research facility
• $400,000 for datacenter renovation
• $450,000 for network expansion Decision expected by January 2010
[email protected]/it/research
Software Needs
• First round of software acquired
• $85,000 committed to ongoing support
• $65,000 available for additional acquisitions
• Need to define needs and priorities for this year
[email protected]/it/research
Summary of Recommendations
• Redesign cluster network around core switch
• Expand storage and address performance issues
• Add compute cluster optimized for serial jobs
• Provide additional statistical, visualization, and general purpose application servers
• Upgrade datacenter facilities to accommodate cluster upgrades
[email protected]/it/research
High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics
Server/Storage
Visualizationserver
Global Storage
Login nodescrc.hpc.louisville.edu
1/10 Gbps Ethernet
Visualization Cluster (CECS)
100+ TB
Campus network
4x DDR Infiniband
SMP IBM p57016 CPUs
Statisticalserver
CRC - before
[email protected]/it/research
High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics
Server/Storage
Global Storage
Login nodescrc.hpc.louisville.edu
1/10 Gbps Ethernet
Visualization Cluster (CECS)
100+ TB
Campus network
4x DDR Infiniband
SMP IBM p57016 CPUs
StatisticalServerApplication servers
CRC - after
Serial/Small Job Cluster
Visualizationservers
Core switch CRC-2 switchCRC-1 switch
[email protected]/it/research
Comments and Questions