hpc open forum for researchers

18
[email protected] www.louisville.edu/it/research HPC Open Forum for Researchers HPC Open Forum for Researchers

Upload: jermaine-bonner

Post on 02-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

HPC Open Forum for Researchers. Overview. Received $1.8 million grant to expand Cardinal Research Cluster (CRC) and research computing infrastructure Identified weak links in CRC Identified needs for new hardware based on current usage and requests Developed recommendations. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HPC Open Forum for Researchers

[email protected]/it/research

HPC Open Forum for Researchers HPC Open Forum for Researchers

Page 2: HPC Open Forum for Researchers

[email protected]/it/research

Overview

• Received $1.8 million grant to expand Cardinal Research Cluster (CRC) and research computing infrastructure

• Identified weak links in CRC

• Identified needs for new hardware based on current usage and requests

• Developed recommendations

Page 3: HPC Open Forum for Researchers

[email protected]/it/research

High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics

Server/Storage

Visualizationserver

Global Storage

Login nodescrc.hpc.louisville.edu

1/10 Gbps Ethernet

Visualization Cluster (CECS)

100+ TB

Campus network

4x DDR Infiniband

SMP IBM p57016 CPUs

Statisticalserver

Cardinal Research Cluster - CRC

Page 4: HPC Open Forum for Researchers

[email protected]/it/research

CRC Limitations

• Network limitations Network switch has no free ports—zero room for expansion

Limited capacity to campus backbone and Internet2

• Storage limitations Scratch space is already becoming full Slow/unreliable performance of GPFS storage

Lack of good archiving system

• Single points of failure No redundancy in storage servers, all must be online to function

No backup hardware for management, queue, and user nodes

Page 5: HPC Open Forum for Researchers

[email protected]/it/research

Usage Trends

• Lots of serial or single-node jobs, very few massively parallel jobs

Bioinformatics jobs

Molecular dynamics jobs

Some Gaussian jobs are single node, none should be more than ~4 nodes

• Current massively parallel jobs are well-served by existing InfiniBand nodes

Page 6: HPC Open Forum for Researchers

[email protected]/it/research

Researcher Requests

• Expand storage capacity

• Provide ability to have larger quotas

• Provide data archiving and management

• Expand visualization servers

• Provide ability to quickly add applications servers

Page 7: HPC Open Forum for Researchers

[email protected]/it/research

Other Considerations

• Need for separate statistical server to free shared memory p570 system for computational focus

• Need to implement second phase of Oracle RAC redundancy (extended RAC)

• Need for general purpose applications servers that can be allocated for dedicated research applications

• Need for local scratch disks on compute nodes

• Need for facilities upgrades (cooling and power)

Page 8: HPC Open Forum for Researchers

[email protected]/it/research

Recommendation - Networking

• Remark: CRC network switch cannot be expanded and is a single point of failure

• Recommendation: Redesign networking for expansion of research computing infrastructure and improved connectivity

Add new core switch for shared resources including storage, user nodes, p570, viz, and stats server

Add switch for expansion of compute nodes and servers on the CRC

Expand connectivity to campus backbone network and Internet2

Page 9: HPC Open Forum for Researchers

[email protected]/it/research

CRC – Network Redesign

Page 10: HPC Open Forum for Researchers

[email protected]/it/research

Recommendation - Storage

• Remark: Address storage space expansion and performance issues

• Recommendation: Add storage space Increase number of storage servers Increase allocation of scratch space Review quota structure with governance committee Develop archiving systems Continue to address GPFS tuning concerns

Page 11: HPC Open Forum for Researchers

[email protected]/it/research

Recommendation - Computation

• Remark: Lots of serial or single-node jobs, very few massively parallel jobs

• Recommendation: Implement new cluster optimized for high-throughput serial processing

Utilize blade centers to provide a low cost way to maximize number of compute nodes

14 nodes/blade center – 168 cores/blade center – allows most jobs to run in a single blade with a high-speed network among the nodes

network between blade centers offers less optimal inter-blade communication than intra-blade communication

Page 12: HPC Open Forum for Researchers

[email protected]/it/research

Recommendation – Computation

• Remark: Address requested and required capabilities

• Recommendations: Add dedicated statistical server Implement extended Oracle RAC Add rack of general-purpose servers Add visualization systems Expand local scratch disk on compute nodes Provide backup server(s) for queue and management nodes

Page 13: HPC Open Forum for Researchers

[email protected]/it/research

Datacenter Requirements

• Proposed project to upgrade cooling & electrical in darkroom

Submitted ARI-R2 grant application - stimulus funding for renovation or expansion of a research facility

• $400,000 for datacenter renovation

• $450,000 for network expansion Decision expected by January 2010

Page 14: HPC Open Forum for Researchers

[email protected]/it/research

Software Needs

• First round of software acquired

• $85,000 committed to ongoing support

• $65,000 available for additional acquisitions

• Need to define needs and priorities for this year

Page 15: HPC Open Forum for Researchers

[email protected]/it/research

Summary of Recommendations

• Redesign cluster network around core switch

• Expand storage and address performance issues

• Add compute cluster optimized for serial jobs

• Provide additional statistical, visualization, and general purpose application servers

• Upgrade datacenter facilities to accommodate cluster upgrades

Page 16: HPC Open Forum for Researchers

[email protected]/it/research

High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics

Server/Storage

Visualizationserver

Global Storage

Login nodescrc.hpc.louisville.edu

1/10 Gbps Ethernet

Visualization Cluster (CECS)

100+ TB

Campus network

4x DDR Infiniband

SMP IBM p57016 CPUs

Statisticalserver

CRC - before

Page 17: HPC Open Forum for Researchers

[email protected]/it/research

High Performance Computing Cluster304 nodes/2432 cores16 or 32 GB/node Informatics

Server/Storage

Global Storage

Login nodescrc.hpc.louisville.edu

1/10 Gbps Ethernet

Visualization Cluster (CECS)

100+ TB

Campus network

4x DDR Infiniband

SMP IBM p57016 CPUs

StatisticalServerApplication servers

CRC - after

Serial/Small Job Cluster

Visualizationservers

Core switch CRC-2 switchCRC-1 switch

Page 18: HPC Open Forum for Researchers

[email protected]/it/research

Comments and Questions