san diego supercomputer center: best practices,...

38
University of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER San Diego Supercomputer Center: San Diego Supercomputer Center: Best practices, policies Best practices, policies Giri Chukkapalli supercomputer best practices symposium May 11, 05

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

San Diego Supercomputer Center: San Diego Supercomputer Center: Best practices, policiesBest practices, policies

Giri Chukkapallisupercomputer best practices

symposiumMay 11, 05

Page 2: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Center’sCenter’s MissionMission

• Computational science vs computer science research

• Computational science• Supporting Single code• Supporting single field• Supporting broad spectrum of fields• Target existing users or grow new users• Capacity vs capability computing

• Cant be everything to everybody• Mission statement and policy document

Page 3: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

User awarenessUser awareness

• Well publicizing to the target user community existing as well as upcoming compute, data capabilities of the center

• This will enable the user community to plan the type of problems they want to solve and develop appropriate codes to take advantage of the resources

• Otherwise, people who happened to know will make use of it

Page 4: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

More than just a large supercomputerMore than just a large supercomputer

• To support a broad computational science research community• Peripheral hardware, software and personnel with wide

range of expertise are necessary• A sizable shared memory machine to do pre and post

processing • Large compute farm to run embarrassingly parallel jobs • Viz. engines• SAN

Page 5: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Computing: One Size Doesn’t Fit AllComputing: One Size Doesn’t Fit AllD

ata

capa

bilit

y(In

crea

sing

I/O

and

sto

rage

)

Compute capability(increasing FLOPS)

SDSC Data Science Env

Campus, Departmental and

Desktop Computing

Traditional HEC Env

QCD

Protein Folding

CPMD

NVOEOL

CIPRes

SCECVisualization

Data Storage/Preservation Extreme I/O

1. 3D + time simulation

2. Out-of-CoreENZOVisualization

CFD

ClimateSCEC

Simulation ENZOsimulation

Can’t be done on Grid(I/O exceeds WAN)

Distributed I/OCapable

Page 6: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Data MovementData Movement

• Into and out of the center• SAN File system

• SAN to/from compute platform’s parallel file system

• Movement of data between compute, viz. and pre/post processing engines

• Automatic migration of data to/from archive• Bottleneck free data flow

Page 7: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Pushing the DataPushing the Data--Intensive EnvelopeIntensive Envelope

Memory ParallelFile System Data Parking Archival Tape

System

C

O

M

P

U

T

E

R

Today’s leading-edge

1 GB/s 100 MB/s1 GB/s

4 TB 60 TB 100 TB 10 PB2 TB/s

15 TF

Tomorrow’s demands

100 GB/s 100 GB/s 10 GB/s

10 TB 3 PB 10 PB 100 PB

10 TB/s

100 TF

Page 8: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Various file systems Various file systems

• Small backed up /home file system• Periodically purged fast parallel file system • Parking file system

• SAN file system with auto-migration to archive• Possibly non-backed non-purged intermediate

size file system

Page 9: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTERVector/

SMPMPPs

Loosely coupled

clusters

Work stations

Data

Engines

servers

Web

server

Sensors

instruments

N E T W O R K / D A T A T R A N S P O R T L A Y E R

GLOBUS LAYER

Grid middleware bridge software, schedulers etc.

Problem Solving Environments portals, UIs, web services

Operating Systems, Compilers, Oracle TOMCAT A/D

Life Sciences Engineering Environmental Astrophysics Etc.

Bioinformatics Automotive/ Climate/

Aircraft Weather

Hardware

Complex

Systems

Domain Specific

Resource Specific

Cyber

Infrastructure

Cyber InfrastructureCyber Infrastructure

Tools

libraries

Page 10: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SDSC DataStar

187 Total Nodes11 p690

176 p655

1.7

1.5

(5)(171)

(7)

Page 11: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SANergySANergy Data MovementData Movement

Orion

TeragridN

etwork

SAM-QFS DISK

2Gb

1Gb x 41Gb x 4

p690

Federation Switch

SAN Switch Infrastructure

2Gb x 4

SANergy MDC

Metadata operations, NFS

Data operations

SANergy client

Page 12: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER~400 Sun FC Disk Arrays (~4100 disks, 540 TB total)

32 FC Tape Drives

Sun Fire 15K

DataStar176 P655s

SAM-QFS ETF DBSAN-GPFS

5 x Brocade 12000 (1408 2Gb ports)

DataStar 11 P690s

SAN

ergy

Clie

nt

SAN

ergy

Serv

er

Force 10 -12000

HPSS

Page 13: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Compute platform: setupCompute platform: setup

• Small identical Test system• Perform all the upgrades on test system first

• Shared interactive pool• Batch pool• Setting up common environment

• Copydefaults• Softenv

• Setting up of third party tools, libraries, helper apps, community codes

Page 14: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Compute platform: setupCompute platform: setup

• Providing example code, scripts, configures• /usr/local/apps/examples

• Providing user interface to allocation management

Page 15: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Compute platform: AllocationsCompute platform: Allocations

• Compute and data allocations• Understanding space-time resolution

relationships• Peer (rotating body) review process• Online system• I am currently part of NSF review committee

• Can provide more info if needed

Page 16: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Criteria for machine accessCriteria for machine access

• Preliminary access for porting, benchmarking and optimizing user’s code

• Single CPU performance criteria (15%?)• Scaling criteria (half the machine with 90%)• If not met provide help, consulting

Page 17: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Compute platform: schedulingCompute platform: scheduling

• Higher priority to large PE jobs• Allowing longer times to larger PE jobs• Weighting based on allocation size• Good API for users to probe and interact with

the scheduler• Prologue and epilogue scripts to bring the

system to clean state• Express, high, low and back fill queues• Optimizing for maximum throughput vs quick

turn around

Page 18: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Regression testsRegression tests

• Well designed set of benchmarks and regression tests to monitor system correctness and performance

• Preventive maintenance• Compiler/OS upgrades• Provide access to login/interactive nodes during PM

Page 19: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Compute platform: life cycleCompute platform: life cycle

• Friendly user phase• Few expert users who can cope with instabilities

• Production phase• Criteria for a machine to be production

• Uptime• Documentation• Accounting• stable

• Terminal phase• When the next system goes to production• 2 or 3 users who can use the whole machine

Page 20: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Communicating to UsersCommunicating to Users

• User guide, FAQ• Periodic articles on tools usage, example apps • Yearly week long training• Email, motd alerts

Page 21: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

consultingconsulting

• Ticketing system, phone consulting• Quick analysis and optimization help

• TOPs (targeted optimization and porting) program• Extended collaboration

• Strategic Applications Collaboration (SAC)• Modern tools like IM

Page 22: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Listening to usersListening to users

• Periodic well designed surveys• User advisory committee• Local internal users• Listening while consulting• Application space is moving from monolithic

single component analysis codes to multi-scale multi-physics systems simulation codes

Page 23: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Usage AnalysisUsage Analysis

• To see how we are fallowing the policies set

Page 24: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

DS p655 Usage by node count (4/1/04DS p655 Usage by node count (4/1/04--5/1/05)5/1/05)1, 6%

2-3., 4%

4, 6%

5-7., 2%

8, 15%

9-15., 5%

16, 9%

17-31., 15%

32, 9%

33-63., 8%

64, 10%

65-123., 4%

128, 6% 129-176., 1%

12-3.45-7.89-15.1617-31.3233-63.6465-123.128129-176.

There have been recent increases in the # of 128-node jobs.

Page 25: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SDSC User Snapshot: 2004SDSC User Snapshot: 2004

• 286 active projects• 90 institutions• 7 million SUs

consumed on DataStar

• PIs funded by NSF, NIH, DOE, NASA, DOD, DARPA, AFOSR, ONR

Time Awarded, by Discipline

Page 26: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

PIs by DisciplinePIs by Discipline

Page 27: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

Time Awarded, by DisciplineTime Awarded, by Discipline

Page 28: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

States with SDSC-Allocated PIs

Users Span the Nation Users Span the Nation

Page 29: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Compute ResourcesSDSC Compute Resources

• DataStar• 1,628 Power4+ processors• IBM p655 and p690 nodes• 4 TB total memory• Up to 2 GBps I/O to disk

• TeraGrid Cluster• 512 Itanium2 IA-64

processors• 1 TB total memory

• Intimidata• 2,048 PowerPC processors• 128 I/O nodes• Half a petabyte of GPFS Intimidata Installation

Page 30: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Data ResourcesSDSC Data Resources

• 1 PB Storage-area Network (SAN)

• 6 PB StorageTek tape library

• DB2, Oracle, MySQL• Storage Resource Broker• HPSS• 72-CPU Sun Fire 15K• 96-CPU IBM p690s

Page 31: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Top 10 UsersSDSC Top 10 Users((SUsSUs consumed in 2004)consumed in 2004)

• Marvin Cohen, UC Berkeley• DataStar: 846,397 SUs

• Michael Norman, UC San Diego• DataStar: 551,969

• Juri Toomre, U Colorado• DataStar: 361,633

• Richard Klein, UC Berkeley• DataStar: 315,240

• J Andrew Mccammon, UCSD• DataStar: 310,909

• Klaus Schulten, UIUC• TeraGrid Cluster: 287,188

• George Karniadakis, Brown U• DataStar: 284,430

• Richard Klein, UC Berkeley• DataStar: 279,766

• Pui-Kuen Yeung, Ga Tech• DataStar: 220,172

• Parviz Moin, Stanford U• DataStar: 188,391

Page 32: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SAC: ENZO SAC: ENZO (Robert (Robert HarknessHarkness))

• “Reconstructing the first billion years’’ • 3D cosmological

hydrodynamics code• Generates TBs of data

now• Stresses network and

data movement limits• Run anywhere, write data

to SDSC with SRB

Page 33: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

SAC: SAC: TeraShakeTeraShake ((YifengYifeng Cui)Cui)

• Estimating the potential damage of a magnitude 7.7 Southern California earthquake

• Large-scale simulation of seismic wave propagation on the San Andreas Fault• 1.8 billion gridpoints• 240 DataStar processors• 1 TB memory• 5 days• 2 GB/s continuous I/O• 47 TB output

Page 34: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

NVO Montage NVO Montage ((LeesaLeesa BriegerBrieger))

•• ComputeCompute--intensive service to intensive service to deliver sciencedeliver science--grade custom grade custom mosaics on demand, with mosaics on demand, with requests made through requests made through existing portalsexisting portals

• 2MASS: 10-TB, three-band infrared frequency archive of the entire sky

• Compute-intensive generation of custom mosaics

• Possible to mosaic the whole sky into five-degree squares with ~1 week of TeraGrid time

Page 35: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

BluegeneBluegene specificspecific

• better development environment• eliminate cross compilation need(pretty ancient)

• Run BGL kernel as a VM on the front end?• BGL’s special need for packing jobs on contiguous chunk

of nodes • Special map files, mapping codes

Page 36: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

BluegeneBluegene: experience: experience

• Extremely reproducible times• Extremely stable hardware• Very poor single processor (compiler?)

performance (double hummer, simd)• Still not tested computation/communication

overlap• Would like to operate in single-boot, multi-user

mode

Page 37: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

BluegeneBluegene: experience: experience

• Several SDSC codes ported:• Mpcugles: LES turbulence code• PK’s DNS turbulence code• POP ocean model• SPECFEM3D: seismic wave propagation• Amber: MD chemistry code• ENZO: Astrophysics code• NAMD, CPMD came from IBM

Page 38: San Diego Supercomputer Center: Best practices, policiestkwon/course/5315/HW/BG/9.SDSC_Best_Practices.pdfUniversity of California, San Diego SAN DIEGO SUPERCOMPUTER CENTER Compute

University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER

BluegeneBluegene: latest: latest

• Half a petabyte of SATA file system attached to BGL

• 64 IA64 server nodes• 3.2GB/s reads and 2.8GB/s writing• 700MB/s from a production code using 512

nodes