nnsa advanced simulation and computing an overview of data ...€¦ · nnsa advanced simulation and...

NNSA Advanced Simulation and ComputingAn Overview of Data Management Issues

Presented at the March 16 -18, 2004 DMW 2004 WorkshopStanford Linear Accelerator Center

Steve Louis

Lawrence Livermore National LabLLNL ASC VIEWS Program Lead

TEL: 1-925-422-1550 FAX: 1-925-423-8715E-mail: stlouis @llnl.gov

UCRL-PRES-202900

This work was performed under the auspices of the U.S. Departmen t of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

This work was performed under the auspices of the U.S. Departmen t of Energy by University of California Lawrence Livermore National Laboratory under contr act No. W-7405-Eng-48.

DISCLAIMER

This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or impli ed, or assumes any legal liability or responsibility for the accuracy, completeness, or u sefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commer cial products, process, or service by trade name, trademark, manufacturer, or otherwise, do es not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opini ons of authors expressed herein do not necessarily state or reflect those of the United S tates Government or the University of California, and shall not be used for advertising or product endorsement purposes.

DMW 2004 Workshop, SLAC, 16 -Mar-2004 3

Typical ASC application characterization

• ASC codes at LLNL are complex multi -physics codes. The codes integrate initial value partial differential equations forthe conservation of particles, momentum, and energy for important elements and constituents of the devices.

• Typical calculations use 10,000 to 1,000,000,000 mesh cells,depending on the problem and the desired resolution. The larger problems must be domain decomposed to fit on available memory sizes for distributed memory systems.

• Partial differential equations are solved with a combination of explicit, implicit, and Monte Carlo techniques. Linear and non -linear solvers play an important role. The problems are integrated in time from an initial state to a final state.


Typical ASC application characterization

• Typical phases• Interactive problem set -up for a large simulation• Running one or more very large 2 -D or 3 -D calculations• Visualization, comparison, validation, archive of results

• Typically the large calculations need terascale computing

• ASC uses terascale computing and 1000’s of CPUs now

• As these codes run over many thousands of processors, huge data dumps must be frequently made to allow for restarts (a.k.a. defensive I/O). Visualization and/or physics files are also saved regularly for subsequent analyses (a.k.a. productive I/O).


From current LLNL Compute & I/O Model

• Derived from peak, platform and network bandwidth, historical usage patterns, user input, and projections• A week long run can generate up to 30 TB of data, and

moving data to archive at 1/10 rate at which it was generated now requires an I/O throughput rate of 495 MB/s

• However, only half the data might typically need to be stored,so the planned throughput rate to archive is less ~250 MB/s

• Some users tend to keep data on the platform for post -processing and visualization purposes, but some users may move entire dataset to a separate visualization server

• Site-Wide-Global -File -System would significantly shift this model and would reduce the need to explicitly move data


Some relevant ancient history on data issues

• Joint DOE/NSF 1998 Workshop Series on Data and Visualization Corridors ( DVCs)• Oxnard, CA January 20 -22 (Frameworks)• Santa Fe, NM March 4 -6 (User Requirements)• Bodega Bay, CA April 6 -8 (Technology Trends)• Duck, NC May 19 -21 (Writing Retreat I)• Wye River, MD July 5 -8 (Writing Retreat II)

Report published September 1998, Technical Report CACR-164, http://www.cacr.caltech.edu/publications/DVC

“There is a pressing need for new methods of handling truly massive datasets, of exploring and visualizing them, and of communicating them over geographic distances…”

(From Foreword of “Report on the 1998 DVC Workshop Series”)


Some 1998 workshop recommendations(most, if not all, are still relevant today…)

• Establish a vigorous, interdisciplinary program to improve the ability to see and understand output from large data sources

• Conduct new research and development focused on data management, graphics, and scientific visualization for large -scale data

• Increase the federal effort and annual investment in DVC R&D by $105 -120M per year over current levels

• Develop and support a national strategy to incorporate results of DVC R&D into national laboratories, research centers, and national infrastructure programs


DVC system model, without archiving…(a la John van Rosendale, circa 1998)

SimulationEngine

Data Manipulation

Engine

SAN LAN

RenderingEngne

RenderingEngne

wow!

WAN


DVC model is still relevant for cluster -based hardware and software now deployed at LLNL

� New levels of graphics performancebased on COTS technologies (with Lintel, Quadrics, nVidia cards)

� Tight coupling to the compute platform via Lustre and gigE links

� Distributed parallel software stack(open source Chromium, DMX)

� Parallel, scalable end-user applications (e.g., VisIt, Blockbuster)

� Multiple display capabilities (Power Walls, office high-resolution displays)

� Provides the blueprint for futurePurple-related visualization and data deployment

MCR: 1,116 P4 Compute nodes

PVC: 58 P4 Render/6 Display nodes

Common Lustre file system (90TB)

GbEnet Federated Switch

128 Port QsNet Elan3

Analog Desktop Displays

3x2 PowerWall

Digital Desktop

1,152 Port QsNet Elan3

1,004 Quad Itanium2 Compute Nodes

B451

GW GW MDS MDS

4 Login nodeswith 6 Gb-Enet

16 Gateway nodes @ 350 MB/sdelivered Lustre I/O over 4x1GbE

Thunder – 1,024 Port QsNet Elan4

OST OST OST64OST

Heads

65,536 Dual PowerPC 440 Compute Nodes

1024 I/O NodesPPC440

TSF

BG/L torus , global tree/barrier

OST OST OST208OST

Heads

64OST

Heads

OCF SGS File System Cluster (OFC)

FederatedEthernet

B439

B113

PVC - 128 Port Elan3

52 Dual P4Render Nodes

B451

GW GW

6 Dual P4Display

BB/MKS Version 6Dec 23, 2003

24

PFTPMM Fiber 1 GigE

SM Fiber 1 GigE

Copper 1 GigE

2 Login nodes

OST OST OST OST

2Gig FC

OST Dual P4 Head

FC RAID

146, 73, 36 GB

400- 600 TerabytesHPSSArchive

LLNLExternal

Backbone

OST OST196OST

Heads

924 Dual P4 Compute Nodes

B439

GW GW MDS MDS



ALC - 960 Port QsNet Elan3

SW

SW

SW

SW

SW

SW

SW

SW

1,114 Dual P4 Compute Nodes

B439

GW GW MDS MDS



MCR – 1,152 Port QsNet Elan3

LLNL Open Computing Facility (OCF) Clusters, Networks, StorageLLNL Open Computing Facility (OCF) Clusters, Networks, StorageLLNL Open Computing Facility (OCF) Clusters, Networks, Storage


ASC Data Storage and I/O RoadmapASC Data Storage and I/O Roadmap

Area CY 02 CY 03 CY 04 CY 05 CY 06 CY 07ASC Perf. Targets

30 TF1 PB Archive7- 20 GB/s parallel FS1 GB/s to Arch. tape

70-100 TF7 PB Archive100 GB/s parallel FS10 GB/s to Arch. tape

200 TF25 PB Archive200 GB/s parallel FS20 GB/s to Arch.tape

SGSFS Lustre Lite on Linux

Lustre Lite limitedproduction

Lustre w. OST striping

Lustre early prod.

Lustre stable prod.

SIO Libs Limited App use

Use by key Apps

Broad App use

Perf. tuned for Lustre

Archive HPSS 4.1 production

HPSS 4.5 production

HPSS 5.1 Metadata fixes

HPSS 6.1 replace DCE

TBD

DFS DFS in production

Pilot NFSv4 on Linux

Deploy NFSv4

Integrate NFSv4 w. Lustre

COTS 180 GB/disk30 MB/s single disk300 GB tape capacity70 MB/s max tape rate

600 GB/disk80 MB/s single disk600 GB tape capacity120 MB/s tape rate

1200 GB/disk200 MB/s single disk2 TB tape capacity200 MB/s tape rate


LLNL HPSS storage slide from three years ago

Accomplishments

– A 20x performance increase in 15 months (faster nets and disks)

– PSE Milepost demonstrated 170 MB/s aggregate throughput White -to-HPSS

– Large single file transfer rates of up to 80MB/s White -to-HPSS

– Large singe file transfer rates of up to 150MB/s White -to-SGI

Challenges

– Yearly doubling of throughput is needed for next machine

At 170 MB/s, 2TB of data moves toAt 170 MB/s, 2TB of data moves tostorage in less than 4 hours. A yearstorage in less than 4 hours. A yearand a half ago it took two and a halfand a half ago it took two and a halfdays to move the same amount of datadays to move the same amount of data

Aggregate Throughput to Storage

1 MB/s 4 MB/s 6 MB/s 9 MB/s

120 MB/s

170 MB/s

0

20

40

60

80

100

120

140

160

180

FY96 FY97 FY98 FY99 FY00 FY01

MB

/sMoved to

HPSS

Moved to SPNodes

Moved to Jumbo GE &Parallel Striping

Moved to Faster Disk on FasterNodes & multi-node Concurrency


Continued improvement in throughput needed to meet requirements of new ASC platforms

Note that this graph represents a 115xperformance improvement in four years!

Aggregate Throughput to Storage

1,037 MB/s

170 MB/s120 MB/s

9 MB/s6 MB/s4 MB/s1 MB/s

854 MB/s

0

200

400

600

800

1000

1200

FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03

MB/

s

Moved toHPSS

Moved to SPNodes

Moved to Jumbo GE, ParallelStriping, Faster Disk & Nodesusing multiple pftp sessions

Moved to Faster Disk using multipleHtar sessions on multiple nodes

12/03 Throughput


A Tri -lab historical timeline for motivating improvement in scalable parallel file systems

20022000 20011999

Proposed PathForward activity for SGSFS

Propose initial architecture

SGSFS workshop “You’re Crazy ”

Build initial requirements document

PathForward team formed to pursue an RFI/RFQ approach, RFI issued, recommend RFQ process

RFQ, analysis, recommend funding open source OBSD development and NFSv4 efforts

Begin partnering talks negotiations for OBSD and NFSv4 PathForwards

Another workshop: Re-invent PosixI/O ?

“Are WeStill Crazy?”

PathForward proposal with OBSD vendor, Panasasborn

Tri-Lab joint requirements document complete

Lustre PathForward effort is born

Alliance contracts placed with universities on OBSD, overlapped I/O and NFSv4

2003 2004

U MinnObject Archive begins


From the June 2003 HECRTF workshop report

• For info: http://www. nitrd .gov/ hecrtf -outreach/index.html

• NNSA Tri -labs (Lee Ward of SNL, Tyce McLarty of LLNL, Gary Grider of LANL) were the ASC I/O representatives at this workshop

• Overwhelming consensus was POSIX I/O is inadequate

5.5. Data Management and File Systems

We believe legacy, POSIX I/O interfaces are incompatible with the full range of hardware architecture choices contemplated …

The interface does not fully support the needs for parallel support along the I/O path …

An alternative, appropriate operating system API should be developed for high-end computing systems …


LLNL ASC SDM project organization areas emphasize many data management issues

• Metadata Infrastructure and Applications – Development effort creating metadata -based environment for managing and simplifying data access (Metadata Tools Project)

• Data Access and Preparation – Research project helping scientists explore terabytes of scientific simulation data by permitting ad -hoc queries over the data (Ad Hoc Query Project)

• Data Discovery – Research projects looking at various aspects of feature extraction, data mining, and pattern recognition (Sapphire Project)

• Data Models and Formats – Development effort generating models and file formats to ensure that ASC's scientific data can be freely exchanged (Limit Point Systems contract)


Scalable Visualization Tool Development for Interactive Exploitation of Large Data Sets

� VIEWS developed tools (e.g., VisIt , TeraScale Browser provide vehicle for advanced research capabilities

• Improved large surface handling

• Parallel distributed volume rendering

• Topological data representations

• View -dependent surface rendering

• Programmable HW graphics rendering

min

time

ττττ0

t0

max

Isosurfacearea colormap

isovalue


A take home message: This five -year-old slide on issues is still as relevant as ever

• Traditional systems for archives and data management not necessarily suitable for the organization of ASC simulation data

• Traditional systems for “realistic” rendering and visualization not necessarily suitable for exploration of ASC simulation data

• ASC needs scalable, flexible methods for: • navigation / archive of massive data sets• efficient data subset selection / retrieval• time -step multivariate animation capability• interactive computational monitoring / steering• advanced application development and debugging• distance and distributed access to massive data


The Long Term Challenge for ASC:(yet another five -year-old, but relevant slide)

• Simple linear scaling of existing data management components won’t necessarily work

• A new use paradigm will be required• Introduce users to new, innovative tools

• Motivate and enable vigorous research efforts

• Explore high -risk, high -reward technologies

• Identify technology shortfalls and barriers

nnsa advanced simulation and computing an overview of data ...€¦ · nnsa advanced simulation and...

Documents