nnsa advanced simulation and computing an overview of data ...€¦ · nnsa advanced simulation and...
TRANSCRIPT
NNSA Advanced Simulation and ComputingAn Overview of Data Management Issues
Presented at the March 16 -18, 2004 DMW 2004 WorkshopStanford Linear Accelerator Center
Steve Louis
Lawrence Livermore National LabLLNL ASC VIEWS Program Lead
TEL: 1-925-422-1550 FAX: 1-925-423-8715E-mail: stlouis @llnl.gov
UCRL-PRES-202900
This work was performed under the auspices of the U.S. Departmen t of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.
This work was performed under the auspices of the U.S. Departmen t of Energy by University of California Lawrence Livermore National Laboratory under contr act No. W-7405-Eng-48.
DISCLAIMER
This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or impli ed, or assumes any legal liability or responsibility for the accuracy, completeness, or u sefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commer cial products, process, or service by trade name, trademark, manufacturer, or otherwise, do es not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opini ons of authors expressed herein do not necessarily state or reflect those of the United S tates Government or the University of California, and shall not be used for advertising or product endorsement purposes.
DMW 2004 Workshop, SLAC, 16 -Mar-2004 3
Typical ASC application characterization
• ASC codes at LLNL are complex multi -physics codes. The codes integrate initial value partial differential equations forthe conservation of particles, momentum, and energy for important elements and constituents of the devices.
• Typical calculations use 10,000 to 1,000,000,000 mesh cells,depending on the problem and the desired resolution. The larger problems must be domain decomposed to fit on available memory sizes for distributed memory systems.
• Partial differential equations are solved with a combination of explicit, implicit, and Monte Carlo techniques. Linear and non -linear solvers play an important role. The problems are integrated in time from an initial state to a final state.
DMW 2004 Workshop, SLAC, 16 -Mar-2004 4
Typical ASC application characterization
• Typical phases• Interactive problem set -up for a large simulation• Running one or more very large 2 -D or 3 -D calculations• Visualization, comparison, validation, archive of results
• Typically the large calculations need terascale computing
• ASC uses terascale computing and 1000’s of CPUs now
• As these codes run over many thousands of processors, huge data dumps must be frequently made to allow for restarts (a.k.a. defensive I/O). Visualization and/or physics files are also saved regularly for subsequent analyses (a.k.a. productive I/O).
DMW 2004 Workshop, SLAC, 16 -Mar-2004 5
From current LLNL Compute & I/O Model
• Derived from peak, platform and network bandwidth, historical usage patterns, user input, and projections• A week long run can generate up to 30 TB of data, and
moving data to archive at 1/10 rate at which it was generated now requires an I/O throughput rate of 495 MB/s
• However, only half the data might typically need to be stored,so the planned throughput rate to archive is less ~250 MB/s
• Some users tend to keep data on the platform for post -processing and visualization purposes, but some users may move entire dataset to a separate visualization server
• Site-Wide-Global -File -System would significantly shift this model and would reduce the need to explicitly move data
DMW 2004 Workshop, SLAC, 16 -Mar-2004 6
Some relevant ancient history on data issues
• Joint DOE/NSF 1998 Workshop Series on Data and Visualization Corridors ( DVCs)• Oxnard, CA January 20 -22 (Frameworks)• Santa Fe, NM March 4 -6 (User Requirements)• Bodega Bay, CA April 6 -8 (Technology Trends)• Duck, NC May 19 -21 (Writing Retreat I)• Wye River, MD July 5 -8 (Writing Retreat II)
Report published September 1998, Technical Report CACR-164, http://www.cacr.caltech.edu/publications/DVC
“There is a pressing need for new methods of handling truly massive datasets, of exploring and visualizing them, and of communicating them over geographic distances…”
(From Foreword of “Report on the 1998 DVC Workshop Series”)
DMW 2004 Workshop, SLAC, 16 -Mar-2004 7
Some 1998 workshop recommendations(most, if not all, are still relevant today…)
• Establish a vigorous, interdisciplinary program to improve the ability to see and understand output from large data sources
• Conduct new research and development focused on data management, graphics, and scientific visualization for large -scale data
• Increase the federal effort and annual investment in DVC R&D by $105 -120M per year over current levels
• Develop and support a national strategy to incorporate results of DVC R&D into national laboratories, research centers, and national infrastructure programs
DMW 2004 Workshop, SLAC, 16 -Mar-2004 8
DVC system model, without archiving…(a la John van Rosendale, circa 1998)
SimulationEngine
Data Manipulation
Engine
SAN LAN
RenderingEngne
RenderingEngne
wow!
WAN
DMW 2004 Workshop, SLAC, 16 -Mar-2004 9
DVC model is still relevant for cluster -based hardware and software now deployed at LLNL
� New levels of graphics performancebased on COTS technologies (with Lintel, Quadrics, nVidia cards)
� Tight coupling to the compute platform via Lustre and gigE links
� Distributed parallel software stack(open source Chromium, DMX)
� Parallel, scalable end-user applications (e.g., VisIt, Blockbuster)
� Multiple display capabilities (Power Walls, office high-resolution displays)
� Provides the blueprint for futurePurple-related visualization and data deployment
MCR: 1,116 P4 Compute nodes
PVC: 58 P4 Render/6 Display nodes
Common Lustre file system (90TB)
GbEnet Federated Switch
128 Port QsNet Elan3
Analog Desktop Displays
3x2 PowerWall
Digital Desktop
1,152 Port QsNet Elan3
1,004 Quad Itanium2 Compute Nodes
B451
GW GW MDS MDS
4 Login nodeswith 6 Gb-Enet
16 Gateway nodes @ 350 MB/sdelivered Lustre I/O over 4x1GbE
Thunder – 1,024 Port QsNet Elan4
OST OST OST64OST
Heads
65,536 Dual PowerPC 440 Compute Nodes
1024 I/O NodesPPC440
TSF
BG/L torus , global tree/barrier
OST OST OST208OST
Heads
64OST
Heads
OCF SGS File System Cluster (OFC)
FederatedEthernet
B439
B113
PVC - 128 Port Elan3
52 Dual P4Render Nodes
B451
GW GW
6 Dual P4Display
BB/MKS Version 6Dec 23, 2003
24
PFTPMM Fiber 1 GigE
SM Fiber 1 GigE
Copper 1 GigE
2 Login nodes
OST OST OST OST
2Gig FC
OST Dual P4 Head
FC RAID
146, 73, 36 GB
400- 600 TerabytesHPSSArchive
LLNLExternal
Backbone
OST OST196OST
Heads
924 Dual P4 Compute Nodes
B439
GW GW MDS MDS
2 Login nodeswith 4 Gb-Enet
32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE
ALC - 960 Port QsNet Elan3
SW
SW
SW
SW
SW
SW
SW
SW
1,114 Dual P4 Compute Nodes
B439
GW GW MDS MDS
4 Login nodeswith 4 Gb-Enet
32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE
MCR – 1,152 Port QsNet Elan3
LLNL Open Computing Facility (OCF) Clusters, Networks, StorageLLNL Open Computing Facility (OCF) Clusters, Networks, StorageLLNL Open Computing Facility (OCF) Clusters, Networks, Storage
DMW 2004 Workshop, SLAC, 16 -Mar-2004 11
ASC Data Storage and I/O RoadmapASC Data Storage and I/O Roadmap
Area CY 02 CY 03 CY 04 CY 05 CY 06 CY 07ASC Perf. Targets
30 TF1 PB Archive7- 20 GB/s parallel FS1 GB/s to Arch. tape
70-100 TF7 PB Archive100 GB/s parallel FS10 GB/s to Arch. tape
200 TF25 PB Archive200 GB/s parallel FS20 GB/s to Arch.tape
SGSFS Lustre Lite on Linux
Lustre Lite limitedproduction
Lustre w. OST striping
Lustre early prod.
Lustre stable prod.
SIO Libs Limited App use
Use by key Apps
Broad App use
Perf. tuned for Lustre
Archive HPSS 4.1 production
HPSS 4.5 production
HPSS 5.1 Metadata fixes
HPSS 6.1 replace DCE
TBD
DFS DFS in production
Pilot NFSv4 on Linux
Deploy NFSv4
Integrate NFSv4 w. Lustre
COTS 180 GB/disk30 MB/s single disk300 GB tape capacity70 MB/s max tape rate
600 GB/disk80 MB/s single disk600 GB tape capacity120 MB/s tape rate
1200 GB/disk200 MB/s single disk2 TB tape capacity200 MB/s tape rate
DMW 2004 Workshop, SLAC, 16 -Mar-2004 12
LLNL HPSS storage slide from three years ago
Accomplishments
– A 20x performance increase in 15 months (faster nets and disks)
– PSE Milepost demonstrated 170 MB/s aggregate throughput White -to-HPSS
– Large single file transfer rates of up to 80MB/s White -to-HPSS
– Large singe file transfer rates of up to 150MB/s White -to-SGI
Challenges
– Yearly doubling of throughput is needed for next machine
At 170 MB/s, 2TB of data moves toAt 170 MB/s, 2TB of data moves tostorage in less than 4 hours. A yearstorage in less than 4 hours. A yearand a half ago it took two and a halfand a half ago it took two and a halfdays to move the same amount of datadays to move the same amount of data
Aggregate Throughput to Storage
1 MB/s 4 MB/s 6 MB/s 9 MB/s
120 MB/s
170 MB/s
0
20
40
60
80
100
120
140
160
180
FY96 FY97 FY98 FY99 FY00 FY01
MB
/sMoved to
HPSS
Moved to SPNodes
Moved to Jumbo GE &Parallel Striping
Moved to Faster Disk on FasterNodes & multi-node Concurrency
DMW 2004 Workshop, SLAC, 16 -Mar-2004 13
Continued improvement in throughput needed to meet requirements of new ASC platforms
Note that this graph represents a 115xperformance improvement in four years!
Aggregate Throughput to Storage
1,037 MB/s
170 MB/s120 MB/s
9 MB/s6 MB/s4 MB/s1 MB/s
854 MB/s
0
200
400
600
800
1000
1200
FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03
MB/
s
Moved toHPSS
Moved to SPNodes
Moved to Jumbo GE, ParallelStriping, Faster Disk & Nodesusing multiple pftp sessions
Moved to Faster Disk using multipleHtar sessions on multiple nodes
12/03 Throughput
DMW 2004 Workshop, SLAC, 16 -Mar-2004 14
A Tri -lab historical timeline for motivating improvement in scalable parallel file systems
20022000 20011999
Proposed PathForward activity for SGSFS
Propose initial architecture
SGSFS workshop “You’re Crazy ”
Build initial requirements document
PathForward team formed to pursue an RFI/RFQ approach, RFI issued, recommend RFQ process
RFQ, analysis, recommend funding open source OBSD development and NFSv4 efforts
Begin partnering talks negotiations for OBSD and NFSv4 PathForwards
Another workshop: Re-invent PosixI/O ?
“Are WeStill Crazy?”
PathForward proposal with OBSD vendor, Panasasborn
Tri-Lab joint requirements document complete
Lustre PathForward effort is born
Alliance contracts placed with universities on OBSD, overlapped I/O and NFSv4
2003 2004
U MinnObject Archive begins
DMW 2004 Workshop, SLAC, 16 -Mar-2004 15
From the June 2003 HECRTF workshop report
• For info: http://www. nitrd .gov/ hecrtf -outreach/index.html
• NNSA Tri -labs (Lee Ward of SNL, Tyce McLarty of LLNL, Gary Grider of LANL) were the ASC I/O representatives at this workshop
• Overwhelming consensus was POSIX I/O is inadequate
5.5. Data Management and File Systems
We believe legacy, POSIX I/O interfaces are incompatible with the full range of hardware architecture choices contemplated …
The interface does not fully support the needs for parallel support along the I/O path …
An alternative, appropriate operating system API should be developed for high-end computing systems …
DMW 2004 Workshop, SLAC, 16 -Mar-2004 16
LLNL ASC SDM project organization areas emphasize many data management issues
• Metadata Infrastructure and Applications – Development effort creating metadata -based environment for managing and simplifying data access (Metadata Tools Project)
• Data Access and Preparation – Research project helping scientists explore terabytes of scientific simulation data by permitting ad -hoc queries over the data (Ad Hoc Query Project)
• Data Discovery – Research projects looking at various aspects of feature extraction, data mining, and pattern recognition (Sapphire Project)
• Data Models and Formats – Development effort generating models and file formats to ensure that ASC's scientific data can be freely exchanged (Limit Point Systems contract)
DMW 2004 Workshop, SLAC, 16 -Mar-2004 17
Scalable Visualization Tool Development for Interactive Exploitation of Large Data Sets
� VIEWS developed tools (e.g., VisIt , TeraScale Browser provide vehicle for advanced research capabilities
• Improved large surface handling
• Parallel distributed volume rendering
• Topological data representations
• View -dependent surface rendering
• Programmable HW graphics rendering
min
time
ττττ0
t0
max
Isosurfacearea colormap
isovalue
DMW 2004 Workshop, SLAC, 16 -Mar-2004 18
A take home message: This five -year-old slide on issues is still as relevant as ever
• Traditional systems for archives and data management not necessarily suitable for the organization of ASC simulation data
• Traditional systems for “realistic” rendering and visualization not necessarily suitable for exploration of ASC simulation data
• ASC needs scalable, flexible methods for: • navigation / archive of massive data sets• efficient data subset selection / retrieval• time -step multivariate animation capability• interactive computational monitoring / steering• advanced application development and debugging• distance and distributed access to massive data
DMW 2004 Workshop, SLAC, 16 -Mar-2004 19
The Long Term Challenge for ASC:(yet another five -year-old, but relevant slide)
• Simple linear scaling of existing data management components won’t necessarily work
• A new use paradigm will be required• Introduce users to new, innovative tools
• Motivate and enable vigorous research efforts
• Explore high -risk, high -reward technologies
• Identify technology shortfalls and barriers