july 15, 2014 presentation for the crossconnects …...2014/07/14 · presentation for the...
TRANSCRIPT
National Oceanic and Atmospheric AdministrationGeophysical Fluid Dynamics Laboratory
Princeton, NJ 08542http://www.gfdl.noaa.gov
Chandin Wilson, EngilityDaniel Gall, EngilityJeffrey Durachta, Samara Technology GroupTara McQueen, NOAA
Cloudy with a chance of transfer
Abstracting the Network for Data Movement
Presentation for the CrossConnects WorkshopImproving Data Mobility & Management for International Climate Science
July 15, 2014
GENERAL COPY PROGRAM
Implementing Cloud Transfer for HPC: Abstracting the Network for Scientific Data Movement
A Copy Wrapper
• “The copy tool” to use in the enclave• Simple ‘scp’-like usage• Hides all the details about actual copy
program and transfer nodes used• Reliable and high throughput using
dedicated transfer nodes• Centralized logging
Motivation
• Distributed Compute and Storage• Filesystems• Users• Toolsets• Connectivity• Reliability• Maintainability
NOAA RDHPCS Enclave
BastionGFDL HPCS(Princeton, NJ)
•Users•Post Processing•Storage
Gaea HPCS(Oakridge, TN)
•Compute
Zeus HPCS(Fairmont, WV)
•Compute•Storage
Jet HPCS(Boulder, CO)
•Users•Compute
Bastion
Bastion
Nwave
Enclave
• Logical distinction between HPCS and user workstations
• Clear security demarcation points• Trusted inside• Unified login for all HPCS access• Redundant access • Multiple private 10Gbit links
Compute and Storage
GCP Evolution
• Predates NOAA RDHPCS Enclave– External collaborators wanted simple way
to pull or push data to GFDL• Improving Cluster Filesystem copies
– Lustre, CXFS, have specific “best tools”– HPCS installs and upgrades change
filesystem characteristics• Exploring WAN transfer options
GCP 1
• Perl, non-OO design• Flexible, but not configurable• Excellent proving ground for ideas.• Understood two or three sites• Predates DTN infrastructure• Mutated hydra after 24 months
Data Transfer Nodes: DTNs
• Dedicated systems for data transfers• Redundant cluster of systems• No / limited user logins• Gridftp server with certificate auth• Some DTNs managed by Batch
Scheduler
Data Transfer Nodes, 1
• Cluster filesystems– Fast and expensive: not usually attached to
batch nodes– NFS serving introduces a layer– FTP-like copies between batch nodes and
cluster filesystem nodes perform fastest
Data Transfer Nodes, 2
• Workstation (local) filesystems– Isolated from HPCS, but transfer access
needed.– Served via NFS from workstations
• Remote filesystems– WAN transfers introduce latency and
bandwidth considerations different than LAN transfers.
DTN Graphic
GCP 2 Design
• Perl, Object Oriented concepts– Functionality organized into a Driver that
manages the transfer, using a Method with an Agent and Transport to copy the data.
– Methods determined by invoking node– Agent selected by node connectivity– Transport selected by node and DTN
capability
GCP Modularity
GCP Configuration
• Configuration abstracted into Categories, Filesystems, and DTN (Data Transfer Nodes)– Categories are sites and nodes, allowing for site-
wide and node-specific designation of a particular transport and/or DTN for a filesystem.
GCP Configuration
GCP Logging
• Logging centralized and flexible– Able to implement per-user trace logging
• Extensive tunable logging via Log4perl• Syslog as the transport• All transfers logged with unique session
identifier
GCP Log Flow
Gaea(Oakridge)
Zeus(Fairmont)
HeadBatch
LDTNRDTN
Head
GFDLPan
(Princeton)
HeadBatch
GFDL(Princeton)Workstations
• Log traffic is SYSLOG protocol on ports 46514 (production) and 46515 (test). This is within the data-transfer port ranges allowed through the firewalls.
Enclave Loghost, dtn-003
GFDL Loghost, nfs-arch
Enclave Loghost, edtn.PRNG
ExternalsANLORNL
Head
GCP Log example
ts=2014-05-29T23:57:56.561;uid=383b1edb-1459-4b5b-b8a5-69957a291e0d;p=INFO;where=GCP::Driver::log_end.1414;event=gov.noaa.rdhpcs.gcp.end; transfer_size=865184916;dtn_destination=pnfs06-ha.princeton.rdhpcs.noaa.gov; file_count=1;status=0;prog=/usr/local/gcp/2.2/gcp;gcp_version=2.2.1; node=pp045.princeton.rdhpcs.noaa.gov;transport_count=1;level=info; transfer_time=10;pid=10824;error=none;user=Bill.Hurlin;gcp_call=.....
GCP 2 Features
• Supports five sites, inside and outside the enclave
• Handles a dozen filesystems across nine node types
• Easily expandable for additional sites, filesystems, nodes, transports, and agents.
GCP Testing and Production
• Unit and end-to-end tests run hourly via Jenkins
• Production test runs validate functionality of entire transfer infrastructure and most of filesystems
GCP Transfer volumes