1 research and development. 2 r&d agenda security bulk data movement data replication and...
TRANSCRIPT
1
Research and Development
2
R&D Agenda
Security Bulk Data Movement Data Replication and Mirroring Monitoring Metrics Versioning Product Services
3
Security: Single Sign-On Solutions
Goal: Single Sign-On (SSO) across browsers and non-browser clients
Public Key Infrastructure (PKI) SSO• SSO for non-browser applications, like GridFTP• SSO through X.509 public key certificates issued by MyProxy• Online Certification Authority (CA) with username/password• Auto-provisioning of trust configuration
Web SSO• SSO for http/https applications through OpenID• OpenID Identity Provider (IdP) with username/password
Web-SSO & PKI-SSO share username/password DB• Single primary authentication mechanism for end user
4
Security: Integrated WebSSO & PKI-SSO
5
Security: MyProxy as Online CA
MyProxy: Open Source software from NCSA Online CA is one of its many capabilities Different primary authentication mechanisms through
standardized Pluggable Authentication Module (PAM) Shipped with Globus Toolkit, supported on various
platforms Client package as separate deployment, including Java
clients and API
Earth System Grid Center for Enabling Technologies: (ESG-CET)
6
Security: Auto-Provisioning
PKI-SSO solutions require configuration of trust-roots• Identity providers (IdPs), Certification Authorities
(CAs)• Revocation lists
Up-to-date configuration required at servers and clients• Scalability issues with large numbers of clients
MyProxy provides auto-provisioning option• Integrated with login• Transparently updates CAs and CRLs• Is extended to use for server-provisioning also
7
Security: OpenID
OpenID provides SSO across multiple servers and can leverage multiple IdPs
OpenID satisfies ESG security requirements OpenID uses standard HTTP/HTTPS protocol Use ESG-specific OpenID profile to ensure safe
deployment• All communication with IdP requires SSL
ClientIdP and IdpRP• Yadis IDs (URIs) for OpenID identifiers• Resource Providers (RP) enforce a white list of IdPs
8
Security: OpenID4Java
OpenID4Java: Open Source software • ESG developers contribute enhancements back
Deployable as independent package into standard application servers• Integrates well with ESG’s application server software
Built-in support:• SSL (encrypted communication)• User attributes push
Java API to write authentication filters and identity providers Extended to support attributes and multiple identity
providers
9
Bulk Data Movement
Requirements• Access all data holdings through uniform interfaces, including disk
pools and mass storage systems on various nodes, using various security models
• Allocate space quotas to users dynamically on gateways in order to serve files to client
• Manage file lifetimes in the allocated spaces, and automatically clean up spaces for reuse
• Provide easy-to-use user facilities to download many files• Manage large-scale robust data movement for replication of core
data between nodes Storage Resource Management (SRM) tools support these
requirements in ESG
10
Bulk Data Movement: SRM Technology, and BeStMan
Storage Resource Managers (SRM) are middleware components over shared distributed storage components, that provide:• Dynamic space allocation • Dynamic file management in spaces• Uniform interface to all storage systems
The Berkeley Storage Manager (BeStMan) is an implementation of the SRM standard• The SRM specification is an OGF (Open Grid Forum) standard
that was developed over the last 7 years• BesStMan is used in ESG, several High-Energy-Physics (HEP)
experiments, and other applications BeStMan in ESG (see figure next slide)
• Used for coordinating space allocation and transparent access and file movement between ESG nodes and the gateway
• Currently interfaces to HPSS in NERSC and ORNL, to MSS at NCAR, and to disk systems at LLNL and LANL
• Also used to manage space on the NCAR gateway
11
Bulk Data Movement: Use of BeStMan in ESG
Tomcat servlet engine Tomcat servlet engine
MCSMetadata Cataloguing Services
MCSMetadata Cataloguing Services
RLSReplica Location Services
RLSReplica Location Services
SOAP
RMI
MyProxyserver
MyProxyserver
MCS client
RLS client
MyProxy client
GRAMgatekeeper
GRAMgatekeeper
CASCommunity Authorization Services
CASCommunity Authorization Services
CAS client
diskMSS
Mass Storage System
HPSSHigh PerformanceStorage System
disk
HPSSHigh PerformanceStorage System
disk
disk
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStManStorage Resource
Management
gridFTP
gridFTP
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
openDAPgserver
openDAPgserver
gridFTPStripedserver
gridFTPStripedserver
LBNL
LLNL
ISI
NCAR
ORNL
ANL
BeStManStorage Resource
Management
BeStManStorage Resource
Management
BeStMan at Gateway accesses all other BeStMan in nodes to get requested files (highlighted in purple)
12
DataMoverLite (DML): Simplifying Data Movement to Clients
Goal: automate pulling of files into user’s workstation Using various transfer protocols (GridFTP, bbcp, https, …) Have a GUI that shows transfer progress, or summary progress with
command line
Supports entire directory transfers
Supports suspend/resume operations
DML available onLinux, PC, MAC
GUI shows info on completed, active, pendingtransfers
Also, file sizes,transfer times,transfer speed
13
Bulk Data Movement Service Requirements
Move terabytes to petabytes (many thousands of files) Asynchronous long-lasting operation Recovery from transient failures and automatic restart Take advantage of (dynamic) network provisioning Use GridFTP, other protocols if necessary Space verification at target Support for data checksums On-demand transfer status information On-demand completion time estimates Statistics collection For security reasons bulk data movement needs to be done
in “pull mode”
14
Workflow for Future Bulk Data Movement Service
Multi-filerequest
coordinator
Verify storage
at Target
Replicatedirectorystructure
Generateplan using statistics
Monitor and generate
statistics
Recoveryand
restart
On-demandstatus
Checksumcomparison
Dynamicprogress
estimation
Filetransferclient
Requestsubmission
Initialrequest
estimation
Com
pos
e re
ques
t f
or f
aile
d fil
es
Initialization
Execution
Suspend and
resume
15
Data Replication and Mirroring
Requirement: several mirror sites around the world want to host key subsets (called a “core”) of ESG data sets
This is a new requirement for ESG• Replication of climate data sets was not originally an ESG goal • Originally considered impractical because of large size of
climate data sets With increasing importance of the IPCC data, international sites
want to replicate or “mirror” key data sets• Give scientists in a geographical region access to a “local”
copy • Reduce wide area latencies for data access• Provide increased fault tolerance and disaster protection, since
data sets are available at multiple sites
16
Impact of Data Replication/Mirroring
This work will make ESG data sets more accessible to climate scientists outside of the ESG-CET project
Initial planned mirror sites:• UK’s British Atmospheric Data Centre (BADC)• Germany’s Max Planck Institute for Meterology (MPIM)• Both have participated in design discussions for mirroring
functionality Others mirror sites likely (e.g., in Asia)
• Global network topology considerations Impact will be to increase the use of ESG and CMIP5 data sets by
scientists around the world, thus advancing climate science discoveries
17
Requirements for Data Mirroring
Newly published data set(s) are added to a common core produced at a gateway
A mirror site replicates some or all of the data sets from the common core published by a gateway
Changes to existing data sets (additions, deletions, replacements, modifications) are propagated from publishing gateway to mirror sites
18
Data Mirroring Plans Going Forward
Implementation plan (currently in progress) involves integration of several key ESG components and new functionality• Choose among available source replicas for data and metadata• Invoke the Bulk Data Movement component to copy data sets
reliably to the mirror site’s data node• Use existing ESG metadata API operations to query the relevant
metadata at the publishing node• Use a modified version of the ESG publication client to publish
newly replicated data sets at the mirror site’s gateway• Identify updates that need to be propagated to mirror sites using
versioning functionality. Technical objectives
• In the next year: complete initial implementation and deployment; evaluate data mirroring at sites in ESG, Europe
• Add functionality, including support for automatic subscription and notification of mirrored data sets
19
Replication Architecture (1)
Replication Client
(does source selection)
(1) List of data sets to be mirrored Any ESG
Gateway
(2) Metadata Query for locations
(3) Locations of data sets
(4) XML list of data transfers (sources and destinations)
Bulk Data Movement Client
at mirror site
Generating List of Files to Be Transferred
20
Replication Architecture (2)
After Bulk Transfers Complete
Replication Client
Publishing Gateway
(1) Metadata query
(2) Metadata
Metadata Database at mirror data
nodeESG Publishing Client on
mirror data node
(4) Store metadata
(3) Invoke publishing
clientMirror site’s
gateway node(5) Publish THREDDS database
(7) Future: notification that new mirror site is available; subscribe
to updates of mirrored data
(6) Metadata harvested and disseminated
to all gateways
21
Monitoring
Monitoring has contributed significantly to the robustness of the ESG infrastructure
Based on the Globus Monitoring and Discovery System (MDS)
ESG uses MDS to monitor the status of components in the distributed system• GridFTP data transfer services• Storage Resource Managers (SRMs)• NCAR portal• HTTP data services• OpenDAP services• Replica Location Services (RLSs)
22
Globus Monitoring and Discovery System
MDS Index Service• Collects status information from information
providers at each component• Report whether a particular service being monitored
is currently working correctly MDS Trigger Service• Takes actions based on monitored conditions• Sends emails to the Earth System Grid
administrators’ mailing list when components fail
23
Impact of Monitoring and Future Plans
Has resulted in much faster recovery of failed services in the distributed ESG infrastructure
Lower downtime of our infrastructure The ESG team is quickly informed when components fail• Allows the team to quickly restart failed services• Often before failures are encountered by users
We plan to deploy yet more sophisticated monitoring• ESG infrastructure increasingly distributed, federated• Also want to monitor status of mirror sites worldwide• Monitor service performance as well as availability
Investigating NetLogger, PerfSONAR
24
Metrics
Metrics are required to track and record users interactions with the ESG enterprise system
Reporting is required to show the benefits of the ESG enterprise system to the scientific community at large
ESG Gateway requests metric data from its Data Nodes
An ESG Gateway will periodically download metrics data (SRM, OPeNDAP, LAS, server hardware performance) gathered by a Data Node for a give interval of time
Returned metrics data will then be stored at the ESG Gateway for future metrics reports
25
Metrics Requirements
26
Metrics Requirements
27
Metrics progress
The gathering of important metrics for the ESG Gateway has been completed
User registrations User logins File downloads User clickstreams Browser type usage
Report generation for key metrics has been completed Total users registered, including monthly trends Total files downloaded, including monthly trends
28
Metrics Plan Going Foward
Several improvements are required in the near term for Metrics
Design and development of the Data Node “black box” metrics gathering software
Design and development of auto generated report notifications via email
Design and development of a star schema for the metrics database
29
Data Versioning
Data changes, even after publication• Errors in simulation, processing, metadata, etc.
Critically important that data publishers and consumers can identify which version of data they are working with• Changes to data may affect results of analyses
Versioning previously handled manually• Adequate for moderate amounts of closely controlled data
(current production archives)• Insufficient for global scale, especially with replication (key
driver) Now putting versioning on formal footing
• In collaboration with BADC, MPIM Initial focus on identification of key use cases, developing and
evaluating preliminary software designs
30
Proposed Versioning Software Design
31
Product Services: Delivering Visualization and Analysis to Users
Product Services provide a web-based easy-to-use interface to a vast array of interactive, science-relevant information products• Make plots in 1 and 2 dimensions along any axis or
combination of two axes including animation along the time axis
• Control plot appearance• Launch external tools either via scripts to access data in
desktop tools or direct launch of Google Earth• Compare different data sets and variables in specialized user
interface• Request server-side analysis and view the results• Supports plots of curvilinear data grids and on-the-fly re-
gridding to rectangular grids Web-based administrative interface for cache management
32
Product Services Architecture
Designed to integrate many data types and products from many legacy applications into a unified user-controlled environment
Combines incoming request with metadata to learn where the data are; what protocol is needed to read them and instructs backend services to read the data and create products
33
Product Services Offer Diverse Capabilities (1/2)
Product Services provide a Web-based easy-to-use interface to a vast array of interactive, science-relevant information products
Compute on-the-fly analysis via efficient server-side functions and plot the result
Launch external tools like Google Earth, Matlab and others
34
Product Services Offer Diverse Capabilities (2/2)
Make comparisons along an axes and/or between data sets
Make comparisons along different cutting planes and/or between data sets