2013 06-21-computing-for-light-sources
DESCRIPTION
Presented at the Computing for Light and Neutron Sources Technical Forum. Discusses Globus Online transfer, sharing and metadata management in the context of collaboration with Advanced Photon Source.TRANSCRIPT
globus online
Globus Online for Managing Tomography Data at APS
Rachana AnanthakrishnanFrancesco De Carlo
Argonne National Lab
We started with reliable, secure, high-performance file transfer …
DataSource
DataDestinatio
n
User initiates transfer request
1
Globus Online moves and syncs files
2
Globus Online notifies user
3
… and then made it simple to share big data off existing storage systems
DataSource
User A selects file(s) to share, selects user or group, and sets permissions
1
Globus Online tracks shared files; no need to move files to cloud storage!
2
User B logs in to Globus
Online and accesses
shared file
3
Transforming data acquisitionCurrent
• Experimental parameters optimized manually
• Collected data combined with visual inspection to confirm optimal condition
• Data reconstructed and sent to users via external drive
• User team starts data reduction at home institution
Transforming data acquisitionEnvisaged
• Experimental parameters optimized automatically
• Collected data available to optimization programs
• Data are automatically reconstructed, reduced, and shared with local and remote participants
• User team leaves the APS with reduced data
Current• Experimental parameters
optimized manually• Collected data combined
with visual inspection to confirm optimal condition
• Data reconstructed and sent to users via external drive
• User team starts data reduction at home institution
Facility data acquisition
Globus Online as enabler
Globus Online transfer service
Reduced data
Analysis/SharingGlobus
Online sharing service
Globus Online dataset service*
* In development
7Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL
Looking at how researchers use data
• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF, …)– Described in different ways
• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …
How do we manage data today?
• Often, a curious mix of ad hoc methods– Organize in directories using file and directory
naming conventions– Capture status in README files, spreadsheets,
notebooks– Even PowerPoint!
• Time-consuming, complex, error prone
Why can’t we manage our data like we manage our pictures and music?
Introducing the dataset• Group data based on use, not location
– Logical grouping to organize, reorganize, search, and describe usage
• Tag with characteristics that reflect content …– Capture as much existing information as we can
• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..
• Share data sets for collaboration– Control access to data and metadata
• Operate on datasets as units– Copy, export, analyze, tag, archive, …
Expanding Globus Online services
• Ingest and publication– Imagine a DropBox that not only
replicates, but also extracts metadata, catalogs, converts
• Cataloging– Virtual views of data based on user-
defined and/or automatically extracted metadata
• Integration with computation– Associate computational procedures,
orchestrate application, catalog results, record provenance
Builds on catalog as a serviceApproach
• Hosted user-defined catalogs
• Based on tag model<subject, name, value>
• Optional schema constraints
• Integrated with other Globus services
Three REST APIs/query/• Retrieve subjects/tags/• Create, delete,
retrieve tags/tagdef/• Create, delete,
retrieve tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
Exemplar: APS Beamlines 32-ID & 2-BM
X-Ray imaging, tomography, ~few µm to 30 nm resolution
Currently can generate up to 100 TB per day
< 1GB/s data rate; ~3-5GB/s in 5-10 years
14
StorageImage processing
(normalization, etc.)
Tomographic reconstruction
Visual inspection
Selection
Beamline 2-BM~1.5um resolution
Beamline 32-ID-C20-50 nm resolution
Image processing (alignment, etc.)
Tomographic reconstruction
Visual inspection
Selection
Selection Multi-scale image fusion
Visual inspection
Up to 100 fps2K x 2K, 16 bits11 GB raw data
1,500 fps2K x 2K, 16 bits1 min readout
11 GB raw data
Multi-scale 3D imaging data fusion at APS
15
APS Imaging Group
APS Software Service Group
Mathematics & Computer Science/Computation Institute
Multi-scale image fusion
Infrastructure LDRD
System integration
Instrument & Data Collection
Data Management Services
Mathematics & Computer Science
Results:Google earth style
zoom in data navigation
Tao of Fusion LDRD
Argonne Collaborations
Timelines• July: – Alpha service available
• August:– Pilot with two groups at APS
• Fall of this year:– Pilot with few other groups at APS– Early beta
Thank You
• Interested in working with us on dataset service:– Email: [email protected]
• Contact: [email protected]• Website: www.globusonline.org