big data at experimental facilities
TRANSCRIPT
Data management at experimental facilities
Ian Foster
Joint work with Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Francesco de Carlo, Ray Osborn, Nicholas Schwarz, Hemant Sharma, Steve Tuecke, Mike Wilde, Justin Wozniak, and many others
Near field-HEDM workflow (Sharma, Almer)
Then: Experimenting in the dark• Feedback during each experiment was non-
existent; required months to analyze data
Near field-HEDM workflow (Sharma, Almer)
3** Supported by Data Engines for Big Data LDRD(Wilde, Wozniak, Sharma, Almer, Blaiszik, Foster)
Then: Experimenting in the dark• Feedback during each experiment was non-
existent; required months to analyze dataNow: Working in the light• Initial feedback over lunch using (Globus, Swift,
and Catalog) to manage and track data, leverage HPC
Architecture for APS experiments
4
Exte
rnal
com
pute
reso
urce
s
Globus Catalog for tracking data• Automate metadata ingestion from
instrumentation and acquisition machines
– API/CLI integration
• Allow near real-time metadata-driven feedback to experiments
• Allow for insertion points in the workflow– Ingest at point of collection – Catalog metadata and provenance– Push to data store – Push to local or external HPC
• Allow building and sharing of typed metadata definitions– E.g., build definition set that specifically
fits X-ray scattering data at your beamline
– Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ... 8
9
• Group data based on use and features, not location/filename– Logical grouping to organize, search,
and describe
• Operate on datasets as units
• Tag datasets with characteristics that reflect content
• Share/move datasets for collaboration
• Interact with via REST API, Python API, GUI, and CLI
Vs.
Globus CatalogCatalog Datasets Members
Globus Catalog web user interface
View and search existing catalogs
Search datasets by tag and/or text
Add fine-grained ACLs by dataset
Create catalog-specific tag definitions
Catalog-NexPy integration
Catalog-NexPy integration
Globus data publicationwww.globus.org/data-publication
• Operated as a hosted service
• Designed for Big Data
• Bring your own (per- collection) storage
• Extensible metadata schemas and input forms
• Customizable publication and curation workflows
• Associate unique and persistent digital identifiers with datasets
• Rich discovery model (in dev)
17
Early applications include:— Materials science (Materials Data Facility)— Climate simulation (ACME)— Genomics, medical imaging, astronomy, etc.
...all of this via SaaS and with your own (institutional
or personal) resources or cloud resources
Summary
Transfer
User Authenticati
onGroups Sharing
Data Publication
Data Cataloging
Automation and
Workflows
High-performance computing at experimental facilities
Ian Foster
Joint work with Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Francesco de Carlo, Ray Osborn, Nicholas Schwarz, Hemant Sharma, Steve Tuecke, Mike Wilde, Justin Wozniak, and many others
APS experimentalists use ALCF for data reconstruction, analysis — 3 examples:
Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimatestructure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift + MPI-IO) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method permits reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on typical cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper wire, 0.2mm diameter
Advanced Photon Source
Experimental and simulated scattering from manganite
Micrometer porosity
structure of shale samples
Near Field-HEDM using Mira via SwiftSingle integrated cross-system script – 4GB processed every 4-10mins
Assess
Red indicates higher statistical confidence in data
Impact and Approach Accomplishments ALCF Contributions• HEDM imaging and analysis
shows granular material structure, of non-destructively
• APS Sector 1 scientists use Mira to process data from live HEDM experiments, providing real-time feedback to correct or improve in-progress experiments
• Scientists working with Discovery Engines LDRD developed new Swift analysis workflows to process APS data from Sectors 1, 6, and 11
• Mira analyzes experiment in 10 mins vs. 5.2 hours on APS cluster: > 30X improvement
• Scaling up to ~128K cores (driven by data features)
• Cable flaw was found and fixed at start of experiment, saving an entire multi-day experiment and valuable user time and APS beam time.
• In press: High-Energy Synchrotron X-ray Techniques for Studying Irradiated Materials, J-S Park et al, J. Mat. Res.
• Big data staging with MPI-IO for interactive X-ray science, J Wozniak et al, Big Data Conference, Dec 2014
• Design, develop, support, and trial user engagement to make Swift workflow solution on ALCF systems a reliable, secure and supported production service
• Creation and support of the Petrel data server
• Reserved resources on Mira for APS HEDM experiment at Sector 1-ID beamline (8/10/2014 and future sessions in APS 2015 Run 1)
Boosting Light Source Productivity with Swift ALCF Data AnalysisH Sharma, J Almer (APS); J Wozniak, M Wilde, I Foster (MCS)
Analyze
Fix
Re-analyze
ValidData!
2 3
4
5
1
Swift provides four important transparencies
23
Parallelism Implicitly parallel functional dataflow programming
Location Runs your script on multiple distributed sites anddiverse computing resources (desktop to petascale)
Failure recovery Retries/relocates failing tasks Can restart failing runs from point of failure
Provenance capture Tasks have recordable inputs and outputs
swift-lang.org
Swift parallel scripting: LAMMPS
Tasks of varying sizes packed into big MPI run
Black: Compute
Blue: Message White: Idle
filter = input_file(data_directory/"input.inp.filter");
foreach i in [0:20] {
t = 300+i;
sed_command = sprintf("s/_TEMPERATURE_/%i/g", t);
lammps_file_name = sprintf("input-%i.inp", t);
lammps_args = "-i " + lammps_file_name;
file lammps_input<lammps_file_name> =
sed(filter, sed_command) =>
@par=8 lammps(lammps_args);
}
Invoke LAMMPS via its C++ API
swift-lang.org
Powder diffraction workflow – derived from HEDM: Makes a notable difference to APS users
• Background-removal step extracted into separate step for Powder Diffraction beamline (Sector 1)
• Used over 200 times by 30 APS users to process 50TB in the past 6 months
• Enables uses to test data quality at beam time, and to leave APS with all their data, ready to analyze
Diffuse scattering workflow using ALCF
Determines crystal configuration that produced given scattering image through simulation and evolutionary algorithm
Crystal coordinate transformation for diffuse scattering workflow