big data at experimental facilities

Data management at experimental facilities

Ian Foster

[email protected]

Joint work with Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Francesco de Carlo, Ray Osborn, Nicholas Schwarz, Hemant Sharma, Steve Tuecke, Mike Wilde, Justin Wozniak, and many others

Near field-HEDM workflow (Sharma, Almer)

Then: Experimenting in the dark• Feedback during each experiment was non-

existent; required months to analyze data

Near field-HEDM workflow (Sharma, Almer)

3** Supported by Data Engines for Big Data LDRD(Wilde, Wozniak, Sharma, Almer, Blaiszik, Foster)

Then: Experimenting in the dark• Feedback during each experiment was non-

existent; required months to analyze dataNow: Working in the light• Initial feedback over lunch using (Globus, Swift,

and Catalog) to manage and track data, leverage HPC

Architecture for APS experiments

4

Exte

rnal

com

pute

reso

urce

s

Globus Catalog for tracking data• Automate metadata ingestion from

instrumentation and acquisition machines

– API/CLI integration

• Allow near real-time metadata-driven feedback to experiments

• Allow for insertion points in the workflow– Ingest at point of collection – Catalog metadata and provenance– Push to data store – Push to local or external HPC

• Allow building and sharing of typed metadata definitions– E.g., build definition set that specifically

fits X-ray scattering data at your beamline

– Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ... 8

9

• Group data based on use and features, not location/filename– Logical grouping to organize, search,

and describe

• Operate on datasets as units

• Tag datasets with characteristics that reflect content

• Share/move datasets for collaboration

• Interact with via REST API, Python API, GUI, and CLI

Vs.

Globus CatalogCatalog Datasets Members

Globus Catalog web user interface

View and search existing catalogs

Search datasets by tag and/or text

Add fine-grained ACLs by dataset

Create catalog-specific tag definitions

Catalog-NexPy integration

Globus data publicationwww.globus.org/data-publication

• Operated as a hosted service

• Designed for Big Data

• Bring your own (per- collection) storage

• Extensible metadata schemas and input forms

• Customizable publication and curation workflows

• Associate unique and persistent digital identifiers with datasets

• Rich discovery model (in dev)

17

Early applications include:— Materials science (Materials Data Facility)— Climate simulation (ACME)— Genomics, medical imaging, astronomy, etc.

...all of this via SaaS and with your own (institutional

or personal) resources or cloud resources

Summary

Transfer

User Authenticati

onGroups Sharing

Data Publication

Data Cataloging

Automation and

Workflows

High-performance computing at experimental facilities

Ian Foster

[email protected]

Joint work with Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Francesco de Carlo, Ray Osborn, Nicholas Schwarz, Hemant Sharma, Steve Tuecke, Mike Wilde, Justin Wozniak, and many others

APS experimentalists use ALCF for data reconstruction, analysis — 3 examples:

Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimatestructure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).

Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift + MPI-IO) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.

X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method permits reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on typical cluster: enables quasi-instant response

2-BM

1-ID

6-ID

Populate

Sim Sim

Select

Sim

Microstructure of a copper wire, 0.2mm diameter

Advanced Photon Source

Experimental and simulated scattering from manganite

Micrometer porosity

structure of shale samples

Near Field-HEDM using Mira via SwiftSingle integrated cross-system script – 4GB processed every 4-10mins

Assess

Red indicates higher statistical confidence in data

Impact and Approach Accomplishments ALCF Contributions• HEDM imaging and analysis

shows granular material structure, of non-destructively

• APS Sector 1 scientists use Mira to process data from live HEDM experiments, providing real-time feedback to correct or improve in-progress experiments

• Scientists working with Discovery Engines LDRD developed new Swift analysis workflows to process APS data from Sectors 1, 6, and 11

• Mira analyzes experiment in 10 mins vs. 5.2 hours on APS cluster: > 30X improvement

• Scaling up to ~128K cores (driven by data features)

• Cable flaw was found and fixed at start of experiment, saving an entire multi-day experiment and valuable user time and APS beam time.

• In press: High-Energy Synchrotron X-ray Techniques for Studying Irradiated Materials, J-S Park et al, J. Mat. Res.

• Big data staging with MPI-IO for interactive X-ray science, J Wozniak et al, Big Data Conference, Dec 2014

• Design, develop, support, and trial user engagement to make Swift workflow solution on ALCF systems a reliable, secure and supported production service

• Creation and support of the Petrel data server

• Reserved resources on Mira for APS HEDM experiment at Sector 1-ID beamline (8/10/2014 and future sessions in APS 2015 Run 1)

Boosting Light Source Productivity with Swift ALCF Data AnalysisH Sharma, J Almer (APS); J Wozniak, M Wilde, I Foster (MCS)

Analyze

Fix

Re-analyze

ValidData!

2 3

4

5

1

Swift provides four important transparencies

23

Parallelism Implicitly parallel functional dataflow programming

Location Runs your script on multiple distributed sites anddiverse computing resources (desktop to petascale)

Failure recovery Retries/relocates failing tasks Can restart failing runs from point of failure

Provenance capture Tasks have recordable inputs and outputs

swift-lang.org

Swift parallel scripting: LAMMPS

Tasks of varying sizes packed into big MPI run

Black: Compute

Blue: Message White: Idle

filter = input_file(data_directory/"input.inp.filter");

foreach i in [0:20] {

t = 300+i;

sed_command = sprintf("s/_TEMPERATURE_/%i/g", t);

lammps_file_name = sprintf("input-%i.inp", t);

lammps_args = "-i " + lammps_file_name;

file lammps_input<lammps_file_name> =

sed(filter, sed_command) =>

@par=8 lammps(lammps_args);

}

Invoke LAMMPS via its C++ API

swift-lang.org

Powder diffraction workflow – derived from HEDM: Makes a notable difference to APS users

• Background-removal step extracted into separate step for Powder Diffraction beamline (Sector 1)

• Used over 200 times by 30 APS users to process 50TB in the past 6 months

• Enables uses to test data quality at beam time, and to leave APS with all their data, ready to analyze

Diffuse scattering workflow using ALCF

Determines crystal configuration that produced given scattering image through simulation and evolutionary algorithm

Crystal coordinate transformation for diffuse scattering workflow

big data at experimental facilities

Science

tracking data

data management

data engines

track data

group data

globus data publication

data store push

big data ldrd wilde