discovery engines for big data: accelerating discovery in basic energy sciences

Discovery Engines for Big Data

Accelerating Discovery in Basic Energy Sciences

Ian Foster

Argonne National Laboratory

Joint work with Ray Osborn, Guy Jennings, Jon Almer, Hemant Sharma, Mike Wilde, Justin Wozniak, Rachana Ananthakrishnan, Ben Blaiszik, and many others

Work supported by Argonne LDRD

Motivating example: Disordered structures

“Most of materials science is bottlenecked by disordered structures”

Atomic disorder plays an important role in controlling the bulk properties of complex materials, for example:

Colossal magnetoresistance

Unconventional superconductivity

Ferroelectric relaxor behavior

Fast-ion conduction

And many many others!

We want a systematic understanding of the relationships between material composition, temperature, structure, and other properties

2

A role for both experiment and simulation

Experiment: Observe (indirect) properties of real structures

E.g., single crystaldiffuse scattering atAdvanced Photon Source

Simulation: Compute properties of potential structures

E.g., DISCUS simulateddiffuse scattering; molecular dynamics forstructures

3

Material

composition

Simulated

structure

Simulated

scattering

La 60%Sr 40%

Sample Experimental

scattering

Opportunity: Integrate experiment & simulation

Experiments can explain and guide simulations

– E.g., guide experiments via evolutionary optimization

Simulations can explain and guide experiments

– E.g., identify temperature regimes in which more data is needed

4

SampleExperimental

scattering

Material

composition

Simulated

structure

Simulated

scattering

La 60%Sr 40%

SampleExperimentalsca ering

Materialcomposi on

Simulatedstructure

Simulatedsca ering

La60%Sr40%

Detecterrors(secs—mins)

KnowledgebasePastexperiments;

simula ons;literature;expertknowledge

Selectexperiments(mins—hours)

Contributetoknowledgebase

Simula onsdrivenbyexperiments(mins—days)

Knowledge-drivendecisionmaking

Evolu onaryop miza on

SampleExperimentalsca ering

Materialcomposi on

Simulatedstructure

Simulatedsca ering

La60%Sr40%

Detecterrors(secs—mins)

KnowledgebasePastexperiments;

simula ons;literature;expertknowledge

Selectexperiments(mins—hours)

Contributetoknowledgebase

Simula onsdrivenbyexperiments(mins—days)

Knowledge-drivendecisionmaking

Evolu onaryop miza on

Opportunity: Link experiment, simulation, and data

analytics to create a discovery engine

Opportunities for discovery acceleration in energy

sciences are numerous and span DOE facilities

6

Single crystal diffuse scatteringDefect structure in disordered materials (Osborn et al.)

High-energy x-ray diffraction microscopyMicrostructure in bulk materials(Almer, Sharma, et al.)

6-ID

1-ID

Grazing incidence small angle x-ray scatteringDirected self assembly (Nealey, Ferrier, De Pablo, et al.). 8

Common themesLarge amounts of data

New mathematical and numerical methodsStatistical and machine learning methods

Rapid reconstruction and analysisLarge-scale parallel computation

End-to-end automationData management and provenance

More data

New analysis methods

New science

processes

(Examples)

Parallel pipeline enables real-time analysis of

diffuse scattering data, plus offline DIFFEV fitting

Use simulation and evolutionary algorithm to determine crystal configthat can produce scattering image

DIFFEV step

Accelerating mapping of materials microstructure

with high energy diffraction microscopy (HEDM)

8

Top: Grains in a 0.79 mm3 volume of a copper wire.Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)

9

Blue Gene/QOrthros

(All data in NFS)

3: Generate

Parameters

FOP.c

50 tasks

25s/task

¼ CPU hours

Uses Swift/K

Dataset

360 files

4 GB total

1: Median calc

75s (90% I/O)

MedianImage.c

Uses Swift/K

2: Peak Search

15s per file

ImageProcessing.c

Uses Swift/K

Reduced

Dataset

360 files

5 MB total

feedback to experiment

Detector

4: Analysis PassFitOrientation.c

60s/task (PC)

1667 CPU hours

60s/task (BG/Q)

1667 CPU hours

Uses Swift/TGO Transfer

Up to

2.2 M CPU hours

per week!

ssh

Globus Catalog

Scientific Metadata

Workflow ProgressWorkflow

Control

Script

Bash

Manual

This is a

single

workflow

3: Convert bin L

to N

2 min for all files,

convert files to

Network Endian

format

Before

After

Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer

Parallel pipeline enables immediate assessment of

alignment quality in high-energy diffraction microscopy

Big data staging with MPI-IO enables interactive

analysis of IBM BG/Q supercomputer

Justin Wozniak

Integrate data movement, management, workflow, and computation to accelerate data-driven applications

New data, computational capabilities, and

methods create opportunities and challenges

11

Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

ApplicationsAlgorithms

EnvironmentsInfrastructure

InfrastructureFacilities

Towards a lab-wide (and DOE-wide) data

architecture and facility

12

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing


Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization


kBase

HPC compute

eMatterFACE-

IT



13




PDACS

Resources

Workflow execution

Data transfer,

sync, sharing



Data publication

& discovery



Parallel file

systemsystem

DISCExperimental


system



kBase

HPC compute

eMatterFACE-

IT



14




PDACS

Resources

Workflow execution

Data transfer,

sync, sharing



Data publication

& discovery



Parallel file

systemsystem

DISCExperimental


system



kBase

HPC compute

eMatterFACE-

IT



15




PDACS

Resources

Workflow execution

Data transfer,

sync, sharing



Data publication

& discovery



Parallel file

systemsystem

DISCExperimental


system



kBase

HPC compute

eMatterFACE-

IT



16




PDACS

Resources

Workflow execution

Data transfer,

sync, sharing



Data publication

& discovery



Parallel file

systemsystem

DISCExperimental


system



kBase

HPC compute

eMatterFACE-

IT

Architecture realization for APS experiments

17

Exte

rnal

co

mp

ute

res

ou

rces

The Petrel research data service

High-speed, high-capacity data store

Seamless integration with data fabric

Project-focused, self-managed

18

1.7 PB GPFS store

32 I/O nodes with GridFTP

Other sites, facilities, colleagues

100 TB allocationsUser managed access

globus.org

Managing the research data lifecycle with Globus services

PI initiates transfer

request; or requested

automatically by

script or science

gateway

1

Globus transfers files

reliably, securely, rapidly

Experimental

facility

Compute facility

2

PI selects files to

share, selects user

or group, and sets

access permissions

Globus controls

access to shared

files on existing

storage; no need

to move files to

cloud storage!

Researcher logs in to

Globus and accesses

shared files; no local

account required;

download via Globus

Researcher

assembles data

set; describes it

using metadata

(Dublin core and

domain-specific)

Curator reviews and

approves; data set

published on campus

or other system

Peers, collaborators

search and discover

datasets; transfer and

share using Globus

4

7

6

3

5

• SaaS Only a web

browser required

• Access using your

campus credentials

• Globus monitors and

notifies throughout

6 8

Publication

repository

Personal computer

Transfe

r

Publicatio

n

Sharin

g

Discove

ry

www.globus.orgBooth 3649

22

Endpoint aps#clutch has transfers to 119 other endpoints

Transfers from

a single APS

storage system

(to 119

destinations)

Storage locations

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

Tying it all together: A basic energy sciences

cyberinfrastructure

24

Storage locations

Compute facilities


Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script


cyberinfrastructure

25

1: Run script (EL1.layer)

Storage locations

Compute facilities


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

26


2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

Compute facilities


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

27



Storage locations

3: Transfer inputs

Compute facilities


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

28



Storage locations

3: Transfer inputs

Compute facilities

4: Run app


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

29



Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

30



Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results

Externalcollaborators


Provenance

Files & Metadata

Scriptlibraries



cyberinfrastructure

31

Researchers

Develop, evaluate, and refine component and end-to-end models

Models from the literature

Fluid models for network flows

SKOPE modeling system

Develop and apply data-driven estimation methods

• Differential regression

• Surrogate models

• Other methods from literature

Develop easy-to-use tools to provide end-users with actionable advice

• Runtime advisor, integrated with Globus services

Automated experiments to test models & build database

• Experiment design

• Testbeds

“Robust Analytical Modeling for Science at Extreme Scales”

Towards a science of workflow performance

SimulationCharacterize,

PredictAssimilateSteer data acquisition

Data analysisReconstruct,

detect features, auto-correlate,

particle distributions, …

Science automation servicesScripting, security, storage, cataloging, transfer

~0.001-0.5 GB/s/flow

~2 GB/s total burst

~200 TB/month

~10 concurrent flows

(Today: x10 in 5 yrs)

IntegrationOptimize, fit, …

Configure

Check

Guide

Batch

Immediate

0.001 1 100+

PFlops

Precompute

material

database

Reconstruct

image

Auto-

correlation

Feature

detection

Scientific opportunities

Probe material structure and function at unprecedented scales

Technical challenges

Many experimental modalities

Data rates and computation needs vary widely; are increasing

Knowledge management, integration, synthesis

New methods demand rapid access to large amounts of data, computing

Discovery engines for energy science

Next steps

From six beamlines to 60 beamlines

From 60 facility users to 6000 facility users

From one lab to all labs

From data management and analysis to knowledgemanagement, integration, and analysis

From per-user to per-discipline (and trans-discipline) data repositories, publication, and discovery

From terabytes to petabytes

From three months to three hours to build pipelines

From intuitive to analytical understanding of systems

34