discovery engines for big data: accelerating discovery in basic energy sciences

34
Discovery Engines for Big Data Accelerating Discovery in Basic Energy Sciences Ian Foster Argonne National Laboratory Joint work with Ray Osborn, Guy Jennings, Jon Almer, Hemant Sharma, Mike Wilde, Justin Wozniak, Rachana Ananthakrishnan, Ben Blaiszik, and many others Work supported by Argonne LDRD

Upload: ian-foster

Post on 02-Jul-2015

539 views

Category:

Science


1 download

DESCRIPTION

Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.

TRANSCRIPT

Page 1: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Discovery Engines for Big Data

Accelerating Discovery in Basic Energy Sciences

Ian Foster

Argonne National Laboratory

Joint work with Ray Osborn, Guy Jennings, Jon Almer, Hemant Sharma, Mike Wilde, Justin Wozniak, Rachana Ananthakrishnan, Ben Blaiszik, and many others

Work supported by Argonne LDRD

Page 2: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Motivating example: Disordered structures

“Most of materials science is bottlenecked by disordered structures”

Atomic disorder plays an important role in controlling the bulk properties of complex materials, for example:

Colossal magnetoresistance

Unconventional superconductivity

Ferroelectric relaxor behavior

Fast-ion conduction

And many many others!

We want a systematic understanding of the relationships between material composition, temperature, structure, and other properties

2

Page 3: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

A role for both experiment and simulation

Experiment: Observe (indirect) properties of real structures

E.g., single crystaldiffuse scattering atAdvanced Photon Source

Simulation: Compute properties of potential structures

E.g., DISCUS simulateddiffuse scattering; molecular dynamics forstructures

3

Material

composition

Simulated

structure

Simulated

scattering

La 60%Sr 40%

Sample Experimental

scattering

Page 4: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Opportunity: Integrate experiment & simulation

Experiments can explain and guide simulations

– E.g., guide experiments via evolutionary optimization

Simulations can explain and guide experiments

– E.g., identify temperature regimes in which more data is needed

4

SampleExperimental

scattering

Material

composition

Simulated

structure

Simulated

scattering

La 60%Sr 40%

SampleExperimentalsca ering

Materialcomposi on

Simulatedstructure

Simulatedsca ering

La60%Sr40%

Detecterrors(secs—mins)

KnowledgebasePastexperiments;

simula ons;literature;expertknowledge

Selectexperiments(mins—hours)

Contributetoknowledgebase

Simula onsdrivenbyexperiments(mins—days)

Knowledge-drivendecisionmaking

Evolu onaryop miza on

Page 5: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

SampleExperimentalsca ering

Materialcomposi on

Simulatedstructure

Simulatedsca ering

La60%Sr40%

Detecterrors(secs—mins)

KnowledgebasePastexperiments;

simula ons;literature;expertknowledge

Selectexperiments(mins—hours)

Contributetoknowledgebase

Simula onsdrivenbyexperiments(mins—days)

Knowledge-drivendecisionmaking

Evolu onaryop miza on

Opportunity: Link experiment, simulation, and data

analytics to create a discovery engine

Page 6: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Opportunities for discovery acceleration in energy

sciences are numerous and span DOE facilities

6

Single crystal diffuse scatteringDefect structure in disordered materials (Osborn et al.)

High-energy x-ray diffraction microscopyMicrostructure in bulk materials(Almer, Sharma, et al.)

6-ID

1-ID

Grazing incidence small angle x-ray scatteringDirected self assembly (Nealey, Ferrier, De Pablo, et al.). 8

Common themesLarge amounts of data

New mathematical and numerical methodsStatistical and machine learning methods

Rapid reconstruction and analysisLarge-scale parallel computation

End-to-end automationData management and provenance

More data

New analysis methods

New science

processes

(Examples)

Page 7: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Parallel pipeline enables real-time analysis of

diffuse scattering data, plus offline DIFFEV fitting

Use simulation and evolutionary algorithm to determine crystal configthat can produce scattering image

DIFFEV step

Page 8: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Accelerating mapping of materials microstructure

with high energy diffraction microscopy (HEDM)

8

Top: Grains in a 0.79 mm3 volume of a copper wire.Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)

Page 9: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

9

Blue Gene/QOrthros

(All data in NFS)

3: Generate

Parameters

FOP.c

50 tasks

25s/task

¼ CPU hours

Uses Swift/K

Dataset

360 files

4 GB total

1: Median calc

75s (90% I/O)

MedianImage.c

Uses Swift/K

2: Peak Search

15s per file

ImageProcessing.c

Uses Swift/K

Reduced

Dataset

360 files

5 MB total

feedback to experiment

Detector

4: Analysis PassFitOrientation.c

60s/task (PC)

1667 CPU hours

60s/task (BG/Q)

1667 CPU hours

Uses Swift/TGO Transfer

Up to

2.2 M CPU hours

per week!

ssh

Globus Catalog

Scientific Metadata

Workflow ProgressWorkflow

Control

Script

Bash

Manual

This is a

single

workflow

3: Convert bin L

to N

2 min for all files,

convert files to

Network Endian

format

Before

After

Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer

Parallel pipeline enables immediate assessment of

alignment quality in high-energy diffraction microscopy

Page 10: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Big data staging with MPI-IO enables interactive

analysis of IBM BG/Q supercomputer

Justin Wozniak

Page 11: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Integrate data movement, management, workflow, and computation to accelerate data-driven applications

New data, computational capabilities, and

methods create opportunities and challenges

11

Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

ApplicationsAlgorithms

EnvironmentsInfrastructure

InfrastructureFacilities

Page 12: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Towards a lab-wide (and DOE-wide) data

architecture and facility

12

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing

Registry:metadata, attributes

Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization

Component & workflow repository

kBase

HPC compute

eMatterFACE-

IT

Page 13: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Towards a lab-wide (and DOE-wide) data

architecture and facility

13

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing

Registry:metadata, attributes

Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization

Component & workflow repository

kBase

HPC compute

eMatterFACE-

IT

Page 14: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Towards a lab-wide (and DOE-wide) data

architecture and facility

14

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing

Registry:metadata, attributes

Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization

Component & workflow repository

kBase

HPC compute

eMatterFACE-

IT

Page 15: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Towards a lab-wide (and DOE-wide) data

architecture and facility

15

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing

Registry:metadata, attributes

Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization

Component & workflow repository

kBase

HPC compute

eMatterFACE-

IT

Page 16: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Towards a lab-wide (and DOE-wide) data

architecture and facility

16

Domain portalsServices

Registry:metadata, attributes

Component & workflow repository

PDACS

Resources

Workflow execution

Data transfer,

sync, sharing

Registry:metadata, attributes

Utility compute system (“cloud”)

Data publication

& discovery

Web interfaces, REST APIs, command line interfaces

Researchers, system administrators, collaborators, students, …

Parallel file

systemsystem

DISCExperimental

facility Visualization

system

Integration layer: Remote access protocols, authentication, authorization

Component & workflow repository

kBase

HPC compute

eMatterFACE-

IT

Page 17: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Architecture realization for APS experiments

17

Exte

rnal

co

mp

ute

res

ou

rces

Page 18: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

The Petrel research data service

High-speed, high-capacity data store

Seamless integration with data fabric

Project-focused, self-managed

18

1.7 PB GPFS store

32 I/O nodes with GridFTP

Other sites, facilities, colleagues

100 TB allocationsUser managed access

globus.org

Page 19: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Managing the research data lifecycle with Globus services

PI initiates transfer

request; or requested

automatically by

script or science

gateway

1

Globus transfers files

reliably, securely, rapidly

Experimental

facility

Compute facility

2

PI selects files to

share, selects user

or group, and sets

access permissions

Globus controls

access to shared

files on existing

storage; no need

to move files to

cloud storage!

Researcher logs in to

Globus and accesses

shared files; no local

account required;

download via Globus

Researcher

assembles data

set; describes it

using metadata

(Dublin core and

domain-specific)

Curator reviews and

approves; data set

published on campus

or other system

Peers, collaborators

search and discover

datasets; transfer and

share using Globus

4

7

6

3

5

• SaaS Only a web

browser required

• Access using your

campus credentials

• Globus monitors and

notifies throughout

6 8

Publication

repository

Personal computer

Transfe

r

Publicatio

n

Sharin

g

Discove

ry

www.globus.orgBooth 3649

Page 20: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Managing the research data lifecycle with Globus services

PI initiates transfer

request; or requested

automatically by

script or science

gateway

1

Globus transfers files

reliably, securely, rapidly

Experimental

facility

Compute facility

2

PI selects files to

share, selects user

or group, and sets

access permissions

Globus controls

access to shared

files on existing

storage; no need

to move files to

cloud storage!

Researcher logs in to

Globus and accesses

shared files; no local

account required;

download via Globus

Researcher

assembles data

set; describes it

using metadata

(Dublin core and

domain-specific)

Curator reviews and

approves; data set

published on campus

or other system

Peers, collaborators

search and discover

datasets; transfer and

share using Globus

4

7

6

3

5

• SaaS Only a web

browser required

• Access using your

campus credentials

• Globus monitors and

notifies throughout

6 8

Publication

repository

Personal computer

Transfe

r

Publicatio

n

Sharin

g

Discove

ry

www.globus.orgBooth 3649

Page 21: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Managing the research data lifecycle with Globus services

PI initiates transfer

request; or requested

automatically by

script or science

gateway

1

Globus transfers files

reliably, securely, rapidly

Experimental

facility

Compute facility

2

PI selects files to

share, selects user

or group, and sets

access permissions

Globus controls

access to shared

files on existing

storage; no need

to move files to

cloud storage!

Researcher logs in to

Globus and accesses

shared files; no local

account required;

download via Globus

Researcher

assembles data

set; describes it

using metadata

(Dublin core and

domain-specific)

Curator reviews and

approves; data set

published on campus

or other system

Peers, collaborators

search and discover

datasets; transfer and

share using Globus

4

7

6

3

5

• SaaS Only a web

browser required

• Access using your

campus credentials

• Globus monitors and

notifies throughout

6 8

Publication

repository

Personal computer

Transfe

r

Publicatio

n

Sharin

g

Discove

ry

www.globus.orgBooth 3649

Page 22: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

22

Endpoint aps#clutch has transfers to 119 other endpoints

Transfers from

a single APS

storage system

(to 119

destinations)

Page 23: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

23

Page 24: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Storage locations

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

Tying it all together: A basic energy sciences

cyberinfrastructure

24

Page 25: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Storage locations

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

25

Page 26: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

Storage locations

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

26

Page 27: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

27

Page 28: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

28

Page 29: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

4: Run app

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

29

Page 30: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

30

Page 31: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

1: Run script (EL1.layer)

2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results

Externalcollaborators

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Tying it all together: A basic energy sciences

cyberinfrastructure

31

Researchers

Page 32: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Develop, evaluate, and refine component and end-to-end models

Models from the literature

Fluid models for network flows

SKOPE modeling system

Develop and apply data-driven estimation methods

• Differential regression

• Surrogate models

• Other methods from literature

Develop easy-to-use tools to provide end-users with actionable advice

• Runtime advisor, integrated with Globus services

Automated experiments to test models & build database

• Experiment design

• Testbeds

“Robust Analytical Modeling for Science at Extreme Scales”

Towards a science of workflow performance

Page 33: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

SimulationCharacterize,

PredictAssimilateSteer data acquisition

Data analysisReconstruct,

detect features, auto-correlate,

particle distributions, …

Science automation servicesScripting, security, storage, cataloging, transfer

~0.001-0.5 GB/s/flow

~2 GB/s total burst

~200 TB/month

~10 concurrent flows

(Today: x10 in 5 yrs)

IntegrationOptimize, fit, …

Configure

Check

Guide

Batch

Immediate

0.001 1 100+

PFlops

Precompute

material

database

Reconstruct

image

Auto-

correlation

Feature

detection

Scientific opportunities

Probe material structure and function at unprecedented scales

Technical challenges

Many experimental modalities

Data rates and computation needs vary widely; are increasing

Knowledge management, integration, synthesis

New methods demand rapid access to large amounts of data, computing

Discovery engines for energy science

Page 34: Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences

Next steps

From six beamlines to 60 beamlines

From 60 facility users to 6000 facility users

From one lab to all labs

From data management and analysis to knowledgemanagement, integration, and analysis

From per-user to per-discipline (and trans-discipline) data repositories, publication, and discovery

From terabytes to petabytes

From three months to three hours to build pipelines

From intuitive to analytical understanding of systems

34