discovery engines for big data: accelerating discovery in basic energy sciences
DESCRIPTION
Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.TRANSCRIPT
Discovery Engines for Big Data
Accelerating Discovery in Basic Energy Sciences
Ian Foster
Argonne National Laboratory
Joint work with Ray Osborn, Guy Jennings, Jon Almer, Hemant Sharma, Mike Wilde, Justin Wozniak, Rachana Ananthakrishnan, Ben Blaiszik, and many others
Work supported by Argonne LDRD
Motivating example: Disordered structures
“Most of materials science is bottlenecked by disordered structures”
Atomic disorder plays an important role in controlling the bulk properties of complex materials, for example:
Colossal magnetoresistance
Unconventional superconductivity
Ferroelectric relaxor behavior
Fast-ion conduction
And many many others!
We want a systematic understanding of the relationships between material composition, temperature, structure, and other properties
2
A role for both experiment and simulation
Experiment: Observe (indirect) properties of real structures
E.g., single crystaldiffuse scattering atAdvanced Photon Source
Simulation: Compute properties of potential structures
E.g., DISCUS simulateddiffuse scattering; molecular dynamics forstructures
3
Material
composition
Simulated
structure
Simulated
scattering
La 60%Sr 40%
Sample Experimental
scattering
Opportunity: Integrate experiment & simulation
Experiments can explain and guide simulations
– E.g., guide experiments via evolutionary optimization
Simulations can explain and guide experiments
– E.g., identify temperature regimes in which more data is needed
4
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%Sr 40%
SampleExperimentalsca ering
Materialcomposi on
Simulatedstructure
Simulatedsca ering
La60%Sr40%
Detecterrors(secs—mins)
KnowledgebasePastexperiments;
simula ons;literature;expertknowledge
Selectexperiments(mins—hours)
Contributetoknowledgebase
Simula onsdrivenbyexperiments(mins—days)
Knowledge-drivendecisionmaking
Evolu onaryop miza on
SampleExperimentalsca ering
Materialcomposi on
Simulatedstructure
Simulatedsca ering
La60%Sr40%
Detecterrors(secs—mins)
KnowledgebasePastexperiments;
simula ons;literature;expertknowledge
Selectexperiments(mins—hours)
Contributetoknowledgebase
Simula onsdrivenbyexperiments(mins—days)
Knowledge-drivendecisionmaking
Evolu onaryop miza on
Opportunity: Link experiment, simulation, and data
analytics to create a discovery engine
Opportunities for discovery acceleration in energy
sciences are numerous and span DOE facilities
6
Single crystal diffuse scatteringDefect structure in disordered materials (Osborn et al.)
High-energy x-ray diffraction microscopyMicrostructure in bulk materials(Almer, Sharma, et al.)
6-ID
1-ID
Grazing incidence small angle x-ray scatteringDirected self assembly (Nealey, Ferrier, De Pablo, et al.). 8
Common themesLarge amounts of data
New mathematical and numerical methodsStatistical and machine learning methods
Rapid reconstruction and analysisLarge-scale parallel computation
End-to-end automationData management and provenance
More data
New analysis methods
New science
processes
(Examples)
Parallel pipeline enables real-time analysis of
diffuse scattering data, plus offline DIFFEV fitting
Use simulation and evolutionary algorithm to determine crystal configthat can produce scattering image
DIFFEV step
Accelerating mapping of materials microstructure
with high energy diffraction microscopy (HEDM)
8
Top: Grains in a 0.79 mm3 volume of a copper wire.Bottom: Tensile deformation of a copper wire when the wire is pulled. (J. Almer)
9
Blue Gene/QOrthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis PassFitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/TGO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
Parallel pipeline enables immediate assessment of
alignment quality in high-energy diffraction microscopy
Big data staging with MPI-IO enables interactive
analysis of IBM BG/Q supercomputer
Justin Wozniak
Integrate data movement, management, workflow, and computation to accelerate data-driven applications
New data, computational capabilities, and
methods create opportunities and challenges
11
Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data
New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
ApplicationsAlgorithms
EnvironmentsInfrastructure
InfrastructureFacilities
Towards a lab-wide (and DOE-wide) data
architecture and facility
12
Domain portalsServices
Registry:metadata, attributes
Component & workflow repository
PDACS
Resources
Workflow execution
Data transfer,
sync, sharing
Registry:metadata, attributes
Utility compute system (“cloud”)
Data publication
& discovery
Web interfaces, REST APIs, command line interfaces
Researchers, system administrators, collaborators, students, …
Parallel file
systemsystem
DISCExperimental
facility Visualization
system
Integration layer: Remote access protocols, authentication, authorization
Component & workflow repository
kBase
HPC compute
eMatterFACE-
IT
Towards a lab-wide (and DOE-wide) data
architecture and facility
13
Domain portalsServices
Registry:metadata, attributes
Component & workflow repository
PDACS
Resources
Workflow execution
Data transfer,
sync, sharing
Registry:metadata, attributes
Utility compute system (“cloud”)
Data publication
& discovery
Web interfaces, REST APIs, command line interfaces
Researchers, system administrators, collaborators, students, …
Parallel file
systemsystem
DISCExperimental
facility Visualization
system
Integration layer: Remote access protocols, authentication, authorization
Component & workflow repository
kBase
HPC compute
eMatterFACE-
IT
Towards a lab-wide (and DOE-wide) data
architecture and facility
14
Domain portalsServices
Registry:metadata, attributes
Component & workflow repository
PDACS
Resources
Workflow execution
Data transfer,
sync, sharing
Registry:metadata, attributes
Utility compute system (“cloud”)
Data publication
& discovery
Web interfaces, REST APIs, command line interfaces
Researchers, system administrators, collaborators, students, …
Parallel file
systemsystem
DISCExperimental
facility Visualization
system
Integration layer: Remote access protocols, authentication, authorization
Component & workflow repository
kBase
HPC compute
eMatterFACE-
IT
Towards a lab-wide (and DOE-wide) data
architecture and facility
15
Domain portalsServices
Registry:metadata, attributes
Component & workflow repository
PDACS
Resources
Workflow execution
Data transfer,
sync, sharing
Registry:metadata, attributes
Utility compute system (“cloud”)
Data publication
& discovery
Web interfaces, REST APIs, command line interfaces
Researchers, system administrators, collaborators, students, …
Parallel file
systemsystem
DISCExperimental
facility Visualization
system
Integration layer: Remote access protocols, authentication, authorization
Component & workflow repository
kBase
HPC compute
eMatterFACE-
IT
Towards a lab-wide (and DOE-wide) data
architecture and facility
16
Domain portalsServices
Registry:metadata, attributes
Component & workflow repository
PDACS
Resources
Workflow execution
Data transfer,
sync, sharing
Registry:metadata, attributes
Utility compute system (“cloud”)
Data publication
& discovery
Web interfaces, REST APIs, command line interfaces
Researchers, system administrators, collaborators, students, …
Parallel file
systemsystem
DISCExperimental
facility Visualization
system
Integration layer: Remote access protocols, authentication, authorization
Component & workflow repository
kBase
HPC compute
eMatterFACE-
IT
Architecture realization for APS experiments
17
Exte
rnal
co
mp
ute
res
ou
rces
The Petrel research data service
High-speed, high-capacity data store
Seamless integration with data fabric
Project-focused, self-managed
18
1.7 PB GPFS store
32 I/O nodes with GridFTP
Other sites, facilities, colleagues
100 TB allocationsUser managed access
globus.org
Managing the research data lifecycle with Globus services
PI initiates transfer
request; or requested
automatically by
script or science
gateway
1
Globus transfers files
reliably, securely, rapidly
Experimental
facility
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
Curator reviews and
approves; data set
published on campus
or other system
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Publication
repository
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
www.globus.orgBooth 3649
Managing the research data lifecycle with Globus services
PI initiates transfer
request; or requested
automatically by
script or science
gateway
1
Globus transfers files
reliably, securely, rapidly
Experimental
facility
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
Curator reviews and
approves; data set
published on campus
or other system
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Publication
repository
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
www.globus.orgBooth 3649
Managing the research data lifecycle with Globus services
PI initiates transfer
request; or requested
automatically by
script or science
gateway
1
Globus transfers files
reliably, securely, rapidly
Experimental
facility
Compute facility
2
PI selects files to
share, selects user
or group, and sets
access permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data
set; describes it
using metadata
(Dublin core and
domain-specific)
Curator reviews and
approves; data set
published on campus
or other system
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
notifies throughout
6 8
Publication
repository
Personal computer
Transfe
r
Publicatio
n
Sharin
g
Discove
ry
www.globus.orgBooth 3649
22
Endpoint aps#clutch has transfers to 119 other endpoints
Transfers from
a single APS
storage system
(to 119
destinations)
23
Storage locations
Compute facilities
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
Tying it all together: A basic energy sciences
cyberinfrastructure
24
Storage locations
Compute facilities
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
25
1: Run script (EL1.layer)
Storage locations
Compute facilities
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
26
1: Run script (EL1.layer)
2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
Compute facilities
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
27
1: Run script (EL1.layer)
2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
3: Transfer inputs
Compute facilities
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
28
1: Run script (EL1.layer)
2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
3: Transfer inputs
Compute facilities
4: Run app
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
29
1: Run script (EL1.layer)
2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
3: Transfer inputs
Compute facilities
4: Run app
6: Update catalogs
5: Transfer results
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
30
1: Run script (EL1.layer)
2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
3: Transfer inputs
Compute facilities
4: Run app
6: Update catalogs
5: Transfer results
Externalcollaborators
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Tying it all together: A basic energy sciences
cyberinfrastructure
31
Researchers
Develop, evaluate, and refine component and end-to-end models
Models from the literature
Fluid models for network flows
SKOPE modeling system
Develop and apply data-driven estimation methods
• Differential regression
• Surrogate models
• Other methods from literature
Develop easy-to-use tools to provide end-users with actionable advice
• Runtime advisor, integrated with Globus services
Automated experiments to test models & build database
• Experiment design
• Testbeds
“Robust Analytical Modeling for Science at Extreme Scales”
Towards a science of workflow performance
SimulationCharacterize,
PredictAssimilateSteer data acquisition
Data analysisReconstruct,
detect features, auto-correlate,
particle distributions, …
Science automation servicesScripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
IntegrationOptimize, fit, …
Configure
Check
Guide
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-
correlation
Feature
detection
Scientific opportunities
Probe material structure and function at unprecedented scales
Technical challenges
Many experimental modalities
Data rates and computation needs vary widely; are increasing
Knowledge management, integration, synthesis
New methods demand rapid access to large amounts of data, computing
Discovery engines for energy science
Next steps
From six beamlines to 60 beamlines
From 60 facility users to 6000 facility users
From one lab to all labs
From data management and analysis to knowledgemanagement, integration, and analysis
From per-user to per-discipline (and trans-discipline) data repositories, publication, and discovery
From terabytes to petabytes
From three months to three hours to build pipelines
From intuitive to analytical understanding of systems
34