summary of 3dpas

24
www.ci.anl.gov www.ci.uchicago.edu Review of 3DPAS Theme Daniel S. Katz, University of Chicago & Argonne National Laboratory Shantenu Jha, Rutgers University Neil Chue Hong, University of Edinburgh Simon Dobson, University of St. Andrews Andre Luckow, Louisiana State University Omer Rana, University of Cardiff Yogesh Simmhan, University of Southern California

Upload: daniel-s-katz

Post on 10-May-2015

415 views

Category:

Technology


0 download

DESCRIPTION

presented at D3Science workshop at e-Science 2011 conference

TRANSCRIPT

Page 1: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

Review of 3DPAS Theme

Daniel S. Katz, University of Chicago & Argonne National LaboratoryShantenu Jha, Rutgers UniversityNeil Chue Hong, University of EdinburghSimon Dobson, University of St. AndrewsAndre Luckow, Louisiana State UniversityOmer Rana, University of CardiffYogesh Simmhan, University of Southern California

Page 2: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

2 3DPAS review for D3Science – [email protected]

Outline

• e-SI• DPA theme• 3DPAS theme• Report in-progress

– Application Scenarios– Understanding Distributed Dynamic Data– Vectors– Infrastructure– Programming Systems and Abstractions

• Future Steps

Page 3: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

3 3DPAS review for D3Science – [email protected]

e-Science Institute (e-SI)• A 10-year project (Aug 2001 – July 2011), located in Edinburgh• Aimed at, but not limited to, UK• http://www.esi.ac.uk/• Tagline – time & space to think• Mission: to stimulate the creation of new insights in e-Science and computing

science by bringing together international experts and enabling them to successfully address significant and diverse challenges

• Research themes formed the core of eSI’s activity– Theme: connected programme of visitors, workshops and events– Conceived and driven by Theme Leader– Focusing on a specific issue in e-Science that crosses boundaries and raises new

research questions– Goals:

o Identify research issueso Rally a community of researcherso Map a path of future research that will make best progress towards new e-Science methods

and capabilities.

Page 4: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

4 3DPAS review for D3Science – [email protected]

Context – Data and Science

• Data has always been important to science• Some use the concept of paradigms

– First (thousand years ago) – empirical – describe natural phenomena

– Second (few hundred years ago) – theoretical – use models and generalizations

– Third (few decades ago) – computational – solve complex problem

– Fourth (few years ago) – data exploration – gain knowledge directly from data from experiment, theory, simulation

• Problem – we cannot keep declaring new paradigms at an exponentially increasing rate

• But it’s true that there is an emerging science of “listening to data”, as defined by Jim Gray, Google, etc.

Page 5: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

5 3DPAS review for D3Science – [email protected]

Distributed Programming Abstractions• DPA theme at eSI

– http://wiki.esi.ac.uk/Distributed_Programming_Abstractions• Series of workshops• Led to book in progress: Shantenu Jha, Daniel S. Katz, Manish Parashar, Omer Rana,

and Jon Weissman, “Abstractions for Distributed Applications and Systems,” to be published by Wiley in 2012

• And multiple papers, including: S. Jha, D. S. Katz, M. Parashar, O. Rana, and J. Weissman, "Critical Perspectives on Large-Scale Distributed Applications and Production Grids," (Best Paper Award Winner), Proceedings of the 10th IEEE/ACM International Conference on Grid Computing (Grid 2009), 2009.

• Idea – start with distributed science and engineering applications – analyze them (determine `vectors’); examine interaction with infrastructures and tools; find abstractions

– Tech report on infrastructures (much of Chapter 3) available now: http://www.ci.uchicago.edu/research/papers/CI-TR-7-0811

– Vectors: Execution Unit, Coordination, Communication, Execution Environment• In the process, we realized that data intensive applications had some unique

challenges and issues

Page 6: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

6 3DPAS review for D3Science – [email protected]

Dynamic Distributed Data-intensive Programming Systems and Applications (3DPAS)

• This led to 3DPAS theme at eSI– http://wiki.esi.ac.uk/3DPAS

• Similar idea to DPA– Start with science and engineering applications– See if DPA vector suffice or if new vectors are needed– Examine what is different with respect to

infrastructures and programming systems• Initially done through workshops at eSI• Continuing through weekly teleconferences• Driving towards a report/paper

Page 7: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

7 3DPAS review for D3Science – [email protected]

D3 (data intensive, distributed, dynamic)• Data intensive: order of magnitude of large data and large computing

– Exascale data and petascale computing– Petascale data and exascale computing– Exascale data and exascale computing.

• Distributed: number, dispersion, and replication of distributed data or computation resources

– Low in a cloud or cluster that resides in a single building– High in a grid that spans multiple geographically-separated administrative domains,

or multiple data centers • Dynamic: perhaps both data and computation

– Data may emerge at runtime– Mechanisms to handle data during application execution, e.g., data transfer,

scheduling– Application components may be launched at runtime in response to data,

application, or environment dynamics• All may vary in different stages of an application

• Most applications have data collection, storage, analysis stages

Page 8: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

8 3DPAS review for D3Science – [email protected]

Value/Impact

• All data-intensive applications do not have dynamic and distributed elements today

• However, as scales increase, applications will have to be distributed and dynamic– And these issues will be increasingly correlated

• Analyzing current D3 applications should impact many future applications– And lead to lessons about and requirements on

future infrastructures and programming systems

Page 9: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

9 3DPAS review for D3Science – [email protected]

Applications Process

• Asked questions about possible applications1. What is the purpose of the application? 2. How is the application used to do this? 3. What infrastructure is used? (including compute, data, network,

instruments, etc.) 4. What dynamic data is used in the application?

a. What are the types of data,b. What is the size of the data set(s)?

5. How does the application get the data? 6. What are the time (or quality) constraints on the application? 7. How much diverse data integration is involved? 8. How diverse is the data? 9. Please feel free to also talk about the current state of the application, if

it exists today, and any specific gaps that you know need to be overcome

Page 10: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

10 3DPAS review for D3Science – [email protected]

Applications Process (2)

• In workshops, discussed current applications, and considered if news application “felt” the same as a previous application in terms of the answers to the questions

• Came to 14 applications• Noted they fall into different categories

– Traditional applications, single program that is run by a user– Archetypical applications: a group of applications, independent

programs, written by different authors, may be competing, usually not intended to run together

– Infrastructural applications: set of applications (or archetypical applications) that need to be run in series (perhaps in different phases), may be run by different groups that do not frequently interact

Page 11: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

11 3DPAS review for D3Science – [email protected]

Applications

Application Area Type Lead Person/Site Metagenomics Biosciences Archetypical Amsterdam Medical Centre,

Netherlands

ATLAS experiment (WLCG)

Particle Physics

Infrastructural CERN & Daresbury Lab + RAL, UK

Large Synoptic Sky Survey (LSST)

Astrophysics Infrastructural University of Edinburgh – Institute of Astronomy, UK

Virtual Astronomy Astrophysics Archetypical University of Edinburgh – Institute of Astronomy, UK

Cosmic Microwave Background

Astrophysics Traditional Lawrence Berkeley National Laboratory, USA

Marine (Sea Mammal) Sensors

Biosciences Infrastructural University of St. Andrews, UK

Climate Earth Science Infrastructural National Center for Atmospheric Research, USA

Page 12: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

12 3DPAS review for D3Science – [email protected]

Applications (2)

Application Area Type Lead Person/Site Interactive Exploration of Environmental Data

Earth Science

Archetypical University of Reading, UK

Power Grids Energy Informatics

Infrastructural University of Southern California, USA

Fusion (International Thermonuclear Experimental Reactor)

Chemistry/Physics

Traditional Oak Ridge National Laboratory & Rutgers University, USA

Industrial Incident Notification and Response

Emergency Response

Infrastructural THALES, The Netherlands

MODIS Data Processing Earth Science

Traditional Lawrence Berkeley National Laboratory, USA

Floating Sensors Earth Science

Infrastructural Lawrence Berkeley National Laboratory, USA

Distributed Network Intrusion Detection

Security Infrastructural University of Minnesota, USA

Page 13: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

13 3DPAS review for D3Science – [email protected]

Climate (infrastructural)

• CMIP/ICPP process runs and analyses climate models in 3 stages

• Data are generated by distributed HPC centers• Data are stored by distributed ESGF gateways

and data nodes• Data are analyzed by distributed researchers,

who search for particular data, gather them to a site, process them

• Resources for analysis can be dynamic, as can data stored in data nodes

Thanks: Don Middleton

Page 14: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

14 3DPAS review for D3Science – [email protected]

Fusion (traditional)

• ITER needs a variety of codes• Codes run on distributed set of leadership-class

facilities, using advance reservations to co-schedule the simulations

• Codes reads and writes data files, using ADIOS and HDF5• Files output by each code are transformed and

transferred to be used as inputs by other codes, linking the codes into a single coupled simulation

• Data generated are too large to be written to disk for post-run analysis; in-situ analysis and visualization tools are being developed

Thanks: Scott Klasky

Page 15: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

15 3DPAS review for D3Science – [email protected]

Metagenomics (archetypical)

• Analysis of genome sequence data being produced by next gen devices

• Sequencers are producing data at a rate increasing faster than computing capability

• Sequencers are distributed; data produced cannot all be co-located

• Multiple analyses (using different software) by multiple users need to make best use of available computing resources, understanding location and access issues wrt datasets

Page 16: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

16 3DPAS review for D3Science – [email protected]

CMB (traditional)• Cosmic Microwave Background (CMB) performs data simulation and analysis to

understand the Universe 400,000 years after the Big Bang– Detectors take O(1012 - 1015) time-ordered sequences– Observations reduced to map of O(106 - 108) sky pixels– Pixels reduced to O(103 - 104) angular power spectrum coefficients– Coefficient reduced to O(10) cosmological parameters

• Computationally most expensive step is from map to angular power spectrum – Exact solution is O(pixels3) – prohibitive– Approximate solution: sets of O(104) Monte Carlo realizations of observed sky to remove

biases and quantify uncertainties, each of which involves simulating and mapping the time-ordered data

– Map-making is applied to both real and simulated data, but O(104) more times to simulated data (uses on-the-fly simulation module – simulations performed when requested)

• Currently uses single HPC system, but would be faster with distributed systems• Central system that builds map would launch data simulations on available remote

resources; output data from the simulations would be asynchronously delivered back to that central system as files incorporated in map as they are produced

Thanks: Julian Borrill

Page 17: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

17 3DPAS review for D3Science – [email protected]

Some Additional Applications• ATLAS/WLCG (Infrastructural)

– Hierarchy of systems; data centrally stored, and locally cached (and copied to where they likely will be used), perhaps at various levels of the hierarchy

– Processing is done by applications that are independent of each other– Processing of one data file is independent of processing of another file, but groups

of processing results are collected to obtain statistical outputs about the data• LSST (Infrastructural)

– Data taken by a telescope– Quick analysis is done at the telescope site for interesting (urgent) events (which

may involve comparing new data with previous data)– System can get more data from other observatories if needed; request other

observatories to take more data; or call a human– Data then transferred to an archive site, may be at observatory, where data are

analyzed, reduced, and classified, some of which may be farmed out to grid resources

– Detailed analysis of new data vs. archived data is performed– Reanalysis of all data is done periodically– Data are stored in files and databases

Page 18: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

18 3DPAS review for D3Science – [email protected]

Some More Additional Applications

• Virtual Astronomy (Archetypical)– Services are orchestrated through a pipeline, including a data retrieval

service that is used to share data across VO sites– Data are moved through the pipeline, and intermediate and final products

can be stored in Grid storage service • Marine (Sea Mammal) Sensors (Infrastructural)

– Data are brought to a central site when sensors periodically transmit– Stored data are analyzed using statistical techniques, then visualized with

tools such as Google Earth• Power Grids (Infrastructural)

– Diverse streams arrive at a central utility private cloud at dynamic rates controlled by the application

– Real-time event detection pipeline can trigger load curtailment operations– Data mining is performed on current and historical data for forecasting– Partial application execution on remote micro-grid sites is possible.

Page 19: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

19 3DPAS review for D3Science – [email protected]

Even More Additional Applications

• Industrial Incident Notification and Response (Infrastructural)– Data are streamed from diverse sources, and sometimes manually entered

into the system– Disaster detection causes additional information sources to be requested

from that region and applications to be composed based on available data– Some applications run on remote sites for data privacy– Escalation can cause more humans in the loop and additional operations

• MODIS Data Processing (Traditional)– Data brought into system from various FTP servers– Pipeline of initial standardized processing steps on data is done on clouds

or HPC resources– Scientists can then submit executables that do further custom processing

on subsets of the data, which likely include some summarization processing (building graphs)

Page 20: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

20 3DPAS review for D3Science – [email protected]

3DPAS Vectors

• DPA vectors– Execution Unit– Communication– Coordination– Execution Environment

• What changes for D3 applications?– DPA already assumed distributed; data-intensive is somewhat

orthogonal to vectors, last D is dynamic• So, what can be dynamic?

– Data (in value or type)– Application (for archetypical and infrastructural applications)– Execution Environment

• And how can the application respond?– All 3 vectors can change (under user control, or autonomically)

Page 21: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

21 3DPAS review for D3Science – [email protected]

Infrastructure• Software infrastructure to support D3 applications and users exists at three

levels: – System-level software capabilities (e.g., notifications, file system consistency)– Middleware (e.g., databases, metadata servers)– Programming systems, services and tools (e.g., data-centric workflows)

• Strong connection between software infrastructure and execution units– Infrastructure supports the communication between and coordination of

execution units, e.g., to allow co-scheduling• What changes for D3 applications?

– Boundary between infrastructure and application often blurredo e.g., a catalog may be provided by underlying infrastructure or implemented in application

– Sometimes infrastructure requires knowledge of data modelso e.g., to support semantic information integration, triggers, optimized data transport

• General need for infrastructure components to support – Data management: sources, storage, access, movement, discovery, notification,

provenance– Data analysis: conversion, enrichment, analysis, workflow, calibration, integration

Page 22: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

22 3DPAS review for D3Science – [email protected]

Programming Systems• Pipelines/workflows a key concept• Loosely, 3 stages for many applications – data collection, data storage, data

analysis– But the order varies: Sometimes analysis is done during collection to reduce

storage• Some stages are built from legacy (heritage) applications• Some applications don’t include all stages (some stages happen elsewhere;

data is just “there”)• Stream processing also is important to some applications (or some stages)

– the complete data can never be stored, and can only be accessed once in time

• Issues that programming systems should address– Programming provisioning of resources– Use of existing services, or building of new services– How to adapt to changes? Autonomics?– Recording provenance

Page 23: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

23 3DPAS review for D3Science – [email protected]

Programming Systems (2)

• Possible change: replace ad hoc and scripted approaches by more formal workflow tools

– Potential benefits: efficiency, productivity, reproducibility, increased software reuse, ability to add provenance tracking

– Potential issues: can application-specific knowledge by used by generic tools?

Page 24: Summary of 3DPAS

www.ci.anl.govwww.ci.uchicago.edu

24 3DPAS review for D3Science – [email protected]

Conclusions

• D3 applications exist, the number is increasing• There are some similarities across some applications

– Stages, streaming, dynamism and adaptivity– Probably means there are generic abstractions that could be used

• Programming systems are somewhat ad hoc• We want generic tools that

– Allow applications to adapt to dynamism in various elementso E.g., developers can find and use available systems at runtime,

applications can run in the best location with respect to data sources– Provide good performance

• Further research needed– How do we abstract the set of distributed systems to allow this?– What middleware and tools are needed?