scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks...
TRANSCRIPT
![Page 1: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/1.jpg)
www.inf.ed.ac.uk
Scrying the next generation of data-intensive research infrastructure���
���Research at the Data-Intensive Research Group of
the University of Edinburgh
Paul Martin OSDC-PIRE 2014, University of Amsterdam
![Page 2: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/2.jpg)
www.inf.ed.ac.uk
Edinburgh
![Page 3: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/3.jpg)
www.inf.ed.ac.uk
School of Informatics
![Page 4: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/4.jpg)
www.inf.ed.ac.uk
Future research infrastructures… • …must support a large range of different research interactions.
– Data collection, curation, processing and publication.
– Curation of models and methods. – Community networks and cross-infrastructure interactions.
• …must support a diverse cast of research actors.
– Investigators, empiricists, theorists, librarians, engineers, etc.
• …must balance conflicting issues:
– Openness and accountability.
– Preservation and accessibility.
– Interoperability and efficacy.
– Oversight and autonomy.
![Page 5: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/5.jpg)
www.inf.ed.ac.uk
The Data-Intensive Research Group • Part of the Centre for Intelligent Systems and their Applications in the
School of Informatics at the University of Edinburgh.
• Research agenda focuses on of how best to address current and future data-intensive research problems:
– How to manage large volumes of data;
– How to process distributed data in different environments;
– How to manage the code and tools used to handle data.
• Recent emphasis has been on workflow-based systems: languages and tools for workflow composition, services for deploying workflows, workflow optimisation and provenance gathering, etc…
• …but also, infrastructure modelling, scientific gateways, commodity supercomputing and anything else that catches our interest.
![Page 6: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/6.jpg)
www.inf.ed.ac.uk
Supporting Research Interactions
• Support a diverse range of interactions by domain experts at the high level…
• …by providing standard interchange formats…
• …that sit atop a heterogeneous array of execution platforms.
Domain Experts
Data-Analysis Experts
Data-Intensive Engineers
user tools
execution platforms
User and application diversity
System complexity
Broker level
Tool level
Enactment level
registries
repositories
optimisation
logical schemata
gateways
component mappings
observations
virtualisation
Data Curators
![Page 7: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/7.jpg)
www.inf.ed.ac.uk
VERCE • Virtual Earthquake and Seismology Research Community e-Science
Environment in Europe.
• Design, build and integrate components for data processing in the seismology domain.
– Streamline the process of configuring and conducting several standard types of computational task.
– Open facilities for the broader community.
– Focus on particular ‘data-intensive’ and ‘HPC’ use-cases.
• ‘Satellite’ project of EPOS (European Plate Observing System).
– Contribute to EPOS Core Services.
![Page 8: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/8.jpg)
www.inf.ed.ac.uk
VERCE Overview
![Page 9: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/9.jpg)
www.inf.ed.ac.uk
VERCE Principles
![Page 10: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/10.jpg)
www.inf.ed.ac.uk
VERCE Technology Stack (c.2014)
VERCE platform of data-intensive services and applications
VERCE scientific gateway
Dissemination and training
Catalogues & registries
Integrated tools Portals Community &
user support
......
Component repositories
Grid infrastructure HPC infrastructure
Network infrastructure
Data infrastructure
Data archives
Enactment layer of services and processing elements
Technology stack
Web PortalLiferay, gUse
WorkflowspecificationWS-PGRADE,
Dispel4Py
DeploymentMPI, Storm
Datainfrastructure
ArcLink, GridFTP,iRODS, HDFS
Grid/HPCInfrastructure
Globus, UNICORE
![Page 11: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/11.jpg)
www.inf.ed.ac.uk
Dispel4Py • Python-based implementation of DISPEL (Data-Intensive Systems Process
Engineering Language).
– Used to describe distributed data-streaming workflows at a logical level.
– Wraps Python code into Processing Elements (PEs; initial focus on seismology applications).
– Workflow graph can be deployed on various platforms (currently Storm and MPI).
• Principles of Dispel:
– Inline specification of new PEs as compositions of existing PEs.
– Strong typing for both language and dataflow with additional semantic (domain) annotation.
– Work in progress…
![Page 12: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/12.jpg)
www.inf.ed.ac.uk
Dispel4Py workflow illustration
TaskScheduler
DataGenerator
TupleBuild
CorroboratedQuery
"uk.org.UoE.dbA"
"uk.org.UoE.dbB"
"uk.org.UoE.dbC"
TupleBurst
TupleSchema
TypeConverter
ForecastModeller
TupleSchema
Warning
Results
"Forecast Results"
![Page 13: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/13.jpg)
www.inf.ed.ac.uk
Dispel4Py lifecycle
![Page 14: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/14.jpg)
www.inf.ed.ac.uk
ENVRI • Common Operations of Environmental Research Infrastructures. • Initiative to promote interoperability between ESFRI projects in the
Environmental Cluster.
– Model characteristics of environmental research infrastructures to identify commonalities and gaps.
– Provide tools and services for data discovery and integration.
– Improve social links between ESFRI and affiliated projects.
• Part of a general strategic effort to simplify the construction of bespoke infrastructure by pooling expertise and resources.
![Page 15: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/15.jpg)
www.inf.ed.ac.uk
ENVRI Requirements
ENVRI Reference Model
Data Acquisition
Data collection
Instrument accessProcess control
Instrument monitoring
Instrument configuration
Instrument integration
Instrument calibration
Configuration logging
Instrument monitoring
Parameter visualisation
Realtime parameter visualisation
Realtime data collection
Data sampling Noise reduction
Data transmissionRealtime data transmission
Data transmission monitoring
Data CurationData quality checkingData quality verification
Data identification
Data cataloguing
Data product generation
Data versioning
Workflow enactment
Data storage & preservation
Data replication
Replica synchronisation
Access control
Resource annotation
Data annotation
Metadata harvesting
Resource registration
Metadata registration
Identifier registration
Sensor registration
Data conversion
Data compression
Data publication
Data citation
Semantic harmonisation
Data discovery and access
Data visualisation
Data Access
Data Processing Data assimilation
Data analysis
Data mining
Data extraction
Scientific modelling & simulation
Scientific workflow enactment
Scientific visualisation
Service namingData processing control
Data process monitoring
Community Support
Authentication
Authorisation
Accounting
User registration
Instant messaging
Interactive visualisation
Event notification
![Page 16: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/16.jpg)
www.inf.ed.ac.uk
ENVRI Reference Model • A standard abstract model for environmental research infrastructures. • Founded on RM-ODP (Reference Model for Open Distributed
Processing).
– Standard for modelling distributed systems.
– Viewpoint based: Enterprise, Information, Computation, Engineering and Technology.
– Support for UML-style design.
• Current model iteration based on core ‘data pipeline’ (acquisition, curation, access).
– Lightweight modelling of Enterprise (Science), Information and Computational Viewpoints.
– Main study cases: EISCAT_3D, EPOS and ICOS.
![Page 17: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/17.jpg)
www.inf.ed.ac.uk
ENVRI Reference Model Example • Example of raw data collection from the computational viewpoint:
data acquisition data curation
acquisitionservice
data transfer service
instrumentcontroller
raw data collector
data storecontroller
PID service
prepare data transfer
configure instrument
deliver raw data import data for curation
update records
new transporter
acquire identifier
retrieve data
community support
catalogue serviceupdate catalogues
field laboratory
update registry
security service
authorise action
![Page 18: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/18.jpg)
www.inf.ed.ac.uk
EFFORT • Earthquake and Failure Forecasting in Real Time. • Project to monitor rock failure experiments in real time.
– Rock samples are subjected to continued pressure in laboratory conditions.
– Stress leads to deformation, leading to sudden failure.
– Models of rock failure may apply to plate deformation and volcanic events.
• Need ability to continuously and reliably collect data from remote experiments, relate to proposed models and provide visualisations on demand.
• Project expanded to build a standard library for volcanology and rock physics analyses (VarPy).
![Page 19: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/19.jpg)
www.inf.ed.ac.uk
EFFORT system overview
![Page 20: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/20.jpg)
www.inf.ed.ac.uk
GeWWE • Generic Web-based Workflow Editor • Project to build a multi-target workflow editing tool.
– Thesis is that workflows are always built from the same fundamental components (standard schema).
– Average user is not keen to learn any specific workflow programming language (like Dispel…).
– Can map workflows to a number of target languages / platforms.
![Page 21: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/21.jpg)
www.inf.ed.ac.uk
GeWWE Schema
![Page 22: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/22.jpg)
www.inf.ed.ac.uk
GeWWE screenshot
![Page 23: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the](https://reader033.vdocuments.mx/reader033/viewer/2022052923/5f040c497e708231d40c0f87/html5/thumbnails/23.jpg)
www.inf.ed.ac.uk
Other Projects • EDIM1 – commodity data-brick computing. • TerraCorrelator – doing data-intensive geoscience.
• DECIPHER – quasi-anonymous analysis of medical data.