www.eu-datagrid.org release 1.0 10 july 2001 1 deliverable 11.1, slide presentation version 1_0, 10...

47
Release 1.0 10 July 2001 2 www.eu-datagrid.org The DataGrid Project Edited by the DataGrid Dissemination Office, C

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Release 1.0 10 July 2001

2

ww

w.e

u-d

ata

gri

d.o

rgThe DataGrid Project

Edited by the DataGrid Dissemination Office, CNR

Release 1.0 10 July 2001

3

ww

w.e

u-d

ata

gri

d.o

rgThe DataGrid Project

The European DataGrid is a project

funded by the European Union to set up

a computational and data-intensive grid

of resources for the analysis of data

from scientific exploration

Release 1.0 10 July 2001

6

ww

w.e

u-d

ata

gri

d.o

rgDataGrid Applications

• Provide production quality testbeds, using real-world applications with real data:

• High Energy Physics– process the huge amount of data from LHC

experimentations

• Biology and Medical Imaging– sharing of genomic databases for the benefit of

international cooperation– processing of medical images for medical collaborations

• Earth Observations– access and analysis of atmospheric ozone data collected

by satellites as Envisat-1

Release 1.0 10 July 2001

8

ww

w.e

u-d

ata

gri

d.o

rgHigh Energy Physics

aéro

port

GenèveAtlas

CMS

Alice

LHCb

The LHC at CERN is an accelerator which brings protons and ions into head-on collisions at higher energies than ever achieved before. This will allow scientists to penetrate still further into the structure of matter and recreate the conditions prevailing in the early universe, just after the "Big Bang"

Release 1.0 10 July 2001

10

ww

w.e

u-d

ata

gri

d.o

rg

level 1 - special hardware

40 MHz (40 TB/sec)

level 2 - embedded processors

level 3 - PCs

75 KHz (75 GB/sec)

5 KHz (5 GB/sec)

100 Hz(100 MB/sec)

ATLAS

Alice

The LHC numbers

Storage Raw recording rate 0.1 – 1 GB/sec accumulating at 5-8 PetaBytes/year 10 PetaBytes of diskProcessing 7M SI95 units (~300 million MIPS)

LHCb

Release 1.0 10 July 2001

12

ww

w.e

u-d

ata

gri

d.o

rg

Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users

CERN’s Network in the World

Release 1.0 10 July 2001

13

ww

w.e

u-d

ata

gri

d.o

rgBiology Applications

The EU-DataGrid project's biology testbed explores strategies to facilitate the sharing of genomic  databases for the benefit of international cooperation. It provides the right platform to test new grid-aware algorithms for comparative genomics

Release 1.0 10 July 2001

15

ww

w.e

u-d

ata

gri

d.o

rgEmerging needs of Medical Imaging

• Generalization of the image for medical diagnosis and prognosis

• Huge amount of data produced by high resolution imagers

• Remote processing of medical images

Remote processing Of Magnetic Resonance Imaging

Release 1.0 10 July 2001

17

ww

w.e

u-d

ata

gri

d.o

rg

Barcelona

Lyon

Clermont-Ferrand

Uppsala

EBI

Padova

Stockholm

Montpellier

Helsinki

Bio sites: accessing the data

•8 main european sites•30.000 users

Release 1.0 10 July 2001

18

ww

w.e

u-d

ata

gri

d.o

rgEarth Observations

ESA missions:

• about 100 Gbytes of data per day (ERS 1/2)• 500 Gbytes, for the next ENVISAT mission (2001).

DataGrid contribute to EO:

• enhance the ability to access high level products•allow reprocessing of large historical archives• improve Earth science complex applications (data fusion, data mining, modelling …)

Source: L. Fusco, June 2001

Release 1.0 10 July 2001

19

ww

w.e

u-d

ata

gri

d.o

rg

ENVISAT• 3500 MEuro programme cost3500 MEuro programme cost

• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 “standard” products~100 “standard” products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe

• ~700 approved science user projects~700 approved science user projects

• 3500 MEuro programme cost3500 MEuro programme cost

• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 “standard” products~100 “standard” products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe

• ~700 approved science user projects~700 approved science user projects

Release 1.0 10 July 2001

20

ww

w.e

u-d

ata

gri

d.o

rg

RA-2

MWR

DORIS

ASAR W

NS - E

Acquisition &

services for

ASAR HR, MERIS FR

S-PAC

MERIS RR

UK - PAC

ASAR HRMERIS FR

AATSR

F - PAC

I - PAC

ASAR HR

MERIS FR

D - PAC

ASAR HR

SCIAMACHY, GOMOS

MIPAS PDHS - E

Acquisition

Processing

Dissemination

Archive

PDCC

Mission Planning

Monitoring

Control

User services co -ordinationProduct Quality

PDAS

Acquisition

PDHS - K

Acquisition

Processing

Dissemination

Archive

E - PAC

MERIS FR

UET

Ka

FMI

GOMOS OLP

LRAC

Low Rate Processing

Dissemination

Archive

ESA PROCURED CENTRE/ STATION

OTHER PDS CENTRE/ STATION

• 10+ european sites• 4000+ researchers/users• 15+ countries

Envisat Facilities in Europe

Release 1.0 10 July 2001

22

ww

w.e

u-d

ata

gri

d.o

rg

DataGrid and Grid(s)

How DataGrid fits the Grid concept

Release 1.0 10 July 2001

23

ww

w.e

u-d

ata

gri

d.o

rgThe Grid

• Enable communities (“virtual organizations”) to:– share geographically distributed resources as they pursue

common goals;

• in the absence of:– central control,

– omniscience,

– trust relationships

(Ian Foster & Carl Kesselman)

Release 1.0 10 July 2001

24

ww

w.e

u-d

ata

gri

d.o

rgThe Virtual Organisations

• A virtual organisation is a set of participants, resources and resource sharing rules:– The participants: dynamic collections of individuals,

institutions with a common problem to solve

– The resources: computers, software, data, instrumentation to solve the problem

– Resource sharing rules: definition of what is shared, who is allowed to share, the conditions under which sharing occurs

Grid addresses the Virtual Organisations problem

Release 1.0 10 July 2001

25

ww

w.e

u-d

ata

gri

d.o

rgDataGrid vs Grid

The GRID metaphor

• Unlimited ubiquitous distributed computing

• Transparent access to multi-petabyte distributed data bases

• Easy to plug in

• Hidden complexity of the infrastructure

• Analogy with the electrical power GRID

The DataGrid requirements

• Enormous computing requirements

• Manage vast quantities of data

• Researchers spread all over the world

• User friendly interfaces

Release 1.0 10 July 2001

26

ww

w.e

u-d

ata

gri

d.o

rgNetworked Computing Models

• Distributed Computing– synchronous processing

• High-Throughput Computing– asynchronous processing

• On-Demand Computing– dynamic resources

• Data-Intensive Computing– databases

• Collaborative Computing– science

Ian Foster and Carl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999

Release 1.0 10 July 2001

27

ww

w.e

u-d

ata

gri

d.o

rgGrid related initiatives

• Global Grid Forum: www.gridforum.org

• Europe– European Grid Forum: www.egrid.org– EuroGrid: (EU Project) Uniform access to parallel supercomputing resources

– DAMIEN:Distributed Applications and Middleware for Industrial use of European Networks

– UNICORE: The UNiform Interface to Computer Resources

• America– GLOBUS: www.globus.org/– CONDOR: www.cs.wisc.edu/condor/– PPDG: www.cacr.caltech.edu/ppdg/– GRIPHYN: www.griphyn.org– NASA: www.nas.nasa.gov/ipg/

• Asia-Pacific– StarTAP– TransPAC

Release 1.0 10 July 2001

28

ww

w.e

u-d

ata

gri

d.o

rg

The DataGrid Project

In brief

Release 1.0 10 July 2001

29

ww

w.e

u-d

ata

gri

d.o

rgThe DataGrid numbers

• 6 main contractors, 15 assistant contractors

• 9.8 millions € funded by EU

• 150 Full Time Equivalent over 3 years

• Flagship project of the EU IST GRID program

• Project started Jan 2001, duration 3 years

Release 1.0 10 July 2001

30

ww

w.e

u-d

ata

gri

d.o

rgMain Partners

• CERN - France

• CNRS - France

• ESA/ESRIN - Italy

• INFN - Italy

• NIKHEF – The Netherlands

• PPARC - UK

Release 1.0 10 July 2001

31

ww

w.e

u-d

ata

gri

d.o

rg

Research and Academic Institutes•CESNET (Czech Republic)•Commissariat à l'énergie atomique (CEA) – France•Computer and Automation Research Institute,  Hungarian Academy of Sciences (MTA SZTAKI)•Consiglio Nazionale delle Ricerche (Italy)•Helsinki Institute of Physics – Finland•Institut de Fisica d'Altes Energies (IFAE) - Spain•Istituto Trentino di Cultura (IRST) – Italy•Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany•Royal Netherlands Meteorological Institute (KNMI)•Ruprecht-Karls-Universität Heidelberg - Germany•Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands•Swedish Natural Science Research Council (NFR) - Sweden

Associated Partners

Industry Partners•Datamat (Italy)•IBM (UK)•Compagnie des Signaux (France)

Release 1.0 10 July 2001

32

ww

w.e

u-d

ata

gri

d.o

rgDataGrid Goals

• Develop open source middleware for fabric & grid management

• Deploy a large scale testbed

• Set-up production quality demonstrations

• Collaborate with and complement other European and US projects

• Involve industries to create the critical mass of interest for the success of the project

Release 1.0 10 July 2001

33

ww

w.e

u-d

ata

gri

d.o

rgTechnical challenges

• Transparent utilisation of heterogeneous computing and storage resources– gegraphiclly distributed– to process huge amounts of data

• Test the usability and effectiveness of novel computing models, on a large-scale testbed

• Demonstrate production quality operations on real large use-cases

Release 1.0 10 July 2001

34

ww

w.e

u-d

ata

gri

d.o

rgThe DataGrid Middleware

• Developed in the framework of the global grid forum activity

• Relies and integrates Globus, condor and other grid related project contributions

• Integrates and re-use existent open source software

• DataGrid middleware distributed as open software

Middleware

Release 1.0 10 July 2001

35

ww

w.e

u-d

ata

gri

d.o

rgThe Middleware Approach

• Toolkit and services addressing key technical problems

– Modular “bag of services” model

– Not a vertically integrated solution

– can be applied to many application domains

• Inter-domain issues, rather than clustering

– Integration of intra-domain solutions

Release 1.0 10 July 2001

36

ww

w.e

u-d

ata

gri

d.o

rgDataGrid Middleware Services

• Focus on– Workload, Data, Monitoring, Fabric, Mass Storage

management services

Fabric ManagementMass StorageManagement

Data mgmt

Workload mgmt

Monitoring Services

Other Grid middleware services (information, security)

Release 1.0 10 July 2001

44

ww

w.e

u-d

ata

gri

d.o

rg

Project Structure

Release 1.0 10 July 2001

45

ww

w.e

u-d

ata

gri

d.o

rgWorking Groups

Applications

Middleware

Infrastructure

Man

ag

em

en

tTest

bed

• The DataGrid project is divided in 12 Work Packages distributed in four Working Groups

Release 1.0 10 July 2001

46

ww

w.e

u-d

ata

gri

d.o

rgWG Description

• The Middleware Working Group coordinates the development of the software modules leveraging, existing and long tested open standard solutions. Five parallel development teams implement the software: job scheduling, data management, grid monitoring, fabric management and mass storage management.

• The Infrastructure Working Group is focused on the integration of middleware software with systems and networks to provide testbeds to demonstrate the effectiveness of DataGrid in production quality operations over high performance networks.

• The Applications Working Group exploits the project developments to process large amounts of data produced by experiments in the fields of High Energy Physics (HEP), Earth Observations (EO) and Biology.

• The Management Working Group has in charge the coordination of the entire project on a day-to-day basis and the dissemination of the results among industries and research institutes.

Applications

Middleware

Infrastructure

Managem

ent

Test

bed

Applications

Middleware

Infrastructure

Managem

ent

Test

bed

Applications

Middleware

Infrastructure

Managem

ent

Test

bed

Applications

Middleware

Infrastructure

Managem

ent

Test

bed

Release 1.0 10 July 2001

47

ww

w.e

u-d

ata

gri

d.o

rgWorkPackage List (I)

John Gordon (RL)Mass Storage Management

Olof Barring (CERN) Fabric Management

Robin Middleton (RAL)Grid Monitoring Services

Ben Segal (CERN)Grid Data Management

Francesco Prelz (INFN)Grid Workload Management

Chair(s) WorkPackage

Release 1.0 10 July 2001

48

ww

w.e

u-d

ata

gri

d.o

rgWorkPackage List (II)

Christian Micheau (CNRS)Biology Applications

Julian Linford (ESA-ESRIN)

Earth Observation Applications

Federico Carminati (CERN) High Energy Physics Applications

Pascale Primet (CNRS)Network Services

Francois Etienne (CNRS)Integration TestBed

Chair(s) WorkPackage

Maurizio Lancia (CNR)Dissemination

Fabrizio Gagliardi (CERN)Project Management

Release 1.0 10 July 2001

49

ww

w.e

u-d

ata

gri

d.o

rgProject management

• Project Manager (PM): Fabrizio Gagliardi (CERN)

• Project Office: – supports the PM in the day-to-day operational management of the project

• Project Management Board (PMB), – one representative of each main partner and the PM,– co-ordinates and manages items that affect the contractual terms of the project

• Project Technical Board (PTB), – one representative from each Work Package– takes decisions on technical issues

• Architecture Task Force (ATF):– issues guidelines about the global design

Release 1.0 10 July 2001

50

ww

w.e

u-d

ata

gri

d.o

rg

Overview of Data Management

Release 1.0 10 July 2001

51

ww

w.e

u-d

ata

gri

d.o

rg

Collective ServicesCollective Services

Information & Monitoring

Information & Monitoring

Replica ManagerReplica Manager Grid SchedulerGrid Scheduler

Replica OptimizationReplica Optimization

Replica Catalog InterfaceReplica Catalog Interface

Grid Application LayerGrid Application Layer

Job ManagementJob Management

Local ApplicationLocal Application Local DatabaseLocal Database

Fabric servicesFabric services

ConfigurationManagement

ConfigurationManagement

Node Installation &Management

Node Installation &Management

Monitoringand

Fault Tolerance

Monitoringand

Fault Tolerance

Resource Management

Resource Management

Fabric StorageManagement

Fabric StorageManagement

Grid

Fabric

Local Computing

Grid

Data ManagementData Management Metadata Management

Metadata Management

Object to File Mapper

Object to File Mapper

Underlying Grid ServicesUnderlying Grid Services

Computing Element Services

Computing Element Services

Authorisation, Authentication and Accounting

Authorisation, Authentication and Accounting

Replica CatalogReplica Catalog

Storage Element Services

Storage Element Services

SQL Database Service

SQL Database Service

Service Index

Service Index

Release 1.0 10 July 2001

52

ww

w.e

u-d

ata

gri

d.o

rgData Management Tasks

• Data TransferEfficient, secure and reliable transfer of data between

sites

• Data ReplicationReplicate data consistently across sites

• Data Access OptimizationOptimize data access using replication and remote open

• Data Access ControlAuthentication, ownership, access rights on data

• Metadata StorageGrid-wide persistent metadata store for all kinds of Grid information

Complex, R&D

GridFTP

SQLDatabase

Design not final

GDMP

Data Granularity : Files, not Objects.

Current Status

Release 1.0 10 July 2001

53

ww

w.e

u-d

ata

gri

d.o

rgData Grid Storage Model

ReplicationService

HSM (HPSS)

GridStorage Disk array

Grid StorageDisk array

HSM(Castor)

Grid ComputeElement

site 1

site 2

site 3

Grid Compute Element Grid ComputeElement

GridStorage Disk array

Release 1.0 10 July 2001

54

ww

w.e

u-d

ata

gri

d.o

rgReplication Service

• Replica CatalogStore logical to physical file mappings and metadata

• Replica ManagerInterface to create/delete replicas on demand

• Replica SelectionFind closest replica based on certain criteria

• Access OptimizationAutomatic creation and selection of replica for whole

jobs

Release 1.0 10 July 2001

55

ww

w.e

u-d

ata

gri

d.o

rgTerminology

• StorageElement:– any storage system with a Grid Interface

• supports GridFTP

• Logical File Name (LFN)– globally unique– LFN://hostname/string

• hostname = virtual organization id– use of hostname guarantees uniqueness

• e.g.: LFN://cms.org/analysis/run10/event24.dat

• Physical File Name (PFN)– PFN://hostname/path

• hostname = StorageElement host

• Transport File Name (TFN)– URI

• includes protocol

Release 1.0 10 July 2001

56

ww

w.e

u-d

ata

gri

d.o

rgReplica Catalog

• Replica Catalog – database containing mappings of a Logical filename to 1 or

more physical filenames

• Design Goals– scalable

• CMS experiment estimates that by 2006 their replica catalog will contains 50 million files spread across dozens of institutions

– decentralized• local data should always be in local catalogs!

– If an application wants to access a local replica, it should not have to query a replica catalog on the other side of the country/planet

– fault tolerant

Release 1.0 10 July 2001

57

ww

w.e

u-d

ata

gri

d.o

rgReplicaCatalog Design

• There should be exactly one ReplicaCatalog for each StorageElement

• All Replica catalogs for a given virtual organization are linked together, probably in a tree structure– “leaf” catalogs contain a mapping of LFN to PFN– “non-leaf” catalogs contain only a pointer to another

replica catalog

• All ReplicaCatalogs (leaf and non-leaf) have identical client APIs

Release 1.0 10 July 2001

58

ww

w.e

u-d

ata

gri

d.o

rg

redirectsand updates

Top LevelReplica Catalog

(contains LFN -> site-RCmappings for this VO)

Site 1Replica Catalog

(contains LFN -> SE-RCmappings for this site only)

Site 1, SE 1Replica Catalog

(contains LFN -> PFNmappings for this SE only)

Site 1, SE 2Replica Catalog

Site 2, SE Replica Catalog

Grid Applications

redirectsand updates

redirectsand updates

Queries

Queries

Queries

Key points:

+ Can query any level in thehierarchy with the same API

+ All updates are performed at theSE-RC level, and automaticallypropagated up the tree

+ Queries are automaticallyredirected up the tree if data isnot at a given level

Sample Hierarchy

Release 1.0 10 July 2001

59

ww

w.e

u-d

ata

gri

d.o

rgSample Usage

• Queries can start at any catalog – queries can be redirected up and/or down the tree until

all instances are found

• common application query: “Is there a replica on my “local” StorageElement”?– only need to access the local ReplicaCatalog to answer

this– this will reduce the load on the top-level RC

• ReplicaCatalog updates (ie: createNewReplica() ) are done at leaf nodes only, and then propagated up the tree in batches (e.g.: every few minutes)– this will also reduce the load on the top-level server

Release 1.0 10 July 2001

62

ww

w.e

u-d

ata

gri

d.o

rgSQLDatabase

• Convenient, scalable and efficient storage, retrieval and query of data held in any type of local or remote RDBMS.

• May be used for metadata. • Core functionality is SQL insert, delete, update and query. • Can be invoked from a command line tool, a Web browser,

and a programming language API. • A well defined language, platform and RDBMS neutral

network protocol between client and server is used. • Unified Grid enabled front end to relational

databases.

Release 1.0 10 July 2001

63

ww

w.e

u-d

ata

gri

d.o

rgGDMP

• Grid Data Management Pilot; to be renamed Grid Data Mirroring Package

• Initial usage: mirror Objectivity DB files to remote sites using GSI and GridFTP.

• Implements basic prototypes of all services or uses Globus equivalents (like replica catalog)

• Has been extended to mirror ROOT files and flat files

Release 1.0 10 July 2001

64

ww

w.e

u-d

ata

gri

d.o

rgWhat next?

• Implement Replica Catalog based on design• Design & implement Replica Manager, Replica

Selection and Cost Estimation• Update security framework• More on Databases and Object to File mapping• R&D in Grid Query Optimization

Step by step evolution of services

Research & Development

Release 1.0 10 July 2001

65

ww

w.e

u-d

ata

gri

d.o

rgFor More Information

• http://www.eu-datagrid.org/• http://cern.ch/grid-data-management/