www.eu-datagrid.org release 1.0 10 july 2001 1 deliverable 11.1, slide presentation version 1_0, 10...
Post on 22-Dec-2015
214 views
TRANSCRIPT
Release 1.0 10 July 2001
2
ww
w.e
u-d
ata
gri
d.o
rgThe DataGrid Project
Edited by the DataGrid Dissemination Office, CNR
Release 1.0 10 July 2001
3
ww
w.e
u-d
ata
gri
d.o
rgThe DataGrid Project
The European DataGrid is a project
funded by the European Union to set up
a computational and data-intensive grid
of resources for the analysis of data
from scientific exploration
Release 1.0 10 July 2001
6
ww
w.e
u-d
ata
gri
d.o
rgDataGrid Applications
• Provide production quality testbeds, using real-world applications with real data:
• High Energy Physics– process the huge amount of data from LHC
experimentations
• Biology and Medical Imaging– sharing of genomic databases for the benefit of
international cooperation– processing of medical images for medical collaborations
• Earth Observations– access and analysis of atmospheric ozone data collected
by satellites as Envisat-1
Release 1.0 10 July 2001
8
ww
w.e
u-d
ata
gri
d.o
rgHigh Energy Physics
aéro
port
GenèveAtlas
CMS
Alice
LHCb
The LHC at CERN is an accelerator which brings protons and ions into head-on collisions at higher energies than ever achieved before. This will allow scientists to penetrate still further into the structure of matter and recreate the conditions prevailing in the early universe, just after the "Big Bang"
Release 1.0 10 July 2001
10
ww
w.e
u-d
ata
gri
d.o
rg
level 1 - special hardware
40 MHz (40 TB/sec)
level 2 - embedded processors
level 3 - PCs
75 KHz (75 GB/sec)
5 KHz (5 GB/sec)
100 Hz(100 MB/sec)
ATLAS
Alice
The LHC numbers
Storage Raw recording rate 0.1 – 1 GB/sec accumulating at 5-8 PetaBytes/year 10 PetaBytes of diskProcessing 7M SI95 units (~300 million MIPS)
LHCb
Release 1.0 10 July 2001
12
ww
w.e
u-d
ata
gri
d.o
rg
Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users
CERN’s Network in the World
Release 1.0 10 July 2001
13
ww
w.e
u-d
ata
gri
d.o
rgBiology Applications
The EU-DataGrid project's biology testbed explores strategies to facilitate the sharing of genomic databases for the benefit of international cooperation. It provides the right platform to test new grid-aware algorithms for comparative genomics
Release 1.0 10 July 2001
15
ww
w.e
u-d
ata
gri
d.o
rgEmerging needs of Medical Imaging
• Generalization of the image for medical diagnosis and prognosis
• Huge amount of data produced by high resolution imagers
• Remote processing of medical images
Remote processing Of Magnetic Resonance Imaging
Release 1.0 10 July 2001
17
ww
w.e
u-d
ata
gri
d.o
rg
Barcelona
Lyon
Clermont-Ferrand
Uppsala
EBI
Padova
Stockholm
Montpellier
Helsinki
Bio sites: accessing the data
•8 main european sites•30.000 users
Release 1.0 10 July 2001
18
ww
w.e
u-d
ata
gri
d.o
rgEarth Observations
ESA missions:
• about 100 Gbytes of data per day (ERS 1/2)• 500 Gbytes, for the next ENVISAT mission (2001).
DataGrid contribute to EO:
• enhance the ability to access high level products•allow reprocessing of large historical archives• improve Earth science complex applications (data fusion, data mining, modelling …)
Source: L. Fusco, June 2001
Release 1.0 10 July 2001
19
ww
w.e
u-d
ata
gri
d.o
rg
ENVISAT• 3500 MEuro programme cost3500 MEuro programme cost
• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 “standard” products~100 “standard” products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe
• ~700 approved science user projects~700 approved science user projects
• 3500 MEuro programme cost3500 MEuro programme cost
• 10 instruments on board10 instruments on board• 200 Mbps data rate to ground200 Mbps data rate to ground• 400 Tbytes data archived/year400 Tbytes data archived/year• ~100 “standard” products~100 “standard” products• 10+ dedicated facilities in Europe10+ dedicated facilities in Europe
• ~700 approved science user projects~700 approved science user projects
Release 1.0 10 July 2001
20
ww
w.e
u-d
ata
gri
d.o
rg
RA-2
MWR
DORIS
ASAR W
NS - E
Acquisition &
services for
ASAR HR, MERIS FR
S-PAC
MERIS RR
UK - PAC
ASAR HRMERIS FR
AATSR
F - PAC
I - PAC
ASAR HR
MERIS FR
D - PAC
ASAR HR
SCIAMACHY, GOMOS
MIPAS PDHS - E
Acquisition
Processing
Dissemination
Archive
PDCC
Mission Planning
Monitoring
Control
User services co -ordinationProduct Quality
PDAS
Acquisition
PDHS - K
Acquisition
Processing
Dissemination
Archive
E - PAC
MERIS FR
UET
Ka
FMI
GOMOS OLP
LRAC
Low Rate Processing
Dissemination
Archive
ESA PROCURED CENTRE/ STATION
OTHER PDS CENTRE/ STATION
• 10+ european sites• 4000+ researchers/users• 15+ countries
Envisat Facilities in Europe
Release 1.0 10 July 2001
22
ww
w.e
u-d
ata
gri
d.o
rg
DataGrid and Grid(s)
How DataGrid fits the Grid concept
Release 1.0 10 July 2001
23
ww
w.e
u-d
ata
gri
d.o
rgThe Grid
• Enable communities (“virtual organizations”) to:– share geographically distributed resources as they pursue
common goals;
• in the absence of:– central control,
– omniscience,
– trust relationships
(Ian Foster & Carl Kesselman)
Release 1.0 10 July 2001
24
ww
w.e
u-d
ata
gri
d.o
rgThe Virtual Organisations
• A virtual organisation is a set of participants, resources and resource sharing rules:– The participants: dynamic collections of individuals,
institutions with a common problem to solve
– The resources: computers, software, data, instrumentation to solve the problem
– Resource sharing rules: definition of what is shared, who is allowed to share, the conditions under which sharing occurs
Grid addresses the Virtual Organisations problem
Release 1.0 10 July 2001
25
ww
w.e
u-d
ata
gri
d.o
rgDataGrid vs Grid
The GRID metaphor
• Unlimited ubiquitous distributed computing
• Transparent access to multi-petabyte distributed data bases
• Easy to plug in
• Hidden complexity of the infrastructure
• Analogy with the electrical power GRID
The DataGrid requirements
• Enormous computing requirements
• Manage vast quantities of data
• Researchers spread all over the world
• User friendly interfaces
Release 1.0 10 July 2001
26
ww
w.e
u-d
ata
gri
d.o
rgNetworked Computing Models
• Distributed Computing– synchronous processing
• High-Throughput Computing– asynchronous processing
• On-Demand Computing– dynamic resources
• Data-Intensive Computing– databases
• Collaborative Computing– science
Ian Foster and Carl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999
Release 1.0 10 July 2001
27
ww
w.e
u-d
ata
gri
d.o
rgGrid related initiatives
• Global Grid Forum: www.gridforum.org
• Europe– European Grid Forum: www.egrid.org– EuroGrid: (EU Project) Uniform access to parallel supercomputing resources
– DAMIEN:Distributed Applications and Middleware for Industrial use of European Networks
– UNICORE: The UNiform Interface to Computer Resources
• America– GLOBUS: www.globus.org/– CONDOR: www.cs.wisc.edu/condor/– PPDG: www.cacr.caltech.edu/ppdg/– GRIPHYN: www.griphyn.org– NASA: www.nas.nasa.gov/ipg/
• Asia-Pacific– StarTAP– TransPAC
Release 1.0 10 July 2001
29
ww
w.e
u-d
ata
gri
d.o
rgThe DataGrid numbers
• 6 main contractors, 15 assistant contractors
• 9.8 millions € funded by EU
• 150 Full Time Equivalent over 3 years
• Flagship project of the EU IST GRID program
• Project started Jan 2001, duration 3 years
Release 1.0 10 July 2001
30
ww
w.e
u-d
ata
gri
d.o
rgMain Partners
• CERN - France
• CNRS - France
• ESA/ESRIN - Italy
• INFN - Italy
• NIKHEF – The Netherlands
• PPARC - UK
Release 1.0 10 July 2001
31
ww
w.e
u-d
ata
gri
d.o
rg
Research and Academic Institutes•CESNET (Czech Republic)•Commissariat à l'énergie atomique (CEA) – France•Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI)•Consiglio Nazionale delle Ricerche (Italy)•Helsinki Institute of Physics – Finland•Institut de Fisica d'Altes Energies (IFAE) - Spain•Istituto Trentino di Cultura (IRST) – Italy•Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany•Royal Netherlands Meteorological Institute (KNMI)•Ruprecht-Karls-Universität Heidelberg - Germany•Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands•Swedish Natural Science Research Council (NFR) - Sweden
Associated Partners
Industry Partners•Datamat (Italy)•IBM (UK)•Compagnie des Signaux (France)
Release 1.0 10 July 2001
32
ww
w.e
u-d
ata
gri
d.o
rgDataGrid Goals
• Develop open source middleware for fabric & grid management
• Deploy a large scale testbed
• Set-up production quality demonstrations
• Collaborate with and complement other European and US projects
• Involve industries to create the critical mass of interest for the success of the project
Release 1.0 10 July 2001
33
ww
w.e
u-d
ata
gri
d.o
rgTechnical challenges
• Transparent utilisation of heterogeneous computing and storage resources– gegraphiclly distributed– to process huge amounts of data
• Test the usability and effectiveness of novel computing models, on a large-scale testbed
• Demonstrate production quality operations on real large use-cases
Release 1.0 10 July 2001
34
ww
w.e
u-d
ata
gri
d.o
rgThe DataGrid Middleware
• Developed in the framework of the global grid forum activity
• Relies and integrates Globus, condor and other grid related project contributions
• Integrates and re-use existent open source software
• DataGrid middleware distributed as open software
Middleware
Release 1.0 10 July 2001
35
ww
w.e
u-d
ata
gri
d.o
rgThe Middleware Approach
• Toolkit and services addressing key technical problems
– Modular “bag of services” model
– Not a vertically integrated solution
– can be applied to many application domains
• Inter-domain issues, rather than clustering
– Integration of intra-domain solutions
Release 1.0 10 July 2001
36
ww
w.e
u-d
ata
gri
d.o
rgDataGrid Middleware Services
• Focus on– Workload, Data, Monitoring, Fabric, Mass Storage
management services
Fabric ManagementMass StorageManagement
Data mgmt
Workload mgmt
Monitoring Services
Other Grid middleware services (information, security)
Release 1.0 10 July 2001
45
ww
w.e
u-d
ata
gri
d.o
rgWorking Groups
Applications
Middleware
Infrastructure
Man
ag
em
en
tTest
bed
• The DataGrid project is divided in 12 Work Packages distributed in four Working Groups
Release 1.0 10 July 2001
46
ww
w.e
u-d
ata
gri
d.o
rgWG Description
• The Middleware Working Group coordinates the development of the software modules leveraging, existing and long tested open standard solutions. Five parallel development teams implement the software: job scheduling, data management, grid monitoring, fabric management and mass storage management.
• The Infrastructure Working Group is focused on the integration of middleware software with systems and networks to provide testbeds to demonstrate the effectiveness of DataGrid in production quality operations over high performance networks.
• The Applications Working Group exploits the project developments to process large amounts of data produced by experiments in the fields of High Energy Physics (HEP), Earth Observations (EO) and Biology.
• The Management Working Group has in charge the coordination of the entire project on a day-to-day basis and the dissemination of the results among industries and research institutes.
Applications
Middleware
Infrastructure
Managem
ent
Test
bed
Applications
Middleware
Infrastructure
Managem
ent
Test
bed
Applications
Middleware
Infrastructure
Managem
ent
Test
bed
Applications
Middleware
Infrastructure
Managem
ent
Test
bed
Release 1.0 10 July 2001
47
ww
w.e
u-d
ata
gri
d.o
rgWorkPackage List (I)
John Gordon (RL)Mass Storage Management
Olof Barring (CERN) Fabric Management
Robin Middleton (RAL)Grid Monitoring Services
Ben Segal (CERN)Grid Data Management
Francesco Prelz (INFN)Grid Workload Management
Chair(s) WorkPackage
Release 1.0 10 July 2001
48
ww
w.e
u-d
ata
gri
d.o
rgWorkPackage List (II)
Christian Micheau (CNRS)Biology Applications
Julian Linford (ESA-ESRIN)
Earth Observation Applications
Federico Carminati (CERN) High Energy Physics Applications
Pascale Primet (CNRS)Network Services
Francois Etienne (CNRS)Integration TestBed
Chair(s) WorkPackage
Maurizio Lancia (CNR)Dissemination
Fabrizio Gagliardi (CERN)Project Management
Release 1.0 10 July 2001
49
ww
w.e
u-d
ata
gri
d.o
rgProject management
• Project Manager (PM): Fabrizio Gagliardi (CERN)
• Project Office: – supports the PM in the day-to-day operational management of the project
• Project Management Board (PMB), – one representative of each main partner and the PM,– co-ordinates and manages items that affect the contractual terms of the project
• Project Technical Board (PTB), – one representative from each Work Package– takes decisions on technical issues
• Architecture Task Force (ATF):– issues guidelines about the global design
Release 1.0 10 July 2001
51
ww
w.e
u-d
ata
gri
d.o
rg
Collective ServicesCollective Services
Information & Monitoring
Information & Monitoring
Replica ManagerReplica Manager Grid SchedulerGrid Scheduler
Replica OptimizationReplica Optimization
Replica Catalog InterfaceReplica Catalog Interface
Grid Application LayerGrid Application Layer
Job ManagementJob Management
Local ApplicationLocal Application Local DatabaseLocal Database
Fabric servicesFabric services
ConfigurationManagement
ConfigurationManagement
Node Installation &Management
Node Installation &Management
Monitoringand
Fault Tolerance
Monitoringand
Fault Tolerance
Resource Management
Resource Management
Fabric StorageManagement
Fabric StorageManagement
Grid
Fabric
Local Computing
Grid
Data ManagementData Management Metadata Management
Metadata Management
Object to File Mapper
Object to File Mapper
Underlying Grid ServicesUnderlying Grid Services
Computing Element Services
Computing Element Services
Authorisation, Authentication and Accounting
Authorisation, Authentication and Accounting
Replica CatalogReplica Catalog
Storage Element Services
Storage Element Services
SQL Database Service
SQL Database Service
Service Index
Service Index
Release 1.0 10 July 2001
52
ww
w.e
u-d
ata
gri
d.o
rgData Management Tasks
• Data TransferEfficient, secure and reliable transfer of data between
sites
• Data ReplicationReplicate data consistently across sites
• Data Access OptimizationOptimize data access using replication and remote open
• Data Access ControlAuthentication, ownership, access rights on data
• Metadata StorageGrid-wide persistent metadata store for all kinds of Grid information
Complex, R&D
GridFTP
SQLDatabase
Design not final
GDMP
Data Granularity : Files, not Objects.
Current Status
Release 1.0 10 July 2001
53
ww
w.e
u-d
ata
gri
d.o
rgData Grid Storage Model
ReplicationService
HSM (HPSS)
GridStorage Disk array
Grid StorageDisk array
HSM(Castor)
Grid ComputeElement
site 1
site 2
site 3
Grid Compute Element Grid ComputeElement
GridStorage Disk array
Release 1.0 10 July 2001
54
ww
w.e
u-d
ata
gri
d.o
rgReplication Service
• Replica CatalogStore logical to physical file mappings and metadata
• Replica ManagerInterface to create/delete replicas on demand
• Replica SelectionFind closest replica based on certain criteria
• Access OptimizationAutomatic creation and selection of replica for whole
jobs
Release 1.0 10 July 2001
55
ww
w.e
u-d
ata
gri
d.o
rgTerminology
• StorageElement:– any storage system with a Grid Interface
• supports GridFTP
• Logical File Name (LFN)– globally unique– LFN://hostname/string
• hostname = virtual organization id– use of hostname guarantees uniqueness
• e.g.: LFN://cms.org/analysis/run10/event24.dat
• Physical File Name (PFN)– PFN://hostname/path
• hostname = StorageElement host
• Transport File Name (TFN)– URI
• includes protocol
Release 1.0 10 July 2001
56
ww
w.e
u-d
ata
gri
d.o
rgReplica Catalog
• Replica Catalog – database containing mappings of a Logical filename to 1 or
more physical filenames
• Design Goals– scalable
• CMS experiment estimates that by 2006 their replica catalog will contains 50 million files spread across dozens of institutions
– decentralized• local data should always be in local catalogs!
– If an application wants to access a local replica, it should not have to query a replica catalog on the other side of the country/planet
– fault tolerant
Release 1.0 10 July 2001
57
ww
w.e
u-d
ata
gri
d.o
rgReplicaCatalog Design
• There should be exactly one ReplicaCatalog for each StorageElement
• All Replica catalogs for a given virtual organization are linked together, probably in a tree structure– “leaf” catalogs contain a mapping of LFN to PFN– “non-leaf” catalogs contain only a pointer to another
replica catalog
• All ReplicaCatalogs (leaf and non-leaf) have identical client APIs
Release 1.0 10 July 2001
58
ww
w.e
u-d
ata
gri
d.o
rg
redirectsand updates
Top LevelReplica Catalog
(contains LFN -> site-RCmappings for this VO)
Site 1Replica Catalog
(contains LFN -> SE-RCmappings for this site only)
Site 1, SE 1Replica Catalog
(contains LFN -> PFNmappings for this SE only)
Site 1, SE 2Replica Catalog
Site 2, SE Replica Catalog
Grid Applications
redirectsand updates
redirectsand updates
Queries
Queries
Queries
Key points:
+ Can query any level in thehierarchy with the same API
+ All updates are performed at theSE-RC level, and automaticallypropagated up the tree
+ Queries are automaticallyredirected up the tree if data isnot at a given level
Sample Hierarchy
Release 1.0 10 July 2001
59
ww
w.e
u-d
ata
gri
d.o
rgSample Usage
• Queries can start at any catalog – queries can be redirected up and/or down the tree until
all instances are found
• common application query: “Is there a replica on my “local” StorageElement”?– only need to access the local ReplicaCatalog to answer
this– this will reduce the load on the top-level RC
• ReplicaCatalog updates (ie: createNewReplica() ) are done at leaf nodes only, and then propagated up the tree in batches (e.g.: every few minutes)– this will also reduce the load on the top-level server
Release 1.0 10 July 2001
62
ww
w.e
u-d
ata
gri
d.o
rgSQLDatabase
• Convenient, scalable and efficient storage, retrieval and query of data held in any type of local or remote RDBMS.
• May be used for metadata. • Core functionality is SQL insert, delete, update and query. • Can be invoked from a command line tool, a Web browser,
and a programming language API. • A well defined language, platform and RDBMS neutral
network protocol between client and server is used. • Unified Grid enabled front end to relational
databases.
Release 1.0 10 July 2001
63
ww
w.e
u-d
ata
gri
d.o
rgGDMP
• Grid Data Management Pilot; to be renamed Grid Data Mirroring Package
• Initial usage: mirror Objectivity DB files to remote sites using GSI and GridFTP.
• Implements basic prototypes of all services or uses Globus equivalents (like replica catalog)
• Has been extended to mirror ROOT files and flat files
Release 1.0 10 July 2001
64
ww
w.e
u-d
ata
gri
d.o
rgWhat next?
• Implement Replica Catalog based on design• Design & implement Replica Manager, Replica
Selection and Cost Estimation• Update security framework• More on Databases and Object to File mapping• R&D in Grid Query Optimization
Step by step evolution of services
Research & Development