data management technologies in grids -...

58
Data Management Technologies in Grids Borja Bergua Guerra Computer Architecture Group Universidad Carlos III de Madrid Congreso GUL UC3M 2008

Upload: lyphuc

Post on 16-Oct-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management

Technologies in Grids

Borja Bergua GuerraComputer Architecture Group

Universidad Carlos III de Madrid

Congreso GUL UC3M 2008

Page 2: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 2

Topics

1. Motivation and some definitions

2. Data Management in Globus

3. Expand solution for cluster and grid computing

Page 3: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 3

Motivation of grid computing

� Many resources scattered across multiple organizations

� CPU

� Networks

� Storage

� Data

� Goals:

� To take advantage of theses resources.

� Sharing geographically distributed computation and storage resources.

� Users can share data and resources reliably and consistently.

Page 4: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 4

Some definitions

� A grid system is a collection of distributed resources connected by a network, located at different administrative domains, accessible to users and applications in order to share resources and increase the performance.

� A grid application is an application that operates in a grid environment.

� A grid middleware is the software that facilitates writing grid applications and manages the grid infrastructure.

Page 5: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 5

Other Definitions

� The Globus Project defines Grid as:� Infrastructure that enables the integrated, collaborative

use of high-end computers, networks, databases, and scientific instruments owned and managed by multiple organizations

� The Gridbus project defines Grid as:� Type of parallel and distributed system that enables

the sharing, selection, and aggregation of geographically distributed autonomous resources dynamically at runtime depending on their availability, capability, performance, cost and users’ quality-of-service requirements.

Page 6: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 6

Resources

� Processors

� Storage� Networks

� Software� Data

� Special Devices

Page 7: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 7

Grid Architectures

� Virtual organizations (VO)� Different physical organizations that share resources

and collaborate in order to achieve a common goal� A VO defines the resources available and the rules for

access

� Economic grid� Based on economic principles� Resources providers (owners) compete to provide the

best service to resource consumers (users) who select appropriate resources based on their specific requirements

Page 8: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 8

Main grid uses

� Distributed supercomputing

� High throughput computing

� On-demand computing

� Data-intensive computing

� Collaborative computing

� Multimedia computing

Page 9: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 9

Grid components

Applications

Tools

Local resources

Middleware

Page 10: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 10

User point of view

applications

results

� Brokers� Schedulers� Job management� Data management� Security model

grid

Page 11: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 11

Jobs execution model

ResourceResourceBrokerBroker

Job SubmissionJob SubmissionServiceService

StorageStorageElementElement

ComputingComputingElementElement

Information Information ServiceService

ReplicaReplicaCatalogueCatalogue

Submit a job

Broker

info

Input data

Output data

Job status

Page 12: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 12

Applications appropriate for grids

� Perfect parallelism. An application can be divided into sets of processes that require little or no communication.

� Data Parallelism. The same operation is performed on many data elements simultaneously.

� Functional Parallelism. Multiple operations are performed simultaneously, with each operation addressing a particular part of the problem.

Page 13: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 13

Applications not appropriate for grids

� Parallel applications with many interprocesscommunication

� Transactions

� Applications with many interdependencies between jobs

� Applications with non standard network protocols

Page 14: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 14

What need applications?

� Applications need programs that process input data and to produce output data in a reliable and efficient way� Executables transfer� Input data transfer� Output data gathering� Checkpoints

� Applications need transparent access to its input and output data is required

� A grid must to provide transparent, secure, high-performance access to data located in different administrative domains and organizations� File location services

� File transfer services

Page 15: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 15

Data management techniques

� Distributed File systems� NFS, AFS,…

� Storage technologies� NAS Devices� Storage Area Networks (SAN)� iSCSI Devices

� Parallel file systems� PVFS, GPFS, Expand,…

� Distributed Storage (wide-area)� File transfers� Replication� Grid file systems

Page 16: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management in Globus

Page 17: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 17

Globus Toolkit

� Software toolkit, developed by The Globus Alliance http://www.globus.org

� It can use to program grid-based applications.

� Includes high-level services that we can use to build Grid applications. It includes:� Resource monitoring and discovery service� A job submission infrastructure� A security infrastructure � Data management services

Page 18: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 18

What is not the Globus toolkit?

� A user tool

� An application

� A scheduler

Page 19: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 19

Globus Toolkit Components

WSAuthenticationAuthorization

Pre-WSAuthenticationAuthorization

Data MgmtSecurityCommonRuntime

Execution Mgmt

Information Services

ReliableFile

Transfer(RFT)

GridFTP

OGSA-DAI[Tech Preview]

Python WS Core[contribution]

C WS Core

CommunitySchedulerFramework

[contribution]

GridResource

Allocation Mgmt(WS GRAM)

Monitoring& DiscoverySystem(MDS4)

Java WS Core

Web ServicesComponents

Non-WS

Components

CredentialManagement

ReplicaLocationService(RLS)

GridResource

Allocation Mgmt(Pre-WS GRAM)

Monitoring& DiscoverySystem(MDS2)

C CommonLibraries

XIO

DelegationService

CAS

GT2

GT3

GT3

GT4

GT4

Page 20: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 20

Security

� Globus Toolkit proposed and implements the Grid Security Infrastructure (GSI)� Single sing-on for using grid services through user certificates� Resource authentication through host certificates� Data encryptation� Authorization

Page 21: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 21

Execution management

� Grid Resource Allocation and Management (GRAM) � GRAM allows to submit, monitor and cancel job.� GRAM is not a scheduler.� Two implementations

� WS GRAM� Pre-WS GRAM

Page 22: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 22

MetaSchedulers

GlobusGRAM

CONDOR

GlobusGRAM

LSF

GlobusGRAM

PBS

Site2 Site3

Job submissionservice

Condor-G

Broker MDS +Replica Catalog

Submit jobs

ResourceDiscovery

Site1

Page 23: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 23

Monitoring and Discovery System (MDS)

� Suite of web services for monitoring and discoveringresources and services on Grids

� Examples of useful information� Characteristics of a compute resource

� Software available, networks connected to, load, type of CPU, disk space

� Characteristics of the Globus infrastructure� Hosts, resource managers, service availability

Page 24: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 24

Data Management in Globus

� Data transfer� GridFTP� Reliable File Transfer (RFT) service

� Data Replication� Replica Location Service (RLS)

Page 25: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 25

GridFTP

� GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks

� Based on FTP protocol � GridFTP: Protocol Extensions to FTP for the Grid� Global Grid Forum Recommendation� http://www.ggf.org/documents/GWD-R/GFD-R.020.pdf

� Globus provides� Server implementation� Client tools (command line programs)� Development libraries

Page 26: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 26

globus-url-copy

� Command line tool that can do multi-protocol data movement

� Mainly used for GridFTP, however, it supports many protocols� gsiftp:// (GridFTP)� ftp://� http://� https://� file://

� You must have a certificate to use it

Page 27: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 27

globus-url-copy

� Command line� globus-url-copy [options] srcURL dstURL

� Examples of URL� gsiftp://host.domain.com:2890/dir/file.dat

� http://host.domain.com/webpage/page.html

� file:///localdirectory/file.dat

� For GridFTP (gsiftp://) and FTP (ftp://), it is legal to specify a user name and password� gsiftp://name:[password]@host.domain.com/foo.dat

Page 28: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 28

Server Architecture

� GridFTP use two separate socket connections:� A control channel for commands and responses� A data channel for data transfer

� Control and data channel can be in separate process

� A single control channel can have multiple data channels� Used in striped operation

Page 29: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 29

Same Server for control and data

ControlData

globus-gridftp-server -p 7000�globus-url-copy

�commands

�Data transfer

Page 30: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 30

Different servers for control and data

� High level of security� The frontend can be run as any user� The backend is run as root, but configured to only

allow connections from the frontend

Control

Data�globus-url-copy

�commands

�Data transfer

globus-gridftp-server -p 7000 -r localhost:7001

globus-gridftp-server -p 7001 -dn-allow-from 127.0.0.1

Page 31: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 31

Third party transfers

Machine A

Machine B

FileA

FileA

(1)

(1) Client send data requests to machine A

(2) Machine A init the transfer and client it disconnected

(3) Machine B receives the file

�globus-url-copy

�gsiftp://machineA/tmp/FileA

�gsiftp://machineB/tmp/FileB

(2)

(3)

Page 32: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 32

Striped servers

Control

�globus-url-copy -stripe

�commands

�Parallel data transfer

globus-gridftp-server -p 7000 -r machineA:5000,machineB:6000,machineC4000

Data Data Data

machineA> globus-gridftp-server -p 5000 –dn

machineB> globus-gridftp-server -p 6000 -dn

machineC> globus-gridftp-server -p 4000 -dn

�machineB�machineA �machineC

Page 33: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 33

Reliable File Transfer

� Uses standard SOAP messages over HTTP to submit and manage transfers

� Increase reliability transfer because state is stored in a database.

� The client can submit the transfer request and then disconnect and go away.

Page 34: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 34

RFT: Third party transfer

ControlChannel

DataChannel

GridFTP Server

ControlChannel

DataChannel

RFT Service

RFT Client

SOAP Messages

Notifications(Optional)

GridFTP Server

Page 35: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 35

Replication Location and management

� This component supports multiple locations for the same file across the grid

� Filenames

� Logical Filename (LFN) is the name that refers to the full set of replicas for a file.

� Logical identifier

� Based on the LFN, all replicas can be looked up in a replica catalogue.

� Physical file name (PFN) is the location of a copy of the file on a storage system . � Host + physical location� May be a physical filename of a file stored on disk

Page 36: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 36

Filenames examples

gridftp://host1.domain/data/myfile

gsiftp://host2.domain/data/myfile

http://example.org/data/filename

Physical Filename

My-data

logical-name

Logical File Name

Page 37: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 37

RLS components

Page 38: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 38

RLS components

� Local Replica Catalogs (LRCs) stores mappings between logical names for data items and the physical locations of replicas of those items.

� Replica Location Index (RLI) nodes aggregate information about one or more LRCs.

� LRCs use soft state update mechanisms to inform RLIsabout their state: relaxed consistency of index.

� Optional compression of state updates reduces communication, CPU and storage overheads

� Membership service registers participating LRCs and RLIs and deals with changes in membership

Page 39: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 39

Example 1

LF1LF7

PFxPFy

ReplicaSite 1

LF8LFz

PFuPFv

ReplicaSite 2

LFz PFvReplicaSite 3

LF1 Site 1LF7 Site 1LF8 Site 2LFz Site 2LFz Site 3

Global ReplicaIndex Node

Soft StateCommunicationsof LNRC State

User Queries:Where is LFi replicated?

Source: Globus Alliance

� Users submit a query (LF) to the RLI for obtaining the list of RLC that store the file

� User selects the appropriate replica site and submit a query (LF) for obtaining the physical name

� Finally the user uses the gridFTP with the PN for transfer the replica

Page 40: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Expand:

Parallel File System for cluster

and grid computing

Page 41: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 41

Expand

� A new approach to the construction of parallel file systems for cluster and grid computing

� Solution for clusters:� Use of standard servers

� Solution for grid:� Use of GridFTP and Globus Toolkit

Page 42: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 42

Parallelism in data access

� Parallelism can be apply in multiple levels� To exploit all

levels

Parallel applications

Parallel computers

Parallel file systems

Parallel devices

Page 43: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 43

Typical architecture of a parallel file system

Parallel Access

Processes

Process

Server

ClientClient library

Server

Page 44: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 44

Motivation

� All parallel file systems do not use standard servers� Very difficult to use these systems in heterogeneous

environments� There is no PFS for grid computing

� However:Why use proprietary, special-purpose servers for the parallel file system when you have most of the necessary functionality in many standard servers (NFS, GridFTP) already?

Page 45: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 45

Expand Goals

� To propose a new approach to the construction of parallel file systems for cluster and grid computing� Based on existing and standard servers

� NFS for cluster computing� GridFTP for grid computing

� Advantages� No changes to servers are required

� All operations are implemented on the clients� Expand is independent of the operating system used in the

clients� All operations based on sever protocol

� The parallel file system construction is greatly simplified� Allows the using of servers with different architectures and

operating systems

Page 46: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 46

Architecture

ClientsClients

NFSCIFSHTTP-WebDavGridFTP...

data and metadata server

distributed partition

Net

NFS CIFS FTP WebDAV

HTTPWebDAVGridFTP Local ....

....

NFS

POSIX MPI-IO

NFI

....

Page 47: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 47

File Structure

� A file consists of several subfiles, one for each server

� Data are distributed across all servers Server Server Server

2

Expand File

Server File (subfiles)

5

8

0

3

6

1

4

7

0

1 2 3 4 5 6 7 8

Page 48: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 48

Directory Structure

/Expand

Logical user view Directory mapping

Dir1 Dir2 Dir3

Dir4

fileA fileB

/export1

Dir1 Dir2

Server1

Dir3

Dir4

fileA fileB

Server

/exportN

Dir1 Dir2

ServerN

Dir3

Dir4

fileA fileB

Server

..............

Page 49: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 49

Expand version for grid computing

Clients

Client

Distributed partition

Distributed partition

Site 1 Site 2 Site 2 Site 4

Internet / GSI

NFS NFS NFS NFS

GridFTP GridFTP GridFTP GridFTP

GridFTP RNS-WS Local ....

....

NFS

POSIX MPI-IO

GridExpand

RNS

RNS

Page 50: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 50

Grid file system requirements

� Logical Hierarchical Name Space� Resource Namespace service

� Uniform Storage Interface and API� POSIX and MPI-IO

� Data Access/Transfer� GridFTP

� Security� GSI

� Optimization and Performance Improvements� Parallel I/O

Page 51: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 51

Performance evaluation

� 500 jobs

� Each job accesses to a random number of files (between 1 and 10 files) chosen among 1000 files.

� The size of each file is 500 MB.

Page 52: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 52

Models evaluated

� 1 site� globus-url-copy

GridFTP

Files

Page 53: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 53

Models evaluated

� 4 sites (distributed files, each server store 250 files)

� globus-url-copy

GridFTP GridFTP GridFTP GridFTP

Files

Page 54: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 54

Models evaluated

� 4 sites (full replication)� globus-url-copy

GridFTP GridFTP GridFTP

Files

GridFTP

Page 55: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 55

Models evaluated

� 4 sites (full replication, parallel access)

� globus-url-copy –p 4

GridFTP GridFTP GridFTP

Files

GridFTP

Parallelaccess

Page 56: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 56

Modeles evaluated

� Grid Expand with 1 and 4 servers

Clients

Client

Distributed partition

Distributed partition

Site 1 Site 2 Site 2 Site 4

Internet / GSI

NFS NFS NFS NFS

GridFTP GridFTP GridFTP GridFTP

GridFTP RNS-WS Local ....

....

NFS

POSIX MPI-IO

GridExpand

RNS

RNS

Page 57: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 57

Results

0

20

40

60

80

100

120

1 site 4 sites(distributed

replicas)

4 sites (fullreplication)

4 sites (fullreplication-

parallel access)

GridExpand 1site

GridExpand 4sites

Tim

e (m

in)

globus-url-copy Expand

Page 58: Data Management Technologies in Grids - ftp.gul.uc3m.esftp.gul.uc3m.es/pub/gul/congreso2008/grid/grid-gul.pdf · Data Management Technologies in Grids Borja Bergua Guerra ... resources

Data Management Technologies Grids Congreso GUL UC3M 2008 58

Some Grid File Systems

� Storage Resource Broker (SBR)

� Grid Datafarm� Distributed StorageTank

� SlashGrid by Andrew