metadata services on the grid

20
University of Coimbra Metadata Services on the GRID Nuno Santos ACAT’05 May 25 th , 2005

Upload: gram

Post on 14-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Metadata Services on the GRID. Nuno Santos ACAT’05 May 25 th , 2005. Contents. Metadata on the GRID ARDA-gLite Metadata Interface The ARDA Implementation Performance study: SOAP vs TCP Streaming. Metadata on the GRID. Metadata is data about data Metadata on the GRID - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Metadata Services on the GRID

University of Coimbra

Metadata Services on the GRID

Nuno Santos

ACAT’05 May 25th, 2005

Page 2: Metadata Services on the GRID

University of Coimbra

Contents Metadata on the GRID ARDA-gLite Metadata Interface The ARDA Implementation Performance study: SOAP vs TCP Streaming

Page 3: Metadata Services on the GRID

University of Coimbra

Metadata on the GRID

Metadata is data about data Metadata on the GRID

Mainly information about files Other information necessary for running jobs Usually living on DBs

Need simple interface for Metadata access Advantages

Easier to use by clients - no SQL, only metadata concepts Common interface - clients don’t have to reinvent the wheel

Must be integrated in the File Catalogue Also suitable for storing information about other resources

Page 4: Metadata Services on the GRID

University of Coimbra

ARDA-gLite Metadata Interface ARDA proposed an interface for Metadata access on the GRID

Designed jointly with the gLite/EGEE team Incorporates feedback from GridPP Endorsed by the EGEE standards committee (PTF) Being implemented in gLite File Catalog (FiReMan)

Interface concepts Metadata - Key-value pairs Entry - Entities to which metadata is attached Attribute – Holds information about an entry

Schema – A collection of attributes Type – The type (int, float, string,…) Name/Key – The name of the attribute Value - Value of an entry's attribute

Entries are associated with schemas Think of schemas as tables, attributes as columns, entries as

rows

Page 5: Metadata Services on the GRID

University of Coimbra

Interface Operations Schema management

void createSchema(String schemaName, Attribute[] attributes)

void dropSchema(String schemaName)

void removeSchemaAttributes(String schemaName, String[] attributeNames)

void addSchemaAttributes(String schemaName, Attribute[] attributes)

Entry managementvoid createEntry(MDEntry[] entries, String[] schemas)

void removeEntry(String query)

int setAttributes(String query, Attribute[] attributes)

Attribute[] listAttributes(String entry)

Page 6: Metadata Services on the GRID

University of Coimbra

Interface Operations Searching and retrieving entries

MDResult query(MDQuery query)

MDResult nextQuery(String token, MDQuery query)

void endQuery(String token)

Datatypes

Allows either stateful or stateless server implementations

MDEntry {String entryAttribute[] attributes

}

MDResult {MDEntry[] entriesString tokenBoolean done

}

MDQuery {String queryString queryType

}

Attribute {String schemaString nameString typeString value

}

Page 7: Metadata Services on the GRID

University of Coimbra

ARDA Prototype Validate proposed interface Architecture:

Metadata organized in a hierarchy Schemas can contain sub-schemas

Can inherit attributes Analogy to file system:

Schema Directory; Entry File

Stability with large responses Send large responses in chunks

Otherwise preparing large responses could crash server

Stateful server DB → Server – Data streamed using DB cursors Server → Client – Response sent in chunks

Page 8: Metadata Services on the GRID

University of Coimbra

ARDA Implementation Backends

Currently: Oracle, PostgreSQL, SQLite

Two frontends TCP Streaming

Chosen for performance SOAP

Formal requirement of EGEE Compare SOAP with TCP

Streaming

Also implemented as standalone Python library Data stored on filesystem

Python Interpreter

Metadata Python

APIClient

filesystem

Metadata Server

MDServer

SOAP

TCP Streaming

PostgreSQL

Oracle

SQLite

Client

Client

Page 9: Metadata Services on the GRID

University of Coimbra

TCP Streaming Frontend Text based protocol (like SMTP,

POP3,…)

Data streamed to client in single connection

Implementation Server – C++, multiprocess Clients – C++, Java, Python, Perl, Ruby

Client: listattr entry

Server: 0entryvalue1value2…<EOT>

Client Server Database

<operation> Create DB cursor

[data]

[data]

[data]

[data]

[data]

[data]

[data]

[data]

StreamingStreaming

Page 10: Metadata Services on the GRID

University of Coimbra

SOAP Frontend Most operations in interface

implemented as simple SOAP calls query() - based on iterators

Initial request – create session Open cursor on DB Return initial chunk of data and

session token Subsequent requests

Client calls nextQuery() using session token

Termination – session closed when: End of data Client calls endQuery() Client timeout

Implementations Server – gSOAP (C++). Clients – Tested WSDL with gSOAP,

ZSI (Python), AXIS (Java)

Client Server Database

query Create DB cursor

[data]

[data]

[data]

[data]

[data]

nextQuery

[data]

nextQuery

[data]

StreamingSOAP with iterators

Page 11: Metadata Services on the GRID

University of Coimbra

Current Uses of the ARDA prototype Evaluated by LHCb-bookkeeping

Migrated bookkeeping metadata to ARDA prototype 20M entries, 15 GB

Feedback valuable in improving interface and fixing bugs Interface found to be complete ARDA prototype showing good scalability

Ganga (LHCb, ATLAS) User analysis job management system Stores job status on ARDA prototype Highly dynamic metadata

Page 12: Metadata Services on the GRID

University of Coimbra

Performance Study SOAP increasingly used as standard protocol for

GRID computing Promising web services standard - Interoperability

Some potential weaknesses XML encoding increases message size (4x to 10x typical) XML processing is compute and memory intensive

How significant are these weaknesses? What is the cost of using SOAP?

ARDA metadata implementation ideal for comparing SOAP with a traditional RCP protocol

Page 13: Metadata Services on the GRID

University of Coimbra

Benchmark Description

Protocols TCP-S – TCP Streaming SOAP – Clients with gSoap (C++), Axis (Java) and ZSI (Python)

Operations ping – A null RPC add – Adds an entry get – Gets all attributes of an entry get (bulk) – Gets all attributes of several entries in a single operation

Entries 60 attributes (ints, floats and strings) 700 bytes on average

HTTP Keepalive/Persistant connections HTTP Keepalive increase HTTP performance. Should improve SOAP

performance. gSOAP supports Keepalive. Axis and ZSI don’t. TCP-S uses persistent TCP connections to compare with HTTP Keepalive

Page 14: Metadata Services on the GRID

University of Coimbra

SOAP Data Overhead Measure size overhead of XML encoding Ping

1000 requests Minimal payload – less than 5 bytes per request SOAP overhead around 8 times

Get attributes in bulk Retrieve 1000 entries

Around 800KB of application data Streaming in TCP Iterators with SOAP – 4KB average SOAP packet payload

With keepalive SOAP overhead around 2.5 times

Total data transferred (in KB)TCP-S SOAP Overhead

Ping 151 1200 7,9Get 1000 Attrs (bulk) 820 2128 2,6

Page 15: Metadata Services on the GRID

University of Coimbra

SOAP Toolkits performance

Test protocol performance No work done on the

backend Switched 100Mbits LAN

Language comparison TCP-S with similar

performance in all languages SOAP performance varies

strongly with toolkit Protocols comparison

Keepalive improves performance significantly

On Java and Python, SOAP is several times slower than TCP-S

1000 pings

0

5

10

15

20

25

Exe

cutio

n T

ime

[s]

C++ (gSOAP) Java (Axis) Python (ZSI)

TCP-S no KATCP-S KA

SOAP no KASOAP KA

Page 16: Metadata Services on the GRID

University of Coimbra

Single client results (LAN) Compare performance of

different operations C++ clients (gSOAP)

When backend must do work, differences between gSOAP and TCP-S are small

Bulk operations very important for performance getBulk 4x faster than get

1000 pings/1000 Entries

0

5

10

15

20

25

Exe

cutio

n T

ime

[s]

ping add get get Bulk

TCP-S no KATCP-S KA

gSOAP no KAgSOAP KA

Page 17: Metadata Services on the GRID

University of Coimbra

Single client results (WAN) Client CERN, server

Taiwan ≈300 ms latency

Results dominated by latency Execution time at server

irrelevant Large performance boost

from latency hiding techniques: keepalive – fewer TCP

handshakes bulk operations – fewer

client/server interactions

1000 pings/1000 Entries

0

200

400

600

800

1000

1200

1400

Exe

cutio

n T

ime

[s]

ping add get get Bulk

TCP-S no KATCP-S KA

gSOAP no KAgSOAP KA

x5

Page 18: Metadata Services on the GRID

University of Coimbra

Scalability with Multiple Clients - Pings Measure scalability of protocols

Switched 100Mbits LAN TCP-S 3x faster than gSoap

(with keepalive) Poor performance without

keepalive Around 1.000 ops/sec (both

gSOAP and TCP-S)

1000 pings

1000

10000

1 10 100A

vera

ge

th

rou

gh

pu

t [c

alls

/se

c]# clients

TCP-S, no KATCP-S, KA

gSOAP, no KAgSOAP, KA

Client ran out of sockets

Page 19: Metadata Services on the GRID

University of Coimbra

Scalability with Multiple Clients - getAttr Measure scalability with

realistic payload Switched 100Mbits LAN All tests with keepalive

Smaller difference between gSOAP and TCP-S TCP-S 2x faster (1000 vs 500

entries/sec) Poor performance of non-bulk

operations 100 entries/sec

1000 entries

100

1000

1 10 100A

vera

ge

th

rou

gh

pu

t [e

ntr

ies/

sec]

# clients

TCP-S, Single, KATCP-S, Bulk, KA

gSOAP, Single, KAgSOAP, Bulk, KA

Page 20: Metadata Services on the GRID

University of Coimbra

Conclusions A common Metadata Interface was developed by

ARDA and gLite Endorsed by the EGEE standards committee

Interface validated by ARDA prototype Prototype in use by LHCb (bookkeeping, Ganga) and

ATLAS (Ganga) SOAP performance studied using ARDA

implementation Toolkit performance varies widely Large SOAP overhead (over 100%)