distributed data management miguel branco 1 dq2 discussion on future features bnl workshop october...

1

Distributed Data Management

Miguel Branco

DQ2 discussionon future features

BNL workshopOctober 4, 2007

2

DQ2 0.4.x• Continue to optimize DB schema to cope with higher load

– channel allocation to follow ‘Dataset Subscription policy’• Hiro/Patrick also asking for local configurable ordered list of preferred

sources within cloud– implications on channel allocation

• How much to ‘prefer’ a T1 before going to a T2 for a replica? Right now, shortest queue wins…

– distinguishing files unlikely to have replicas in the future (bad subscriptions)

• particularly in the local monitoring– removing ‘holes’ in system (growing backlogs)

• Reduce load (better GSI session reuse)• Goal O(100K) file transfers/day/site

– or SRM/storage limitations– Need better understanding outside DQ2

3

Local monitoring of site services

4

Staging…• Did not recognize this was a problem for OSG• .. It is very hard to do with remote storages

without SRM– FTS 2 + SRMv2 move on the right direction but

not there yet• Could do a local mechanism for T1->T2

transfers in the same cloud– provided site services for T2 run “close” to the T1

storage• … but not for cross T1 transfers

5

Hierarchiescurrent thoughts, for discussion

• Hierarchical datasets would be a special kind of dataset.• These would have only 2 states: open AND frozen• These would not have versions• The constituents of a hierarchical dataset could only be closed

dataset versions or frozen datasets• Not sure if the following commands should be provided explicitly:

– list files in hierarchical dataset directly?• or only list datasets in hierarchical dataset and forcing user to loop over

results?– subscribe open hierarchical dataset?

• or only allow listing datasets in open hierarchical dataset and forcing user to manually subscribe sub-units

• point is: having to loop over OPEN hierarchies (likely manageable)– locations of hierarchical dataset?

• or only allow listing locations of the individual datasets in the hierarchical dataset?

6

Merging• Not much to do from DQ2 side here but

provide an attribute for each dataset– “merged” Y/N (or protocol: zip, tar?)

• DQ2 does 3rd party transfers only– does not actually ‘see’ the data

7

Checksums• Not much from DQ2 here but enforcing

checksums in the central catalogues and its protocol– ‘md5:’ for MD5

• adler32 is frequently discussed as a better checksum candidate– but not relevant to DQ2, rather to the sites

and production people

8

Subscription lifetime• Increasingly important…

– Would clean up what no one is cleaning up now… (some sites with O(100K) files in impossible situations)

• Discussion from yesterday:– allow only waitForSources to be set by users with production

role ?• avoid creating looping subscriptions in the system

• Forbid subscriptions for datasets with more than X files, if not production user requesting?

• Forbid more than Y subscriptions per sure, if not production user?

• Ignore subscription - regardless of its state - after more than 3 months?– Subscription is marked as broken

9

Central catalogues• [ as mentioned yesterday ]• Main changes are:

– for Scalability only…– dropping VUIDs (becomes DUID+Version number)– DUID becomes timestamp-oriented UUID so that backend

is partitioned in time• and highly optimized UUID storage on ORACLE

– meaning shorter index• ORACLE partitioning, redirect service…

– .. but fully backward compatible with 0.3 clients• Many queries become much faster

– list files in dataset is query by DUID as opposed to query by N number of VUIDs

– ORACLE IOTs guarantees listing files from a dataset [version] reads close to sequential blocks on disk

10

Location catalogue• [ as mentioned yesterday ]• Location catalogue will be populated asynchronously

with:– information on missing files– (re)marking complete/incomplete locations for existing

datasets - consistency– Missing files are extra information made available on ‘best-

effort’ to the users• derived from request by Ganga

• This is populated by the ‘tracker’ service– Which was being reworked for the site services– The tracker service is a ‘stronger’ Fetcher (as existing on the

site services), used to find content on site VS content missing on site - one of the site services performance bottleneck

11

Dashboard• Relatively big update coming soon

– distinguish errors source/destination– display messages on the dashboard for all

sites– alarms supported– more overview of site services state from a

central place• e.g. states of files (based also on new site

services monitoring)

12

ToA• More and more info there…• Blacklist/whitelist• Preferred site connections• This is a cache file, same style as ToA

– but independent file from ToA cache since it is more dynamic

• ToA renewal much stronger– I’d claim it is the most reliable info system

so far on the Grid :-)

13

Communication…• … still not working:

– e.g. did not recognize staging as a problem– e.g. 0.3.2 apparently not deployed on OSG T2s

• quite bad as 0.3.1 had a simple bug where agents could simply die whenever a glitch happened in the central catalogue connection

– glitches “common” with the central catalogue request rate, but harmless and ok to retry

• … what to do here?• Jabber chatroom :-)

– [email protected]– ask me - [email protected] or [email protected] -

to be authorized

mailto:[email protected]



distributed data management miguel branco 1 dq2 discussion on future features bnl workshop october...

Documents