distributed data management miguel branco 1 dq2 discussion on future features bnl workshop october...
DESCRIPTION
3 Local monitoring of site servicesTRANSCRIPT
1
Distributed Data Management
Miguel Branco
DQ2 discussionon future features
BNL workshopOctober 4, 2007
2
DQ2 0.4.x• Continue to optimize DB schema to cope with higher load
– channel allocation to follow ‘Dataset Subscription policy’• Hiro/Patrick also asking for local configurable ordered list of preferred
sources within cloud– implications on channel allocation
• How much to ‘prefer’ a T1 before going to a T2 for a replica? Right now, shortest queue wins…
– distinguishing files unlikely to have replicas in the future (bad subscriptions)
• particularly in the local monitoring– removing ‘holes’ in system (growing backlogs)
• Reduce load (better GSI session reuse)• Goal O(100K) file transfers/day/site
– or SRM/storage limitations– Need better understanding outside DQ2
3
Local monitoring of site services
4
Staging…• Did not recognize this was a problem for OSG• .. It is very hard to do with remote storages
without SRM– FTS 2 + SRMv2 move on the right direction but
not there yet• Could do a local mechanism for T1->T2
transfers in the same cloud– provided site services for T2 run “close” to the T1
storage• … but not for cross T1 transfers
5
Hierarchiescurrent thoughts, for discussion
• Hierarchical datasets would be a special kind of dataset.• These would have only 2 states: open AND frozen• These would not have versions• The constituents of a hierarchical dataset could only be closed
dataset versions or frozen datasets• Not sure if the following commands should be provided explicitly:
– list files in hierarchical dataset directly?• or only list datasets in hierarchical dataset and forcing user to loop over
results?– subscribe open hierarchical dataset?
• or only allow listing datasets in open hierarchical dataset and forcing user to manually subscribe sub-units
• point is: having to loop over OPEN hierarchies (likely manageable)– locations of hierarchical dataset?
• or only allow listing locations of the individual datasets in the hierarchical dataset?
6
Merging• Not much to do from DQ2 side here but
provide an attribute for each dataset– “merged” Y/N (or protocol: zip, tar?)
• DQ2 does 3rd party transfers only– does not actually ‘see’ the data
7
Checksums• Not much from DQ2 here but enforcing
checksums in the central catalogues and its protocol– ‘md5:’ for MD5
• adler32 is frequently discussed as a better checksum candidate– but not relevant to DQ2, rather to the sites
and production people
8
Subscription lifetime• Increasingly important…
– Would clean up what no one is cleaning up now… (some sites with O(100K) files in impossible situations)
• Discussion from yesterday:– allow only waitForSources to be set by users with production
role ?• avoid creating looping subscriptions in the system
• Forbid subscriptions for datasets with more than X files, if not production user requesting?
• Forbid more than Y subscriptions per sure, if not production user?
• Ignore subscription - regardless of its state - after more than 3 months?– Subscription is marked as broken
9
Central catalogues• [ as mentioned yesterday ]• Main changes are:
– for Scalability only…– dropping VUIDs (becomes DUID+Version number)– DUID becomes timestamp-oriented UUID so that backend
is partitioned in time• and highly optimized UUID storage on ORACLE
– meaning shorter index• ORACLE partitioning, redirect service…
– .. but fully backward compatible with 0.3 clients• Many queries become much faster
– list files in dataset is query by DUID as opposed to query by N number of VUIDs
– ORACLE IOTs guarantees listing files from a dataset [version] reads close to sequential blocks on disk
10
Location catalogue• [ as mentioned yesterday ]• Location catalogue will be populated asynchronously
with:– information on missing files– (re)marking complete/incomplete locations for existing
datasets - consistency– Missing files are extra information made available on ‘best-
effort’ to the users• derived from request by Ganga
• This is populated by the ‘tracker’ service– Which was being reworked for the site services– The tracker service is a ‘stronger’ Fetcher (as existing on the
site services), used to find content on site VS content missing on site - one of the site services performance bottleneck
11
Dashboard• Relatively big update coming soon
– distinguish errors source/destination– display messages on the dashboard for all
sites– alarms supported– more overview of site services state from a
central place• e.g. states of files (based also on new site
services monitoring)
12
ToA• More and more info there…• Blacklist/whitelist• Preferred site connections• This is a cache file, same style as ToA
– but independent file from ToA cache since it is more dynamic
• ToA renewal much stronger– I’d claim it is the most reliable info system
so far on the Grid :-)
13
Communication…• … still not working:
– e.g. did not recognize staging as a problem– e.g. 0.3.2 apparently not deployed on OSG T2s
• quite bad as 0.3.1 had a simple bug where agents could simply die whenever a glitch happened in the central catalogue connection
– glitches “common” with the central catalogue request rate, but harmless and ok to retry
• … what to do here?• Jabber chatroom :-)
– [email protected]– ask me - [email protected] or [email protected] -
to be authorized