Download - CMS evolution on data access: xrootd remote access and ... · Outlook ! CMS computing model: first assumptions and results ! Technology trends and impact on HEP Computing models !

CMS evolution on data access: xrootd remote

access and data federation

Giacinto DONVITO INFN-BARI

Outlook

ò  CMS computing model: first assumptions and results

ò  Technology trends and impact on HEP Computing models

ò  Xrootd concepts and features

ò  CMS framework enhancements

ò  Xrootd Redirector use cases and basic ideas

ò  Xrootd Redirectors actual deployment

ò  Xrootd first test and monitoring

CMS computing model: first assumptions ò  The Computing Models of the LHC experiments were designed using

Monarc (Models of Networked Analysis at Regional Centers for LHC Experiments) Model assumption:

ò  Resource are located in distributed centers with a hierarchical structure ò  Tier 0 Master computing center, full resource set (disks and tapes)

ò  Tier 1 National Centre, resource subset (disks and tapes)

ò  Tier 2 Major Centre (only disks)

ò  Tier 3 University Department

ò  The network will be bottleneck

ò  We will need a hierarchical mass storage because we cannot afford sufficient local disk space

ò  The file catalogue will not scale

ò  We will overload the source site if we ask for transfers to a large number of sites

ò  We need to run job “close” to the data to achieve efficient CPU utilization

ò  We need a a priori structured and predictable data utilization

CMS computing model: first assumptions

3 CCR !""# $ Gran Sasso% &" June "# D' Bonacorsi

Tiers architectureTiers architecture

[ Image courtesy of Harvey Newman, Caltech ]

Tier2 Centre~1 TIPS

Online System

Offline Processor Farm~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance RegionalCentre

Italy RegionalCentre

Germany RegionalCentre

InstituteInstituteInstituteInstitute~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.There are 100 “triggers” per secondEach triggered event is ~1 MByte in size

Physicists work on analysis “channels”.Each institute will have ~10 physicists working on one or morechannels; data for these channels should be cached by theinstitute server

Physics data cache

~PBytes/sec

~622 Mbits/secor Air Freight (deprecated)

Tier2 Centre~1 TIPS

Tier2 Centre~1 TIPS

Tier2 Centre~1 TIPS

Caltech~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000SpecInt95 equivalents

(it applies to all LHC exps, though)

CMS computing model: first assumptions

!, CCR !""# $ Gran Sasso% &" June "# D' Bonacorsi

Other routes: T&Other routes: T&!!T! $$$ example: CMST! $$$ example: CMS

tape

Tier-04.6 M-SI2K

0.4 PB disk

4.9 PB tape

5 Gbps WAN

WNs

Tier-1s

280 MB/s(RAW, RECO, AOD)

225 MB/s(RAW)


225 MB/s(RAW) Up to 1 GB/s

(AOD analysis,

calibration)

Tier-20.9 M-SI2K

0.2 PB disk

1 Gbps WAN

WNs

12 MB/s(MC)

60 MB/s(skimmed AOD,

Some RAW+RECO)

Tier-1

Tier-12.5 MSI2K

1.0 PB disk

2.4 PB tape

~10 Gbps WAN

WNs

~800 MB/s(AOD skimming,

data reprocessing)

Tier-2s

Tier-1s

Tier-0

48 MB/s(MC)


AOD

AOD

240 MB/s(skimmed AOD,

Some RAW+RECO)

NB: averaged sustained throughputs.

Take CMS data Cow as exampleCMS T&"T! transfers likely to be bursty and driven by analysis

demands

D&"4&"" MB4s for worst4best connected T!+s )!""5*

CMS T!"T& transfers almost entirely fairly continuous

simulation transfers

D &"MB4s )NOTE: aggregate input rate into T&+s comparable to rate

from T"!*

Courtesy: J.Hernandez (CMS)

CMS computing model: first results ò  Data transfers among sites is much more reliable than

expected

ò  WAN network performances are growing fast and the network infrastructure is reliable

ò  Retrieving files from a remote sites could be easier than retrieving it from a local hierarchical mass storage system

ò  It looks like we need a clearer partition between disk and tape ò  Using tape only like an archive

ò  Geographically distributed Job submission and matchmaking are working well

ò  The tool developed to help the final users are easy and make the usage of the grid transparent to the user

CMS computing model: first results From CERN

From T1s

It is “easy” to get out of a single T0/T1 up to 300-400MB/s sustained of WAN transfer to all others CMS sites

CMS computing model: first results

A very small number of datasets are used very heavily

Outlook








Technology trends and impact on HEP Computing models

ò  The WAN available bandwidth is comparable with the backbone available at LAN level

ò  Network flows are already larger than foreseen at this point in the LHC program, even with lower luminosity

ò  Some T2’s are very large

ò  All US ATLAS and US CMS T2’s have 10G capability.

ò  Some T1-T2 flows are quite large (several to 10Gbps)

ò  T2-T2 data flows are also starting to become significant.

ò  The concept of transferring file regionally is substantially

broken!


ò  The vision progressively moves away from all hierarchical models to peer-peer

ò  True for both CMS and ATLAS

ò  For reasons of reduced latency, increased working efficiency

ò  The challenge in the next future will be the available IOPS on storage systems

ò  It is clear we need to optimize the IO at the application level as the disk will not increase the performance too much in the future

ò  The hierarchical mass storage system could not cope with dynamic data stage-in request

ò  The cost per TB of the disk allow us to build huge disk-only facilities


Outlook








Xrootd concepts and features

What Is xrootd ?

ò  A file access and data transfer protocol

ò  Defines POSIX-style byte-level random access for

ò  Arbitrary data organized as files of any type

ò  Identified by a hierarchical directory-like name

ò  A reference software implementation

ò  Embodied as the xrootd and cmsd daemons

ò  xrootd daemon provides access to data

ò  cmsd daemon clusters xrootd daemons together


What Isn’t xrootd ? ò  It is not a POSIX file system

ò  There is a FUSE implementation called xrootdFS

ò  An xrootd client simulating a mountable file system

ò  It does not provide full POSIX file system semantics

ò  It is not an Storage Resource Manager (SRM)

ò  Provides SRM functionality via BeStMan

ò  It is not aware of any file internals (e.g., root files)

ò  But is distributed with root and proof frameworks

ò  As it provides unique & efficient file access primitives

Primary xrootd Access Modes ò  The root framework

ò  Used by most HEP and many Astro experiments (MacOS, Unix and Windows)

ò  POSIX preload library ò  Any POSIX compliant application (Unix only, no recompilation needed)

ò  File system in User SpacE

ò  A mounted xrootd data access system via FUSE (Linux and MacOS only)

ò  SRM, globus-url-copy, gridFTP, etc ò  General grid access (Unix only)

ò  xrdcp ò  The parallel stream, multi-source copy command (MacOS, Unix and Windows)

ò  xrd ò  The command line interface for meta-data operations (MacOS, Unix and

Windows)


What Makes xrootd Unusual?

ò  A comprehensive plug-in architecture

ò  Security, storage back-ends (e.g., tape), proxies, etc

ò  Clusters widely disparate file systems

ò  Practically any existing file system

ò  Distributed (shared-everything) to JBODS (shared-nothing)

ò  Unified view at local, regional, and global levels

ò  Very low support requirements

ò  Hardware and human administration



lfn2pfn prefix encoding

Physical Storage System (ufs, hdfs, hpss, etc)

Authentication (gsi, krb5, etc)

Clustering (cmsd)

Authorization (dbms, voms, etc)

Logical File System (ofs, sfs, alice, etc)

Protocol (1 of n) (xroot, proof, etc)

Protocol Driver (XRD)

Replaceable plug-ins to accommodate any environment

Let’s take a closer look at xrootd-style clustering


/my/file /my/file

1: open(“/my/file”) 4: Try open() at A

Data Servers

Manager (a.k.a. Redirector) Client

cmsd xrootd

cmsd xrootd

cmsd xrootd

cmsd xrootd

2: Who has “/my/file”?

A B C

A Simple xrootd Cluster

Xrootd concepts and features Exploiting Stackability

S e r v e r s

S e r v e r s

S e r v e r s

1: open(“/my/file”) 5: Try open() at ANL

Distributed Clusters

Meta-Manager (a.k.a. Global Redirector)

Client

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

A B

C /my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

ANL SLAC UTA

cmsd xrootd

A

B

C

3: Who has “/my/file”? 3: Who has “/my/file”? 3: Who has “/my/file”?

8: open(“/m

y/file”) cmsd xroot

d

Manager (a.k.a. Local Redirector)



cmsd xrootd

6: open(“/m

y/file”)

7: Try open() at A

cmsd xrootd

2: Who has “/my/file”? Data is uniformly available

By federating three distinct sites

/my/file

An exponentially parallel search! (i.e. O(2n)) Federated Distributed Clusters

Federated Distributed Clusters

ò  Unites multiple site-specific data repositories

ò  Each site enforces its own access rules

ò  Usable even in the presence of firewalls

ò  Scalability increases as more sites join

ò  Essentially a real-time bit-torrent social model

ò  Federations are fluid and changeable in real time

ò  Provide multiple data sources to achieve high transfer rates

ò  Increased opportunities for data analysis

ò  Based on what is actually available


Copy Data Access Architecture

ò  The built-in File Residency Manager drives

ò  Copy On Fault

ò  Demand driven (fetch to restore missing file)

ò  Copy On Request

ò  Pre-driven (fetch files to be used for analysis)


S e r v e r s

S e r v e r s

S e r v e r s

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

A B


d cmsd xroot

d

cmsd xrootd

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

ANL SLAC UTA

cmsd xrootd

cmsd xrootd



Manager (a.k.a. Local Redirector) cmsd xroot

d

/my/file

cmsd xrootd

Meta-Manager (a.k.a. Global Redirector) Client

xrdcp –x xroot://mm.org//my/file /my open(“/my/file”)

xrdcp copies data using two sources

/my/file

Direct Data Access Architecture

ò  Use servers as if all of them were local

ò  Normal and easiest way of doing this

ò  Latency may be an issue (depends on algorithms & CPU-I/O ratio)

ò  Requires Cost-Benefit analysis to see if acceptable


S e r v e r s

S e r v e r s

S e r v e r s

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

A B


d cmsd xroot

d

cmsd xrootd

A B

C

/my/file

/my/file cmsd xroot

d cmsd xroot

d

cmsd xrootd

ANL SLAC UTA

cmsd xrootd

cmsd xrootd




d

/my/file

cmsd xrootd


open(“/my/file”)

Cached Data Access Architecture

ò  Front servers with a caching proxy server

ò  Client access proxy server for all data

ò  Server can be central or local to client (i.e. laptop)

ò  Data comes from proxy’s cache or other servers


S e r v e r s

S e r v e r s

S e r v e r s

A B


d cmsd xroot

d

cmsd xrootd

A B


d cmsd xroot

d

cmsd xrootd

A B


d cmsd xroot

d

cmsd xrootd

ANL SLAC UTA

cmsd xrootd

cmsd xrootd




d

/my/file

cmsd xrootd


open(“/my/file”) xroot

d

Outlook








CMS framework enhancements ò  The CMSSW 4_X contains ROOT 5.27/06b, which has

significant I/O improvements over 3_x

ò  CMSSW 4_x also contains modest I/O improvements on top of out-of-the-box ROOT I/O

ò  New ROOT added:

ò  Auto-flushing: All buffers are flushed to disk periodically, guaranteeing some level of locality.

ò  Buffer-resizing: Buffers are resized so approximately the same number of events are in each buffer.

ò  Read coalescing: “Nearby” reads (but non-consecutive) are combined into one.

CMS framework enhancements ò  Incremental improvements through 3_x and 4_x: Only one

event tree, improved event read ordering,TTreeCache became functional, caching non-event trees.

ò  Improved startup: While TTreeCache is being trained, we now read all baskets for the first 10 events at once. So, startup is typically one large read instead of many small ones.

Improved startup latencies (simplified)Branch Branch

Even

t

Even

t

{

Trai

ning

CMSSW 3_X CMSSW 4_2

= individual read= vectored read

7 read calls21 squares read

2 read calls27 squares read

We read all branches for the first 20 events; compared to the entire file size, the bandwidth

cost is minimal. On high latency links, this can cause a 5 minute reduction in time to

read the first 20 events.

CMS framework enhancements Upcoming Enhancements

ò  “Real” Asynchronous prefetch (using threads and double-buffering). Brian’s opinion: the current patch set is problematic and not usable by CMS. At least a year out.

ò  Additional reduction in un-streaming time. Multi-threaded un-streaming.

ò  ROOT implementation of things in CMSSW: multiple TTreeCaches per file. Improved startup latencies.

Outlook








Xrootd Redirector use cases and basic ideas

ò  The goal of this activity is to decrease the effort required for the end- user to access CMS data.

ò  We have built a global prototype, based on Xrootd, which allows a user to open (almost) any file in CMS, regardless of their location or the file’s source.

ò  Goals:

ò  Reliability: Seamless failover, even between sites.

ò  Transparency: Open the same filename regardless of source site.

ò  Usability: Must be native to ROOT.

ò  Global: Cover as many CMS files as possible.


ò  The target of this project:

ò  End-users: event viewers, running on a few CMS files, sharing files with a group.

ò  T3s: Running medium-scale ntuple analysis

ò  None of these users are well-represented by CMS tools right now.

ò  So, a prototype is better than nothing…


Xrootd Prototype

ò  Have a global redirector users can contact for all files.

ò  Can use any ROOT-based app to access the prototype infrastructure! Each file only has one possible URL

ò  Each participating site deploys at least 1 xrootd server that acts like a proxy/door to the external world.

Xrootd Redirector use cases and basic ideas Illustration

XrootdRemote User

AuthfileLCMAPS

(Site mapping policy)

1. Attempt login

2. Request mappingfor DN / VOMS

3. Result:User "brian"

4. Login successful!

5. Open /store/foor

4. Open successful!

6. Can brian open/store/foo?

7. Permit

Tuesday, October 18, 11

ò  The authentication is always delegated to the site hosting the file

Outlook








Xrootd Redirectors actual deployment

Site Integration

ò  We have yet to run into a storage system that cannot be integrated with Xrootd. Techniques:

ò  POSIX:Works out-of-the-box. Example: Lustre.

ò  C API: Requires a new XrdOss plugin. Example: HDFS, dcap, REDDNet (?)

ò  Native implementation: Latest dCache has built-in Xrootd support; combine with xrootd.org cmsd to join federation.

ò  Direct access: Let xrootd read the other system’s files directly from disk (dCache @ FNAL)

Xroot over HDFS

ò  In this case: we need a native xrootd door reading file using hdfs standard C library

ò  This is quite similar to the installation of a gridftp door

Xrootd redirector

Native Xrootd local door

Local HDFS installation

Xroot over xroot

ò  In this case: we need a native xrootd door reading file using xroot library

ò  This machine is configurated in “proxy-mode”

ò  This needs that the dCache instance already provide a dCache xrootd endpoint

Xrootd EU

redirector


Local dCache installation

Dcache xrootd door

Xroot over posix

ò  In this case: we need one or more native xrootd doors that are clients of the local parallel file-system (GPFS/Lustre)

Xrootd EU

redirector


Local GPFS/Lustre

installation


CMSSW Performance

ò  Note the use cases all involve wide-area access to CMS data, whether interactively or through the batch system.

ò  WAN performance for the software is key to this project’s success.

ò  Current monitoring shows 99% of the bytes CMS jobs requests are via vectored reads - many jobs have about 300 reads per file total. That’s at least 30s of I/ O that a locally running job wouldn’t have.

ò  We plan to invest our time in I/O to make sure these numbers improve, or at least don’t regress.


Xrootd deployment

ò  US Xrootd Redirector:

ò  @Nebraska

ò  Collecting data from US Sites:

ò  Nebraska, Purdue, Caltech, UCSD, FNAL

ò  EU Xrootd Redirector:

ò  @INFN_Bari

ò  Collecting data from EU Sites:

ò  INFN-Bari, INFN-Pisa, DESY, Imperial College (UK)


Xrootd deployment

ò  Xrootd provides per-file and per-connection monitoring.

ò  I.e., a UDP packet per login, open, close, disconnect, and summarizing every 5s of activity.

ò  This is forwarded to a central server, where it is summarized and forwarded to to several consumers.

Monitoring to ML

Tuesday, October 18, 11

ò  Also provides a per-server summary (connections, files open) that is fed directly to MonaLisa.


Xrootd distributed performance

ò  Using CMSSW 4_1_3 and a I/O bound analysis

ò  Job are running on T2_IT_LNL and data are hosted in T2_IT_Bari

ò  ping time: 12ms

ò  CPU Efficiency drop from 68% to 50%

ò  ~30% of performance drop

ò  This is about to be one of the worst case


Xrootd distributed performance

Conclusion ò  Xrootd redirector infrastructure is fulfilling few use cases that

are not covered by official CMS tools

ò  It is already been used:

ò  from both final users

ò  Debugging code, visualization etc.

ò  and computing center

ò  That do not have enough man power to fully support CMS locally

ò  Tuning CMSSW is very important in order to provide good performance when accessing data remotely

Download - CMS evolution on data access: xrootd remote access and ... · Outlook ! CMS computing model: first assumptions and results ! Technology trends and impact on HEP Computing models !

Top Related