july-2008fabrizio furano - the scalla suite and the xrootd1
TRANSCRIPT
Data access and Storage
And moreFrom the xrootd and Scalla perspective
Fabrizio FuranoCERN IT/GS
July-08South African National Compute Grid Training
Deployment and Strategy MeetingUniversity of Cape Town
http://savannah.cern.ch/projects/xrootdhttp://xrootd.slac.stanford.edu
Fabrizio Furano - The Scalla suite and the Xrootd
Physics experiments rely on rare events and statistics
◦Huge amount of data to get a significant number of eventsThe typical data store can reach 5-10 PB… nowMillions of files, thousands of concurrent clients
Each one opening many files (about 100-150 in Alice, up to 1000 in GLAST)
Each one keeping many open files
◦The transaction rate is very highNot uncommon O(103) file opens/sec per cluster
Average, not peak
Traffic sources: local GRID site, local batch system, WAN
Need scalable high performance data access◦No imposed limits on performance and size, connectivity
The historical Problem: data access
July-2008 2
Fabrizio Furano - The Scalla suite and the Xrootd 3
The evolution of the BaBar-initiated xrootd projectData access with HEP requirements in mind◦But a fully generic platform, however
Structured Cluster Architecture for Low Latency Access
◦Low Latency Access to data via xrootd serversPOSIX-style byte-level random access
By default, arbitrary data organized as files
Hierarchical directory-like name space
Protocol includes high performance featuresExponentially scalable and self organizing
◦Tools and methods to cluster, harmonize, connect, …
What is Scalla?
July-2008
Fabrizio Furano - The Scalla suite and the Xrootd
xrootd Plugin Architecture
July-2008
lfn2pfnprefix encoding
Storage System(oss, drm/srm, etc)
authentication(gsi, krb5, etc)
Clustering(cmsd)
authorization(name based)
File System(ofs, sfs, alice, etc)
Protocol (1 of n)(xrootd)
Protocol Driver(XRD)
4
Default set of plugins :◦Scalable file server functionalities
Its primary historical function
◦To be used in common data mngmt schemesThe ROOT framework bundles it as it is◦And provides one more plugin: XrdProofdProtocol
◦Plus several other ROOT-side classes
◦The heart of PROOF: the Parallel ROOT FacilityA completely different task by loading a different pluginMassive low latency parallel computing of independent
items (events in physics)Using the characteristics of the xrootd framework
Different usages
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 5
Fabrizio Furano - The Scalla suite and the Xrootd 6
No weird configuration requirementsScale setup complexity with the requirements’ complexity
Fault toleranceHigh, scalable transaction rate
Open many files per second. Double the system and double the rate.
NO DBs! Would you put one in front of your laptop’s file system?
No known limitations in size and global throughput for the repo
Very low CPU usageHappy with many clients per server
Thousands. But check their bw consumption vs the disk/net performance!
WAN friendly (client+protocol+server)Enable efficient remote POSIX-like data access
WAN friendly (server clusters)Can set up WAN-wide repositories by aggregating remote clusters
Most famous basic features
July-2008
Fabrizio Furano - The Scalla suite and the Xrootd 7
Basic working principle
July-2008
cmsdxrootd
cmsdxrootd
cmsdxrootd
cmsdxrootd
Client
A small2-level cluster.
Can hold
Up to 64 servers
P2P-like
Simple LAN clusters
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 8
cmsdxrootd
cmsdxrootd
cmsdxrootd
cmsdxrootd
Simple clusterUp to 64 data servers1-2 mgr redirectors
cmsd
cmsdxrootd
cmsdxrootd
cmsdxrootd
cmsdxrootd
cmsdxrootd cmsd
xrootdcmsd
xrootd
cmsdxrootd
cmsdxrootd cmsd
xrootdcmsd
xrootd
cmsdxrootd
cmsdxrootd
Advanced clusterUp to 4096 (2 lvls) or
262K (3 lvls) data servers
Everything can have hot spares
Very carefully crafted, heavily multithreaded◦Server side: promote speed and scalability
High level of internal parallelism + statelessExploits OS features (e.g. async i/o, polling, selecting)Many many speed+scalability oriented featuresSupports thousands of client connections per serverNo interactions with complicated things to do simple tasks
◦Client: Handles the state of the communicationReconstructs everything to present it as a simple interface
Fast data pathNetwork pipeline coordination + latency hidingSupports connection multiplexing + intelligent server cluster
crawlingServer and client exploit multi core CPUs natively
Single point performance
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 9
Server side◦If servers go, the overall functionality can be fully preserved
Redundancy, MSS staging of replicas, …Can means that weird deployments can give it up
E.g. storing in a DB the physical endpoint addresses for each file. Generally a bad idea.
Client side (+protocol)The client crawls the server metacluster looking for data
◦The application never notices errorsTotally transparent, until they become fatal
i.e. when it becomes really impossible to get to a working endpoint to resume the activity
◦Typical tests (try it!)Disconnect/reconnect network cablesKill/restart servers
Fault tolerance
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 10
Password-based (pwd)◦Either system or dedicated password file
User account not neededGSI (gsi)◦Handle GSI proxy certificates
◦VOMS support should be OK now (Andreas, Gerri)
◦No need of Globus libraries (and super-fast!)Kerberos IV, V (krb4, krb5)◦Ticket forwarding supported for krb5
◦Fast ID (unix, host) to be used w/ authorizationALICE security tokens◦Emphasis on ease of setup and performance
Available auth protocols
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 11Courtesy of Gerardo Ganis (CERN PH-SFT)
Creating big clusters scales linearlyThe throughput and the size, keeping latency very low
We like the idea of disk-based cacheThe bigger (and faster), the better
So, why not to use the disk of every WN ?In a dedicated farm500GB * 1000WN 500TBThe additional cpu usage is anyway quite low
Can be used to set up a huge cache in front of a MSSNo need to buy a bigger MSS, just lower the miss rate !
Adopted at BNL for STAR (up to 6-7PB online)See Pavel Jakl’s (excellent) thesis work
They also optimize MSS access to nearly double the staging performance
Quite similar to the PROOF approach to storageOnly storage. PROOF is very different for the computing part.
The “many” paradigm
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 12
We want to make WAN data analysis convenient◦A process does not always read every byte in a file
◦Even if it does… no problem
◦The typical way in which HEP data is processed is (or can be) often known in advance
TTreeCache in ROOT does an amazing job for this
◦xrootd: fast and scalable server sideMakes things run quite smooth
Gives room for improvement at the client sideAbout WHEN transferring the data
There might be better moments to trigger a chunk xfer
with respect to the moment it is neededThe app has not to wait while it receives data… in parallel
WAN direct access – Motivation
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 13
WAN direct access – hiding latency
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 14
Pre-xferdata
“locally”
Remoteaccess
Remoteaccess+Data
Processing
Data access
OverheadNeed for
potentiallyuseless replicas
And a hugeBookkeeping!
LatencyWasted CPU
cyclesBut easy
to understand
Interesting!Efficientpractical
Fabrizio Furano - The Scalla suite and the Xrootd 15
Setup: client at CERN, data at SLAC◦164ms RTT time, available bandwidth < 100Mb/s
Smart features switched OFF◦Test 1: Read a large ROOT Tree
(~300MB, 200k interactions)Expected time: 38000s (latency)+750s (data)+CPU 10 hrs!➙No time to waste to precisely measure this!
◦Test 2: Draw a histogram from that tree data(~6k interactions)
Measured time 20min
Using xrootd with WAN optimizations disabled
Dumb WAN Access*
July-2008
*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007
Fabrizio Furano - The Scalla suite and the Xrootd 16
Smart features switched ONROOT TTreeCache + XrdClient Async mode +
15*multistreaming◦Test 1 actual time: 60-70 seconds
Compared to 30 seconds using a Gb LANVery favorable for sparsely used files
… at the end, even much better than certain always-overloaded SEs…..
◦Test 2 actual time: 7-8 seconds Comparable to LAN performance (5-6 secs)100x improvement over dumb WAN access (was 20
minutes)
Smart WAN Access*
July-2008
*Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007
Fabrizio Furano - The Scalla suite and the Xrootd 17
Up to now, xrootd clusters could be populated◦With xrdcp from an external machine
◦Writing to the backend store (e.g. CASTOR/DPM/HPSS etc.) E.g. FTD in ALICE now uses the first. It “works”…
Load and resources problemsAll the external traffic of the site goes through one machine
Close to the dest cluster
If a file is missing or lost◦For disk and/or catalog screwup
◦Job failure... manual intervention neededWith 107 online files finding the source of a trouble can be
VERY tricky
Cluster globalization
July-2008
Fabrizio Furano - The Scalla suite and the Xrootd 18
Purpose:◦A request for a missing file comes at cluster X,
◦X assumes that the file ought to be thereAnd tries to get it from the collaborating clusters, from the fastest one
Note that X itself is part of the game◦And it’s composed by many servers
The idea is that◦Each cluster considers the set of ALL the others like a
very big online MSS
◦This is much easier than what it seemsSlowly Into production for ALICE
Virtual MSS
July-2008
Cluster Globalization… an example
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 19
cmsd
xrootdPragueNIHAM
… any other
cmsd
xrootd
CERN
cmsd
xrootd
ALICE global redirector (alirdr)all.role meta managerall.manager meta alirdr.cern.ch:1312
root://alirdr.cern.ch/Includes
CERN, GSI, and othersxroot clusters
Meta Managers can be geographically
replicatedCan have several in different places for region-aware load
balancing
cmsd
xrootd
GSIall.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312all.role manager all.role manager all.role manager
Global redirector acts as a WAN xrootd meta-managerLocal clusters subscribe to it◦And declare the path prefixes they export
◦Local clusters (without local MSS) treat the globality as a very big MSS
◦Coordinated by the Global redirectorLoad balancing, negligible loadPriority to files which are online somewherePriority to fast, least-loaded sitesFast file location
True, robust, realtime collaboration between storage elements!
◦Very attractive for tier-2s
Many pieces
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 20
cmsd
xrootd
GSI
The Virtual MSS Realized
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 21
cmsd
xrootd PragueNIHAM
… any other
cmsd
xrootd
CERN
cmsd
xrootd
ALICE global redirector
all.role meta managerall.manager meta alirdr.cern.ch:1312
all.role manager all.role managerall.role manager
But missing a file?Ask to the global metamgr
Get it from any othercollaborating cluster
all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312
Local clients worknormally
Powerful mechanism to increase reliability◦Data replication load is widely distributed
◦Multiple sites are available for recoveryAllows virtually unattended operation◦Automatic restore due to server failure
Missing files in one cluster fetched from anotherTypically the fastest one which has the file really online
No costly out of time (and sync!) DB lookups
◦Practically no need to track file locationBut does not stop the need for metadata repositories
Virtual MSS – The vision
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 22
The mechanism is there, fully “boxed”◦The new setup does almost everything it’s needed
A (good) side effect:◦Pointing an app to the “area” global redirector gives
complete, load-balanced, low latency view of all the repository
◦An app using the “smart” WAN mode can just runProbably now a full scale production/analysis won’t
But what about an interactive small analysis on a laptop?
After all, HEP sometimes just copies everything, useful and not
I cannot say that in some years we will not have a more powerful WAN infrastructure
And using it to copy more useless data looks just ugly
If a web browser can do it, why not a HEP app? Looks just a little more difficult.
Better if used with a clear design in mind
Virtual MSS
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 23
Scalla is a data access system◦Some users/applications want file system semantics
More transparent but much less scalable (transactional namespace)
For years users have asked ….◦Can Scalla create a file system experience?
The answer is ….◦It can to a degree that may be good enough
We relied on FUSE to show howUsers shall rely on themselves to decide
If they actually need a huge multi-PB unique filesystemProbably there is something else which is “strange”
Data System vs File System
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 24
Filesystem in UserspaceUsed to implement a file system in a user space
program ◦Linux 2.4 and 2.6 only
◦Refer to http://fuse.sourceforge.net/Can use FUSE to provide xrootd access
Looks like a mounted file system
Several people have xrootd-based versions of this◦Wei Yang at SLAC
Tested and fully functional (used to provide SRM access for ATLAS)
What is FUSE
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 25
XrootdFS (Linux/FUSE/Xrootd)
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 26
Redirectorxrootd:1094
Name Spacexrootd:2094Redirector
Host
ClientHost opendir
createmkdir
mvrm
rmdir
xrootd POSIX Client
Kernel
User Space
Appl
POSIX File System
InterfaceFUSE
FUSE/Xroot Interface
Should run cnsd on serversto capture non-FUSE eventsAnd keep the FS namespace!
Makes some things much simpler◦Most SRM implementations run transparently
◦Avoid pre-load library worriesBut impacts other things◦Performance is limited
Kernel-FUSE interactions are not cheapThe implementation is OK but quite simple-mindedRapid file creation (e.g., tar) is limited
Remember that the comparison is with a plain xrootd cluster, much faster
◦FUSE must be administratively installed to be usedDifficult if involves many machines (e.g., batch workers)Easier if it involves an SE node (i.e., SRM gateway)
So, it’s good for the SRM-side of a repo◦But not much for the job side
Why XrootdFS?
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 27
Many new ideas are reality or comingTypically dealing with◦True realtime data storage distribution
◦Interoperability (Grid, SRMs, file systems, WANs…)
◦Enabling interactivity (and storage is not the only part of it)The setup encapsulation + vMSS is ready◦In production at CERN for ALICE::CERN::SE
Trying to avoid common mistakesBoth manual and automated setups are honorful and to be honoured!
Conclusion
July-2008Fabrizio Furano - The Scalla suite and the Xrootd 28
Fabrizio Furano - The Scalla suite and the Xrootd 29
Old and new software Collaborators◦Andy Hanushevsky, Fabrizio Furano (client-side), Alvise Dorigo
◦Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows porting)
◦Alice: Derek Feichtinger, Andreas Peters, Guenter Kickinger
◦STAR/BNL: Pavel Jackl, Jerome Lauret
◦GSI: Kilian Schwartz
◦Cornell: Gregory Sharp
◦SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill Weeks
◦Peter ElmerOperational collaborators◦BNL, CERN, CNAF, FZK, INFN, IN2P3, RAL, SLAC
Acknowledgements
July-2008
Fabrizio Furano - The Scalla suite and the Xrootd
Single Level Switch
July-2008
Client Redirector(Head Node)
Data Servers
open file X
A
B
C
go to C
open file X
Who has file X?
I have
Cluster
Client sees all servers as xrootd data servers
2nd open X
go to C
RedirectorsCache filelocation
30