1 chep 2003 arie shoshani experience with deploying storage resource managers to achieve robust file...
Post on 18-Dec-2015
219 Views
Preview:
TRANSCRIPT
1 CHEP 2003 Arie Shoshani
Experience with Deploying Experience with Deploying Storage Resource Managers to Achieve Storage Resource Managers to Achieve
Robust File replication Robust File replication
Arie ShoshaniArie Shoshani
Alex SimAlex Sim
Junmin GuJunmin Gu
Scientific Data Management GroupScientific Data Management Group
Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory
http://sdm.lbl.gov/srmhttp://sdm.lbl.gov/srm
2 CHEP 2003 Arie Shoshani
OutlineOutline
• File replication problem - motivationFile replication problem - motivation
• What are Storage Resource ManagersWhat are Storage Resource Managers
• General Analysis Scenario and the use of SRMsGeneral Analysis Scenario and the use of SRMs
• SRM functionalitySRM functionality
• SRMs use for file replication – robustnessSRMs use for file replication – robustness
• Advantages of using SRMs for file replicationAdvantages of using SRMs for file replication
• File monitoring toolFile monitoring tool
• Analysis of file replicationAnalysis of file replication
3 CHEP 2003 Arie Shoshani
MotivationMotivation
• Multi-File Replication – why is it a problem?Multi-File Replication – why is it a problem?
• Tedious task – many files, repetitious
• Lengthy task – long transfer time, can take days
• Error prone – need to monitor scripts
• Error recovery – need to restart file transfers
• Stage and archive from MSS – limited concurrency, down
time, transient failures
• Use of FTP – large windows, concurrent transfer
• Security – both for local MSS and the network
• Firewalls – transfer from/to MSS must be internal to the site
4 CHEP 2003 Arie Shoshani
What are What are Storage Resource Managers?Storage Resource Managers?
• Grid architecture needs to include reservation & Grid architecture needs to include reservation & scheduling of:scheduling of:• Compute resources• Storage resources• Network resources
• Storage Resource Managers (SRMs) role in the Storage Resource Managers (SRMs) role in the data grid architecturedata grid architecture• Shared storage resource allocation & scheduling• Especially important for data intensive applications• Often files are archived on a mass storage system (MSS)• Wide area networks – minimize transfers • large scientific collaborations (100’s of nodes,
1000’s of clients) – opportunities for file sharing• File replication and caching may be used• Need to support non-blocking (asynchronous) requests
5 CHEP 2003 Arie Shoshani
General Analysis ScenarioGeneral Analysis Scenario
MSS
RequestExecuter
Storage Resource Manager
Metadatacatalog
Replicacatalog
NetworkWeatherService
logicalquery
network
clientclient ...
RequestInterpreter
requestplanning
A set oflogical files
Execution plan and site-specific
files
Client’s site
...Disk
Cache
DiskCache
ComputeEngine
DiskCache
Compute Resource Manager
Storage Resource Manager
ComputeEngine
DiskCache
Requests fordata placement andremote computation
Site 2Site 1 Site N
Storage Resource Manager
Storage Resource Manager
Compute Resource Manager
result files
ExecutionDAG
6 CHEP 2003 Arie Shoshani
SRM is a ServiceSRM is a Service
• SRM functionalitySRM functionality• Manage space
• Negotiate and assign space to users• Manage “lifetime” of spaces
• Manage files on behalf of a user• Pin files in storage till they are released• Manage “lifetime” of files• Manage action when pins expire (depends on file types)
• Manage file sharing• Policies on what should reside on a storage resource at any one time• Policies on what to evict when space is needed
• Get files from remote locations when necessary• Purpose: to simplify client’s task
• Manage multi-file requests• A brokering function: queue file requests, pre-stage when possible
• Provide grid access to/from mass storage systems• HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor
(CERN), MSS (NCAR), …
7 CHEP 2003 Arie Shoshani
Types of SRMsTypes of SRMs
• Types of storage resource managersTypes of storage resource managers• Disk Resource Manager (DRM)
• Manages one or more disk resources• Tape Resource Manager (TRM)
• Manages access to a tertiary storage system (e.g. HPSS)• Hierarchical Resource Manager (HRM=TRM + DRM)
• An SRM that stages files from tertiary storage into its disk cache
• SRMs and File transfersSRMs and File transfers• SRMs DO NOT perform file transfer• SRMs DO invoke file transfer service if needed
(GridFTP, FTP, HTTP, …)• SRMs DO monitor transfers and recover from failures
• TRM: from/to MSS• DRM: from/to network
8 CHEP 2003 Arie Shoshani
Uniformity of Interface Uniformity of Interface Compatibility of SRMsCompatibility of SRMs
SRM SRM SRM
Enstore JASMine
ClientUSER/APPLICATIONS
Grid Middleware
SRM
DCache
SRM
CASTOR
SRM
DiskCache
9 CHEP 2003 Arie Shoshani
SRMs use in STAR forSRMs use in STAR forRobust Muti-file replication Robust Muti-file replication
Anywhere
BNL
DiskCache
DiskCache
HRM-COPY(thousands of files)
SRM-GET (one file at a time)
HRM-ClientCommand-line Interface
HRM(performs writes)
HRM(performs reads)
LBNLGridFTP GET (pull mode)
stage filesarchive files
Network transfer
Get listof files
Recovers from staging failures
Recovers from file transfer failures
Recovers from archiving failures
10 CHEP 2003 Arie Shoshani
Detailed sequence of actionsDetailed sequence of actionsFor each file being replicatedFor each file being replicated
srmGet (sourceURL)2
GridFTP GET (pull mode)6
File staged (BNL’s diskURL)5
Anywhere srmCopy {(sourceURL=hpss.bnl.gov/xyz/file_x, targetURL =hpss.lbnl.gov/uvw/file_y)}
Get listof files fromdirectory
Request files
DiskCache
DiskCache
HRM-ClientCommand-line Interface
LBNL HRM(performs writes)
BNLHRM(performs reads)
1Allocate
Space 3Allocate
Space 4
StageFile
Transfer Complete7
8ReleaseSpace
9
Call_back: file on disk
Call_back: file on tape
12
10
ArchiveFile
11 ReleaseSpace
Web-basedFile
MonitoringTool
11 CHEP 2003 Arie Shoshani
Web-Based File Monitoring ToolWeb-Based File Monitoring Tool
Shows:-Files already transferred- Files during transfer- Files to be transferred
Also shows foreach file:-Source URL-Target URL-Transfer rate
12 CHEP 2003 Arie Shoshani
Tracking multi-file replication Tracking multi-file replication performanceperformance
20020103123100 20020103123200 20020103123300 20020103123400 20020103123500 20020103123600 20020103123700 20020103123800
time
pro
cess
set287_07_10evts_h_dst.xdf.STAR.DBset195_02_2evts_dst.xdf.STAR.DBset162_01_28evts_dst.xdf.STAR.DBset195_01_33evts_dst.xdf.STAR.DBset193_01_17evts_h_dst.xdf.STAR.DBset165_01_31evts_dst.xdf.STAR.DBset165_02_30evts_dst.xdf.STAR.DBset163_02_24evts_dst.xdf.STAR.DBset163_01_32evts_dst.xdf.STAR.DBset192_01_27evts_dst.xdf.STAR.DB
FILE_REQUEST_FAILED
Notified_Client
Migration_Finished
Migration_Requested
Transfered_to_PDSF_from_BNL
Staging_finished_at_BNL
Staging_started_at BNL
Staging_requested_at_BNL
File replication request start
Helped discover hard-to-find bug
13 CHEP 2003 Arie Shoshani
File tracking helps to identify File tracking helps to identify bottlenecksbottlenecks
Shows that archiving is the bottleneck
14 CHEP 2003 Arie Shoshani
File tracking shows recovery from File tracking shows recovery from transient failurestransient failures
Total:45 GBs
15 CHEP 2003 Arie Shoshani
File tracking shows network File tracking shows network slowdown and recoveryslowdown and recovery
Total:53 GBs
16 CHEP 2003 Arie Shoshani
Conclusion: Key advantagesConclusion: Key advantagesof using SRMs for file replicationof using SRMs for file replication
• All HRM communications are part of HRM functionalityAll HRM communications are part of HRM functionality• No changes required to HRMs
• Can replicate files from multiple sitesCan replicate files from multiple sites• In a single command to one target
• Recovers from transient failuresRecovers from transient failures• For staging and archiving from MSS• For network
• Uses disk buffers to keep multiple filesUses disk buffers to keep multiple files• pre-stage in case of slow network• Hold files in case of slow archiving
• Concurrent transfersConcurrent transfers• Concurrent staging, concurrent archiving from/to MSS• Concurrent transfers over the network• Concurrency limited by parameter setup
• Automatic cleanup of buffers (garbage collection)Automatic cleanup of buffers (garbage collection)• Can replicate files between different MSSs Can replicate files between different MSSs
(Enstore, Jasmine, HPSS, Castor, …)(Enstore, Jasmine, HPSS, Castor, …)• On-line monitoring, summary generatedOn-line monitoring, summary generated
17 CHEP 2003 Arie Shoshani
BNL–LBNL file replication for STAR BNL–LBNL file replication for STAR
is is in production for 9 monthsin production for 9 months now now
(nearly daily use to replicate 1000s of files per day)(nearly daily use to replicate 1000s of files per day)
More on SRMsMore on SRMs
Thursday, at 1:30 pmThursday, at 1:30 pm
(Category 3)(Category 3)
Final note
HTTP://sdm.lbl.gov/srm
top related