wide area network access to cms data using the lustre filesystem

1
Wide Area Network Access to CMS Data Using the Lustre Filesystem J. L. Rodriguez , P. Avery*, T. Brody , D. Bourilkov*, Y.Fu*, B. Kim*, C. Prescott*, Y. Wu* Florida International University (FIU), *University of Florida (UF) Network: Florida Lambda Rail (FLR) FIU: Servers were connected to the FLR via a dedicated Campus Research Network (CRN) @ 1Gbps, however local hardware issues limits FIU’s actual bandwidth to ~ 600 Mbps UF: Servers connected to FLR via their own dedicated CRN @ 2x10Gbps Flatech: Servers connected to FLR @ 1Gbps Server TCP buffers set to max of 16MB Lustre Fileserver at UF-HPC/Tier2 Center: Gainesville, FL Storage subsystem: Six each, RAID INC Falcon III with redundant dual port 4Gbit FC RAID controller shelves with 24x750 GB HDs, with raw storage of 104 TB Attached to: Two dual quad core Barcelona Opteron 2350 with 16 GB RAM, three FC cards and 1x10GigE Chelsio NIC Storage system clocked at greater than 1 GBps via TCP/IP large block I/O FIU Lustre Clients: Miami, FL CMS analysis server: medianoche.hep.fiu.edu, dual 4 core Intel X5355 with 16GB RAM, dual 1GigE FIU fileserver: fs1.local, dual 2 core Intel Xeon, with 16GB RAM, 3ware 9000 series RAID cntlr, NFS ver 3.x, RAID 5 (7+1) with 16TB disk raw OSG gatekeeper: dgt.hep.fiu.edu, dual 2 core Xeon with 2GB RAM single GigE Used as in Lustre tests, experimented with NAT ( it works, but not tested) System configuration: Lustre patched kernel-2.6.9-55.EL_lustre1.6.4.2, both systems mounted UF-HPC’s Lustre filesystem on local mount point Flatech Lustre Client: Melbourne, FL CMS server: flatech-grid3.fit.edu, dual 4 core Intel E5410 w/8GB RAM, GigE System configuration: unpatched SL4 kernel. Lustre enabled via runtime kernel modules Site Configuration and Security All sites share common UID/GID domains Mount access restricted to specific IP’s via firewall ACLs and root_squash security features are not currently implemented in testbed The Florida State Wide Lustre Testbed Computing facilities in the distributed computing model for the CMS experiment at CERN. In the US, Tier2 sites are medium size facilities with approximately 10 6 kSI2K of computing power and 200TBs of disk storage. The facilities are centrally managed; each with dedicated computing resources and manpower. Tier3 sites on the other hand range in size from a single interactive analysis computer or small cluster to large facilities that rival the Tier2s in resources. Tier3s are usually found at Universities in close proximity to CMS researchers. CMS Experiment Online System CERN Computer Center FermiLab Korea Russia UK FSU 200 - 1500 MB/s 10 Gb/s 10-40 Gb/s 1.0 Gb/s Tier 0 Tier 1 Tier 3 Tier 2 Desktop or laptop PCs, Macs… Flatech UCSD Caltech U Florida 3000 physicists, 60 countries 10s of Petabytes/yr by 2010 CERN / Outside = 10-20% FIU OSG FLR: 10 Gbps UF HPC UF Tier2 FIU Tier3 FlaTech Tier3 Introduction We explore the use of the Lustre cluster filesystem over the WAN to access CMS (Compact Muon Solenoid) data stored on a storage system located hundreds of miles away. The Florida State Wide Lustre Testbed consist of two client sites located at CMS Tier3s, one in Miami, FL, one in Melbourne, FL and a Luster storage system located in Gainesville at the University of Florida’s HPC Center. In this paper we report on I/O rates between sites, using both the CMS application suite CMSSW and the I/O benchmark tool IOzone. We describe our configuration, outlining the procedures implemented, and conclude with suggestions on the feasibility of implementing a distributed Lustre storage to facilitate CMS data access for users at remote Tier3 sites. Lustre is a POSIX compliant, network aware, highly scalable, robust and reliable cluster filesystem developed by Sun Microsystems Inc. The system can run over several different types of networking infrastructure including ethernet, Infiniband, myrinet and others. It can be configured with redundant components to eliminating single points of failure. It has been tested with 10,000’s of nodes, providing petabytes of storage and can move data at 100’s of GB/sec. The system employs state-of-the-art security features and plans to introduce GSS and kerberos based security in future releases. The system is available as Public Open Source under the GNU General Public License. Lustre is deployed on a broad array of computing facilities, Both large and small, commercial and public organizations including some of the largest super computing centers in the world are currently using the Lustre as their distributed file system. IO Performance with CMSSW: FIU to UF IO Performance with the IOzone benchmark tool FIU to UF Lustre I/O P erform ance FIU to U F "m ounted overW AN " uf-hpc@ tcp1:/crn ... /fiulocal/crn 0 20 40 60 80 100 120 64 128 256 512 1024 2048 4096 8192 16384 R ecord Length [K B ] I/O Perform ance [M Bps] Sequentialw rite Sequentialread R andom write R andom read The IOzone benchmark tool was used to establish the maximum possible I/O performance of Lustre over the WAN between FIU and UF and between Flatech and UF. Here we report on results between FIU and UF only. – Lustre fs at UF-HPC was mounted on local mount point on medianoche.hep.fiu.edu located in Miami – File sizes set to 2XRAM, to avoid cacheing effects – Measurements made as function of record length – Checked in multi-processor mode: 1 thru 8 concurrent processes – Checked with dd read/write rates – All tests consistent with IOzone results shown With large block IO, we can saturate the network link between UF and FIU using the standard IO benchmark tool IOzone Using the CMSSW application we tested the IO performance of the testbed between the FIU Tier3 and the UF-HPC Lustre storage. An IO bound CMSSW application was used for the tests. Its main function was to skim objects from data collected during the Cosmic Runs at Four Tesla (CRAFT) in the Fall of 2008. The application is the same as that utilized by the Florida Cosmic Analysis group. The output data file was redirected to /dev/null. Report on aggregate and average read I/O rate – Aggregate IO rate is the total IO rate per node vs. number of jobs concurrently running on a single node – Average IO rate is per process per job vs. number of jobs concurrently running on a single node Compare IO rates between Lustre NFS and local disk – NFS: fileserver 16TB 3ware 9000 over NFS ver. 3 – Local: single 750GB SATAII hard drive Observations – For NFS and Lustre filesystems the IO rates scale linearly with the number of jobs, not so with local disk – Average IO rates remain relatively constant as a function of jobs per node for distributed filesystem – The Lustre IO rate are significantly lower than seen with IOzone and lower than obtained with NFS We are now investigating the cause of the discrepancy between the Lustre CMSSW IO rates and the rates observed with IOzone Summary: – Lustre is very easy to deploy, particularly so as a client installation – Direct I/O operations show that the Lustre filesystem mounted over the WAN works reliably and with high degree of performance. We have demonstrated that we can easily saturate a 1 Gbps link with I/O bound applications – CMSSW remote data access was observed to be slower than expected when compared to IO rates using IO benchmarks and when compared to other distributed filesystems – We have demonstrated that the CMSSW application can access data located hundreds of miles away with the Lustre filesystem. Data accessed this way can be done seamlessly, reliably, with a reasonable degree of performance even with all components “out of the box” Lustre clients are easy to deploy, mounts are easy to establish, are reliable and robust Security established by restricting IPs and sharing UID/GID domains between all sites Conclusion: The Florida State Wide Lustre Testbed demonstrates an alternative method for accessing data stored at dedicated CMS computing facilities. This method has the potential of greatly simplifying access to data sets, large, medium or small, for remote experimenters with limited local computing resources. Summary and Conclusion Lustre version 1.6.7 IO performance of the testbed between FIU and UF. The plot shows sequential and random read/write performance, in Mbytes per second using the IOzone as a function of record length. CM SSW Aggregate IO R ates vs. Num ber ofJobs per node 0 20 40 60 80 100 1 2 4 8 Jobs/node Total Rate [M Bps] NFS over LAN Lustre over W AN Localdisk CM SSW A verage IO R ate perprocess vs. Num berofJobs perN ode 0 5 10 15 1 2 4 8 Jobs/node <I/O > R ate perprocess [M Bps] N FS overLAN Local disk Lustre overW AN

Upload: farrah-newman

Post on 31-Dec-2015

16 views

Category:

Documents


0 download

DESCRIPTION

Korea. Russia. UK. FermiLab. U Florida. Caltech. UCSD. FIU. F latech. FSU. FLR: 10 Gbps. 3000 physicists, 60 countries 10s of Pet abytes/yr by 2010 CERN / Outside = 10-20%. FlaTech Tier3. CMS Experiment. Online System. CERN Computer Center. 200 - 1500 MB/s. Tier 0. 10-40 Gb/s. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wide Area Network Access to CMS  Data Using the Lustre Filesystem

Wide Area Network Access to CMS Data Using the Lustre Filesystem

J. L. Rodriguez†, P. Avery*, T. Brody†, D. Bourilkov*, Y.Fu*, B. Kim*, C. Prescott*, Y. Wu* †Florida International University (FIU), *University of Florida (UF)

Network: Florida Lambda Rail (FLR) FIU: Servers were connected to the FLR via a dedicated Campus Research Network (CRN) @

1Gbps, however local hardware issues limits FIU’s actual bandwidth to ~ 600 Mbps UF: Servers connected to FLR via their own dedicated CRN @ 2x10Gbps Flatech: Servers connected to FLR @ 1Gbps Server TCP buffers set to max of 16MB

Lustre Fileserver at UF-HPC/Tier2 Center: Gainesville, FL Storage subsystem: Six each, RAID INC Falcon III with redundant dual port 4Gbit FC RAID

controller shelves with 24x750 GB HDs, with raw storage of 104 TB Attached to: Two dual quad core Barcelona Opteron 2350 with 16 GB RAM, three FC cards

and 1x10GigE Chelsio NIC Storage system clocked at greater than 1 GBps via TCP/IP large block I/O

FIU Lustre Clients: Miami, FL CMS analysis server: medianoche.hep.fiu.edu, dual 4 core Intel X5355 with 16GB RAM, dual

1GigE FIU fileserver: fs1.local, dual 2 core Intel Xeon, with 16GB RAM, 3ware 9000 series RAID cntlr,

NFS ver 3.x, RAID 5 (7+1) with 16TB disk raw OSG gatekeeper: dgt.hep.fiu.edu, dual 2 core Xeon with 2GB RAM single GigE Used as in

Lustre tests, experimented with NAT ( it works, but not tested) System configuration: Lustre patched kernel-2.6.9-55.EL_lustre1.6.4.2, both systems mounted

UF-HPC’s Lustre filesystem on local mount point

Flatech Lustre Client: Melbourne, FL CMS server: flatech-grid3.fit.edu, dual 4 core Intel E5410 w/8GB RAM, GigE System configuration: unpatched SL4 kernel. Lustre enabled via runtime kernel modules

Site Configuration and Security All sites share common UID/GID domains Mount access restricted to specific IP’s via firewall ACLs and root_squash security features are not currently implemented in testbed

The Florida State Wide Lustre Testbed

Computing facilities in the distributed computing model for the CMS experiment at CERN. In the US, Tier2 sites are medium size facilities with approximately 106 kSI2K of computing power and 200TBs of disk storage. The facilities are centrally managed; each with dedicated computing resources and manpower. Tier3 sites on the other hand range in size from a single interactive analysis computer or small cluster to large facilities that rival the Tier2s in resources. Tier3s are usually found at Universities in close proximity to CMS researchers.

CMS Experiment

Online System

CERN Computer Center

FermiLabKorea RussiaUK

FSU

200 - 1500 MB/s

10 Gb/s

10-40 Gb/s

1.0 Gb/s

Tier 0

Tier 1

Tier 3

Tier 2

Desktop or laptop PCs, Macs…

Flatech

UCSDCaltechU Florida

3000 physicists, 60 countries10s of Petabytes/yr by 2010CERN / Outside = 10-20%

FIU

OSG

FLR: 10 Gbps

UF HPC UF

Tier2

FIU Tier3

FlaTech Tier3

IntroductionWe explore the use of the Lustre cluster filesystem over the WAN to access CMS (Compact Muon Solenoid) data stored on a storage system located hundreds of miles away. The Florida State Wide Lustre Testbed consist of two client sites located at CMS Tier3s, one in Miami, FL, one in Melbourne, FL and a Luster storage system located in Gainesville at the University of Florida’s HPC Center. In this paper we report on I/O rates between sites, using both the CMS application suite CMSSW and the I/O benchmark tool IOzone. We describe our configuration, outlining the procedures implemented, and conclude with suggestions on the feasibility of implementing a distributed Lustre storage to facilitate CMS data access for users at remote Tier3 sites.

Lustre is a POSIX compliant, network aware, highly scalable, robust and reliable cluster filesystem developed by Sun Microsystems Inc. The system can run over several different types of networking infrastructure including ethernet, Infiniband, myrinet and others. It can be configured with redundant components to eliminating single points of failure. It has been tested with 10,000’s of nodes, providing petabytes of storage and can move data at 100’s of GB/sec. The system employs state-of-the-art security features and plans to introduce GSS and kerberos based security in future releases. The system is available as Public Open Source under the GNU General Public License. Lustre is deployed on a broad array of computing facilities, Both large and small, commercial and public organizations including some of the largest super computing centers in the world are currently using the Lustre as their distributed file system.

IO Performance with CMSSW: FIU to UF

IO Performance with the IOzone benchmark tool FIU to UFLustre I/O Performance FIU to UF"mounted over WAN" uf-hpc@tcp1:/crn ... /fiulocal/crn

0

20

40

60

80

100

120

64 128 256 512 1024 2048 4096 8192 16384

Record Length [KB]

I/O P

erfo

rman

ce [M

Bps

]

Sequential write

Sequential read

Random write

Random read

Lustre I/O Performance FIU to UF"mounted over WAN" uf-hpc@tcp1:/crn ... /fiulocal/crn

0

20

40

60

80

100

120

64 128 256 512 1024 2048 4096 8192 16384

Record Length [KB]

I/O P

erfo

rman

ce [M

Bps

]

Sequential write

Sequential read

Random write

Random read

The IOzone benchmark tool was used to establish the maximum possible I/O performance of Lustre over the WAN between FIU and UF and between Flatech and UF. Here we report on results between FIU and UF only. – Lustre fs at UF-HPC was mounted on local mount

point on medianoche.hep.fiu.edu located in Miami– File sizes set to 2XRAM, to avoid cacheing effects– Measurements made as function of record length– Checked in multi-processor mode: 1 thru 8

concurrent processes– Checked with dd read/write rates – All tests consistent with IOzone results shown

With large block IO, we can saturate the network link between UF and FIU using the standard IO benchmark tool IOzone

Using the CMSSW application we tested the IO performance of the testbed between the FIU Tier3 and the UF-HPC Lustre storage. An IO bound CMSSW application was used for the tests. Its main function was to skim objects from data collected during the Cosmic Runs at Four Tesla (CRAFT) in the Fall of 2008. The application is the same as that utilized by the Florida Cosmic Analysis group. The output data file was redirected to /dev/null.

Report on aggregate and average read I/O rate– Aggregate IO rate is the total IO rate per node vs.

number of jobs concurrently running on a single node– Average IO rate is per process per job vs. number of

jobs concurrently running on a single node

Compare IO rates between Lustre NFS and local disk – NFS: fileserver 16TB 3ware 9000 over NFS ver. 3– Local: single 750GB SATAII hard drive

Observations– For NFS and Lustre filesystems the IO rates scale

linearly with the number of jobs, not so with local disk– Average IO rates remain relatively constant as a

function of jobs per node for distributed filesystem– The Lustre IO rate are significantly lower than seen

with IOzone and lower than obtained with NFS

We are now investigating the cause of the

discrepancy between the Lustre CMSSW IO rates and the rates observed with IOzone

Summary: – Lustre is very easy to deploy, particularly so as a client installation– Direct I/O operations show that the Lustre filesystem mounted over the WAN works

reliably and with high degree of performance. We have demonstrated that we can

easily saturate a 1 Gbps link with I/O bound applications– CMSSW remote data access was observed to be slower than expected when

compared to IO rates using IO benchmarks and when compared to other distributed

filesystems – We have demonstrated that the CMSSW application can access data located

hundreds of miles away with the Lustre filesystem. Data accessed this way can be

done seamlessly, reliably, with a reasonable degree of performance even with all

components “out of the box”

Lustre clients are easy to deploy, mounts are easy to establish, are reliable and robustSecurity established by restricting IPs and sharing UID/GID domains between all sites

Conclusion:

The Florida State Wide Lustre Testbed demonstrates an alternative method for accessing data stored at dedicated CMS computing facilities. This method has the potential of greatly simplifying access to data sets, large, medium or small, for remote experimenters with limited local computing resources.

Summary and Conclusion

Lustre version 1.6.7

IO performance of the testbed between FIU and UF. The plot shows sequential and random read/write performance, in Mbytes per second using the IOzone as a function of record length.

CMSSW Aggregate IO Rates vs.Number ofJobs per node

0

20

40

60

80

100

1 2 4 8Jobs/node

To

tal R

ate

[MB

ps]

NFS over LAN

Lustre over WAN

Local disk

CMSSW Aggregate IO Rates vs.Number ofJobs per node

0

20

40

60

80

100

1 2 4 8Jobs/node

To

tal R

ate

[MB

ps]

NFS over LAN

Lustre over WAN

Local disk

CMSSW Average IO Rate per process vs. Number of Jobs per Node

0

5

10

15

1 2 4 8Jobs/node

<I/O

> R

ate

pe

r p

roc

es

s

[MB

ps

]

NFS over LAN

Local disk

Lustre over WAN

CMSSW Average IO Rate per process vs. Number of Jobs per Node

0

5

10

15

1 2 4 8Jobs/node

<I/O

> R

ate

pe

r p

roc

es

s

[MB

ps

]

NFS over LAN

Local disk

Lustre over WAN