bnl service challenge 3 site report xin zhao, zhenping liu, wensheng deng, razvan popescu, dantong...

23
BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility USATLAS Computing Facility Brookhaven National Lab Brookhaven National Lab

Upload: allen-white

Post on 03-Jan-2016

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

BNL Service Challenge 3 Site Report

Xin Zhao, Zhenping Liu, Wensheng Deng, Xin Zhao, Zhenping Liu, Wensheng Deng,

Razvan Popescu, Dantong Yu and Bruce GibbardRazvan Popescu, Dantong Yu and Bruce Gibbard

USATLAS Computing FacilityUSATLAS Computing Facility

Brookhaven National LabBrookhaven National Lab

Page 2: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

2

Services at BNL

FTS (FTS (version 2.3.1version 2.3.1) client + server and its backend Oracle and myproxy servers.) client + server and its backend Oracle and myproxy servers. FTS does the job of reliable file transfer from CERN to BNL. Most Functionalities were implemented. It became reliable in controlling data transfer

after several rounds of redeployments for bug fixing: short timeout value causing excessive failures, incompatibility with dCache/SRM.

Does not support DIRECT data transfer between CERN to BNL dCache data pool server (dCache SRM third party data transfer). The data transfers actually go through a few dCache GridFTP door nodes at BNL, which presents scalability issue. We had to move these door nodes to non-blocking networking ports to distribute traffic.

Both BNL and RAL discovered that the number of streams per file could not be more than 10, (a bug)?

Networking to CERN: Networking to CERN: Network for dCache was upgraded to 2*1Gpbs around June. Shared link with Long Round Trip Time: >140 ms, while RTT for Europe sites to CERN

is about 20ms. Occasional packet losses were discovered along the path between BNL-CERN. 1.5 G bps aggregated bandwidth observed by iperf with 160 TCP streams.

Page 3: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

3

Services at BNL

dCache/SRM (dCache/SRM (V1.6.5-2, with SRM 1.1 interface, Total 332 (3.06 Ghz, V1.6.5-2, with SRM 1.1 interface, Total 332 (3.06 Ghz, 2GByte Memory and 3 SCSI 140 Gbyte drives) nodes with about 170 2GByte Memory and 3 SCSI 140 Gbyte drives) nodes with about 170 TB disks, Multiple GridFTP, SRM, and dCap doors ): USATLAS TB disks, Multiple GridFTP, SRM, and dCap doors ): USATLAS production dCache system.production dCache system. All nodes have Scientific Linux 3 with XFS module compiled. Experienced High load on write pool serves during large amount data

transfer. Was fixed by replacing the EXT file systems with XFS file system. Core server crashed once. Reason was identified and fixed. Small buffer space (1.0TB) for data written into dCache system. dCache can now deliver up to 200MB/second for input/output (limited by

network speed.)

LFC (1.3.4) client and server was installed at BNL Replica Catalog Server.LFC (1.3.4) client and server was installed at BNL Replica Catalog Server. Server was installed. Tested the basic functionalities: lfc-ls, lfc-mkdir etc. Will populate LFC with the entries in our production globus RLS server.

ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.

Page 4: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

4

Read poolsDCap doors

SRM door doors

GridFTP doors doors

Control Channel

write pools

Data Channel

DCap Clients

Pnfs Manager Pool Manager

HPSS

GridFTP Clientsd

SRM Clients

Oak Ridge Batch system

DCache System

BNL dCache Configuration

Page 5: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

5

CERN Storage System

Page 6: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

6

Data Transfer from CERN to BNL (ATLAS Tier 1)

Page 7: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

7

Transfer Plots

Castor2 LSFCastor2 LSF

plugin problemplugin problem

Page 8: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

8

BNL SC3 data transfer

All data actually are All data actually are routed through routed through GridFtp doorsGridFtp doors

SC3 Monitored SC3 Monitored

at BNLat BNL

Page 9: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

9

Data Transfer Status

BNL stablized FTS data transfer with BNL stablized FTS data transfer with

high successful completion rate, as high successful completion rate, as

shown in the left image.shown in the left image.

We have attained150 MB/second rate We have attained150 MB/second rate

for about one hour with large number (> for about one hour with large number (>

50) of parallel file transfers. CERN FTS 50) of parallel file transfers. CERN FTS

had the limit of 50 files per channel, had the limit of 50 files per channel,

which is not enough to fill up CERNwhich is not enough to fill up CERN

BNL data channel.BNL data channel.

Page 10: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

10

Final Data Transfer Reports

Page 11: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

11

Lessons Learned From SC2

Four file transfer servers with 1 Gigabit WAN network connection to CERN.Four file transfer servers with 1 Gigabit WAN network connection to CERN.

Meet the performance/throughput challenges (70~80MB/second disk to disk).Meet the performance/throughput challenges (70~80MB/second disk to disk). Enabled data transfer between dCache/SRM and CERN SRM at openlab

Design our own script to control SRM data transfer.

Enabled data transfer between BNL GridFtp servers and CERN openlab GridFtp

servers controlled by Radiant software.

Many components need to be tunedMany components need to be tuned 250 ms RRT, high packet dropping rate, has to use multiple TCP streams and

multiple file transfers to fill up network pipe.

Sluggish parallel file I/O with EXT2/EXT3, lot of processes with “D” state, more file

streams, worse the performance on file system.

Slight improvement with XFS system. Still need to tune file system parameter

Page 12: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

12

Some Issues Service Challenge also challenges resource: Service Challenge also challenges resource:

Tuned network pipes, optimized the configuration and performance of BNL production dCache system and its associate OS, file systems,

Required more than one staff’s involvements to stabilize the newly deployed FTS, dCache and network infrastructure.

Staffing level decreased as services became stable.

Limited Resources are shared by experiments and users. Limited Resources are shared by experiments and users. At CERN, SC3 infrastructure are shared by multiple Tier 1 sites.

Due to the heterogeneous nature of Tier 1 sites, data transfer for each site should be optimized non-uniformly based on site’s various aspects: i.e. network RRT, packet loss rates, experiment requirements etc.

At BNL, network and dCache are also used by production users. Need to closely monitor the SRM and network to avoid impacting production activities.

At CERN, James Casey alone handles answering email, setting up the system, At CERN, James Casey alone handles answering email, setting up the system, reporting problems and running data transfer. He provides 7/16 support himself. reporting problems and running data transfer. He provides 7/16 support himself. How to scale to 7/24 production support/production center? How to handle the time difference between US and CERN? CERN Support Phone (Tried once, but the operator did not speak English)

Page 13: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

13

What have been done.

SC3 Tier 2 Data Transfer SC3 Tier 2 Data Transfer Data were transferred to three selected Tier 2 sites.

SC3 Tape TransferSC3 Tape Transfer Tape Data Transfer was stablized at 60 MB/second with loaned

tape resources.

Met the goal defined at the beginning of Service Challenge.

Full Chain of data transfer was exercised.

Page 14: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

14

ATLAS SC3 Service Phase

Page 15: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

15

ATLAS SC3 Service Phase goals

Exercise ATLAS data flowExercise ATLAS data flow

Integration of data flow with the ATLAS Production System Integration of data flow with the ATLAS Production System

Tier-0 exerciseTier-0 exercise

More information: More information: https://uimon.cern.ch/twiki/bin/view/Atlas/DDMSc3

Page 16: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

16

ATLAS-SC3 Tier0

Quasi-RAW data generated at CERN and reconstruction Quasi-RAW data generated at CERN and reconstruction jobs run at CERNjobs run at CERN No data transferred from the pit to the computer centre

““Raw data” and the reconstructed ESD and AOD data are Raw data” and the reconstructed ESD and AOD data are replicated to Tier 1 sites using agents on the VO Boxes at replicated to Tier 1 sites using agents on the VO Boxes at each site.each site.

Exercising use of CERN infrastructure …Exercising use of CERN infrastructure … Castor 2, LSF

and the LCG Grid middleware …and the LCG Grid middleware … FTS, LFC, VO Boxes

Distributed Data Management (DDM) softwareDistributed Data Management (DDM) software

Page 17: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

17

ATLAS Tier-0

EF

CPU

T1T1T1castor

RAW

1.6 GB/file0.2 Hz17K f/day320 MB/s27 TB/day

ESD

0.5 GB/file0.2 Hz17K f/day100 MB/s8 TB/day

AOD

10 MB/file2 Hz170K f/day20 MB/s1.6 TB/day

AODm

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

RAW

AOD

RAW

ESD (2x)

AODm (10x)

RAW

ESD

AODm

0.44 Hz37K f/day440 MB/s

1 Hz85K f/day720 MB/s

0.4 Hz190K f/day340 MB/s

2.24 Hz170K f/day (temp)20K f/day (perm)140 MB/s

Page 18: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

18

ATLAS-SC3 Tier-0

Main goal is a 10% exerciseMain goal is a 10% exercise Reconstruct “10%” of the number of events ATLAS will get in 2007 using

“10%” of the full resources that will be needed at that time

Tier-0Tier-0 ~300 kSI2k “EF” to CASTOR: 32 MB/s Disk to tape: 44 MB/s (32 for raw and 12 for ESD+AOD) Disk to WN: 34 MB/s T0 to each T1: 72 MB/s 3.8 TB to “tape” per day

Tier-1 (in average):Tier-1 (in average): ~8500 files per day At a rate of ~72 MB/s

Page 19: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

19

24h before 4 day intervention 29/10 - 1/11

We achieved quite good rate in the testing phase (sustained 20-30 MB/s to three sites (PIC, BNL and CNAF).

ATLAS DDM Monitoring

Page 20: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

20

Data DistributionData Distribution

Use a generated “dataset”Use a generated “dataset” Contains 6035 files (3 TB) and we tried to replicate it to BNL, CNAF and

PIC.

BNL Data Transfer is under way.BNL Data Transfer is under way.

PIC: 3600 files copied and registeredPIC: 3600 files copied and registered 2195 ‘failed replication’ after 5 retries by us x 3 FTS retries

Problem under investigation

205 ‘assigned’ - still waiting to be copied 31 ‘validation failed’ since SE is down 4 ‘no replicas found’ LFC connection error

CNAF: 5932 files copied and registeredCNAF: 5932 files copied and registered 89 ‘failed replication’ 14 ‘no replicas found’

Page 21: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

21

General view of SC3

When everything is running smoothly ATLAS get good resultsWhen everything is running smoothly ATLAS get good results

The middleware (FTS) is stable but there were still lots of compatibility The middleware (FTS) is stable but there were still lots of compatibility

issues: issues: FTS does not work new version of dCache/SRM (version 1.3).

ATLAS DDM software dependencies can also cause problems when sites

upgrade middleware

not managed to exhaust anything production s/w; LCG m/w) not managed to exhaust anything production s/w; LCG m/w)

Still far from concluding the exercise and not running stably in any way.Still far from concluding the exercise and not running stably in any way.

Exercise will continue adding new sites.Exercise will continue adding new sites.

Page 22: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

22

BNL Service Challenge 4 Plan

Several steps needed to set-up hardware or service (ex: Several steps needed to set-up hardware or service (ex: choose, procure, start install, end install, make operational) choose, procure, start install, end install, make operational) LAN, Tape system, Computing farm, disk storage dCache/SRM, FTS, LFC, DDM.

Continue to maintain and support the services with the Continue to maintain and support the services with the define SLA (Service Level Agreement).define SLA (Service Level Agreement).

December/2005: begin installation of expanded LAN, new December/2005: begin installation of expanded LAN, new tape system and make the new installation operational.tape system and make the new installation operational.

January/2006: begin data transfer with the newly upgraded January/2006: begin data transfer with the newly upgraded infrastructure, the target rate is 200M bytes/second and infrastructure, the target rate is 200M bytes/second and deploy all required baseline software.deploy all required baseline software.

Page 23: BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven

23

BNL Service Challenge 4 Plan

April/2006, establish the stable data transfer in the speed of April/2006, establish the stable data transfer in the speed of

200M Bytes/second to disks and 200 M Bytes/second to 200M Bytes/second to disks and 200 M Bytes/second to

tape.tape.

May/2006, disk and computing farm upgrading.May/2006, disk and computing farm upgrading.

June/01/2006: stable data transfer driven by ATLAS June/01/2006: stable data transfer driven by ATLAS

production system and ATLAS data management production system and ATLAS data management

infrastructure between T0~T1 (200M Bytes/second) and infrastructure between T0~T1 (200M Bytes/second) and

provide services to satisfy SLA (Service level agreement).provide services to satisfy SLA (Service level agreement).

Details of involving Tier 2 are in planning too.Details of involving Tier 2 are in planning too.