third lcb workshop distributed computing and regional centres session harvey b. newman (cit)

September 29,1999: LCB Workshop Session on Distributed Computing & Regional Centres Harvey Newman (CIT)

Third LCB WorkshopThird LCB Workshop

Distributed Computing and Distributed Computing and Regional Centres SessionRegional Centres Session

Harvey B. Newman (CIT)Harvey B. Newman (CIT)Marseilles, September 29, 1999Marseilles, September 29, 1999

http://l3www.cern.ch/~newman/marseillessep29.ppthttp://l3www.cern.ch/~newman/marseillessep29.ppthttp://l3www.cern.ch/~newman/marseillessep29/index.htmhttp://l3www.cern.ch/~newman/marseillessep29/index.htm


LHC Computing: LHC Computing: DifferentDifferent from from Previous Experiment GenerationsPrevious Experiment Generations

Geographical dispersion:Geographical dispersion: of people and resources of people and resources Complexity:Complexity: the detector and the LHC environment the detector and the LHC environment Scale:Scale: Petabytes per year of data Petabytes per year of data

1800 Physicists 150 Institutes 32 Countries

Major challenges associated with:Major challenges associated with: Coordinated Use of Distributed computing resources Coordinated Use of Distributed computing resources Remote software development and physics analysisRemote software development and physics analysis Communication and collaboration at a distanceCommunication and collaboration at a distance

R&D: New Forms of Distributed SystemsR&D: New Forms of Distributed Systems


HEP Bandwidth Needs & Price HEP Bandwidth Needs & Price EvolutionEvolution

HEP GROWTHHEP GROWTH1989 - 1999 1989 - 1999 A Factor of one to Several Hundred on A Factor of one to Several Hundred on

Principal Transoceanic Links Principal Transoceanic Links A Factor of Up to 1000 in Domestic Academic A Factor of Up to 1000 in Domestic Academic

and Research Nets and Research Nets

HEP NEEDSHEP NEEDS 1999 - 2006 Continued Study by ICFA-SCIC;1999 - 2006 Continued Study by ICFA-SCIC;

1998 Results of ICFA-NTF Show A Factor of 1998 Results of ICFA-NTF Show A Factor of One to Several Hundred (2X Per Year)One to Several Hundred (2X Per Year)

COSTS ( to Vendors)COSTS ( to Vendors)Optical Fibers and WDM: a factor > 2/year reduction now ?Optical Fibers and WDM: a factor > 2/year reduction now ?Limits of Transmission Speed, Electronics, Protocol SpeedLimits of Transmission Speed, Electronics, Protocol Speed PRICEPRICE to HEP ? to HEP ?

Complex Market, but Increased Budget likely to be neededComplex Market, but Increased Budget likely to be neededReference BW/Price Evolution: ~1.5 times/yearReference BW/Price Evolution: ~1.5 times/year


Cost Evolution: CMS 1996 VersusCost Evolution: CMS 1996 Versus1999 Technology Tracking Team 1999 Technology Tracking Team

Compare to 1999 Technology Tracking Team Projections for 2005Compare to 1999 Technology Tracking Team Projections for 2005 CPU: Unit cost will be close to early predictionCPU: Unit cost will be close to early prediction Disk: Will be more expensive (by ~2) than early predictionDisk: Will be more expensive (by ~2) than early prediction Tape: Currently Zero to 10% Annual Cost Decrease (Potential Problem)Tape: Currently Zero to 10% Annual Cost Decrease (Potential Problem)

CMSCMS1996 Estimates1996 Estimates 1996 Estimates1996 Estimates1996 Estimates1996 Estimates


LHC (and HENP) Computing LHC (and HENP) Computing and Software Challengesand Software Challenges

Software:Software: Modern Languages, Methods and ToolsModern Languages, Methods and Tools The Key to Manage Complexity The Key to Manage Complexity

FORTRAN The End of an Era;FORTRAN The End of an Era;OBJECTS OBJECTS A Coming of AgeA Coming of Age

““TRANSPARENT” TRANSPARENT” Access To Data:Access To Data: Location and Storage Medium IndependenceLocation and Storage Medium Independence

Data Grids: A New Generation of Data-Intensive Data Grids: A New Generation of Data-Intensive Network-Distributed Systems for AnalysisNetwork-Distributed Systems for Analysis A Deep A Deep Heterogeneous Heterogeneous Client/Server Hierarchy,Client/Server Hierarchy,

of Up to 5 Levels of Up to 5 Levels An Ensemble of Tape and Disk Mass StoresAn Ensemble of Tape and Disk Mass Stores LHC: Object Database FederationsLHC: Object Database Federations

Interaction of the Software and Data Handling Architectures: Interaction of the Software and Data Handling Architectures: The Emergence of New Classes of Operating SystemsThe Emergence of New Classes of Operating Systems


Four Experiments Four Experiments The Petabyte to Exabyte ChallengeThe Petabyte to Exabyte Challenge

ATLAS, CMS, ALICE, LHCBATLAS, CMS, ALICE, LHCB

Higgs and New particles; Quark-Gluon Plasma; CP ViolationHiggs and New particles; Quark-Gluon Plasma; CP Violation

Data written to tapeData written to tape ~5 Petabytes/Year and UP ~5 Petabytes/Year and UP (1 PB = 10 (1 PB = 101515 Bytes) Bytes)

0.1 to 1 Exabyte (1 EB = 100.1 to 1 Exabyte (1 EB = 101818 Bytes) Bytes) (~2010) (~2020 ?) Total for the LHC Experiments(~2010) (~2020 ?) Total for the LHC Experiments


To Solve: the HENP To Solve: the HENP “Data Problem”“Data Problem”

While the proposed future computing and data handling facilities While the proposed future computing and data handling facilities are large by present-day standards,are large by present-day standards,

They will not support FREE access, transport or reconstruction They will not support FREE access, transport or reconstruction for more than a Minute portion of the data.for more than a Minute portion of the data.

Need effective global strategiesNeed effective global strategies to handle and prioritise to handle and prioritise requests, requests, based on both policies and marginal utilitybased on both policies and marginal utility

Strategies must be studied and prototyped, to ensure Strategies must be studied and prototyped, to ensure Viability:Viability: acceptable turnaround times; efficient resource utilization acceptable turnaround times; efficient resource utilization

Problems to be Explored; How To Problems to be Explored; How To Meet the demands of hundreds of users who need transparent Meet the demands of hundreds of users who need transparent

access to local and remote data, in disk caches and tape storesaccess to local and remote data, in disk caches and tape stores Prioritise hundreds to thousands of requests from local Prioritise hundreds to thousands of requests from local

and remote communitiesand remote communities Ensure that the system is dimensioned “optimally”, Ensure that the system is dimensioned “optimally”,

for the aggregate demandfor the aggregate demand


MONARCMONARC

MModels odels OOf f NNetworked etworked AAnalysis nalysis At At RRegional egional CCentersenters

Caltech, CERN, Columbia, FNAL, Heidelberg, Caltech, CERN, Columbia, FNAL, Heidelberg, Helsinki, INFN, IN2P3, KEK, Marseilles, MPI, Helsinki, INFN, IN2P3, KEK, Marseilles, MPI,

Munich, Orsay, Oxford, TuftsMunich, Orsay, Oxford, Tufts

GOALSGOALS Specify the main parameters Specify the main parameters

characterizing the Model’s characterizing the Model’s performance: throughputs, latenciesperformance: throughputs, latencies

Develop “Baseline Models” in the “feasible” Develop “Baseline Models” in the “feasible” categorycategory

Verify resource requirement baselines: Verify resource requirement baselines: (computing, data handling, networks)(computing, data handling, networks)

COROLLARIES:COROLLARIES: Define the Define the Analysis ProcessAnalysis Process Define Define RC Architectures and ServicesRC Architectures and Services Provide Provide Guidelines for the final ModelsGuidelines for the final Models Provide a Provide a Simulation ToolsetSimulation Toolset for Further for Further

Model studiesModel studies 622

Mbi

ts/s 622 M

bits/s

Desktops

CERN6.107 MIPS

2000 Tbyte; Robot

Universityn.106MIPS100 Tbyte;

Robot

FNAL/BNL4.106 MIPS200 Tbyte;

Robot

622

Mbi

ts/s

622

Mbi

ts/s

622Mbits/s

622 Mbits/s

Desktops

Desktops

Model Circa Model Circa 20062006


MONARC General Conclusions MONARC General Conclusions on LHC Computingon LHC Computing

Following discussions of computing and network requirements, Following discussions of computing and network requirements, technology evolution and projected costs, support requirements etc.technology evolution and projected costs, support requirements etc.

The scale of LHC “Computing” is such that it requires The scale of LHC “Computing” is such that it requires a worldwide effort to a worldwide effort to accumulate the necessary technical and financial resourcesaccumulate the necessary technical and financial resources

The uncertainty in the affordable network BW implies that several The uncertainty in the affordable network BW implies that several scenarios of computing resource-distribution must be developedscenarios of computing resource-distribution must be developed

A distributed hierarchy of computing centres will lead to better useA distributed hierarchy of computing centres will lead to better useof the financial and manpower resources of CERN, the Collaborations,of the financial and manpower resources of CERN, the Collaborations,and the nations involved, than a highly centralised model focused at CERN and the nations involved, than a highly centralised model focused at CERN

Hence: Hence: The distributed model also provides better use of The distributed model also provides better use of physics opportunities at the LHC by physicists and students physics opportunities at the LHC by physicists and students

At the top of the hierarchy is the CERN Center, with the ability to perform allAt the top of the hierarchy is the CERN Center, with the ability to perform allanalysis-related functions, but not the ability to do them completelyanalysis-related functions, but not the ability to do them completely

At the next step in the hierarchy is a collection of large, multi-service At the next step in the hierarchy is a collection of large, multi-service “Tier1 Regional Centres”, “Tier1 Regional Centres”, each with each with

10-20% of the CERN capacity devoted to one experiment10-20% of the CERN capacity devoted to one experiment There will be Tier2 or smaller special purpose centers in many regionsThere will be Tier2 or smaller special purpose centers in many regions


Grid-Hierarchy ConceptGrid-Hierarchy Concept

Matched to the Worldwide-Distributed Matched to the Worldwide-Distributed Collaboration Structure of LHC ExperimentsCollaboration Structure of LHC Experiments

Best Suited for the Multifaceted Best Suited for the Multifaceted Balance BetweenBalance Between

Proximity of the data to centralized processing resourcesProximity of the data to centralized processing resources Proximity to end-users for frequently accessed dataProximity to end-users for frequently accessed data

Efficient use of limited network bandwidthEfficient use of limited network bandwidth (especially transoceanic; and many world regions) (especially transoceanic; and many world regions)through organized caching/mirroring/replicationthrough organized caching/mirroring/replication

Appropriate use of (world-) regional and local computing Appropriate use of (world-) regional and local computing and data handling resourcesand data handling resources

Effective involvement of scientists and students in eachEffective involvement of scientists and students in eachworld region, in the data analysis and the physicsworld region, in the data analysis and the physics


MONARC Phase 1 and 2MONARC Phase 1 and 2DeliverablesDeliverables

September 1999: Benchmark testSeptember 1999: Benchmark test validating the simulation validating the simulation Milestone completed Milestone completed

Fall 1999: A Baseline ModelFall 1999: A Baseline Model representing a possible representing a possible (somewhat simplified) solution for LHC Computing.(somewhat simplified) solution for LHC Computing.

Baseline numbers for a set of system and analysis Baseline numbers for a set of system and analysis process parametersprocess parameters

CPU times, data volumes, frequency and site of jobs and data...CPU times, data volumes, frequency and site of jobs and data... Reasonable “ranges” of parametersReasonable “ranges” of parameters

“ “Derivatives”: How the effectiveness dependsDerivatives”: How the effectiveness depends on some of the more sensitive parameters on some of the more sensitive parameters

Agreement of the experiments on the reasonablenessAgreement of the experiments on the reasonablenessof the Baseline Modelof the Baseline Model

Chapter on Computing ModelsChapter on Computing Models in the CMS and ATLAS in the CMS and ATLAS Computing Technical Progress ReportsComputing Technical Progress Reports


MONARC and Regional CentresMONARC and Regional Centres

MONARC RC Representative Meetings in April and AugustMONARC RC Representative Meetings in April and August Regional Centre Planning well-advanced, with optimistic outlook, in Regional Centre Planning well-advanced, with optimistic outlook, in

US (FNAL for CMS; BNL for ATLAS), France (CCIN2P3), ItalyUS (FNAL for CMS; BNL for ATLAS), France (CCIN2P3), Italy Proposals to be submitted late this year or early nextProposals to be submitted late this year or early next

Active R&D and prototyping underway, especially in US, Italy, Japan;Active R&D and prototyping underway, especially in US, Italy, Japan;and UK (LHCb), Russia (MSU, ITEP), Finland (HIP/Tuovi) and UK (LHCb), Russia (MSU, ITEP), Finland (HIP/Tuovi)

Discussions in the national communities also underway in Discussions in the national communities also underway in Japan, Finland, Russia, UK, GermanyJapan, Finland, Russia, UK, Germany

Varying situations: according to the funding structure and outlookVarying situations: according to the funding structure and outlook Need for more active planning outside of US, Europe, Japan, RussiaNeed for more active planning outside of US, Europe, Japan, Russia

Important for R&D and overall planningImportant for R&D and overall planning There is a near-term need to understand the level and sharing ofThere is a near-term need to understand the level and sharing of

support for LHC computing between CERN and the outside institutes, support for LHC computing between CERN and the outside institutes, to enable the planning in several countries to advance. to enable the planning in several countries to advance.

MONARC + CMS/SCB assumption: “traditional” 1/3: 2/3 sharingMONARC + CMS/SCB assumption: “traditional” 1/3: 2/3 sharing


MONARC Working Groups & ChairsMONARC Working Groups & Chairs

““Analysis Process Design”Analysis Process Design” P. Capiluppi (Bologna, CMS)P. Capiluppi (Bologna, CMS)

“ “Architectures”Architectures” Joel Butler (FNAL, CMS) Joel Butler (FNAL, CMS)

“ “Simulation”Simulation” Krzysztof Sliwa (Tufts, ATLAS)Krzysztof Sliwa (Tufts, ATLAS)

““Testbeds”Testbeds” Lamberto Luminari (Rome, ATLAS) Lamberto Luminari (Rome, ATLAS)

““Steering”Steering” Laura Perini (Milan, ATLAS) Laura Perini (Milan, ATLAS) Harvey Newman (Caltech, CMS)Harvey Newman (Caltech, CMS)

& & “Regional Centres Committee”“Regional Centres Committee”


MONARC Architectures WGMONARC Architectures WG

Discussion and studyDiscussion and study of Site Requirementsof Site Requirements Analysis task division between CERN and RCAnalysis task division between CERN and RC Facilities required with different analysis scenarios, Facilities required with different analysis scenarios,

and network bandwidthand network bandwidth Support required to Support required to (a)(a) sustain the Centre, and sustain the Centre, and

(b)(b) contribute effectively to the distributed system contribute effectively to the distributed system

ReportsReports Rough Sizing Estimates for a Large LHC Experiment FacilityRough Sizing Estimates for a Large LHC Experiment Facility Computing Architectures of Existing Experiments:Computing Architectures of Existing Experiments:

LEP, FNAL Run2, CERN Fixed Target (NA45, NA48),LEP, FNAL Run2, CERN Fixed Target (NA45, NA48), FNAL Fixed Target (KTeV, FOCUS) FNAL Fixed Target (KTeV, FOCUS)

Regional Centres for LHC Computing: (functionality & services)Regional Centres for LHC Computing: (functionality & services) Computing Architectures of Future Experiments (in progress)Computing Architectures of Future Experiments (in progress)

Babar, RHIC, COMPASSBabar, RHIC, COMPASS Conceptual Designs, Drawings and SpecificationsConceptual Designs, Drawings and Specifications

for Candidate Site Architecturefor Candidate Site Architecture


Comparisons with LHC sized Comparisons with LHC sized experiment: CMS or ATLASexperiment: CMS or ATLAS

ExperimentOnsite CPU SI95 1 Si95 = 40 MIPS

onsite disk (TB)

onsite tape (TB) LAN capacity Data Import/Export Box Count

LHC (2006) 520,000* 540 3000 46 GB/s10 TB/day (sustained) ~1400

CDF - 2 12,000 20 800 1 Gb/s 18 MB/s ~250D0 - 2 7,000 20 600 300 Mb/s 10 MB/s ~250

Babar ~6000 8 ~300 100 + 1000

Mb/s ~400 GB/day ~400 D0 295 1.5 65 300 Mb/s ? 180CDF 280 2 100 100 Mb/s ~100 GB/day ?ALEPH 300 1.8 30 1 Gb/s ? 70DELPHI 515 1.2 60 1 Gb/s ? 80L3 625 2 40 1 Gb/s ? 160OPAL 835 1.6 22 1 Gb/s ? 220NA45 587 1.3 2 1 Gb/s 5 GB/day 30

[*] [*] Total CPU: CMS or ATLAS ~ 1.5-2,000,000 MSi95 Total CPU: CMS or ATLAS ~ 1.5-2,000,000 MSi95 (Current Concepts; maybe for 10(Current Concepts; maybe for 103333 Luminosity) Luminosity)


Architectural Sketch: One Major LHC Architectural Sketch: One Major LHC Experiment, At CERN (L. Robertson)Experiment, At CERN (L. Robertson)

Mass Market Commodity PC FarmsMass Market Commodity PC Farms LAN-SAN and LAN-WAN “Stars” (Switch/Routers)LAN-SAN and LAN-WAN “Stars” (Switch/Routers) Tapes (Many Drives for ALICE); an archival medium only ?Tapes (Many Drives for ALICE); an archival medium only ?


MONARC Architectures WG: MONARC Architectures WG: Lessons and Challenges for LHCLessons and Challenges for LHC

SCALE:SCALE: ~100 Times more CPU and ~10 Times more Data ~100 Times more CPU and ~10 Times more Datathan CDF at Run2 (2000-2003)than CDF at Run2 (2000-2003)

DISTRIBUTION:DISTRIBUTION: Mostly Achieved in HEP Only for Simulation. Mostly Achieved in HEP Only for Simulation.For Analysis (and some re-Processing), For Analysis (and some re-Processing), it will not happenit will not happenwithout advance planning and commitmentswithout advance planning and commitments

REGIONAL CENTRES:REGIONAL CENTRES: Require Coherent support, continuity, Require Coherent support, continuity, the ability to maintain the code base, calibrations and job the ability to maintain the code base, calibrations and job

parameters up-to-dateparameters up-to-date HETEROGENEITY: HETEROGENEITY: Of facility architecture and mode of use,Of facility architecture and mode of use,

and of operating systems, must be accommodated. and of operating systems, must be accommodated. FINANCIAL PLANNING:FINANCIAL PLANNING: Analysis of the early planning for the Analysis of the early planning for the

LEP era showed a definite tendency to underestimate the LEP era showed a definite tendency to underestimate the more requirements (by more than an order of magnitude)more requirements (by more than an order of magnitude)

Partly due to budgetary considerationsPartly due to budgetary considerations


Tapes

Network from CERN

Networkfrom Tier 2& simulation centers

Tape Mass Storage & Disk Servers

Database Servers

PhysicsSoftware

Development

R&D Systemsand Testbeds

Info serversCode servers

Web ServersTelepresence

Servers

TrainingConsultingHelp Desk

ProductionReconstruction

Raw/Sim ESD

Scheduled, predictable

experiment/physics groups

ProductionAnalysis

ESD AODAOD DPD

Scheduled

Physics groups

Individual Analysis

AOD DPDand plots

Chaotic

PhysicistsDesktops

Tier 2

Local institutes

CERN

Tapes

Regional Centre ArchitectureRegional Centre ArchitectureExample by I. Gaines Example by I. Gaines


MONARC Architectures WG:MONARC Architectures WG:Regional Centre Facilities & Services Regional Centre Facilities & Services

Regional Centres Should ProvideRegional Centres Should Provide All technical and data services required to do physics analysisAll technical and data services required to do physics analysis All Physics Objects, Tags and Calibration dataAll Physics Objects, Tags and Calibration data Significant fraction of raw dataSignificant fraction of raw data Caching or mirroring calibration constantsCaching or mirroring calibration constants Excellent network connectivity to CERN and the region’s usersExcellent network connectivity to CERN and the region’s users Manpower to share in the development of common maintenance,Manpower to share in the development of common maintenance,

validation and production software validation and production software A fair share of post- and re-reconstruction processingA fair share of post- and re-reconstruction processing Manpower to share in the work on Common R&D ProjectsManpower to share in the work on Common R&D Projects Service to members of other regions on a (?) best effort basisService to members of other regions on a (?) best effort basis Excellent support services for training, documentation, Excellent support services for training, documentation,

troubleshooting at the Centre or remote sites served by ittroubleshooting at the Centre or remote sites served by itLong Term Commitment for staffing, hardware evolution and supportLong Term Commitment for staffing, hardware evolution and support

for R&D, as part of the distributed data analysis architecturefor R&D, as part of the distributed data analysis architecture


MONARC Analysis Process WGMONARC Analysis Process WG

“ “How much data is processed by how many people,How much data is processed by how many people, how often, in how many places, with which priorities…”how often, in how many places, with which priorities…”

Analysis Process Design: Initial StepsAnalysis Process Design: Initial Steps Consider number and type of processing and analysis jobs, Consider number and type of processing and analysis jobs,

frequency, number of events, data volumes, CPU etc.frequency, number of events, data volumes, CPU etc. Consider physics goals, triggers, signals and background ratesConsider physics goals, triggers, signals and background rates Studies covered Reconstruction, Selection/Sample Studies covered Reconstruction, Selection/Sample

Reduction (one or more passes), Analysis, Reduction (one or more passes), Analysis, SimulationSimulation Lessons from existing experiments are limited:Lessons from existing experiments are limited:

each case is tuned to the detector, run conditions, each case is tuned to the detector, run conditions, physics goals and technology of the timephysics goals and technology of the time

Limited studies so far, from the user rather than the system Limited studies so far, from the user rather than the system point of view; more as feedback from simulations are point of view; more as feedback from simulations are

obtainedobtained Limitations on CPU dictate a largely “Physics Analysis Limitations on CPU dictate a largely “Physics Analysis

Group” Group” oriented approach to reprocessing of dataoriented approach to reprocessing of data And Regional (“local”) support for individual activitiesAnd Regional (“local”) support for individual activities Implies dependence on the RC HierarchyImplies dependence on the RC Hierarchy


MONARC Analysis Process:MONARC Analysis Process:Initial Sharing AssumptionsInitial Sharing Assumptions

Assume similar computing capacity available outside Assume similar computing capacity available outside CERN for re-processing and data analysisCERN for re-processing and data analysis [*][*]

There is no allowance for event simulation and There is no allowance for event simulation and reconstruction of simulated data, which it is assumed reconstruction of simulated data, which it is assumed will be performed entirely outside CERN will be performed entirely outside CERN [*][*]

Investment, services and infrastructure should be Investment, services and infrastructure should be optimised to reduce overall costs [TCO]optimised to reduce overall costs [TCO]

Tape sharing makes sense if Alice needs so much more Tape sharing makes sense if Alice needs so much more at a different time of the yearat a different time of the year

[*] First two assumptions would likely result in “at least” [*] First two assumptions would likely result in “at least” a 1/3:2/3 CERN:Outside ratio of resources a 1/3:2/3 CERN:Outside ratio of resources(I.e., likely to be larger outside).(I.e., likely to be larger outside).


MONARC Analysis Process ExampleMONARC Analysis Process Example


MONARC Analysis Process BaselineMONARC Analysis Process BaselineGroup-Oriented AnalysisGroup-Oriented Analysis

ATLAS OR CMS Value Range

No. of Analysis Groups 20/Exp. 10-25/Exp.

No. of Members Per Group 25 15-35

No. of RCs including CERN 6/Exp. 4-12/Exp.

No. of Analysis Groups/RC 5 3-7

Active time of Members 8 Hrs/Day 2-14 Hrs/Day

Activity of Members Single RC > One RC


MONARC Baseline Analysis Process:MONARC Baseline Analysis Process:ATLAS/CMS Reconstruction StepATLAS/CMS Reconstruction Step

PARAMETER VALUE RANGE

Frequency 4/ Year 2-6/ YearInput Data 1 PB 0.5-2 PBCPU/event 350 SI95.Sec 250-500 SI95.SecData Input Storage Tape (HPSS?) Disk-TapeOutput Data 0.1 PB 0.05-0.2 PBData Output Storage Disk Disk-TapeTriggered by Collaboration 1/2 Collaboration;

1/2 Analysis GroupData Input Residence CERN CERN + some RCData Output Residence CERN + some RC CERN + RCsTime response 1 Month 10 Days - 2 MonthsPriority (if possible) High -


SizesSizes Raw data Raw data 1 MB/event1 MB/event ESD ESD 100 KB/event100 KB/event AOD AOD 10 KB/event10 KB/event TAG or DPDTAG or DPD 1 KB/event1 KB/event

CPU Time in SI95 seconds CPU Time in SI95 seconds (without ODBMS overhead; ~ 20%)(without ODBMS overhead; ~ 20%)

Creating ESD (from Raw) Creating ESD (from Raw) 350350 Selecting ESD Selecting ESD 0.250.25 Creating AOD (from ESD)Creating AOD (from ESD) 2.5 2.5 Creating TAG (from AOD) Creating TAG (from AOD) 0.5 0.5 Analyzing TAG or DPD Analyzing TAG or DPD 3.03.0 Analyzing AOD Analyzing AOD 3.0 3.0 Analyzing ESD Analyzing ESD 3.03.0 Analyzing RAW Analyzing RAW 350350

Monarc Analysis Model Baseline: Monarc Analysis Model Baseline: Event Sizes and CPU TimesEvent Sizes and CPU Times


Monarc Analysis Model Baseline: Monarc Analysis Model Baseline: ATLAS or CMS at CERN CenterATLAS or CMS at CERN Center

CPU PowerCPU Power 520 KSI95520 KSI95 Disk spaceDisk space 540 TB540 TB Tape capacityTape capacity 3 PB, 400 MB/sec3 PB, 400 MB/sec Link speed to RCLink speed to RC 40 MB/sec (1/2 of 622 40 MB/sec (1/2 of 622

Mbps)Mbps) Raw dataRaw data 100%100% 1-1.5 PB/year1-1.5 PB/year ESD dataESD data 100%100% 100-150 TB/year100-150 TB/year Selected ESDSelected ESD 100%100% 20 TB/year 20 TB/year [*][*] Revised ESDRevised ESD 100%100% 40 TB/year 40 TB/year [*][*] AOD dataAOD data 100%100% 2 TB/year 2 TB/year [*][*] Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year [*][*] TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year Simulated dataSimulated data 100%100% 100 TB/year 100 TB/year

(repository)(repository)

[*] Covering all Analysis Groups; each selecting ~1% [*] Covering all Analysis Groups; each selecting ~1% of Total ESD or AOD data for a Typical Analysis of Total ESD or AOD data for a Typical Analysis


Monarc Analysis Model Baseline: Monarc Analysis Model Baseline: ATLAS or CMS at CERN CenterATLAS or CMS at CERN Center

CPU PowerCPU Power 520 KSI95520 KSI95 Disk spaceDisk space 540 TB540 TB Tape capacityTape capacity 3 PB, 400 MB/sec3 PB, 400 MB/sec Link speed to RCLink speed to RC 40 MB/sec (1/2 of 622 Mbps)40 MB/sec (1/2 of 622 Mbps) Raw dataRaw data 100%100% 1-1.5 PB/year1-1.5 PB/year ESD dataESD data 100%100% 100-150 TB/year100-150 TB/year Selected ESDSelected ESD 100%100% 20 TB/year20 TB/year Revised ESDRevised ESD 100%100% 40 TB/year 40 TB/year AOD dataAOD data 100%100% 2 TB/year 2 TB/year Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year Simulated dataSimulated data 100%100% 100 TB/year 100 TB/year

(repository)(repository)

Some of these Basic Numbers require further StudySome of these Basic Numbers require further Study

300 KSI95 ?300 KSI95 ?

200 TB/yr200 TB/yr

140 TB/yr140 TB/yr

~1-10 TB/yr ~1-10 TB/yr

~70 TB/yr ~70 TB/yr

LHCb LHCb (Prelim.)(Prelim.)


Monarc Analysis Model Baseline: Monarc Analysis Model Baseline: ATLAS or CMS “Typical” Tier1 RCATLAS or CMS “Typical” Tier1 RC

CPU PowerCPU Power ~100 KSI95~100 KSI95 Disk spaceDisk space ~100 TB~100 TB Tape capacityTape capacity 300 TB, 100 MB/sec300 TB, 100 MB/sec Link speed to Tier2Link speed to Tier2 10 MB/sec (1/2 of 155 Mbps)10 MB/sec (1/2 of 155 Mbps) Raw dataRaw data 1% 1% 10-15 TB/year10-15 TB/year ESD dataESD data 100%100% 100-150 TB/year100-150 TB/year Selected ESDSelected ESD 25%25% 5 TB/year 5 TB/year [*][*] Revised ESDRevised ESD 25%25% 10 TB/year 10 TB/year [*][*] AOD dataAOD data 100%100% 2 TB/year 2 TB/year [**][**] Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year [**][**] TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year

Simulated dataSimulated data 25%25% 25 TB/year 25 TB/year (repository)(repository)

[*] Covering Five Analysis Groups; each selecting ~1% [*] Covering Five Analysis Groups; each selecting ~1% of Total ESD or AOD data for a Typical Analysis of Total ESD or AOD data for a Typical Analysis

[**] Covering All Analysis Groups[**] Covering All Analysis Groups


MONARC Analysis Process WG:MONARC Analysis Process WG:A Short List of Upcoming IssuesA Short List of Upcoming Issues

Priorities, schedules and policiesPriorities, schedules and policies Production vs. Analysis Group vs. Individual activitiesProduction vs. Analysis Group vs. Individual activities Allowed percentage of access to higher data tiersAllowed percentage of access to higher data tiers

(TAG /Physics Objects/Reconstructed/RAW) (TAG /Physics Objects/Reconstructed/RAW) Improved understanding of the Data Model, and ODBMSImproved understanding of the Data Model, and ODBMS Including Including MC productionMC production; simulated data storage and access; simulated data storage and access Mapping the Analysis Process onto heterogeneous Mapping the Analysis Process onto heterogeneous

distributed resourcesdistributed resources Determining the role of Institutes’ workgroup servers and Determining the role of Institutes’ workgroup servers and

desktops, in the Regional Centre Hierarchydesktops, in the Regional Centre Hierarchy Understanding how to manage persistent data: Understanding how to manage persistent data:

e.g. storage / migration / transport / re-compute strategiese.g. storage / migration / transport / re-compute strategies Deriving a methodology for Model testing and optimisationDeriving a methodology for Model testing and optimisation

Metrics for evaluating the global efficiency of a Model:Metrics for evaluating the global efficiency of a Model: Cost vs throughput; turnaround; reliability of data access Cost vs throughput; turnaround; reliability of data access


MONARC Testbeds WGMONARC Testbeds WG

Measurements of Key ParametersMeasurements of Key Parameters governing the behavior and scalability of the Models governing the behavior and scalability of the Models Simple testbed configuration defined and implemented Simple testbed configuration defined and implemented Sun Solaris 2.6, C++ compiler version 4.2Sun Solaris 2.6, C++ compiler version 4.2

Objectivity 5.1 with /C++, /stl, /FTO, /Java optionsObjectivity 5.1 with /C++, /stl, /FTO, /Java options Set up at CNAF, FNAL, Genova, Milano, Padova, Roma, Set up at CNAF, FNAL, Genova, Milano, Padova, Roma,

KEK, Tufts, CERN KEK, Tufts, CERN Four “Use Case” Applications Using Objectivity:Four “Use Case” Applications Using Objectivity:

ATLASFAST++, GIOD/JavaCMS, ATLAS 1 TB Milestone, ATLASFAST++, GIOD/JavaCMS, ATLAS 1 TB Milestone, CMS Test BeamsCMS Test Beams

System Performance Tests; Simulation Validation System Performance Tests; Simulation Validation Milestone Carried Out:Milestone Carried Out:

See I. Legrand talkSee I. Legrand talk


MONARC Testbed SystemsMONARC Testbed Systems


MONARC Testbeds WG: MONARC Testbeds WG: Isolation of Key ParametersIsolation of Key Parameters

Some Parameters Measured,Some Parameters Measured,Installed in the MONARC Simulation Models,Installed in the MONARC Simulation Models,and Used in First Round Validation of Models.and Used in First Round Validation of Models.

Objectivity AMS Response Time-Function, Objectivity AMS Response Time-Function, and its dependence on and its dependence on

Object clustering, page-size, data class-hierarchy Object clustering, page-size, data class-hierarchy and access patternand access pattern

Mirroring and caching (e.g. with the Objectivity DRO option)Mirroring and caching (e.g. with the Objectivity DRO option) Scalability of the System Under “Stress”: Scalability of the System Under “Stress”:

Performance as a function of the number of jobs, Performance as a function of the number of jobs, relative to the single-job performancerelative to the single-job performance

Performance and Bottlenecks Performance and Bottlenecks for a variety of data for a variety of data access patternsaccess patterns

Frequency of following TAG Frequency of following TAG AOD; AOD AOD; AOD ESD; ESD ESD; ESD RAW RAW Data volume accessed remotelyData volume accessed remotely

Fraction on Tape, and on DiskFraction on Tape, and on Disk As Function of Net Bandwidth; Use of QoSAs Function of Net Bandwidth; Use of QoS


The Java (JDK 1.2) environment is well-suited for developinga flexible and distributed process oriented simulation.

This Simulation program is still under development, and dedicated measurements to evaluate realistic parameters and

“validate” the simulation program are in progress.

MONARC Simulation MONARC Simulation

A CPU- and code-efficient approach for the simulation of distributed systemshas been developed for MONARC

provides an easy way to map the distributed data processing, transport, and analysis tasks onto the simulation

can handle dynamically any Model configuration,including very elaborate ones with hundreds of interacting complex Objects

can run on real distributed computer systems, and may interact with real components


Example : Physics Analysis at Example : Physics Analysis at Regional CentresRegional Centres

Similar data processing Similar data processing jobs are performed in jobs are performed in several RCsseveral RCs

Each Centre has “TAG” Each Centre has “TAG” and “AOD” databases and “AOD” databases replicated.replicated.

Main Centre provides Main Centre provides “ESD” and “RAW” data “ESD” and “RAW” data

Each job processes Each job processes AOD data, and also a AOD data, and also a a fraction of ESD and a fraction of ESD and RAW.RAW.


Example: Physics AnalysisExample: Physics Analysis


Simple Validation Measurements Simple Validation Measurements The AMS Data Access Case The AMS Data Access Case

0

20

40

60

80

100

120

140

160

180

0 5 10 15 20 25 30 35

Nr. of concurrent jobs

Me

an

Tim

e p

er

job

[m

s]

Simulation Measurements

Raw Data DBLAN

4 CPUs Client


MONARC Strategy and MONARC Strategy and Tools for Phase 2Tools for Phase 2

Strategy : Vary System Capacity and Network Performance Strategy : Vary System Capacity and Network Performance Parameters Over a Wide RangeParameters Over a Wide Range

Avoid complex, multi-step decision processes that couldAvoid complex, multi-step decision processes that could require protracted study. require protracted study.

Keep for a possible Phase 3Keep for a possible Phase 3 Majority of the workload satisfied in an acceptable timeMajority of the workload satisfied in an acceptable time

Up to minutes for interactive queries, up to hours for Up to minutes for interactive queries, up to hours for short jobs, up to a few days for the whole workload short jobs, up to a few days for the whole workload

Determine requirements “baselines” and/or flaws in Determine requirements “baselines” and/or flaws in certain Analysis Processes in this way certain Analysis Processes in this way Perform a comparison of a CERN-tralised Model, Perform a comparison of a CERN-tralised Model, and suitable variations of Regional Centre Models and suitable variations of Regional Centre Models

Tools and Operations to be Designed in Phase 2Tools and Operations to be Designed in Phase 2 Query estimatorsQuery estimators Affinity evaluators, to determine proximity of multiple Affinity evaluators, to determine proximity of multiple requests in space or time requests in space or time Strategic algorithms for caching, reclustering, mirroring, Strategic algorithms for caching, reclustering, mirroring, or pre-emptively moving data (or jobs or parts of jobs)or pre-emptively moving data (or jobs or parts of jobs)


MONARC Phase 2MONARC Phase 2Detailed MilestonesDetailed Milestones

September '99 Reliable figures on Technologies and Costsfrom Technology Tracking work to beinserted in the Modelling.

September '99 First results on Model Validation available.September '99 First results on Model Comparison available.November '99 Completion of a simulation cycle achieving

the Phase 2 goals.November '99 Document on Guidelines for Regional

Centres available.December '99 Presentation to LCB of a proposal for the

continuation of MONARC.

July 1999: Complete Phase 1;July 1999: Complete Phase 1;Begin Second Cycle of SimulationsBegin Second Cycle of Simulations

with More Refined Modelswith More Refined Models


MONARC Possible Phase 3MONARC Possible Phase 3

TIMELINESS and USEFUL IMPACTTIMELINESS and USEFUL IMPACT Facilitate the efficient planning and design of Facilitate the efficient planning and design of mutually mutually

compatible compatible site and network architectures, and servicessite and network architectures, and services Among the experiments, the CERN Centre Among the experiments, the CERN Centre

and Regional Centresand Regional Centres Provide Provide modelling consultancy and servicemodelling consultancy and service to the to the

experiments and Centresexperiments and Centres Provide a Provide a core of advanced R&D activitiescore of advanced R&D activities, aimed at LHC , aimed at LHC

computing system optimisation and production prototyping computing system optimisation and production prototyping Take advantage of workTake advantage of work on distributed data-intensive computing on distributed data-intensive computing

for HENP this year for HENP this year in other “next generation” projectsin other “next generation” projects [*] [*] For example in US: “Particle Physics Data Grid” (PPDG) of DoE/NGI;For example in US: “Particle Physics Data Grid” (PPDG) of DoE/NGI; “A Physics Optimized Grid Environment for Experiments” (APOGEE) “A Physics Optimized Grid Environment for Experiments” (APOGEE) to DoE/HENP; joint “GriPhyN” proposal to NSF by ATLAS/CMS/LIGO to DoE/HENP; joint “GriPhyN” proposal to NSF by ATLAS/CMS/LIGO

[*] [*] See H. Newman, http://www.cern.ch/MONARC/progress_report/longc7.htmlSee H. Newman, http://www.cern.ch/MONARC/progress_report/longc7.html


MONARC Phase 3MONARC Phase 3

Possible Technical Goal: System OptimisationPossible Technical Goal: System OptimisationMaximise Throughput and/or Reduce Long TurnaroundMaximise Throughput and/or Reduce Long Turnaround

Include long and potentially complex decision-processesInclude long and potentially complex decision-processesin the studies and simulationsin the studies and simulations

Potential for substantial gains in the work performed Potential for substantial gains in the work performed or resources savedor resources saved

Phase 3 System Design ElementsPhase 3 System Design Elements RESILIENCE,RESILIENCE, resulting from flexible management of each data resulting from flexible management of each data

transaction, especially over WANstransaction, especially over WANs FAULT TOLERANCE,FAULT TOLERANCE, resulting from robust fall-back strategies to resulting from robust fall-back strategies to

recover from abnormal conditionsrecover from abnormal conditions SYSTEM STATE & PERFORMANCE TRACKING,SYSTEM STATE & PERFORMANCE TRACKING, to match and to match and

co-schedule requests and resources, detect or predict faultsco-schedule requests and resources, detect or predict faults

Synergy with PPDG and other Advanced R&D Projects.Synergy with PPDG and other Advanced R&D Projects.

Potential Importance for Scientific Research and Industry:Potential Importance for Scientific Research and Industry:

Simulation of Distributed Systems for Data-Intensive Computing.Simulation of Distributed Systems for Data-Intensive Computing.


MONARC Status: ConclusionsMONARC Status: Conclusions

MONARC is well on its way to specifying baseline Models MONARC is well on its way to specifying baseline Models representing cost-effective solutions to LHC Computing.representing cost-effective solutions to LHC Computing.

Initial discussions have shown that LHC computing has Initial discussions have shown that LHC computing has a new scale and level of complexity. a new scale and level of complexity.

A Regional Centre hierarchy of networked centres A Regional Centre hierarchy of networked centres appears to be the most promising solution.appears to be the most promising solution.

A powerful simulation system has been developed, and we areA powerful simulation system has been developed, and we areconfident of delivering a very useful toolset for further model studies confident of delivering a very useful toolset for further model studies by the end of the project.by the end of the project.

Synergy with other advanced R&D projects has been identified.Synergy with other advanced R&D projects has been identified.This may be of considerable mutual benefit.This may be of considerable mutual benefit.

We will deliver important information, and example Models:We will deliver important information, and example Models: That is very timely for the Hoffmann Review and That is very timely for the Hoffmann Review and discussions of LHC Computing over the next months discussions of LHC Computing over the next months In time for the Computing Progress Reports of ATLAS and CMSIn time for the Computing Progress Reports of ATLAS and CMS


LHC Data Models: RD45LHC Data Models: RD45

HEP data models are complex!HEP data models are complex! Rich hierarchy of hundreds of Rich hierarchy of hundreds of

complex data types (classes)complex data types (classes) Many relations between themMany relations between them Different access patterns Different access patterns

(Multiple Viewpoints)(Multiple Viewpoints)

LHC experiments rely on LHC experiments rely on OO technology OO technology OO applications deal with networks OO applications deal with networks

of objects (and containers)of objects (and containers) Pointers (or references) are Pointers (or references) are

used to describe relations used to describe relations

Existing solutions do not scaleExisting solutions do not scale Solution suggested by RD45: Solution suggested by RD45:

ODBMS coupled to a Mass ODBMS coupled to a Mass Storage SystemStorage System

EventEvent

TrackListTrackList

TrackerTracker CalorimeterCalorimeter

TrackTrackTrackTrack

TrackTrackTrackTrackTrackTrack

HitListHitList

HitHitHitHitHitHitHitHitHitHit


System View of Data Analysis by 2005System View of Data Analysis by 2005

Multi-Petabyte Object Database Federation Multi-Petabyte Object Database Federation Backed by a Networked Set of Archival StoresBacked by a Networked Set of Archival Stores

High Availability and Immunity from CorruptionHigh Availability and Immunity from Corruption “ “Seamless” response to database queriesSeamless” response to database queries

Location Independence; storage brokers; cachingLocation Independence; storage brokers; caching Clustering and Reclustering of ObjectsClustering and Reclustering of Objects Transfer only “useful” data: Transfer only “useful” data:

Tape/disk; across networks; disk/clientTape/disk; across networks; disk/client Access and Processing FlexibilityAccess and Processing Flexibility

Resource and application profiling, state tracking, Resource and application profiling, state tracking, co-schedulingco-scheduling

Continuous retrieval/recalculation/storage decisionsContinuous retrieval/recalculation/storage decisions Trade off data storage, CPU and network capabilitiesTrade off data storage, CPU and network capabilities

to optimize performance and coststo optimize performance and costs


CMS Analysis and CMS Analysis and Persistent Object StorePersistent Object Store

Online

Common Filters and Pre-Emptive Object

Creation

On Demand Object Creation

CMS

Slow ControlDetector

Monitoring “L4”L2/L3

L1

Persistent Object Store Object Database Management System

Filtering

Simulation Calibrations, Group

AnalysesUser Analysis

Data Organized In a(n Data Organized In a(n Object) “Hierarchy”Object) “Hierarchy”

Raw, Reconstructed Raw, Reconstructed (ESD), Analysis Objects (ESD), Analysis Objects (AOD), Tags(AOD), Tags

Data DistributionData Distribution All raw, reconstructed All raw, reconstructed

and master parameter and master parameter DB’s at CERN DB’s at CERN

All event TAG and AODs All event TAG and AODs at all regional centersat all regional centers

Selected reconstructed Selected reconstructed data sets at each regional data sets at each regional center center

HOTHOT data (frequently data (frequently accessed) moved to RCsaccessed) moved to RCs


GIOD Summary: GIOD Summary: (Caltech/CERN/FNAL/HP/SDSC)(Caltech/CERN/FNAL/HP/SDSC)

Hit

Track

Detector

GIOD hasGIOD has Constructed a Terabyte-scale Constructed a Terabyte-scale

set of fully simulated eventsset of fully simulated events and used these to create a and used these to create a large OO databaselarge OO database

Learned how to create large Learned how to create large database federationsdatabase federations

Completed “100” (to 170) Completed “100” (to 170) Mbyte/sec CMS MilestoneMbyte/sec CMS Milestone

Developed prototype Developed prototype reconstruction and analysis reconstruction and analysis codes, and Java 3D OO codes, and Java 3D OO visualization prototypesvisualization prototypes, , that work seamlessly with that work seamlessly with persistent objects over persistent objects over networksnetworks

Deployed facilities and Deployed facilities and database federations as database federations as useful testbedsuseful testbeds for for Computing Model studiesComputing Model studies


Babar OOFS: Putting The Pieces Babar OOFS: Putting The Pieces TogetherTogether

Filesystem ImplementationFilesystem Implementation

SLACDesigned &Developed

Veritas IBMDOE

Objectivity

Filesystem Logical LayerFilesystem Logical Layer

Filesystem Physical LayerFilesystem Physical Layer

Database Protocol LayerDatabase Protocol Layer

User


Dynamic Load Balancing Hierarchical Dynamic Load Balancing Hierarchical Secure AMSSecure AMS

Dynamic

Selection

Tapes aretransparent.

Tape resourcescan resideanywhere.

Defer Request ProtocolDefer Request Protocol Transparently delays client while data is made availableTransparently delays client while data is made available Accommodates high latency storage systems (e.g., tape) Accommodates high latency storage systems (e.g., tape)

Request Redirect ProtocolRequest Redirect Protocol Redirects client to an alternate AMSRedirects client to an alternate AMS Provides for dynamic replication and real-time load balancingProvides for dynamic replication and real-time load balancing


Regional Centers Concept:Regional Centers Concept:A Data Grid HierarchyA Data Grid Hierarchy

LHC Grid Hierarchy LHC Grid Hierarchy ExampleExample

Tier0: CERNTier0: CERN Tier1: National Tier1: National

“Regional” Center“Regional” Center Tier2: Regional Tier2: Regional

CenterCenter Tier3: Institute Tier3: Institute

Workgroup Server Workgroup Server Tier4: Individual Tier4: Individual

DesktopDesktop

Total 5 LevelsTotal 5 Levels


Background: Why “Grids”?Background: Why “Grids”?

I. Foster, ANL/ChicagoI. Foster, ANL/ChicagoBecause the resources needed Because the resources needed to solve complex problems are to solve complex problems are rarely colocatedrarely colocated

Advanced scientific instrumentsAdvanced scientific instruments Large amounts of storageLarge amounts of storage Large amounts of computingLarge amounts of computing Groups of smart peopleGroups of smart people

For a variety of reasonsFor a variety of reasons Resource allocations not optimized Resource allocations not optimized

for one applicationfor one application Required resource configurations Required resource configurations

changechange Different views of priorities and truthDifferent views of priorities and truth

For transparent, rapid access and delivery of Petabyte-scale dataFor transparent, rapid access and delivery of Petabyte-scale data(and Multi-TIPS computing resources)(and Multi-TIPS computing resources)


Grid Services Architecture [*]Grid Services Architecture [*]

GridGridFabricFabric

GridGridServicesServices

ApplnApplnToolkitsToolkits

ApplnsApplns

Data stores, networks, computers, display devices,… ; Data stores, networks, computers, display devices,… ; associated local services associated local services

Protocols, authentication, policy, resource Protocols, authentication, policy, resource management, instrumentation, resource discovery,etc.management, instrumentation, resource discovery,etc.

......RemoteRemote

vizviztoolkittoolkit

RemoteRemotecomp.comp.toolkittoolkit

RemoteRemotedatadata

toolkittoolkit

RemoteRemotesensorssensorstoolkittoolkit

RemoteRemotecollab.collab.toolkittoolkit

A Rich Set of HEP Data-Analysis A Rich Set of HEP Data-Analysis Related ApplicationsRelated Applications

[*] Adapted from Ian Foster: there are computing grids, [*] Adapted from Ian Foster: there are computing grids, data grids, access (collaborative) grids,...data grids, access (collaborative) grids,...


RD45, GIOD:RD45, GIOD: Networked Object DatabasesNetworked Object Databases Clipper/GC; Clipper/GC; High speed access to Objects or File data High speed access to Objects or File data

FNAL/SAM for processing and analysisFNAL/SAM for processing and analysis SLAC/OOFS SLAC/OOFS Distributed File System + Objectivity Interface Distributed File System + Objectivity Interface NILE, Condor:NILE, Condor: Fault Tolerant Distributed Computing with Fault Tolerant Distributed Computing with

Heterogeneous CPU ResourcesHeterogeneous CPU Resources MONARC:MONARC: LHC Computing Models: LHC Computing Models:

Architecture, Simulation, Testbeds; Strategy, PoliticsArchitecture, Simulation, Testbeds; Strategy, Politics PPDG:PPDG: First Distributed Data Services and First Distributed Data Services and

Grid System Prototype Grid System Prototype ALDAP:ALDAP: OO Database Structures and Access Methods OO Database Structures and Access Methods

for Astrophysics and HENP Datafor Astrophysics and HENP Data APOGEE: APOGEE: Full-Scale Grid Design; Instrumentation, SystemFull-Scale Grid Design; Instrumentation, System

Modeling and Simulation, Evaluation/OptimizationModeling and Simulation, Evaluation/Optimization GriPhyN: GriPhyN: Production Prototype Grid in Hardware Production Prototype Grid in Hardware

and Software; then Production and Software; then Production

Roles of HENP ProjectsRoles of HENP Projectsfor Distributed Analysis ( for Distributed Analysis ( Grids) Grids)


ALDAPALDAP: : AAccessingccessing LLargearge DDataata ArchivesArchivesinin AAstronomystronomy andand PParticle Physicsarticle Physics

NSF Knowledge Discovery Initiative (KDI)NSF Knowledge Discovery Initiative (KDI)

CALTECH, Johns Hopkins, FNAL(SDSS)CALTECH, Johns Hopkins, FNAL(SDSS) Explore advanced adaptiveExplore advanced adaptive database structures, physical database structures, physical

data storage hierarchies for archival storage of next data storage hierarchies for archival storage of next generation astronomy and particle physics datageneration astronomy and particle physics data

Develop spatial indexes, novel data organizations, Develop spatial indexes, novel data organizations, distribution and delivery strategies, for distribution and delivery strategies, for Efficient and transparent access to data across networksEfficient and transparent access to data across networks

Create prototype network-distributed data query execution Create prototype network-distributed data query execution systems using Autonomous Agent workerssystems using Autonomous Agent workers

Explore commonalities and find effective Explore commonalities and find effective common solutionscommon solutionsfor particle physics and astrophysics datafor particle physics and astrophysics data

ALDAP (NSF/KDI) ProjectALDAP (NSF/KDI) Project


The China Clipper Project:The China Clipper Project:A Data Intensive GridA Data Intensive Grid

China Clipper GoalChina Clipper GoalDevelop and demonstrate middleware allowing Develop and demonstrate middleware allowing

applications transparent, high-speed access to large applications transparent, high-speed access to large data sets distributed over wide-area networksdata sets distributed over wide-area networks..

Builds on expertise and assets at ANL, LBNL & SLACBuilds on expertise and assets at ANL, LBNL & SLAC NERSC, ESnetNERSC, ESnet

Builds on Globus Middleware and high-performance Builds on Globus Middleware and high-performance distributed storage system distributed storage system (DPSS from LBNL)(DPSS from LBNL) Initial focus on large DOE HENP applicationsInitial focus on large DOE HENP applications

RHIC/STAR, BaBarRHIC/STAR, BaBar Demonstrated data rates to 57 Mbytes/sec.Demonstrated data rates to 57 Mbytes/sec.

ANL-SLAC-Berkeley


HENP Grand Challenge/Clipper HENP Grand Challenge/Clipper Testbed and TasksTestbed and Tasks

High-Speed Testbed High-Speed Testbed Computing and networking Computing and networking

(NTON, ESnet) infrastructure(NTON, ESnet) infrastructure

Differentiated Network ServicesDifferentiated Network Services Traffic shaping on ESnetTraffic shaping on ESnet

End-to-end Monitoring End-to-end Monitoring Architecture (QE, QM, CM)Architecture (QE, QM, CM) Traffic analysis, event monitor Traffic analysis, event monitor

agents to support traffic agents to support traffic shaping and CPU schedulingshaping and CPU scheduling

Transparent Data Management Transparent Data Management ArchitectureArchitecture OOFS/HPSS, DPSS/ADSMOOFS/HPSS, DPSS/ADSM

Application DemonstrationApplication Demonstration Standard Analysis Framework Standard Analysis Framework

(STAF)(STAF) Access data at SLAC, LBNL, Access data at SLAC, LBNL,

or ANL (net and data quality)or ANL (net and data quality)


The Particle Physics Data Grid The Particle Physics Data Grid (PPDG)(PPDG)

DoE/NGI Next Generation Internet ProjectDoE/NGI Next Generation Internet Project

ANL, BNL, Caltech, FNAL, JLAB, LBNL, ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC, U.Wisc/CSSDSC, SLAC, U.Wisc/CS

Goal: Goal: To be able to query and partially retrieve data from To be able to query and partially retrieve data from PB data stores across Wide Area Networks within seconds PB data stores across Wide Area Networks within seconds

Drive progress in the development of the necessary Drive progress in the development of the necessary middlewaremiddleware, , networks networks and fundamental computer science of and fundamental computer science of distributed systemsdistributed systems..

Deliver some of the infrastructure for widely distributed data analysis Deliver some of the infrastructure for widely distributed data analysis at multi-PetaByte scales by 100s to 1000s of physicistsat multi-PetaByte scales by 100s to 1000s of physicists


PPDG First Year Deliverable:PPDG First Year Deliverable:Site-to-Site Replication ServiceSite-to-Site Replication Service

Network Protocols Tuned for High Throughput Network Protocols Tuned for High Throughput Use of DiffServ Use of DiffServ for for

(1) (1) Predictable high priority delivery of high - bandwidth Predictable high priority delivery of high - bandwidth data streams data streams

(2) Reliable background transfers(2) Reliable background transfers Use of integrated instrumentation Use of integrated instrumentation

to detect/diagnose/correct problems in long-lived high speed to detect/diagnose/correct problems in long-lived high speed transfers [NetLogger + DoE/NGI developments]transfers [NetLogger + DoE/NGI developments]

Coordinated reservaton/allocation techniques Coordinated reservaton/allocation techniques for storage-to-storage performancefor storage-to-storage performance First Year Goal: First Year Goal: Optimized cached read access to 1-10 Gbytes, Optimized cached read access to 1-10 Gbytes,

drawn from a total data set of order One Petabytedrawn from a total data set of order One Petabyte

PRIMARY SITEData Acquisition,

CPU, Disk, Tape Robot

SECONDARY SITECPU, Disk, Tape Robot


PPDG Multi-site Cached PPDG Multi-site Cached File Access SystemFile Access System

UniversityUniversityCPU, Disk, CPU, Disk,

UsersUsers

PRIMARY SITEPRIMARY SITEData Acquisition,Data Acquisition,Tape, CPU, Disk, Tape, CPU, Disk,

RobotRobot

Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot



UsersUsers


UsersUsers



First Year PPDG “System” First Year PPDG “System” ComponentsComponents

Middleware Components (Initial Choice): See PPDG Proposal Page 15Middleware Components (Initial Choice): See PPDG Proposal Page 15 Object and File-Based Object and File-Based Objectivity/DB (SLAC enhanced) Objectivity/DB (SLAC enhanced)

Application Services Application Services GC Query Object, Event Iterator,GC Query Object, Event Iterator, Query Monitor Query Monitor

FNAL SAM SystemFNAL SAM System Resource ManagementResource Management Start with Human InterventionStart with Human Intervention

(but begin to deploy resource discovery & mgmnt tools)(but begin to deploy resource discovery & mgmnt tools) File Access Service File Access Service Components of OOFS (SLAC)Components of OOFS (SLAC) Cache ManagerCache Manager GC Cache Manager (LBNL)GC Cache Manager (LBNL)

Mass Storage ManagerMass Storage Manager HPSS, Enstore, OSM (Site-dependent)HPSS, Enstore, OSM (Site-dependent) Matchmaking Service Matchmaking Service Condor (U. Wisconsin)Condor (U. Wisconsin) File Replication Index File Replication Index MCAT (SDSC)MCAT (SDSC) Transfer Cost Estimation ServiceTransfer Cost Estimation Service Globus (ANL)Globus (ANL)

File Fetching ServiceFile Fetching Service Components of OOFSComponents of OOFS File Movers(s) SRB (SDSC); Site specificFile Movers(s) SRB (SDSC); Site specific End-to-end Network ServicesEnd-to-end Network Services Globus tools for QoS reservationGlobus tools for QoS reservation

Security and authenticationSecurity and authentication Globus (ANL) Globus (ANL)


PPDG Middleware Architecture for PPDG Middleware Architecture for Reliable High Speed Data Delivery Reliable High Speed Data Delivery

Object-based andObject-based andFile-based Application File-based Application

ServicesServices

Cache ManagerCache Manager

File AccessFile AccessServiceService

Matchmaking Matchmaking ServiceService

Cost EstimationCost Estimation

File FetchingFile FetchingServiceService

File Replication File Replication IndexIndex

End-to-End End-to-End Network ServicesNetwork Services

Mass Storage Mass Storage ManagerManager

Resource Resource ManagementManagement

File MoverFile Mover

File MoverFile Mover

Site BoundarySite Boundary Security DomainSecurity Domain


PPDG Developments In 2000-2001PPDG Developments In 2000-2001

Co-Scheduling algorithmsCo-Scheduling algorithms Matchmaking and PrioritizationMatchmaking and Prioritization Dual-Metric PrioritizationDual-Metric Prioritization

Policy and Marginal UtilityPolicy and Marginal Utility

DiffServ on Networks DiffServ on Networks to segregate tasks to segregate tasks

Performance ClassesPerformance Classes Transaction ManagementTransaction Management

Cost EstimatorsCost Estimators Application/TM InteractionApplication/TM Interaction Checkpoint/RollbackCheckpoint/Rollback

Autonomous Agent HierarchyAutonomous Agent Hierarchy


Mobile Agents: Reactive, Autonomous, Goal Driven, AdaptiveMobile Agents: Reactive, Autonomous, Goal Driven, Adaptive Execute AsynchronouslyExecute Asynchronously Reduce Network Load: Local ConversationsReduce Network Load: Local Conversations Overcome Network Latency; Some OutagesOvercome Network Latency; Some Outages Adaptive Adaptive Robust, Fault Tolerant Robust, Fault Tolerant Naturally Heterogeneous Naturally Heterogeneous Extensible Concept: Extensible Concept: Agent HierarchiesAgent Hierarchies

Beyond Traditional Architectures:Beyond Traditional Architectures:Mobile Agents (Java Aglets)Mobile Agents (Java Aglets)

““Agents are objects with rules and legs” -- D. TaylorAgents are objects with rules and legs” -- D. Taylor

Application

Se

rvic

e

Ag

en

tAgent

Ag

en

t Ag

en

tA

ge

nt

Ag

en

tA

ge

nt


Distributed Data Delivery and Distributed Data Delivery and LHC Software ArchitectureLHC Software Architecture

LHC Software and/or Analysis Process must LHC Software and/or Analysis Process must account for data and resource-related realitiesaccount for data and resource-related realities

Delay for data location, queueing, scheduling; Delay for data location, queueing, scheduling; sometimes for transport and reassemblysometimes for transport and reassembly

Allow for long transaction times, performanceAllow for long transaction times, performanceshifts, errors, out-of-order arrival of datashifts, errors, out-of-order arrival of data

Software Architectural ChoicesSoftware Architectural Choices Traditional, single-threaded applicationsTraditional, single-threaded applications

AllowAllow for data arrival and reassembly for data arrival and reassembly OROR Performance-Oriented (Complex)Performance-Oriented (Complex)

I/O requests up-front; multi-threaded; data driven;I/O requests up-front; multi-threaded; data driven; respond to ensemble of (changing) cost estimates respond to ensemble of (changing) cost estimates

Possible code movement as well as data movementPossible code movement as well as data movement Loosely coupled, dynamic: Loosely coupled, dynamic:

e.g. Agent-based implementatione.g. Agent-based implementation


GriPhyN: First Production Scale GriPhyN: First Production Scale “Grid Physics Network”“Grid Physics Network”

Develop a New Form of Integrated Distributed System, while Develop a New Form of Integrated Distributed System, while Meeting Primary Goals of the US LIGO and LHC ProgramsMeeting Primary Goals of the US LIGO and LHC Programs

Single Unified GRID System Concept; Hierarchical StructureSingle Unified GRID System Concept; Hierarchical Structure (Sub-)Implementations, for LIGO, SDSS, US CMS, US ATLAS: (Sub-)Implementations, for LIGO, SDSS, US CMS, US ATLAS:

~20 Centers: Few Each in US for LIGO, CMS, ATLAS, SDSS~20 Centers: Few Each in US for LIGO, CMS, ATLAS, SDSS Aspects Complementary to Centralized DoE FundingAspects Complementary to Centralized DoE Funding

University-Based Regional Tier2 Centers, Partnering with University-Based Regional Tier2 Centers, Partnering with the Tier1 Centersthe Tier1 Centers

Emphasis on Training, Mentoring and Remote CollaborationEmphasis on Training, Mentoring and Remote Collaboration

Making the Process of Search and Discovery Making the Process of Search and Discovery Accessible to StudentsAccessible to Students

GriPhyN Web Site: http://www.phys.ufl.edu/~avery/mre/GriPhyN Web Site: http://www.phys.ufl.edu/~avery/mre/

White Paper: http://www.phys.ufl.edu/~avery/mre/white_paper.htmlWhite Paper: http://www.phys.ufl.edu/~avery/mre/white_paper.html


APOGEE/GriPhyN APOGEE/GriPhyN Data Grid ImplementationData Grid Implementation

An Integrated Distributed System of Tier1 and Tier2 CentersAn Integrated Distributed System of Tier1 and Tier2 Centers Flexible relatively low-cost (PC-based) Tier2 architecturesFlexible relatively low-cost (PC-based) Tier2 architectures

Medium-scale (for the LHC era) data storage and I/O capabilityMedium-scale (for the LHC era) data storage and I/O capability Well-adapted to local operation; modest system engineer support Well-adapted to local operation; modest system engineer support Meet changing local and regional needs in the active, Meet changing local and regional needs in the active,

early phases of data analysisearly phases of data analysis Interlinked with Gbps Network Links:Interlinked with Gbps Network Links:

Internet2 and Regional Nets Circa 2001-2005 Internet2 and Regional Nets Circa 2001-2005 State of the Art “QoS” techniques to prioritise and shape traffic, State of the Art “QoS” techniques to prioritise and shape traffic,

to manage bandwidth. Preview transoceanic BW, within the to manage bandwidth. Preview transoceanic BW, within the USUS

A working Production-Prototype (2001-2003) for A working Production-Prototype (2001-2003) for Petabyte-Scale Distributed Computing ModelsPetabyte-Scale Distributed Computing Models

Focus on LIGO (+ BaBar and Run2) handling of real data, and Focus on LIGO (+ BaBar and Run2) handling of real data, and LHC Mock Data Challenges with simulated dataLHC Mock Data Challenges with simulated data

Meet the needs, and learn from system performance under stressMeet the needs, and learn from system performance under stress


VRVS: From Videoconferencing to VRVS: From Videoconferencing to Collaborative EnvironmentsCollaborative Environments

> 1400 registered hosts, 22 reflectors, 34 Countries > 1400 registered hosts, 22 reflectors, 34 Countries Running in U.S. Europe and AsiaRunning in U.S. Europe and Asia

SwitzerlandSwitzerland: : CERN (2)CERN (2) ItalyItaly: : CNAF BolognaCNAF Bologna UKUK: : Rutherford LabRutherford Lab FranceFrance: : IN2P3 Lyon, MarseillesIN2P3 Lyon, Marseilles GermanyGermany: : Heidelberg Univ.Heidelberg Univ. FinlandFinland: : FUNETFUNET SpainSpain: : IFCA-Univ. CantabriaIFCA-Univ. Cantabria RussiaRussia: : Moscow State Univ., Tver. U.Moscow State Univ., Tver. U. U.SU.S::

Caltech, LBNL, SLAC, FNAL,Caltech, LBNL, SLAC, FNAL, ANL, BNL, ANL, BNL, Jefferson Lab.Jefferson Lab.

DoE HQDoE HQ Germantown Germantown Asia:Asia: Academia Sinica, TaiwanAcademia Sinica, Taiwan South America: CeCalcula, VenezualaSouth America: CeCalcula, Venezuala


Role of SimulationRole of Simulationfor Distributed Systemsfor Distributed Systems

Simulations are widely recognized and used as essential toolsSimulations are widely recognized and used as essential tools for the design, performance evaluation and optimisation for the design, performance evaluation and optimisation

of complex distributed systemsof complex distributed systems From battlefields to agriculture; from the factory floor From battlefields to agriculture; from the factory floor

to telecommunications systemsto telecommunications systems Discrete event simulations with an appropriate Discrete event simulations with an appropriate

and high level of abstraction are powerful toolsand high level of abstraction are powerful tools ““Time” intervals, interrupts and performance/load Time” intervals, interrupts and performance/load

characteristics are the essentials characteristics are the essentials Not yet an integral part of the HENP culture, but Not yet an integral part of the HENP culture, but

Some experience in trigger, DAQ and tightly coupledSome experience in trigger, DAQ and tightly coupledcomputing systems: CERN CS2 models computing systems: CERN CS2 models

Simulation is a vital part of the study of site architectures,Simulation is a vital part of the study of site architectures, network behavior, data access/processing/delivery strategies, network behavior, data access/processing/delivery strategies,

for HENP Grid Design and Optimization for HENP Grid Design and Optimization


Monitoring Architecture:Monitoring Architecture:Use of NetLogger as in CLIPPERUse of NetLogger as in CLIPPER

End-to-end monitoring of End-to-end monitoring of grid assets [*] is needed togrid assets [*] is needed to Resolve network Resolve network

throughput problemsthroughput problems Dynamically schedule Dynamically schedule

resourcesresources

Add precision-timed event Add precision-timed event monitor agents to:monitor agents to:

ATM switches ATM switches DPSS serversDPSS servers Testbed computational Testbed computational

resourcesresources

Produce trend analysis Produce trend analysis modules for monitor agentsmodules for monitor agents

Make results available to Make results available to applicationsapplications

[*] See talk by B. Tierney[*] See talk by B. Tierney


SummarySummary

The HENP/LHC Data Analysis Problem The HENP/LHC Data Analysis Problem Worldwide-distributed Petabyte scale compactedWorldwide-distributed Petabyte scale compacted

binary data, and computing resources binary data, and computing resources Development of a robust networked data Development of a robust networked data access access

and analysis system is mission-criticaland analysis system is mission-critical An aggressive R&D program is requiredAn aggressive R&D program is required

to develop systems for reliable data access, processing to develop systems for reliable data access, processing and analysis across an ensemble of networksand analysis across an ensemble of networks

An effective inter-field partnership is now developingAn effective inter-field partnership is now developingthrough many R&D projectsthrough many R&D projects

HENP analysis is now one of the driving forcesHENP analysis is now one of the driving forces for the development of “Data Grids” for the development of “Data Grids”

Solutions to this problem could be widely applicable in Solutions to this problem could be widely applicable in other scientific fields and industry, by LHC startupother scientific fields and industry, by LHC startup


LHC Computing: LHC Computing: Upcoming IssuesUpcoming Issues

Cost of Computing at CERN for the LHC ProgramCost of Computing at CERN for the LHC Program May Exceed 100 MCHF at CERN; Correspondingly More in TotalMay Exceed 100 MCHF at CERN; Correspondingly More in Total Some ATLAS/CMS Basic Numbers (CPU, 100 kB Reco. Event) Some ATLAS/CMS Basic Numbers (CPU, 100 kB Reco. Event)

from 1996; Require Further Studyfrom 1996; Require Further Study We cannot “scale up” from previous generations (new methods)We cannot “scale up” from previous generations (new methods)

CERN/Outside Sharing: MONARC and CMS/SCB Use 1/3:2/3 “Rule”CERN/Outside Sharing: MONARC and CMS/SCB Use 1/3:2/3 “Rule” Computing Architecture and Cost EvaluationComputing Architecture and Cost Evaluation

Integration and “Total Cost of Ownership”Integration and “Total Cost of Ownership” Possible Role of Central I/O ServersPossible Role of Central I/O Servers

Manpower EstimatesManpower Estimates CERN versus scaled Regional Centre estimatesCERN versus scaled Regional Centre estimates Scope of services and support providedScope of services and support provided

Limits of CERN support and service and the need Limits of CERN support and service and the need for Regional Centres for Regional Centres

Understanding that LHC Computing is “Different”Understanding that LHC Computing is “Different” A different scale; and worldwide distributed A different scale; and worldwide distributed

computing for the first timecomputing for the first time Continuing, System R&D is requiredContinuing, System R&D is required

third lcb workshop distributed computing and regional centres session harvey b. newman (cit)

Documents

new generation of data

henp computing

data handling facilities

henp data problem

data handling architectures

proposed future computing

early prediction tape

software challengessoftware