proof and condor

22
December, 2003 ACAT'03 1 PROOF and Condor Fons Rademakers http://root.cern.ch

Upload: xander

Post on 11-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

PROOF and Condor. Fons Rademakers http://root.cern.ch. PROOF – Parallel ROOT Facility. Collaboration between core ROOT group at CERN and MIT Heavy Ion Group. Part of and based on ROOT framework Uses heavily ROOT networking and other infrastructure classes. Main Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PROOF and Condor

December, 2003 ACAT'03 1

PROOF and Condor

Fons Rademakers

http://root.cern.ch

Page 2: PROOF and Condor

December, 2003 ACAT'03 2

PROOF – Parallel ROOT Facility

Collaboration between core ROOT group at CERN and MIT Heavy Ion Group

Part of and based on ROOT framework Uses heavily ROOT networking and

other infrastructure classes

Page 3: PROOF and Condor

December, 2003 ACAT'03 3

Main Motivation Design a system for the interactive analysis of

very large sets of ROOT data files on a cluster of computers

The main idea is to speed up the query processing by employing parallelism

In the GRID context, this model will be extended from a local cluster to a wide area “virtual cluster”. The emphasis in that case is not so much on interactive response as on transparency

With a single query, a user can analyze a globally distributed data set and get back a “single” result

The main design goals are: Transparency, scalability, adaptability

Page 4: PROOF and Condor

December, 2003 ACAT'03 5

Parallel Chain Analysis

root

Remote PROOF Cluster

proof

proof

proof

TNetFile

TFile

Local PC

$ root

ana.Cstdout/obj

node1

node2

node3

node4

$ root

root [0] tree.Process(“ana.C”)

$ root

root [0] tree.Process(“ana.C”)

root [1] gROOT->Proof(“remote”)

$ root

root [0] tree.Process(“ana.C”)

root [1] gROOT->Proof(“remote”)

root [2] chain.Process(“ana.C”)

ana.C

proof

proof = slave server

proof

proof = master server

#proof.confslave node1slave node2slave node3slave node4

*.root

*.root

*.root

*.root

TFile

TFile

Page 5: PROOF and Condor

December, 2003 ACAT'03 6

PROOF - Architecture

Data Access Strategies Local data first, also rootd, rfio, dCache,

SAN/NAS Transparency

Input objects copied from client Output objects merged, returned to client

Scalability and Adaptability Vary packet size (specific workload, slave

performance, dynamic load) Heterogeneous Servers

Support to multi site configurations

Page 6: PROOF and Condor

December, 2003 ACAT'03 7

Workflow For Tree Analysis –Pull Architecture

Initialization

Process

Process

Process

Process

Wait for nextcommand

Slave 1Process(“ana.C”)

Pac

ket

gen

erat

or

Initialization

Process

Process

Process

Process

Wait for nextcommand

Slave NMaster

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

SendObject(histo)SendObject(histo)

Addhistograms

Displayhistograms

0,100

200,100

340,100

490,100

100,100

300,40

440,50

590,60

Process(“ana.C”)

Page 7: PROOF and Condor

December, 2003 ACAT'03 8

Data Access Strategies

Each slave get assigned, as much as possible, packets representing data in local files

If no (more) local data, get remote data via rootd, rfiod or dCache (needs good LAN, like GB eth)

In case of SAN/NAS just use round robin strategy

Page 8: PROOF and Condor

December, 2003 ACAT'03 9

Additional Issues

Error handling Death of master and/or slaves Ctrl-C interrupt

Authentication Globus, ssh, kerb5, SRP, clear passwd,

uid/gid matching Sandbox and package manager

Remote user environment

Page 9: PROOF and Condor

December, 2003 ACAT'03 10

Running a PROOF Job

Specify a collection of TTrees or files with objects

root[0] gROOT->Proof(“cluster.cern.ch”);

root[1] TDSet *set = new TDSet(“TTree”, “AOD”);

root[2] set->AddQuery(“lfn:/alice/simulation/2003-04”,“V0.6*.root”);

root[10] set->Print(“a”);

root[11] set->Process(“mySelector.C”);

Returned by DB or File Catalog query etc. Use logical filenames (“lfn:…”)

Page 10: PROOF and Condor

December, 2003 ACAT'03 11

The Selector Basic ROOT TSelector

Created via TTree::MakeSelector()

// Abbreviated version

class TSelector : public TObject {

Protected:

TList *fInput;

TList *fOutput;

public

void Init(TTree*);

void Begin(TTree*);

void SlaveBegin(TTree *);

Bool_t Process(int entry);

void SlaveTerminate();

void Terminate();

};

Page 11: PROOF and Condor

December, 2003 ACAT'03 12

PROOF Scalability

32 nodes: dual Itanium II 1 GHz CPU’s,2 GB RAM, 2x75 GB 15K SCSI disk,1 Fast Eth

Each node has one copy of the data set(4 files, total of 277 MB), 32 nodes: 8.8 Gbyte in 128 files, 9 million events

8.8GB, 128 files1 node: 325 s

32 nodes in parallel: 12 s

Page 12: PROOF and Condor

December, 2003 ACAT'03 13

PROOF and Data Grids

Many services are a good fit Authentication File Catalog, replication services Resource brokers Job schedulers Monitoring

Use abstract interfaces

Page 13: PROOF and Condor

December, 2003 ACAT'03 14

The Condor Batch System Full-featured batch system

Job queuing, scheduling policy, priority scheme, resource monitoring and management

Flexible, distributed architecture Dedicated clusters and/or idle desktops Transparent I/O and file transfer

Based on 15 years of advanced research Platform for ongoing CS research Production quality, in use around the world,

pools with 100’s to 1000s of nodes. See: http://www.cs.wisc.edu/condor

Page 14: PROOF and Condor

December, 2003 ACAT'03 15

COD - Computing On Demand

Active, ongoing research and development Share batch resource with interactive use

Most of the time normal Condor batch use Interactive job “borrows” the resource for short

time Integrated into Condor infrastructure

Benefits Large amount of resource for interactive burst Efficient use of resources (100% use)

Page 15: PROOF and Condor

December, 2003 ACAT'03 16

COD - Operations

BatchNormal batchBatchRequest claimBatchCODActivate claimBatchCODSuspend claimBatchCODResumeBatchDeactivateBatchRelease

Page 16: PROOF and Condor

December, 2003 ACAT'03 17

PROOF and COD

Integrate PROOF and Condor COD Great cooperation with Condor team

Master starts slaves as COD jobs Standard connection from master to

slave Master resumes and suspends slaves

as needed around queries Use Condor or external resource

manager to allocate nodes (vm’s)

Page 17: PROOF and Condor

December, 2003 ACAT'03 18

Condor

Slave Batch

Master

Condor

Slave Batch

Condor

Batch

Condor

Client

PROOF and COD

Page 18: PROOF and Condor

December, 2003 ACAT'03 19

PROOF and COD Status

Status Basic implementation finished Successfully demonstrated at SC’03

with 45 slaves as part of PEAC TODO

Further improve interface between PROOF and COD

Implement resource accounting

Page 19: PROOF and Condor

December, 2003 ACAT'03 20

PEAC – PROOF Enabled Analysis Cluster

Complete event analysis solution Data catalog and data management Resource broker PROOF

Components used: SAM catalog, dCache, new global resource broker, Condor+COD, PROOF

Multiple computing sites with independent storage systems

Page 20: PROOF and Condor

December, 2003 ACAT'03 21

PEAC System Overview

Page 21: PROOF and Condor

December, 2003 ACAT'03 22

PEAC Status

Successful demo at SC’03 Four sites, up to 25 nodes Real CDF StNtuple based analysis COD tested with 45 slaves

Doing post mortem and plan for next design and implementation phases Available manpower will determine time

line Plan to use 250 node cluster at FNAL Other cluster at UCSD

Page 22: PROOF and Condor

December, 2003 ACAT'03 23

Conclusions PROOF maturing Lot of interest from experiments with large

data sets COD essential to share batch and

interactive work on the same cluster Maximizes resource utilization

PROOF turns out to be powerful application to use and show the power of Grid middleware to its full extend See tomorrows talk by Andreas Peters on

PROOF and AliEn