analysis software benchmark

Root analysis and implications to analysis model in ATLAS

Akira Shibata, New York University@ ACAT 08 in Erice

Nov 05, 2008

1

[email protected] - Novebmer 5, 2008

Are we ready to face data from LHC collisions?Grid computing? Do we have enough CPU? Tape? Disks? RAM? Do we need T1? T2? T3? AF? Do we need backdoor access? Are the machines maintained? Is it scary? Are they online? Do we have enough bandwidth? Can we copy data across the world? Can we reach the data we need? Can we reduce the data size? ESD? AOD? D1PD? D2PD? D3PD? Can we download them? Do we need interactive access? How do we write an analysis? How fast do they run? Do we need to buy more disk? How big is my ntuple? Do we need to buy more CPU? Disks? RAM? Are we up to date? Do I look cool if I buy a mac? Is virtual machine useful? Why do we use ROOT? What is PROOF? Is python fast enough? Is it easy to code? How often will I need to process my data? How fast will my analysis run? What can I do to get faster? What are the options? What is the future technology?

2

mailto:[email protected]



Analysis in the Era of Grid Computing

Tiered model for computing model. Leveled approach needed to optimize the system. Above all, how well does it work from the

physicistsʼ point of view?

Root Native

Root + POOL

Rough size estimate

T1T1

T2

T3

Desktop

T2

T3

Desktop

ROOT / ARA

Analysis at Institute

HistoHisto

Central

AOD/

DPD

making

Grid

Analy /

DPD

making

ESDESDESDESD

AODAODAOD

D1PDD1PD

D2PDD2PD

D3PDD3PD

User

NtupleUser

Ntuple

Local Root

Analysis

Get

D1PDD1PD

cpu

request

deliver

~500kB/evt

~100kB/evt

10-50kB/evt 1-10kB/evt

~1kB/evt

30-80kB/evt

3




Derived Physics Data

• DPDs are created using the following operations:• Skimming: selecting the events one needs• Thinning: selecting the objects one needs• Slimming: removing information from objects.

• ESDs hold full information from reconstruction. AOD, DnPDs are derived with increasing level of derivation.

• Primary purpose of D1PD is to have access to parts of the ESD information that are otherwise difficult to get to.

• D1/2PD are in POOL format. D3PD refers to any DPD that are in ntuple format.

• ESD, AOD, D1PD contents are defined by groups. Several types of D1PD are defined by performance groups. D2PD and D3PD are defined by users.

• First level analysis may be done (variable calculation, object reco etc) when D2/3PD are created.

4




Motivation for Profiling ROOT Analysis• The primary use of the Grid is event reconstruction,

storage and production of reduced data. This is done using ATLAS software, Athena. Some analysis happens here too.

• However, post-Grid (non-Athena) ROOT analysis is the main stage for physics analysis.

• Mostly a user-level decision due to the private nature of physics analysis but:

• the situation is becoming more complex due to availability of new technology;

• no good summary exists comparing the available options;

• it is an important ingredient for an efficient analysis model;

• it is needed for estimating resource requirements.• Technical discussions does not always answer practical

questions. This study will benchmark analysis “modes” in realistic settings based on wall-time measurements.

5




“Flat” vs POOL Persistency• Many of the complexity in the current situation is due to

the POOL technology (additional layer to the ROOT persistency technology) used in ATLAS. POOL supports:• Metadata lookup - used by TAG to access events in

large file without having to read the full contents.• More flexibility in writing out complex objects. Has its

own way of T/P separation and schema evolution.• When the decision was made ROOT persistency was not

so great as it is now.• Problems writing out STL objects.• Problems referring to objects in different trees/files.

• ROOT persistency has improved and now has less issues.

• ARA - enabling reading POOL objects from ROOT by calling POOL converters on demand. P->T conversion. Takes extra read time.

6




Summary of Existing Analysis ModesMode Draw CINT ACLiC PyRoot g++ Athena

Ntuple ◎◎ ◎◎ ◎◎ ◎◎ ◎◎ ◎

POOL ◎ ◎ ╳ ◎◎ ◎◎ ◎◎◎

Compiled/Interpreted Interpreted Interpreted Compiled Interpreted Compiled Both

LanguageC++

Python(C++)-- C++ Python C++

C++Python

Interactive ◎◎ ◎◎ ╳ ◎◎ ╳ ◎

Additional packages -

MakeClassMakeSelector

SPyrootSFrame

AMA-

Standard dev env - - ╳ - ◎◎ ◎◎

Athena components ╳ ╳ ╳ ◎ ◎ ◎◎◎

Implemented most common options. All codes available in ATLAS CVS: users/ashibata/RootBenchmark

7




Benchmark Analysis Contents• A simple Zee reconstruction analysis implemented for

every mode:1. Access electron container (POOL) / electron

kinematics branches (Ntuple)2. Select electrons using isEM and pt and charge3. Fill histograms with electron kinematics (pT and

multiplicity)4. Combine electrons to reconstruct Z5. Fill histogram with Z mass6. Write histograms out in finalize• Repeated the above 10 times

• Not complex enough for a real analysis but not entirely trivial.

• For Draw, plot electron after isEM/pt/charge selection. No four vector arithmetics.

8




Obtaining Reliable Results• Using POSIX measurement as much as

possible. Use wall time from time module.• Avoiding somewhat unstable measurement

with TStopwatch.• Measurements affected by other activities on

the machine. Overcome by multiple measurements.• Machine: Acas (BNL) node with normal load

3.34GB mem, 2 cores Xeon@ 2.00 GHz, data on NFS.

• Disk cache leads to misleading results. CPU time = Wall time once the data is in memory. • Force disc read by flushing RAM. Do not re-

read until all other files have been read. Alternate between AOD and ntuple analyses.

9




Methodology

Number of events0 1000020000300004000050000

Wal

l tim

e (s

)

0

200

400

600

800

1000

1200

1400

1600

AODgpp (init:6.64e+01s, rate:5.35e+02Hz)

SFrame (init:3.62e+01s, rate:3.15e+02Hz)

Draw (init:4.62e+01s, rate:1.25e+02Hz)

PyAthena (init:2.74e+01s, rate:9.65e+01Hz)

Athena (init:3.08e+01s, rate:6.86e+01Hz)

CINT (init:5.25e+01s, rate:1.85e+01Hz)

PyRoot (init:2.50e+00s, rate:1.24e+01Hz)

AOD

Number of events0 1000020000300004000050000

Wal

l tim

e (s

)

0

200

400

600

800

1000

1200

1400

1600

AODgpp (init:6.64e+01s, rate:5.35e+02Hz)

SFrame (init:3.62e+01s, rate:3.15e+02Hz)

Draw (init:4.62e+01s, rate:1.25e+02Hz)

PyAthena (init:2.74e+01s, rate:9.65e+01Hz)

Athena (init:3.08e+01s, rate:6.86e+01Hz)

CINT (init:5.25e+01s, rate:1.85e+01Hz)

PyRoot (init:2.50e+00s, rate:1.24e+01Hz)

AOD1. Measured time taken to

process with increasing number of events.

2. Repeat measurements and take average for each point.

3. Fit a straight line to obtain overhead (offset) and rate (evt/sec).

4. Calculate errors from standard deviation.

Only use rate in comparing the modes. Overhead varies between a fraction of seconds to tens of seconds.

10




Data and FormatPOOL Ntuple

Full contents AOD 144.22 kB/evt

CBNT?not tried

DPD contentsTrigger/Jets/Leptons etc

TopD1PD31.42 kB/evt

TopD3PD4.87 kB/evt

Small DPD contentsTracks + Electrons

SmallD2PD18.74 kB/evt

SmallD3PD0.71 kB/evt

Very small DPDElectrons

VerySmallD2PD1.06 kB/evt

VerySmallD3PD0.37 kB/evt

All derived from FDR2 AODs. All produced on PANDA (except AOD and D1PD). Around 10,000 events per file. Total sample size for one data type ranges between 1 GB - 100 GB. A use-case driven comparison. Input file sizes are different.

11




AOD

Hz

0 200 400

PyRoot (17Hz, 18%)

TSelector (19Hz, 2%)

CINT (21Hz, 15%)

PyAthena (95Hz, 11%)

Athana (98Hz, 8%)

Draw (138Hz, 35%)

SFrame (321Hz, 13%)

gpp (535Hz, 3%)

PyRoot (17Hz, 18%)


CINT (21Hz, 15%)


Athana (98Hz, 8%)

Draw (138Hz, 35%)

SFrame (321Hz, 13%)

gpp (535Hz, 3%)

AODAOD Analysis Results

mode (rate, error)

Hz

AOD Input

Seems to be reading all containers in the files

Only small difference between C++/Python in Athena.

Compiled non-framework analysis is the fastest.

CINT by far the slowest.

12




Top_D1PD

Hz

0 500 1000


CINT (26Hz, 6%)

PyRoot (43Hz, 9%)


Draw (298Hz, 55%)

Athana (313Hz, 6%)

SFrame (721Hz, 17%)

gpp (1130Hz, 15%)


CINT (26Hz, 6%)

PyRoot (43Hz, 9%)


Draw (298Hz, 55%)

Athana (313Hz, 6%)

SFrame (721Hz, 17%)

gpp (1130Hz, 15%)

Top_D1PDD1PD Level ComparisonTop D1PD Input Top D3PD Input

Hz Hz

mode (rate, error)

An order of magnitude advantage for using ntuple for g++ analysis. Much less difference with non-compiled modes.

Top_D3PD

Hz

0 20000 40000 60000CINT (32Hz, 2%)



PyRoot (300Hz, 21%)

Athana (838Hz, 1%)

Draw (2343Hz, 15%)

SFrame (9453Hz, 19%)

TSelector_ACLiC (18551Hz, 18%)

gpp (45869Hz, 21%)

ACLiC (48494Hz, 20%)

ACLiC_Opt (58719Hz, 16%)

CINT (32Hz, 2%)



PyRoot (300Hz, 21%)

Athana (838Hz, 1%)

Draw (2343Hz, 15%)



gpp (45869Hz, 21%)

ACLiC (48494Hz, 20%)


Top_D3PD

Ntuple/POOL=7.9

Ntuple/POOL=13.1

Ntuple/POOL=40.6

Ntuple/POOL=2.7

Ntuple/POOL=7.1

Ntuple/POOL=1.2

Ntuple/POOL=1.2

Ntuple/POOL=1.8

13




Small_D3PD

Hz

0 20000 40000 60000CINT (32Hz, 1%)



PyRoot (382Hz, 22%)

Athana (855Hz, 3%)

Draw (6358Hz, 17%)

SFrame (14597Hz, 26%)



gpp (71003Hz, 7%)

CINT (32Hz, 1%)



PyRoot (382Hz, 22%)

Athana (855Hz, 3%)

Draw (6358Hz, 17%)

SFrame (14597Hz, 26%)



gpp (71003Hz, 7%)

Small_D3PD

Ntuple/POOL=8.7

Ntuple/POOL=1.1

Ntuple/POOL=33.3

Ntuple/POOL=21.2

Ntuple/POOL=1.4

Ntuple/POOL=3.8

Ntuple/POOL=1.1

Ntuple/POOL=1.7

Small_D2PD

Hz

0 1000 2000


CINT (29Hz, 4%)

PyRoot (100Hz, 10%)

Draw (300Hz, 29%)


Athana (596Hz, 5%)


gpp (2132Hz, 6%)


CINT (29Hz, 4%)

PyRoot (100Hz, 10%)

Draw (300Hz, 29%)


Athana (596Hz, 5%)


gpp (2132Hz, 6%)

Small_D2PDD2PD Level ComparisonSmall D2PD Input Small D3PD Input

Hz

mode (rate, error)

HzPOOL analysis faster than AOD input by x4. Larger difference between Athena and PyAthena with smaller input files. Why?

14




Very_Small_D3PD

Hz

0 20000 40000 60000CINT (32Hz, 1%)


PyRoot (331Hz, 25%)


Athana (854Hz, 5%)

Draw (6777Hz, 16%)

SFrame (13751Hz, 28%)


gpp (48516Hz, 17%)


CINT (32Hz, 1%)


PyRoot (331Hz, 25%)


Athana (854Hz, 5%)

Draw (6777Hz, 16%)

SFrame (13751Hz, 28%)


gpp (48516Hz, 17%)


Very_Small_D3PDVery_Small_D2PD

Hz

0 1000 2000 3000

CINT (31Hz, 0%)

Draw (294Hz, 47%)


PyRoot (416Hz, 19%)

Athana (667Hz, 8%)


gpp (2798Hz, 5%)

CINT (31Hz, 0%)

Draw (294Hz, 47%)


PyRoot (416Hz, 19%)

Athana (667Hz, 8%)


gpp (2798Hz, 5%)

Very_Small_D2PDVery Small Input ComparisonVery Small D2PD Input Very Small D3PD Input

Ntuple/POOL=5.5

Ntuple/POOL=1.0

Ntuple/POOL=17.3

Ntuple/POOL=23.0

Ntuple/POOL=1.3

Ntuple/POOL=1.1

Ntuple/POOL=0.8

HzD2PD nearing D3PD even more. A few thousand Hz possible with ARA. Ntuple mode still factor of 5-10 faster in C++ modes.

15




Event Size (kB)0 20 40 60 80 100120140160

Eve

nt

Siz

e *

Exe

c R

ate

(kB

/s)

210

310

410

POOL Analysis

AthAthena

PyRoot

PyAthena

Draw

gpp

CINT

SFrame

TSelector

POOL Analysis

Event Size (kB)0 20 40 60 80 100120140160

Event S

ize *

Exec R

ate

(kB

/s)

210

310

410

POOL Analysis

Draw

Athena

PyRoot

PyAthena

gpp

CINT

SFrame

POOL Analysis

I/O Dependency Comparison

Clear I/O constraint > 20 kB in POOL analysis coming from file size, NOT read-out size. Ntuples are usually smaller than 20kB.

Event Size (kb)0 20 40 60 80 100120140160

Eve

nt

Siz

e *

Exe

c R

ate

(kb

/s)

210

310

410

POOL Analysis

AthAthena

PyRoot

PyAthena

Draw

gpp

CINT

SFrame

TSelector

POOL Analysis

Event Size (kB)0 1 2 3 4 5

Eve

nt

Siz

e *

Exe

c R

ate

(kB

/s)

210

310

410

510

Ntuple Analysis

ACLiC

gpp

PyAthena

TSelector

ACLiC_Opt

CINT

TSelector_ACLiC

AthAthena

PyRoot

SFrame

Draw

Ntuple Analysis

Event Size (kb)0 1 2 3 4 5

Event S

ize *

Exec R

ate

(kb/s

)

210

310

410

510

Ntuple Analysis

ACLiC

gpp

PyAthena

TSelector

ACLiC_Opt

CINT

TSelector_ACLiC

AthAthena

PyRoot

SFrame

Draw

Ntuple Analysis

16




Summary• Very clear performance advantage for ROOT native ntuple

format. An order of magnitude difference. Ball park figure: Thousands evts/sec vs hundreds of Hz. Those numbers should be taken as upper limit, real analyses would be more complex.

• Compiled mode is ~two orders of magnitude faster than non-compiled options.

• Use of frameworks, even quite a simple one, can slow things down, though, any realistic analysis would require some infrastructure. Choose/write frameworks wisely!

• With Athena, the overhead of framework seems large, though typical DPD jobs can be highly CPU intensive.

• Effect of file caching by system ties input file size and the execution rate (regardless of the actual read-out). Above 20 kb/evt, the analysis is bound by this effect. This is a very tight slimming/thinning requirement for D12PD. May be able to improve this with high performance disk.

17




Acknowlegement

I have bothered a lot of people with this project including (random order):Scott Snyder, Wim Lavrijsen, Sebastien Binet, Emil Obrekov, David Quarrie, Kyle Cranmer, David Adams, Sven Menke, Shuwei Ye, Sergey Panitkin, Stephanie Majeski, Hong Ma, Tadashi Maeno, Attila Krasznahorkay, Jim Cochran, roottalk, Paolo Califiura

Many thanks.

18



analysis software benchmark

Documents