resource predictors in hep applications john huth, harvard sebastian grinstein, harvard peter hurst,...

23
Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Upload: matthew-carr

Post on 04-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Resource Predictors in HEP Applications

John Huth, HarvardSebastian Grinstein, Harvard

Peter Hurst, HarvardJennifer M. Schopf, ANL/NeSC

Page 2: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

The Problem

• Large data sets gets recreated, and scientists want to know if they should– Fetch a copy of the data– Recreate it locally

• This problem can be considered in the context of a virtual data system that tracks how data is created so recreation is feasible

Page 3: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

To make this decision you need

• 1) Estimate of time to recreate data– Info about data provenance, machine types,

etc

• 2) Estimate of data transfer time

• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly

Page 4: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

To make this decision you need

• 1) Estimate of time to recreate data– Info about data provenance, machine types,

etc

• 2) Estimate of data transfer time

• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly– OUR AREA OF CONCENTRATION

Page 5: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Regeneration Time Estimates

• Previous work (Chep 2004, “Resource Predictors in HEP Applications”)

• Estimate runtime of ATLAS application– End-to-end estimation since no low-level application

model available– Used data about input parameters (number of events,

versioning, debug on/off, etc) and benchmark data (using nbench)

• Estimates are accurate to 10% for event generation and reconstruction, 25% for event simulation

Page 6: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Regeneration Time Estimate Accuracy

Page 7: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

File Transfer Time Estimates

• Much previous work (e.g. Vazhkudai and Schopf, IJHPCA Vol 17, No. 3, August 2003 )

• We use simple end-to-end history data from GridFTP logs to estimate behavior– Simple approach works well on our

networks/machines– Average bandwidth used with no file-size

filtering

Page 8: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Testbed

• Files transferred from BNL to Harvard and from CERN to Harvard– BNL (aftpexp01.bnl.gov): 4x 3GHz Xeon, Linux 2.4.21-

37.ELsmp, 2.0GB RAM, 1.0 GBit/s NIC– Harvard: 2x 3.4GHz P4, Linux 2.4.20-21.EL.cernsmp, 1.5GB

RAM, 1.0 GBit/s NIC

• Typical network routes:– Harvard –NoX – ManLan – ESNet – BNL

Typical Latency 7.8 ms

– Harvard – NoX – ManLan – Chicago (Abilene) – CERNTypical Latency 148 ms

• Bottlenecks are in machines at each end (e.g. disk access)

Page 9: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Network Routing

Page 10: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Benchmarking

• Transfer files from BNL to Harvard– 20 files each 25MB, 50MB, 100MB, 250MB,

500MB, 1GB

• Average file transfer times are linear with file size

• Initially quiet machines, network– Transfers of 100MB files have variance ~5%

Page 11: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Time vs File Size, BNL(Quiet network)

Page 12: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Variance, BNL(100 MB files, quiet network)

Page 13: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Benchmarking

• Some data taken during “Service Challenge 3”

• Average file transfer times are still linear with file size, but have larger variance

Page 14: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Time vs File Size, BNL(Busy network)

Page 15: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Variance, BNL(100 MB files, busy network)

Page 16: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

But our concentration was on the framework

• Given ways to estimate application run time and file transfer time, we want to plug them into an existing framework to make better resource management decisions

• Could be implemented as a post-processor to optimize DAG’s produced by Chimera

Page 17: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Workflow Optimization

• A script parses the DAG, looking for I/O, binaries

• I/O files indexed in Replica Location Service (RLS)

• Client queries database for execution parameters, bandwidths

• Script evaluates execution, transfer times and rewrites fastest DAG

Page 18: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Our Strawman Application

• ATLAS event reconstruction jobs take ~20Mins to calculate a 100 Meg file

• File transfer Boston to BNL ~15 Sec/ 100 Meg file

• We created simplified jobs that would have average execution times equal to the file transfer times in order to have a situation closer to the one originally hypothesized

• Likely to be more common as data access becomes more contentious, and machines/calculations speed up

Page 19: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Framework Tests

• Generate “Non-optimized” DAG’s – linear chains which use a random mixture of transfers and calculations to instantiate 10, 20, or 40 files.

• Operate on these DAG’s with our optimizer to produce “Optimized” DAG’s

• Submit both “Non-optimized” and “Optimized” DAG’s and compare processing times

• For our particular strawman we expect the “Optimized” DAG’s to be 25% faster than the “Non-optimized”

Page 20: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Framework Tests

Page 21: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Comparison of Results

Page 22: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Optimized Results

Page 23: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Summary

• Implementation works

• A 28% time savings is seen

• Works with crude bandwidth predictions– More sophisticated predictions for dynamic

situations would be helpful

• Most useful when regeneration and transfer times are similar.