interactive data analysis with proof bleeding edge physics with bleeding edge computing fons...
Post on 29-Jan-2016
227 Views
Preview:
TRANSCRIPT
Interactive Data Analysis with PROOFBleeding Edge Physics with Bleeding Edge Computing
Fons RademakersCERN
DVD stack with oneyear of LHC data!(~ 20 Km)
Mt. Blanc(4.8 Km)
LHC Data Challenge•The LHC generates:
■ 40 million collisions per second
•Combined the 4 experiments record:■ After filtering, 100 interesting collision per
second
■ From 1 to 12 MB per collision ⇒ from 0.1 to 1.2 GB/s
■ 1010 collisions registered every year
■ ~ 10 PetaBytes (1015 B) per year
■ LHC data correspond to 20 millions DVD’s per year!
■ Computing power equivalent to 100.000 of today’s PC
■ Space equivalent to 400.000 large PC disks
Balloon(30 Km)
Airplane(10 Km)
•Typical HEP analysis needs a continuous algorithm refinement cycle
HEP Data Analysis
Implement Implement algorithmalgorithmImplement Implement algorithmalgorithm
Run over data setRun over data setRun over data setRun over data set
Make Make improvementsimprovements
Make Make improvementsimprovements
HEP Data Analysis•Ranging from I/O bound to CPU bound
•Need many disks to get the needed I/O rate
•Need many CPUs for processing
•Need a lot of memory to cache as much as possible
Some ALICE Numbers
•1.5 PB of raw data per year
•360 TB of ESD+AOD per year (20% of raw)
•One pass using 400 disks at 15 MB/s will take 16 hours
Using parallelism is the only way to analyze this amount of data in a reasonable amount of time
PROOF Design Goals•System for running ROOT queries in parallel on a large number of distributed computers or multi-core machines
•PROOF is designed to be a transparent, scalable and adaptable extension of the local interactive ROOT analysis session
•Extends the interactive model to long running “interactive batch” queries
Where to Use PROOF•CERN Analysis Facility (CAF)
•Departmental workgroups (Tier-2’s)
•Multi-core, multi-disk desktops (Tier-3/4’s)
The Traditional Batch Approach
File File catalogcatalog
BatchBatchScheduleSchedule
rr
StorageStorage
CPU’sCPU’s
QueryQuery
Split analysis job in NSplit analysis job in Nstand-alone sub-jobsstand-alone sub-jobs
Collect sub-jobs andCollect sub-jobs andmerge into single outputmerge into single output
Batch clusterBatch cluster
• Split analysis task in N batch jobs• Job submission sequential• Very hard to get feedback during processing• Analysis finished when last sub-job finished
Job Job splittersplitter
JobJob
JobJob
JobJob
JobJob
Job Job MergerMerger
JobJob
JobJob
JobJob
JobJob
QueueQueue
JobJob
JobJob
JobJob
JobJob
The PROOF Approach
File File catalogcatalog
MasterMaster
ScheduleSchedulerr
StorageStorage
CPU’sCPU’s
QueryQuery
PROOF query:PROOF query:data file list, mySelector.Cdata file list, mySelector.C
Feedback,Feedback,merged final outputmerged final output
PROOF clusterPROOF cluster
• Cluster perceived as extension of local PC• Same macro and syntax as in local session• More dynamic use of resources• Real-time feedback• Automatic splitting and merging
Multi-Tier Architecture
Adapts to wide area
virtual clusters
Geographically separated domains,
heterogeneous machines
Network performanceLess important VERY important
Optimize for Optimize for data localitydata locality or high bandwidth data or high bandwidth data server accessserver access
The ROOT Data ModelTrees & Selectors
preselectipreselectionon
analysianalysiss
OkOk
Output listOutput list
Process()Process()
BrancBranchh
BrancBranchh
BrancBranchh
BrancBranchh
LeaLeaff
LeaLeaff
LeaLeaff
LeaLeaff
LeaLeaff
LeaLeaff
LeaLeaff
EventEvent nnRead Read
needed needed parts onlyparts only
ChainChain
Loop over eventsLoop over events
11 22 nn lastlast
Terminate()Terminate()- Finalize analysis- Finalize analysis
(fitting, ...)(fitting, ...)
Terminate()Terminate()- Finalize analysis- Finalize analysis
(fitting, ...)(fitting, ...)
Begin()Begin()- Create histograms- Create histograms- Define output list- Define output list
Begin()Begin()- Create histograms- Create histograms- Define output list- Define output list
TSelector - User Code
// Abbreviated versionclass TSelector : public TObject {protected: TList *fInput; TList *fOutput;public void Notify(TTree*); void Begin(TTree*); void SlaveBegin(TTree *); Bool_t Process(int entry); void SlaveTerminate(); void Terminate();};
12
TSelector::Process()
13
... ... // select event b_nlhk->GetEntry(entry); if (nlhk[ik] <= 0.1) return kFALSE; b_nlhpi->GetEntry(entry); if (nlhpi[ipi] <= 0.1) return kFALSE; b_ipis->GetEntry(entry); ipis--; if (nlhpi[ipis] <= 0.1) return kFALSE; b_njets->GetEntry(entry); if (njets < 1) return kFALSE; // selection made, now analyze event b_dm_d->GetEntry(entry); //read branch holding dm_d b_rpd0_t->GetEntry(entry); //read branch holding rpd0_t b_ptd0_d->GetEntry(entry); //read branch holding ptd0_d
//fill some histograms hdmd->Fill(dm_d); h2->Fill(dm_d,rpd0_t/0.029979*1.8646/ptd0_d); ... ...
The Packetizer•The packetizer is the heart of the system
•It runs on the master and hands out work to the workers
•Different packetizers allow for different data access policies
■All data on disk, allow network access■All data on disk, no network access■Data on mass storage, go file-by-file■Data on Grid, distribute per Storage Element
•Makes sure all workers end at the same time
Pull architectureworkers ask for work, no complex worker state in the
master
PROOF Scalability on Multi-Core Machines
Current version of Mac OS X fully8 core capable. Running my MacProsince 4 months with dual Quad CoreCPU’s.
Production Usage in Phobos
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
16
Interactive Batch•Allow submission of long running queries
•Allow client/master disconnect, reconnect
•Allow interaction and feedback at any time during the processing
Analysis Scenario
AQ1: 1s query produces a local histogramAQ1: 1s query produces a local histogramAQ2: a 10m query submitted to PROOF1AQ2: a 10m query submitted to PROOF1AQ3 - AQ7: short queriesAQ3 - AQ7: short queriesAQ8: a 10h query submitted to PROOF2AQ8: a 10h query submitted to PROOF2
BQ1: browse results of AQ2BQ1: browse results of AQ2BQ2: browse intermediate results of AQ8BQ2: browse intermediate results of AQ8AQ3 - AQ6: submit 4 10m queries to PROOF1AQ3 - AQ6: submit 4 10m queries to PROOF1
CQ1: browse results of AQ8, BQ3 - BQ6CQ1: browse results of AQ8, BQ3 - BQ6
Monday at 10:15Monday at 10:15ROOT session on my ROOT session on my laptoplaptop
Monday at 16:25Monday at 16:25ROOT session on my ROOT session on my desktopdesktop
Wednesday at 8:40Wednesday at 8:40Browse from any web Browse from any web browserbrowser
•Open/close sessions
•Define a chain
•Submit a query,execute a command
•Query editor
•Online monitoring of feedback histograms
•Browse folders with query results
•Retrieve, archive and delete query results
Session Viewer GUI
Monitoring•MonALISA based monitoring
■Each host reports to MonALISA■Each worker reports to MonALISA
•Internal monitoring■File access rate, packet generation time and latency, processing time, etc.
■Produces a tree for further analysis
Query Monitoring
The same for: CPU usage, cluster usage, memory, event rate,local/remote MB/s and files/s
21
ALICE CAF Test Setup
•Since May evaluation of CAF test setup■33 machines, 2 CPUs each, 200 GB disk
•Tests performed■Usability tests■Simple speedup plot■Evaluation of different query types■Evaluation of the system when running a combination of query types
Query Type Cocktail
•4 different query types■20% very short queries■40% short queries■20% medium queries■20% long queries
•User mix■33 nodes■10 users, 10 or 30 workers/user, max ave. speedup = 6.6
■5 users, 20 workers/user■15 users, 7 workers/user
Relative Speedup
24
Average expected speedupAverage expected speedup
Relative Speedup
25
Average expected speedupAverage expected speedup
Cluster Efficiency
26
Multi-User Scheduling
•Scheduler is needed to control the use of available resources in multi-user environments
•Decisions taken per query based on the following metric:
■Overall cluster load■Resources needed by the query■User quotas and priorities
•Requires dynamic cluster reconfiguration
•Generic interface to external schedulers planned (Condor, LSF, ...)
Conclusions•The LHC will generate data on a scale not seen anywhere before
•LHC experiments will critically depend on parallel solutions to analyze their enormous amounts of data
•Grids will very likely not provide the needed stability and reliability we need for repeatable high statistics analysis
•Wish us good luck!
top related