design and performance of the cdf experiment online control and configuration system
DESCRIPTION
Design and Performance of the CDF Experiment Online Control and Configuration System. William Badgett, Fermilab for the CDF Collaboration 2006 Computing in High Energy and Nuclear Physics Conference Online Computing Session 2, OC-2, Id 363 February 13, 2006 Mumbai, India. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
1
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Design and Performance of the CDF Experiment Online Control
and Configuration System
William Badgett, Fermilabfor the CDF Collaboration
2006 Computing in High Energy and Nuclear Physics Conference
Online Computing Session 2, OC-2, Id 363
February 13, 2006Mumbai, India
2
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Introduction
CDF Online Configuration and Control CDF Run IIa & b Status
Brief overview of CDF DAQ System Configuration and Conditions
Overview of Online DatabasesHardware Database and APIRun Control; Run Configurations & Conditions db
Operational experience during data taking
Performance, availability Conclusions Wish list…
3
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
• Run IIa (2001-2005) Goal 2 fb-1
New accelerator Main Injector give x5 increase
Recycler ~2 to 3 increase, to preserve anti-protons
first physics use in 2004! More frequent bunch spacing 396 ns
give 36 bunches Higher Beam Energy ~980 GeV (from
900 GeV) Peak Luminosity goal 2×1032 cm-2sec-1
• Run IIb (2005-2009? LHC?) Goal 15 fb-1
Electron cooling, crossing angle, anti-proton intensity, electron lens ~2 to 3 increase
Peak Luminosity goal 3.3×1032 cm-
2sec-1
Trigger and DAQ Upgrades
Tevatron Upgrades Run IIa,b
4
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Collecting Luminosity
Nominal “good” data taking starts around March 2002
Red: Delivered by the Tevatron 1.55 fb-1
Blue: Recorded by CDF 1.25 fb-1 (Live)
Data samples can be further reduced by detector malfunctions according to event selection
Data collection now greatly exceeds CDF Run I, also with increased detector sensitivity
5
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Improving the Beam
1230 sec10180 cmL~
1220~ scmL
Luminosity continues to improve…
123010330~ scmL -
Compare to Run I peak:
Planning for Run IIb:
Peak luminosity to date:
6
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Data Acquisition Overview
Front end VME crates digitize, time, etc.Subset of data is split off and send to the triggerPlus monitoring and control messages published (ethernet)
Trigger Supervisor controls the entire operation, communications hub between DAQ and trigger
Event Builder collects event fragments and forwards them to Level 3 trigger farm for final decision
Data Logger sends data to computing center tape robots; sends a fraction to disk and to online monitors
Optical fibres
Level 3 farms do offline style processing and cuts Commercial linux boxes
7
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Operational Efficiency
Sources of down timeBeam losses too highHigh Voltage tripsDetector malfunctionsBeam time calibrationsDAQ or Trigger malfunction
Pipeline jump (sync)Hardware failureSoftware crash or system failure, Database, RunControl
Trigger/DAQ deadtimeHuman error
The Silicon Tracker is particularly sensitive to beam lossesHas experienced damage from problematic beam aborts
Improving with time, then becoming asymptotic with last percentage points becoming exponentially more difficult to recover…
8
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Detecting & Fixing DAQ Errors
RunControl FrontEnd Crates
Level 3 Trigger
ConsumerServer
Event Builder
ErrorHandler
DAQ Error ConsumerConsumerError to Online Interface
Control and configuration messages Data mini-bank
Assemble full data event record
Use full event to find and trigger data
acquisition errors, plus physics triggers
Convert from offline to online message format
and forward errors to all online monitors
Verify and error and determine source;
construct error message
Process many errors sources and send recommended reset
or run recovery action to RunControl
Fast recovery; crate reset access; starting and
stopping runs
Check crate data consistency every L2 accept (fast)
Regular status and heart beat messages
FrontEnd Monitors
Log data on disk and tape, fan out event samples to DAQ
consumers
Data to disk and tape
Build in redundancy and constant cross
checking
Error recovery normally completely
automatic
“What goes around, comes around…”
9
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Online Software
User Interfaces, control and real-time monitoring Control and Monitoring in Java JDK v1_4_2 (Sun) Commercial PCs running FNAL Scientific Linux 3.0.5 Not limited to CPU, architecture or operating system Oracle database v9.2 running on Sun 450
Readout Crate Controllers FrontEnd crates running VxWorks, C language Simplicity, close to hardware
Level 3 and Data Monitoring Linux, with C++ offline Analysis Control framework Giving physicists a dangerous weapon
10
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Online Database Schemas
• Hardware * Pseudo-static, slots, delays, basic timing Δdata style history tables
• Run * Configurations for user selection on RunControl
INPUT tables Conditions for DAQ and Trigger, rates, latencies, etc.
OUTPUT tables• Trigger
Trigger thresholds and algorithms Immutable physics objects
• Calibration Detector characterization and correction constants
• SlowControls Record the environmental state of the detector Voltage, temperatures, etc.
*will describe in detail
11
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Database Growth
Many application revisions at first to control exponential growthSince then, steady growth except for extended shutdowns
12
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Database Availability
• The CDF Data Acquisition operates in close coöperation with the online production Database
• CDF runs 24 hours per day, 7 days per week, even during Tevatron shutdown periods
• Unscheduled downtimes can lose data, since March 2002: 1 db Disk failure where raid failed to failover (!) 1 db memory card failure 1 big db “human error” RunControl online Java API bugs, crashes
• Maintenance downtimes necessary but painful to schedule Detector maintenance work requires Database and
RunControl up & running
13
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13DownTime Impact
Category N EventsMinutes
LostLuminosity
LostPercent
Luminosity
RunControl
16 258 321 nb-1 0.021%
Database 10 325 515 nb-1 0.034%
DownTime events directly attributable to Database or RunControl pathologies only
(does not include configuration time triggered by external failures)
ΣL ~ 1.5 fb-1
14
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13CDF Database Replication
Use Oracle Streams replication:• automatic propagation of DML and DDL in a leap-frog style to unlimited database instances• minimize load on online and offline production instances• essentially instantaneous push of new data
Online
•RunControl•Monitors•Calibrations•Consumers
RunHardwareTriggerCalibrationSlowControl
•L3 Trigger•Web servlets•ShiftCrew electronic logBook (!)
RunHardwareTriggerCalibrationFileCatalog/SAM
•Offline Production Farms•Luminosity calculations
OfflineProduction
OfflineUser Replica
RunHardwareTriggerCalibrationFileCatalog/SAM
•User analysis farms•General database web browser
…Access for rest of world, direct or via additional instanceRemote SAM stations, FronNTier cache
DB Color Key:Read+WriteReadOnly
15
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Hardware Database
• Need complete image of configuration data loaded by RunControl, 30 seconds to load
• Updates at a low rate, but critical for operations• Core tables and Java classes describe all
electronics Crates, Cards, etc.• All updates to core tables logged in history
tables automatically via database triggers; tables grow steadily with time
• Java classes read incremental updates before runs, and use reflection methods to update core data image on the fly, quickly and transparently, < few milliseconds
• This is a flexible and unifiedunified design, used for allall detector components at CDF!
• Every second counts when configuring a run!
16
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Incremental updates from history table used together with Java reflection to dynamically update Java data image in milliseconds
Hardware Database Java API
hdwdb.Card
hdwdb.BankCard
hdwdb.AdMem
Hdwdb.AdMemTof
hdwdb.Crate (static Hashtable)
Hashtable hdwdb.Card
Hashtable hdwdb.Channel
Image object containment tree
Electronics Card Inheritance Tree
hdwdb.Tracer
Boards to configure
Boards to readout
17
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Hardware Database Web Interface
• Light-weight Apache+Tomcat servlets for browsing hierarchical database structures• Dynamic links point to other database objects• Read-only policy on the web for security issues• Write access requires kerberos authentication to get inside firewall
Real time crate data
acquisition status
Crate hardware database
details with contained cards data
Crate hardware database
details with contained cards data
18
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13CDF RunControl
RunControl Central Control Program directing, configuring and
synchronizing the actions of ~150 clients Real-time Java multi-threaded application,
approximately ten threads at any one time SmartSocketstm commercial TCP/IP name services for
communication to and from clients in a publish/subscribe model
Provides run configuration for the hardware and software clients
Closely linked to the database, describing hardware, run options, calibration constants, trigger table, etc.
Front line monitoring and error reporting for the DAQ system
Works with ErrorHandler, an auxiliary process logging errors and making informed decisions as to recovery procedures, automatic and human intervention
19
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13CDF RunControl
StateManager• User initiates transitions between different states• Goal is to stay in the Active state until run is complete, taking recovery actions as necessary
Ideas for transitions and state flow diagrams, rf. Zeus Experiment RunControl, Chris Youngman, et al
Extensibility of the Object Oriented design:• Easy to implement any other diagram, e.g. TDC testing, source runs• Ported for use at FNAL Fixed Target program with few changes
20
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Transitions
• Partition: Select front end crates and clients for the run; configure trigger and return crosspoints
• Config/Setup: Configure crates and clients with info that could change run by run, without adding or subtracting RC clients (slowest transition) Most work done here!Most work done here!
• Activate: Final step to enable system to take data, fast• End: Normal end of run, produces end of run summaries• Abort: Return to Idle when no other option available • Pause/Resume: Briefly stop data taking (HV trips, flying
wires, inhibits)• Halt/Recover/Run: Fast system error recovery, first option
to use when an error occurs during data taking; critical to maintaining operational efficiency
• Reset: Return to Start state from Idle, or when no other options are available
21
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Typical Transition Performance
L3 farm*L3+SVT+μ
Pathological tails, remotge client software crashes, etc.
Slow pokes•L3 distribution•Silicon Vertex Trigger•Muon-Track Trigger
*L3 Config tail when calibration or trigger executable not cached
Source:Large L3 farm distribution, and large trigger look-up tables
Need social engineering for each transition time improvementClient reply time plotted,
RunControl setup time < ~ 1 sec
22
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Select from predefined run configurations organized hierarchically in folders related to function:• Each entry represents a set of relational entries in several RunConfiguration database tables, mapped onto an Object (Java and C++) using container objects to express relations
• Contents change from run to run
• Human readable and selectable RunConfigurations are flexible and non-binding
• RunConditions contains copy when a run is executed
RunConfiguration Selector
The Run Database,Visualization
23
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Graphical Representation of RunConfiguration object
Global DAQ RunType
Trigger Table, coupled
Front end crate selectionMove to left to include
or right to exclude
Java “TomCat” servlets provide web browsable version from anywhere
The tabbed panes contain detailed information about the
RunConfiguration
Run database in turn points to entities in the trigger, calibration, and hardware databases
24
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Run Database Schema
(subset of whole run schema)
Run Configurations “Input” tablesConfigure DAQ
according to type of run and record for posterity
Run Conditions “Output” tables
Record settings, trigger rates, luminosity and background
rates, run quality status, etc.
25
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Configuration Messages Structure
rc.ReadoutRun
rc.ReadoutList
rc.ConfigMess
To every client with readout to perform, list of banks
Contains global common variables runNumber, runType, etc.
rc.phys.COTReadoutList
Collate information from Hardware, Run, Trigger, and Calibration databases• Class InheritanceInheritance as needed according to type of client (electronics crate or software server application, L3 trigger, etc.)• Pick up desired message dynamically from Hardware database• Java classes generate C code and headers automatically
• UnifiedUnified system avoids much duplicated work!!!
To every client, with destination specified
rc.phys.CalReadoutList
rc.phys.CalSmxrReadoutList
rc.phys.MuonReadoutListDetector component specific configuration details
26
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Real Time Monitoring (java swing)
Publish/subscribe based monitoring allows implementation of easy to read monitor panels, arrayed around the control room
Rate Monitor and Dynamic Prescaler
Status SummaryMonitor may be run anywhere, and also provide HTML web files
Tevatron Loss Monitor
Crate VxWorks Monitor
And, of course, panic situations will give voice alarms, too
27
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Data Acquisition / Control Room
The primary Data Acquisition consoles: RunControl, online monitoring
28
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Web Based Monitoring, RunSummary
Root used for plotting
Run summary pages are dynamically produced, with almost every quantity hyper-linked, with many of the links drawing plots of the quantity of interest& links to error logs and all run settings
http://www-cdfonline.fnal.gov/
Follow RunSum and related links
Publicly accessible!
29
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Freeware Experience
• Java Experience has been quite positive Easy to build complex programs without headaches
of C and C++ Extensibility of Java classes has proven invaluable All CDF RunControl and monitoring applications can
run anywhereanywhere! Not reliant on a CPU nor operating system!
100% availability so far JDK/Linux releases: Sun phasing out v1_4_2 support Downsides, when you really push Java:
It’s not really platform independent! Various subtle differences (threads, look & feel)
Java Virtual Machine is a complicated creature, with sometimes mysterious and impossible to debug behaviour, crashes
30
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Operating Systems
• Linux experience also positiveLinux disk and web servers reliableVery difficult (impossible?) to get our programs
to crash the operating systemPerhaps Linux can replace Sun for the database
systemTesting in the offline realm has so far been
positiveBut we miss that VMS system API (!)
• But still have not made leap to Oracle database on Linux for critical servers…Cannot argue with success – unscheduled
database downtime extremelyextremely rareOffline replicas on Linux in good shape
31
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Commercial Software Experience
Commercial SoftwareOracle Database
Generally impervious to crashes, robust, reliable Fulfills our database and communications needs Oracle provides a nice support forum (but see below) Downsides
Money $$ Lots of it Many people fear it Can’t see the source; but you probably wouldn’t want to
SmartSockets (Talarian/TibCo) Remarkably good performance for a centralized TCP communications
server Features and support sometimes lacking Downsides
Again, Money $$$ The price of a single client license keeps going up Small company, short lifespan In this case, you probably would like to see the source code Crashes on VxWork we cannot debug
But beware false economies!
32
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Wish List
• Cross-experimental and cross-lab development of software could be quite beneficial in some common areas: IP message passing (multi-platform, multi-language) Database servers (!) …other software?
• Virtually every experiment needs such beasts, but often effort is duplicated
• Avoid expensive licenses, with no source code access• Should tailor to HEP requirements, and provide
continuing support (everything is always in development!)
• Paw, Root, and data handling, have been successes in common tools
• Hearing murmurs … what’s out there?
33
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Conclusions
• CDF is running well, taking data during the Tevatron Run II, 2001 through 2009
• We have designed and implemented a set of database schemas and associated Java APIs to configure and control the CDF online data acquisition system in real time
• Through object oriented programming, we have created a powerful and flexible approach to run configurations that is used by all components of the experiment
• A suite of control and monitoring software, web interfaced, has been developed; shift crew’s job is now easier and more efficient
• Through replication, web interfaces and offline database hooks, we have an extensible database available to users world-wide
35
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Resource Allocation
Multiple RunControls can run simultaneously: PartitionsResource Manager controls ownership of front end crates and other virtual resourcesAllocation recorded centrally in the Hardware DatabaseReal-time database event notifications keep all clients informedJava monitoring Thread listens to events and updates object images
Real-time Java color-coded displayrepresenting device allocation
36
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13DAQ Performance
Trigger Level
Design RateMax Rate
Achieved IIaLimitation and Improvements
Upgraded Goals IIb
1 50,000 Hz 27,000 Hz
Level 2 Processing time of 40-60 sec; replace global L21; improve Silicon vertex trigger timing2
35–40 kHz
2 300 Hz 390 Hz
ATM network message passing; replace with gigabit ethernet; TDC hit processing time3; compress and improve processing; add L3 processing power4
>1 kHz~1.5 kHz
3
100 events/sec
20 Mbytes/sec
135 events/sec
20 Mbytes/sec
Data logging output bandwidth5; upgrade to gigabit ethernet; upgrade logging CPUs
40 Mb/s270 ev/sec
60 Mb/s410 ev/sec
37
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
• Dynamic Prescaling As luminosity decreases, trigger rates
also decrease To maximize usage of DAQ bandwidth,
automatically lower prescales of triggers at Level 1 to increase trigger rate during a data acquisition run, within bounds
Use for Level 1 two-track trigger (for B 85% of Level 1 bandwidth
Heavily prescaled at start of run for safety
Bandwidth Usage Maximization
250
4
4
3
3
2
2
1
Level 1 Trigger rate plot, triggers per second
Red arrows indicate change of prescale valuesRun is paused, hardware set, run resumed
L1 Trigger Cross Section plot, trigger counts normalized by luminosity
38
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Complete Operational Efficiency
• Efficiency factors: Intrinsic system limits: instantaneous deadtime
Limited by system throughput performance Adjusted through physics choices via trigger cuts
Accelerator beam quality Losses prevent detector operation, trips and tolerances little or no experimental control
Operational downtimes Starting and stopping runs Failures of services (e.g. database server) Detector malfunctions Data acquisition and trigger electronics malfunctions Test runs, beam time calibrations Human errors… others
39
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Efficiency Tabulation, to date
CDF DownTimeSummary
Stores: 1038 ... 3745
DownTimeEvents per Category, by Cummulative Luminosity Lost
Category Group nEvents DownTime, minutes LostLumi, nb-1 LostLumi, %
TEVLOSS ACCEL 218 5,382.0 8,749.5 1.4
EVB DAQ 288 5,471.0 6,650.5 1.0
STARTUP DAQ 509 6,062.0 5,687.3 0.9
TRIGLVL2 TRIGGER 404 5,595.0 5,529.5 0.9
SVX HV HV 273 4,350.0 5,286.5 0.8
TRIGLVL3 TRIGGER 159 4,104.0 4,952.3 0.8
DAQOTHR DAQ 251 4,891.0 3,995.6 0.6
TDCs DAQ 183 2,234.0 2,943.6 0.5
SCRAPERS ACCEL 60 1,796.0 2,813.5 0.4
SVX DAQ DAQ 115 1,972.0 2,752.7 0.4
TRIGTABL TRIGGER 233 2,860.0 2,731.6 0.4
COT HV HV 114 2,007.0 2,408.8 0.4
SOLENOID MAGNETS 40 1,455.0 1,949.5 0.3
FEVME DAQ 108 1,217.0 1,876.2 0.3
NOCATEG MISC 166 2,833.0 1,743.8 0.3
HUMERR OPERATN 36 1,103.0 1,693.1 0.3
TRIGLVL1 TRIGGER 76 1,329.0 1,688.1 0.3
CSL DAQ 26 1,157.0 1,471.9 0.2
SMXR DAQ 43 776.0 1,436.8 0.2
TEVSTUD ACCEL 100 2,391.0 1,393.0 0.2
Total 4453 72,927.0 85,152.2 13.3%
...several smaller categories suppressed
Downtime occurrences automatically tabulated and linked to shiftcrew’s electronic logbook, each DAQ run and Tevatron store
Browse and group by category, lost time, lost luminosityover years’ time scale,
category assignment proliferate; operational utility on small time scale
40
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13
Efficiency Tabulation, intrinsic
CDF DownTimeSummary
Stores: 1038 ... 3745
Category Group nEvents DownTime, minutes LostLumi, nb-1 LostLumi, %
Total OPS 4453 72,927.0 85,152.2 13.3%
Pause 5,406.0 9,307.9 1.5%
Halt 18,345.2 35,786.9 5.6%
Inhibit 1,862.7 6,929.2 1.1%
WaitBusy 2,191.5 6,326.2 1.0%
L2orReadout 2,091.6 4,989.1 0.8%
Readout 2,414.9 6,917.6 1.1%
L1Done 257.7 773.2 0.1%
Level2 2,916.5 10,688.1 1.7%
TS 311.6 625.1 0.1%
Intrinsic DAQ&Trigger 10,149.8 30,319.4 4.7%
Intra-Run Total 12,046.5 37,248.7 5.8%
SmallRunPenalty 8,973.7 3,878.5 0.6%
TotalLiveLumi 515,411.9 80.3%
TotalDeliveredLumi 641,666.2 100.0%
Category totals, previous
Intrinsic dead time during data acquisition runs
Intra-run downtime below
Runs too small to process
Net efficiency
41
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13Client MicroManagement
This window indicates the transition status of clients:•Butter yellow: RC has not sent transition•Margarine yellow: RC has send transition, waiting for acknowledgment•Green Client sent successful acknowledgment•Red Client sent error
Each client is monitored continuously for participation in the run and for possible errors
Each client has its own individual control panel; complete resets and recovery are one-touch; all configuration and response information are available here
42
CDFWilliam BadgettCDF Control & Configuration
CHEP’06 2006.02.13State Management
RunControl maintains synchronization of activities through the StateManager and its flowBasic functionality expressed in the base class StateManagerDifferent run types require differing control flowsSpecific StateManagers inherit from base class and extend as necessaryConfiguration messages are also easily extensible according to the needs of individual detectorsAvoid duplicating lots of work!TDC Testing Diagram
Calorimeter Radioactive Source runsRequires source motion control transitions
There’s only one RunControl at CDF