daq in run 5 200 gev/u cu-cu

March 9, 2005 John Haggerty 1

DAQ in Run 5 200 GeV/u Cu-Cu

John HaggertyBrookhaven National Laboratory


Scorecard on New Stuff

• Linux Event Builder Up and running• Multievent buffering Replaced MUTR and EMCAL firmware, still

seems to work (judging from 0 and J/)• New raw data disks Four of five 6.4 Tbyte disk arrays operational• Scientific Linux More problems than one would think with NFS and

more kernel mysteries than one would like, but generally ok• MUID and ERT LL1 Next generation of LL1’s operational• JSEB New firmware has reduced the rate of data corruption to nearly

zero (it was the JSEB); the Jungo Linux driver seems to have no major issues

• Better logging control The BBServer• PostgreSQL run database Has all the functionality of the MySQL

database and some enhancements• Split GL1/LL1 DCM Didn’t fix quite as much as we had hoped• Split BB DCM Jamie and I measured 8.5 kHz to the EvB—no longer

the bottleneck


Run 4-Run 5 Comparison

I computed the livetime and event rates from the Run 4 (MySQL) and Run 5 (PostgreSQL) database entries

•The event rate is higher due to Event Builder improvements

•The livetime is greater due to the multievent buffering


Running Efficiency

•RHIC scalers used to determine when beams are colliding (52% of the day) and backgrounds are ok (78% of that) •PostgreSQL run database used to calculate physics running time (50% of the time that RHIC beams are colliding, around 60% of the time that backgrounds are acceptable)


The (Continuing) Need for Speed(some speculation)

• We still need some performance measurements to know exactly what limitations there are

– There are many measurements that indicate that many systems can run through the Event Builder at 5-6 kHz… but can all of them?

– There are probably system effects that limit that, too– To get that, one must studiously make sure that packet lengths in the

DCM’s don’t have any hot spots (i.e., a single fiber needs to have less data than the BB, about 1 kbyte, or less if it’s multiplexed 2)

– Jamie and Chun are working new list memories for the known hot spots of EMCAL Reference FEM’s

– Note that 10 kHz of 10 kbyte events is 100 Mbyte/sec—upper limit of JSEB speed (it should be doing at least 60 Mbyte/sec), but that means that we need each DCM group to be limited to 10 kbyte/event

– We don’t actually know the performance limits of the Buffer Boxes except in the most general terms (data comes in on Gigabit Ethernet, which generally maxes out at around 100-110 Mbyte/sec, but might be substantially less due to the many socket connections to individual ATP’s)


Other Stuff

• The operating efficiency is not that great, but I can’t discern any small number of things that are doing us in; it seems more like random system problems at random times. MUTR may or may not be worse (it has 350 FEM’s, so it has a good chance of having a problem).

• Steve Boose, Chi, and I are looking into try a circuit this year to change the clock more smoothly, basically by waiting for an oscillator to drift into phase with the RHIC clock. It might be possible to try something, though it probably won’t be without side effects; it would be better to do this by controlling the clock change, but that would have to be done at the clock source.

• STAR scalers still need work and integration, but he STAR Scaler Interface has been plugged into the GL1 crate the whole run (to record LL1 rates in the RHIC scalers).

• We could use the additional disk space provided by the fifth 6.4 Tbyte disk array (and possibly the bandwidth). It’s possible we need bandwidth beyond that depending on the realizable performance if we try to triple the average data taking rate.

• I have had a hankering for better logging of the recorded data rate.• We don’t have the accepted event readout for the MUID and ERT LL1 data; maybe

at this point we don’t even need it?• We need to put the GL1p back for spin running.• I’m supposed to replace logbook and netmon with a super-server.


Problem Solving

• There will probably be new problems to work on this month, but some of the things that still need work are:

– Investigating the weird time delays in stopping some SEB’s– Why do we sometimes drop in data taking speed to near zero, with events

backed up in the SEB’s?– Why do some buffer boxes write faster than others?– What’s with this kernel? Does NFS work or what?– Did distributing all those hosts files really make things faster?– That last disk array


It’s not my talk, but…

• Copying data from the counting house to HPSS has been pretty irritating this run

• When we could copy data 50% faster than we took it, we didn’t really notice it

• We have 8 STK 9940B drives which are capable of 30 Mbyte/sec, but Tom Throwe reports the effective bandwidth as 25 Mbyte/sec, and we write data at around 200 Mbyte/sec after compression; on top of that, there has been a problem that limits many of the drives to 15 Mbyte/sec, and there have been significant outages this year… it all adds up to us taking data faster than we can archive it, so the disks fill up in 1008 and in the HPSS cache

• There doesn’t seem to be any simple way we could write HPSS-compatible tapes in 1008.

daq in run 5 200 gev/u cu-cu

Documents

new stufflinux event

event rates

mysql database

rhic clock

clock change

clock source

rate of data corruption

emcal firmware