cms emu meeting, dec. 6, 2008 1 electronics long term operations what we learned from electronics...

11
MS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

Upload: janice-campbell

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 1

Electronics Long Term Operations What we learned from Electronics Commissioning

G. Rakness U.C.L.A. Dec 6, 2008

Page 2: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 2

CMS CSC System is Huge

There is a tendency to forget the size of this system.

~400,000 channels >17,000 electronics boards 60 remote VME crates ~5,500 skew clear cables, over a million shielded conductors 1,400 gigabit optical fibers

This system has been cabled and commission in less than 11 months!

Page 3: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 3

Turning on the Electronics

PCrate Sequential LV power up - Major improvement, late October (Sytnik) This assures Proms properly load FPGAs

1) Power up DMB/TMB 2) Power up VMECC 3) Power up CCB/MPC

It is essential that DCS monitoring is turned off during sequence. THERE IS NO AUTOMATIC WAY TO DO THIS IN DCS! This works well but there are rare problems.

Page 4: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 4

Peripheral Crate Power-up Problems

1) Problem: VMECC fails to program Solution: a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !)

2) Problem: Netgear Gigabit Switch CPU Locks out VMECC Solution: a) Recycle switch power supply with new remote AC power switch (ssh)

3) Problem: TMB or DMB fail to program Solution: a) TTC hard reset (1/2 detector) b) CCB hard reset (whole crate) c) worse case (rare): Power cycle DMB/TMB slot (2 slots) There is a run around problem here. One would like to reset only the problem DMB or TMB

4) Almost zero Prom programming loss observed

Page 5: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 5

Front End Board Power-up Problems

FEBoard Power LV Powerup - Switch on LV individually through DMB using LVMB

Power on problems rare. Almost all due to infamous Erased Prom problem. CFEBs and ALCTs occasionally lose Prom Data on power up. rare on power-up , typically less than 1 in 458 ALCTs and 2300 CFEBs

Prom Read back shows ~equal proms with one bit flip (1->0) and no bit flips from loaded data. (A typical Prom read back has millions of bits). 1->0 flip suggest charge loss on gate.

Solution: Automatically detect problem proms and reload firmware. This was successfully implemented in late November.

CCB Initialization - resets TTC signal communications e.g. hard resets

This has been a bit problematic. Debugging possibly needed?

Page 6: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 6

Problems during Global/Local Data Taking

Global/Local Data Taking

Electronics seems just to work on good boards. We have tested hard reset response (FPGA reload, reset, and Flash memory constant loads)and have never seen a problem.

Rarely VMECC loses gigabit communications.

Solution: a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !) Rarely a DMB or TMB looses VME communications - data/trigger operation unaffected - long period with no DCS access - this is under study, we have no explanation - only fixed on hard reset for a new run

Page 7: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 7

Problems during Global/Local Data Taking

Failures that Require Board Replacement VMECC, DMB, TMB, CCB, and MPC failures are rare. They are easilyaccessible and are fixed within hours.

FED DDU and DCC failures are even rarer. They are swapped out withinminutes if needed.

F.E. Board failures require access. Boards we discovered with problems last February have still not been replaced.

LVDB Fuses Rarely ALCT and CFEB LVDB fuses blow. These are extremely difficult to replace.

It was earlier this year one can blow an LVDB fuse programming the ALCT with bad firmware. This had been fixed in software and is believed to be impossible now.

There is a random unexplained source of blown fuses over the last six months

Page 8: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 8

Problems during Global/Local Data Taking

~4 ALCT fuses need replacing ~2 LVDB-CFEB fuses need replacing

Two of the ALCT fuses blew on separate chambers on the same night! We presently have no idea the source of these failures.

Sudden LV Power Loss on Peripheral Crate

There are electronics problems that can only be explained by sudden short term power loss to peripheral crates

- DDU has registered 9 FMM Errors instantaneously in one crate - MPC has been observed to go into power up mode

These seem to have decreased in frequency since mid-summer There is no DCS voltage history available. This would help greatly in debugging/understanding this problem.

Solution: restart run

Page 9: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 9

Failed Boards needing Replacement

Other Long-term Board Failures ME1/1

A third of the long-term board problems have occurred on ME1/1 CSCs. The ME1/1 group has shown data suggesting that many of these are skew clear cable related.

ME1/1 Skew Clear cables have patch panel. Damaged connectors suspected. ME1/1 Skew Clear cables are at length limit of technology.

4/72 ALCT problems, 9/360 CFEB problems

Other Chamber Board Failures (non-ME1/1)

~11/396 ME1/2,3 ME2, ME3, ME4/1 ALCT boards need replacement ~19/1908 ME1/2,3 ME2, ME3, ME4/1 CFEBs boards need replacement, although some of these are skew clear cable related

Systematic repairs of boards replaced have shown no repeat problems. We have had few boards to autopsy with long term failures. Biggest problems still on chamber.

Page 10: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 10

FED Crate Problems Monster Event problem showed filtering problems on DCC andon global DAQ group’s slink mezzanine boards. Through collaborationproblem eviscerated on both sides.

No single board DDU or DCC problems seen.

Software thread loading problem solved in September

DDUs report problems from other boards. The problemsare on the other boards. "Don't kill the messenger."

Online Computer Problems

The online software runs on 16 computers. Known problems:

1) Problem: On power-up randomly some number of machines don't boot Solution: Hand recycling power on machines. Although not optimal, ACPI cards are expensive and are reportably flakey

Page 11: CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008

CMS Emu Meeting, Dec. 6, 2008 11

Computer Problems Encountered

2) Problem: Farm machines overheating alarms Solution: fans with 3x air volume installed 3) Problem: Farm machine eth_hook drivers have problems after weeks of running Solution: patches to gigabit driver seems to have removed problem 4) Problem: DCS machines drivers don't work after several days Solution: XMAS monitoring seems to have solved problems 5) Problem: We do not manage the computers

A recent motherboard was swapped on a farm machine 9 days later and 10s of email NFS mounting problem machine still unusable

Solution: Eric Cano et al are overworked. This is their problem since we don't have root privileges on USA owned machines ???!?

2 Spare machines live, configured and connected $$$$ space for 1 2u machine in usc ???