future experiment specific needs for lhcb openfabrics/infiniband workshop at cern monday june 26 sai...
DESCRIPTION
Niko Neufeld CERN, PH 3 LHCb DAQ system: features On average every 1 us new data become available at each of ~ 300 sources (= custom electronics boards, “TELL1”) Data from several 1 us cycles (=“triggers”) are concatenated into 1 IP packet reduces message / packet-rate IP packets are pushed over 1000 BaseT links short distances allow using 1000 BaseT throughout Destination IP-address is synchronously assigned via a custom optical network (TTC) to all TELL1s For each trigger a PC-server must receive IP packets from all TELL1 boards (“event-building”).TRANSCRIPT
Future experiment specific needs for LHCb
OpenFabrics/Infiniband Workshop at CERN
Monday June 26Sai Suman Cherukuwada and Niko Neufeld CERN/PH
Niko NeufeldCERN, PH
2
LHCb Trigger-DAQ system: Today• LHC crossing-rate: 40MHz• Visible events: 10MHz• Two stage trigger system
– Level-0: synchronous in hardware; 40 MHz 1 MHz
– High Level Trigger (HLT): software on CPU-farm; 1 MHz 2 kHz
• Front-end Electronics (FE): interface to Readout Network
• Readout network– Gigabit Ethernet LAN– Full readout at 1MHz
• Event filter farm– ~ 1800 to 2200 1 U servers
FE
Readout network
CPU
FE FE
L0 trigger
Timing and Fast Control
CPUCPUCPU CPUpermanent
storage
Niko NeufeldCERN, PH
3
LHCb DAQ system: features
• On average every 1 us new data become available at each of ~ 300 sources (= custom electronics boards, “TELL1”)
• Data from several 1 us cycles (=“triggers”) are concatenated into 1 IP packet reduces message / packet-rate
• IP packets are pushed over 1000 BaseT links short distances allow using 1000 BaseT throughout
• Destination IP-address is synchronously assigned via a custom optical network (TTC) to all TELL1s
• For each trigger a PC-server must receive IP packets from all TELL1 boards (“event-building”).
Niko NeufeldCERN, PH
4
Terminology
• channel: elementary sensitive element = 1 ADC = 8 to 10 bits. The entire detector comprises millions of channels
• event: all data fragments (comprising several channels) created at the same discrete time together form an event. It is an electronic snap-shot of the detector response to the original physics reaction
• zero-suppression: send only channel-numbers of non-zero value channels (applying a suitable threshold)
• packing-factor: number of event-fragments (“triggers”) packed into a single packet/message– reduces the message rate– optimises bandwidth usage– is limited by the number of
CPU cores in the receiving CPU (to guarantee prompt processing and thus limit latency)
Niko NeufeldCERN, PH
5
PC #876
Following the data-flow
TELL1 TELL1 TELL1 TELL1 TELL1 TELL1 UKL1 UKL1 TELL1 TELL1 TELL1
Front-end Electronics
400 Links35 GByte/s
TFCSystem
StorageSystem
Readout Network
Switch
PC
PC
PC
Switch
PC
PC
PC
Switch
PC
PC
PC
Switch
PC
PC
Switch
PC
PC
PC
Event FilterFarm
50 Subfarms
L0YesMEP
DestinationPC #876
VELO#2
RICH#1
BΦΚs
L0Yes
VELO TT IT OT CALO RICH MUON L0
VELO#1
RICH#2
#2#1
VELOMEP TOPC #876
#2#1
RICHMEP
PC #876
RICH#2RICH#1VELO#2VELO#1
MEP
HLTProcessMEP
RequestPC #876
Niko NeufeldCERN, PH
6
Data pre-processing: The LHCb common Readout Board TELL1
PP-FPGA
A-RxCard
L1B
SyncLink-FPGA
PP-FPGA
L1B
PP-FPGA
L1B
PP-FPGA
L1B
RO-TxTTCrxECS
4 x 1000 BaseT
TTCECS
FE FE FE FE
A-RxCard O-RxCard
Throttle
Receiver cards get data from detector via optical fibres
FPGAs do pre-processing, zero-suppression and data formatting (into IP packets)
FPGA attached to Ethernet Quad-MAC on SPI3 bus (simple FIFO protocol)
IP packets are pushed out to the Data Acquisition on a private LAN over 4 x 1000 BaseT links
Niko NeufeldCERN, PH
7
Improving the LHCb trigger
• Triggering is filtering. The quality of the trigger is determined (using simulated data) by measuring how many good events of the possible good events are selected:
efficiency ε = Ngood-selected / Ngood-all
• Each stage has its own efficiency. LHCb looses mostly in the “L0” step: 40 MHz 1 MHz
• Reason: only coarse information (“high pT”) used
• Solution: reconstruct secondary vertices at collision rate 40 MHz!
Niko NeufeldCERN, PH
8
Upgrade
We want to have a DAQ and Event filter which:• allows for vertex triggering at collision rate
(40 MHz)• fits within the existing infrastructure:
– 1 MW power and cooling– 50 racks with a total space of 2200 Us
• preserves the main good features of the current LHCb DAQ– simple, scalable, industry-standard technologies,
as much as possible commodity items
• costs <107 of a reasonable currency
Niko NeufeldCERN, PH
9
Two Options
• Two stage readout:– Readout ~ 10 kB @ 40 MHz.
Data are buffered in the FL1 for a suitable amount of time: 40 ms (?)
– Algorithm on event-filter farm selects 1 MHz of “good” events and informs (how?) FL1 boards of its decision (yes/continue – no/discard):
– In case of “yes” the entire detector is read out: 35 kB @ 1 MHz
• Always read out entire detector 35 kB @ 40 MHz (“brute force”)
Niko NeufeldCERN, PH
10
Full read-out at 40 MHz
• At a collision rate of 40 MHz, the data rate for a full readout is ~ 1400 GB/s, or ~ 12 Tb/s– network with ~ 2 x 1200 x 10 Gigabit ports
• Need several switches as building blocks optimised topology highly desirable (non-Banyan)
• Advantages:– No latency constraints– Less memory requirements on the FL1
• Disadvantages:– Huge, expensive– Almost all of the data shipped will never be looked at
(physics algorithms do not change much)– Requires zero-suppression and FPGA pre-processing for all
detector data 40 MHz (not obvious)
Niko NeufeldCERN, PH
11
Parameters / Assumptions
• Vertex reconstruction requires only a subset of the total event of roughly 10 kB @ 40 MHz (essentially the VertexLocator of the future + some successor of TT)
• FE with full 40 MHz readout capability• We dispose of the successor of the TELL1, FL1* ,
which has several 10 Gigabit output links and can do pre-processing / zero-suppression at the required rate
• Several triggers are packed into a MTP. This reduces the message rate from each board. In this presentation we assume 8 triggers per message == RTX-message = 5 MHz (per FL1)
(*) FL1 for Future L1 or Fast L1 or FormuLa 1
Niko NeufeldCERN, PH
12
Data pre-processing: A new readout-board: FL1
PP-FPGA
L1B
SyncLink-FPGA
PP-FPGA
L1B
PP-FPGA
L1B
PP-FPGA
L1B
RO-TxSyncInfo
HostProcessorECS
4 x CX4
TTCECS
FE FE FE FE
O-RxCard
Throttle
Receiver cards get data from detector via optical fibres
FPGAs do pre-processing, zero-suppression and data formatting)
FPGA attached to HCA on ??? bus (are there alternatives to PCIe?)
Output to the Data Acquisition private LAN on (up-to) 4 x CX4 cables
O-RxCard
Host processor needed (??) to handle complex protocol stack
Niko NeufeldCERN, PH
13
Event filter farm for upgraded LHCb
• We need an event-filter which can absorb 4 * 10^7 * 10 kB/s + 10^6 * 35 kB/s ~ 435 GB/s!
• Assume 2000 servers:– A server is something which takes one U in space and has p
two processor sockets– Each socket holds a chip, which comprises several CPU
cores
• Each server must accept ~ 210 MB/s as 500 kHz of messages of ~ 400 BytesOptions for attaching servers to network: – 3 Gigabit links as a trunk: not very practical because would
have to bring > 130 links into one rack!– Use an (underused) 10 Gigabit link
Niko NeufeldCERN, PH
14
Server Horoscopes
• Quad-core processors from Intel and AMD will most likely be available in 2007
• Could we have “Octo-cores” by end of 2008?• Can thus assume to have 8 cores running at
2 to 2.4 GHz (prob. not more!) in one U.• Commitment by Intel and AMD: power
consumption per processor < 100 W• Reasonable rumors:
– 2007 will see first mainboards with 10 Gigabit interface on board: most likely CX4 for either 10 Gigabit Ethernet or Infiniband (?)
Niko NeufeldCERN, PH
15
CPU power for triggering / latency / buffering
• Assuming 2000 servers / 16000 cores and40 MHz of events each core has on average 2.5 ms to reach a decision when processing the ~ 10 kB of vertex-detector data should have at least 40 ms buffering in the
FL1s to cope with fluctuations in processing time (the processing time distribution is known to have long tails)
• Assuming 400 FL1 means that they have to have 12.5 GB buffer memory
Niko NeufeldCERN, PH
16
30 m+
LHCb Detector
1. Readout 10KB events @ 40MHz,Buffer on FL1
1 2 3 … 400Front-End L1 Boards
1 2 3 4
4x10Gbps Links
High Density Switches
Rack1 … Rack50
Fabric
400 ports “in” per Switch
65Gbps per Rack
60 m
20m+?
Farm Racks with 1 x 32-port Switch or 2 x 16-port Switch
2. Send to Farm for Trigger decision 3. Send trigger
Decision to FL1
4. Receive TriggerDecision. 5. If trigger decision
Positive, readout 35KB @ 1MHz
Niko NeufeldCERN, PH
17
Power Consumption
• Probably need 512 MB per core (trigger process)
• x 8 ==> 4 GB • 4 GB of high-speed memory + onboard 10
Gigabit interface will need also power (assume conservatively 50 W)
• The 1 U box should stay below 300 W• Total power for CPUs < 600 kW• 10 Gigabit distribution switches need also
power (should count at least 250 W)
Niko NeufeldCERN, PH
18
Open questions
• Can an FPGA drive the HCA or do we need an embedded host-processor with an OS?
• It would be nice to centrally assign the next destination (server) to all FL1 boards. This means determining the Queue Pair number and DLID/DGID to send a message to. Can we use the Infiniband network for this as well?
• Almost the entire traffic is unidirectional (from the FL1s to the servers). Can we take advantage of this fact?