future experiment specific needs for lhcb openfabrics/infiniband workshop at cern monday june 26 sai...

Future experiment specific needs for LHCb

OpenFabrics/Infiniband Workshop at CERN

Monday June 26Sai Suman Cherukuwada and Niko Neufeld CERN/PH

mailto:[email protected]






Niko NeufeldCERN, PH

2

LHCb Trigger-DAQ system: Today• LHC crossing-rate: 40MHz• Visible events: 10MHz• Two stage trigger system

– Level-0: synchronous in hardware; 40 MHz 1 MHz

– High Level Trigger (HLT): software on CPU-farm; 1 MHz 2 kHz

• Front-end Electronics (FE): interface to Readout Network

• Readout network– Gigabit Ethernet LAN– Full readout at 1MHz

• Event filter farm– ~ 1800 to 2200 1 U servers

FE

Readout network

CPU

FE FE

L0 trigger

Timing and Fast Control

CPUCPUCPU CPUpermanent

storage


3

LHCb DAQ system: features

• On average every 1 us new data become available at each of ~ 300 sources (= custom electronics boards, “TELL1”)

• Data from several 1 us cycles (=“triggers”) are concatenated into 1 IP packet reduces message / packet-rate

• IP packets are pushed over 1000 BaseT links short distances allow using 1000 BaseT throughout

• Destination IP-address is synchronously assigned via a custom optical network (TTC) to all TELL1s

• For each trigger a PC-server must receive IP packets from all TELL1 boards (“event-building”).


4

Terminology

• channel: elementary sensitive element = 1 ADC = 8 to 10 bits. The entire detector comprises millions of channels

• event: all data fragments (comprising several channels) created at the same discrete time together form an event. It is an electronic snap-shot of the detector response to the original physics reaction

• zero-suppression: send only channel-numbers of non-zero value channels (applying a suitable threshold)

• packing-factor: number of event-fragments (“triggers”) packed into a single packet/message– reduces the message rate– optimises bandwidth usage– is limited by the number of

CPU cores in the receiving CPU (to guarantee prompt processing and thus limit latency)


5

PC #876

Following the data-flow

TELL1 TELL1 TELL1 TELL1 TELL1 TELL1 UKL1 UKL1 TELL1 TELL1 TELL1

Front-end Electronics

400 Links35 GByte/s

TFCSystem

StorageSystem

Readout Network

Switch

PC

PC

PC

Switch

PC

PC

PC

Switch

PC

PC

PC

Switch

PC

PC

Switch

PC

PC

PC

Event FilterFarm

50 Subfarms

L0YesMEP

DestinationPC #876

VELO#2

RICH#1

BΦΚs

L0Yes

VELO TT IT OT CALO RICH MUON L0

VELO#1

RICH#2

#2#1

VELOMEP TOPC #876

#2#1

RICHMEP

PC #876

RICH#2RICH#1VELO#2VELO#1

MEP

HLTProcessMEP

RequestPC #876


6

Data pre-processing: The LHCb common Readout Board TELL1

PP-FPGA

A-RxCard

L1B

SyncLink-FPGA

PP-FPGA

L1B

PP-FPGA

L1B

PP-FPGA

L1B

RO-TxTTCrxECS

4 x 1000 BaseT

TTCECS

FE FE FE FE

A-RxCard O-RxCard

Throttle

Receiver cards get data from detector via optical fibres

FPGAs do pre-processing, zero-suppression and data formatting (into IP packets)

FPGA attached to Ethernet Quad-MAC on SPI3 bus (simple FIFO protocol)

IP packets are pushed out to the Data Acquisition on a private LAN over 4 x 1000 BaseT links


7

Improving the LHCb trigger

• Triggering is filtering. The quality of the trigger is determined (using simulated data) by measuring how many good events of the possible good events are selected:

efficiency ε = Ngood-selected / Ngood-all

• Each stage has its own efficiency. LHCb looses mostly in the “L0” step: 40 MHz 1 MHz

• Reason: only coarse information (“high pT”) used

• Solution: reconstruct secondary vertices at collision rate 40 MHz!


8

Upgrade

We want to have a DAQ and Event filter which:• allows for vertex triggering at collision rate

(40 MHz)• fits within the existing infrastructure:

– 1 MW power and cooling– 50 racks with a total space of 2200 Us

• preserves the main good features of the current LHCb DAQ– simple, scalable, industry-standard technologies,

as much as possible commodity items

• costs <107 of a reasonable currency


9

Two Options

• Two stage readout:– Readout ~ 10 kB @ 40 MHz.

Data are buffered in the FL1 for a suitable amount of time: 40 ms (?)

– Algorithm on event-filter farm selects 1 MHz of “good” events and informs (how?) FL1 boards of its decision (yes/continue – no/discard):

– In case of “yes” the entire detector is read out: 35 kB @ 1 MHz

• Always read out entire detector 35 kB @ 40 MHz (“brute force”)


10

Full read-out at 40 MHz

• At a collision rate of 40 MHz, the data rate for a full readout is ~ 1400 GB/s, or ~ 12 Tb/s– network with ~ 2 x 1200 x 10 Gigabit ports

• Need several switches as building blocks optimised topology highly desirable (non-Banyan)

• Advantages:– No latency constraints– Less memory requirements on the FL1

• Disadvantages:– Huge, expensive– Almost all of the data shipped will never be looked at

(physics algorithms do not change much)– Requires zero-suppression and FPGA pre-processing for all

detector data 40 MHz (not obvious)


11

Parameters / Assumptions

• Vertex reconstruction requires only a subset of the total event of roughly 10 kB @ 40 MHz (essentially the VertexLocator of the future + some successor of TT)

• FE with full 40 MHz readout capability• We dispose of the successor of the TELL1, FL1* ,

which has several 10 Gigabit output links and can do pre-processing / zero-suppression at the required rate

• Several triggers are packed into a MTP. This reduces the message rate from each board. In this presentation we assume 8 triggers per message == RTX-message = 5 MHz (per FL1)

(*) FL1 for Future L1 or Fast L1 or FormuLa 1


12

Data pre-processing: A new readout-board: FL1

PP-FPGA

L1B

SyncLink-FPGA

PP-FPGA

L1B

PP-FPGA

L1B

PP-FPGA

L1B

RO-TxSyncInfo

HostProcessorECS

4 x CX4

TTCECS

FE FE FE FE

O-RxCard

Throttle

Receiver cards get data from detector via optical fibres

FPGAs do pre-processing, zero-suppression and data formatting)

FPGA attached to HCA on ??? bus (are there alternatives to PCIe?)

Output to the Data Acquisition private LAN on (up-to) 4 x CX4 cables

O-RxCard

Host processor needed (??) to handle complex protocol stack


13

Event filter farm for upgraded LHCb

• We need an event-filter which can absorb 4 * 10^7 * 10 kB/s + 10^6 * 35 kB/s ~ 435 GB/s!

• Assume 2000 servers:– A server is something which takes one U in space and has p

two processor sockets– Each socket holds a chip, which comprises several CPU

cores

• Each server must accept ~ 210 MB/s as 500 kHz of messages of ~ 400 BytesOptions for attaching servers to network: – 3 Gigabit links as a trunk: not very practical because would

have to bring > 130 links into one rack!– Use an (underused) 10 Gigabit link


14

Server Horoscopes

• Quad-core processors from Intel and AMD will most likely be available in 2007

• Could we have “Octo-cores” by end of 2008?• Can thus assume to have 8 cores running at

2 to 2.4 GHz (prob. not more!) in one U.• Commitment by Intel and AMD: power

consumption per processor < 100 W• Reasonable rumors:

– 2007 will see first mainboards with 10 Gigabit interface on board: most likely CX4 for either 10 Gigabit Ethernet or Infiniband (?)


15

CPU power for triggering / latency / buffering

• Assuming 2000 servers / 16000 cores and40 MHz of events each core has on average 2.5 ms to reach a decision when processing the ~ 10 kB of vertex-detector data should have at least 40 ms buffering in the

FL1s to cope with fluctuations in processing time (the processing time distribution is known to have long tails)

• Assuming 400 FL1 means that they have to have 12.5 GB buffer memory


16

30 m+

LHCb Detector

1. Readout 10KB events @ 40MHz,Buffer on FL1

1 2 3 … 400Front-End L1 Boards

1 2 3 4

4x10Gbps Links

High Density Switches

Rack1 … Rack50

Fabric

400 ports “in” per Switch

65Gbps per Rack

60 m

20m+?

Farm Racks with 1 x 32-port Switch or 2 x 16-port Switch

2. Send to Farm for Trigger decision 3. Send trigger

Decision to FL1

4. Receive TriggerDecision. 5. If trigger decision

Positive, readout 35KB @ 1MHz


17

Power Consumption

• Probably need 512 MB per core (trigger process)

• x 8 ==> 4 GB • 4 GB of high-speed memory + onboard 10

Gigabit interface will need also power (assume conservatively 50 W)

• The 1 U box should stay below 300 W• Total power for CPUs < 600 kW• 10 Gigabit distribution switches need also

power (should count at least 250 W)


18

Open questions

• Can an FPGA drive the HCA or do we need an embedded host-processor with an OS?

• It would be nice to centrally assign the next destination (server) to all FL1 boards. This means determining the Queue Pair number and DLID/DGID to send a message to. Can we use the Infiniband network for this as well?

• Almost the entire traffic is unidirectional (from the FL1s to the servers). Can we take advantage of this fact?

future experiment specific needs for lhcb openfabrics/infiniband workshop at cern monday june 26 sai...

Documents