graph data structures and graph neural networks in high

TRACKINGWITH

GRAPHS

XIANGYANG JU

DANIEL MURNANE

ON BEHALF OF THE

EXA.TRKX COLLABORATION 1

OVERVIEW

1. Why ML/GNNs for High Energy Physics?

2. GNN applications in HEP/ExatrkX

3. ML Tracking Pipeline

4. Metric Learning

Code Walkthrough (Pytorch)

5. Doublet & Triplet GNN

6. DBSCAN for TrackML Score

Code Walkthrough (TensorFlow)

2

WHY MACHINE LEARNING FOR TRACKING?

High-luminosity scaling problem, means we need something

to compliment traditional tracking algorithms,

but why graphs?

4

In other words…

Co

mp

uti

ng p

ow

er

Time, Energy, Number of Collisions

Predicted

capacity

Traditional

methods

(scale quadratically)

HL-LHC, 14 TeV

2024 – 2026

6 Billion events/second

WHY GRAPHS SPECIFICALLY?



but why graphs?

Graphs can capture inherent sparsity of much physics data

5

Hits to

graphs

WHY GRAPHS?



but why graphs?

Graphs can capture inherent sparsity of much physics data

Graphs can capture the manifold and relational structure of much physics data

Conversion to and from graphs can allow manipulation of dimensionality

Graph Neural Networks are booming (i.e. wouldn’t be talking about graphs if there weren’t a wealth of classic

algorithms and NN models for graph data)

Industry research and investment means good outlook for software and hardware optimised for graphs

6

APPLICATIONS

TrackML dataset ~ HL-LHC siliconhttps://indico.cern.ch/event/831165/contributions/3717124/

High Granularity Calorimeter datahttps://arxiv.org/abs/2003.11603

LArTPC data ~ DUNE experimenthttps://indico.cern.ch/event/852553/contributions/4059542/

Quantum GNN for Particle Track Reconstructionhttps://indico.cern.ch/event/852553/contributions/4057625/

GNNs on FPGAs for Level-1 Triggerhttps://indico.cern.ch/event/831165/contributions/3758961/

7

https://indico.cern.ch/event/831165/contributions/3717124/

https://arxiv.org/abs/2003.11603




APPLICATIONS

TrackML dataset ~ HL-LHC siliconhttps://indico.cern.ch/event/831165/contributions/3717124/

High Granularity Calorimeter datahttps://arxiv.org/abs/2003.11603

8

LArTPC data ~ DUNE experimenthttps://indico.cern.ch/event/852553/contributions/4059542/

Quantum GNN for Particle Track Reconstructionhttps://indico.cern.ch/event/852553/contributions/4057625/

GNNs on FPGAs for Level-1 Triggerhttps://indico.cern.ch/event/831165/contributions/3758961/


https://arxiv.org/abs/2003.11603




THE EXATRKX PROJECT

MISSION

Optimization, performance and validation studies of ML approaches to the Exascale tracking problem, to enable production-level tracking on next-generation detector systems.

PEOPLE

• Caltech: Joosep Pata, Maria Spiropulu, Jean-Roch Vlimant, Alexander Zlokapa

• Cincinnati: Adam Aurisano, Jeremy Hewes

• FNAL: Giuseppe Cerati, Lindsey Gray, Thomas Klijnsma, Jim Kowalkowski, Gabriel Perdue, Panagiotis Spentzouris

• LBNL: Paolo Calafiura (PI), Nicholas Choma, Sean Conlon, Steve Farrell, Xiangyang Ju, Daniel Murnane, Prabhat

• ORNL: Aristeidis Tsaris

• Princeton: Isobel Ojalvo, Savannah Thais

• SLAC: Pierre Cote De Soux, Francois Drielsma, Kasuhiro Terao, Tracy Usher

9

https://exatrkx.github.io/

THE PHYSICAL PROBLEM

• “TrackML Kaggle Competition” dataset

• Generated by HL-LHC-like tracking (ACTS) simulation

• 9000 events to train on

• Each event has up to 100,000 layer hits from around

10,000 particles

• Layers can be hit multiple times by same particle

(“duplicates”)

• Non-particle hits present (“noise”)

10


• “TrackML Kaggle Competition” dataset

• Generated by HL-LHC-like tracking (ACTS) simulation

• 9000 events to train on

• Each event has up to 100,000 layer hits from around

10,000 particles

• Layers can be hit multiple times by same particle

(“duplicates”)

• Non-particle hits present (“noise”)

11


• Need to construct hit data into graph data, i.e. nodes and edges

• Can use geometric heuristics (have used in past: ~45% efficiency, 5% purity)

• To improve performance, use learned embedding construction

• Ideal final result is a “TrackML score”

𝑆 ∈ 0,1

• All hits belonging to same track labelled with same unique label ⇒ 𝑆 = 1

12


• We will follow a particular

event, let’s say event #4692

• We will follow a particular

particle, let’s say particle

#1058360480961134592

• It’s a mouthful, so let’s call

her Diane

13

y

zx

TRACKING PIPELINE1. Metric Learning

2. Doublet GNN

3. (Optional) Triplet GNN

4. DBSCAN → TrackML score

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify

triplets in GNN

Apply cut for

seeds

DBSCAN for

track labels

14

DATASET

OUR PREVIOUS STATE-OF-THE-ART

𝜂 = 1

𝜂 = 2

𝜂 = 3

𝜂 = 0𝜂 = −1

𝜂 = −2

𝜂 = −3

15

DATASET


Barrel only

𝜂 = 1

𝜂 = 2

𝜂 = 3

𝜂 = 0𝜂 = −1

𝜂 = −2

𝜂 = −3Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in

GNN

Filter, convert

to triplets

Train/classify

tripets in

GNN

16

DATASET


Barrel only

Adjacency condition

𝜂 = 1

𝜂 = 2

𝜂 = 3

𝜂 = 0𝜂 = −1

𝜂 = −2

𝜂 = −3 Adjacency allows clear ordering

of hits in track, and therefore

ground truth

Can also be used to prune

false positives in the pipeline

17

1

2

3

DATASET


Adjacency allows clear ordering


ground truth



18

But what about

skipped layers?

Barrel only

Adjacency condition

1

2

3

DATASET




ground truth



19

But what about

skipped layers?

Barrel only

Adjacency condition

What about the

endcaps?

1

2

3?

3?

DATASET




ground truth



20

But what about

skipped layers?

Barrel only

Adjacency condition

What about the

endcaps?

Let’s define clear ground truth and see if

an embedding space can be learned without

knowledge of detector geometry

1

2

3?

3?

DATASET

GEOMETRY-FREE TRUTH GRAPH

1. For each particle, order hits by increasing

distance from creation vertex,

𝑅 = 𝑥2 + 𝑦2 + 𝑧2

2. Group by shared layers

3. Connect all combinations from layer 𝐿𝑖 to 𝐿𝑖+1,

where 𝑅𝑖−1 < 𝑅𝑖 < 𝑅𝑖+1

21

Creation vertex

DATASET

GEOMETRY-FREE TRUTH GRAPH

22

The real question is: Can such

a specific definition of truth be

learned in an embedded space?

1. For each particle, order hits by increasing

distance from creation vertex,

𝑅 = 𝑥2 + 𝑦2 + 𝑧2

2. Group by shared layers

3. Connect all combinations from layer 𝐿𝑖 to 𝐿𝑖+1,

where 𝑅𝑖−1 < 𝑅𝑖 < 𝑅𝑖+1

1. For all hits in barrel, embed features (co-ordinates, cell

direction data, etc.)

into N-dimensional space

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify

triplets in GNNApply cut for

seeds

METRIC LEARNING

OUR PREVIOUS STATE OF THE ART

24




2. Associate hits from same tracks as close in N-

dimensional distance (close = within Euclidean

distance r)

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify


seeds

METRIC LEARNING


25





dimensional distance (close = within Euclidean

distance r)

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify


seeds

METRIC LEARNING


26

r





dimensional distance

3. Score each “target” hit within embedding

neighbourhood against

the “seed” hit at centre = Euclidean distance

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify


seeds

METRIC LEARNING


27

Architecture specifics

“Comparative” hinge loss

Negatives punished for being in margin

radius

Positives punished for being outside

margin radius (Δ)

Margin = training radius

Train with random pairs with only (r,𝜙,z): 0.3

– 0.5% purity @ 96% efficiency.

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify


seeds

METRIC LEARNING IN EMBEDDING SPACE


28


TOWARDS REALISTIC TRACKING

29

Where do we lose eff/pur:

Most random pair negatives are easy – The Curse of Easy Negatives

Cell shape gives directional information to each hit

Battle against GPU memory – Can only train on subset of pairs



30

Where do we lose eff/pur:

Most negatives are easy.

Solution: Hard Negative Mining (HNM)

1. Run event through embedding

2. Build radius graph

3. Train loss on all hits within radius=margin=1 of each hit

4. For speed, use sparse custom-built technique + the

Facebook GPU-powered FAISS KNN library


EFFECTIVE TRAINING

32

Performance improvement with:

• Random Pairs (RP)

• Hard Negative Mining (HNM)

• Positive Weighting (PW)

• Cell Information (CI) - shape

• (Warm-up WU) – Altogether: HPRCW3

FILTERING


The pipeline so far. We want to get purity (1.3%) up further, so let’s filter these huge graphs.

51

2 Radius

Graph51

2

51

2

51

2

51

2

8

r,y,

zce

ll

Hinge

Loss

51

2

51

2

51

2 Cross

Entropy

Loss

no

rm

no

rm

34

FILTERING

TOWARDS REALISTIC TRACKING Just like in embedding, training on random pairs in filter gives poor performance, since the selection mostly gives

easy negatives.

35

Again, solution is HNM: run event through model (without tracking gradients), get hard negatives, run these through

model (now tracking gradients), training only on them.

z z

Distribution of true edgesDistribution of all edges

out of metric learning

FILTERING


36

FILTERING


Regime Phys. Truth purity

@ 99% efficiency

PID Truth purity

@ 99% efficiency

Vanilla 6.3% 7.8%

Cell info 8.3% 12.8%

Cell info,

layer+batch norm

14.0% 17.4%

Graph size O(1 million edges) O(3 million edges)

Physical Truth:

Truth is defined as edges

between closest hits in

track, on different layers

PID Truth:

Truth is defined as any edge

connecting hits with the

same Particle ID (PID)

Remember

37

38

FILTERING


Does it work? Let’s check Diane:

Pretty good!

True positive

False positive

No false negatives

39

FILTERING


Does it work? Let’s check Diane:

Not quite as good…

True positive

False positive

False negatives

This is where GNN

comes in

FILTERING

ROBUSTNESS

40

(In collaboration with University of Washington:

Aditi Chauhan, Alex Schuy, Ami Oka, David Ho, Shie-Chieh Hsu)

FILTERING

ROBUSTNESS

Noise level a proxy for low-pT hits

Robustness of embedding to noise

Not re-trained, that is:

Out-of-the-box penalty around 20%

41

(In collaboration with University of Washington:

Aditi Chauhan, Alex Schuy, Ami Oka, David Ho, Shie-Chieh Hsu)

METRIC LEARNING IN EMBEDDED SPACE

CODE WALKTHROUGH

42

GRAPH NEURAL NETWORKS

FOR HIGH ENERGY PHYSICS

Can approximate geometry of the physics problem

Are a generalisation of many other machine learning

techniques

E.g. Message passing convolution generalises CNN

from flat to arbitrary geometry

Can learn node (i.e. hit / spacepoint) features and

embeddings, as well as edge (i.e. relational) features and

embeddings

E.g. In practice, for a LHC-like detector environment, join hits

into graph, and iterate through message-passing of hidden

features

43

Raw hit data

embedded

Filter likely,

adjacent

doublets

Train/classify

doublets in GNN

Filter, convert to

triplets

Train/classify

triplets in GNN

Apply cut for

seeds

DBSCAN for

track labels

Doublet GNN architecture

Performance

Triplet construction +

performance

44


THE AIM

45


ARCHITECTURES

𝑣0𝑘+1 = 𝜙(𝑒0𝑗

𝑘 , 𝑣𝑗𝑘, 𝑣0

𝑘)MESSAGE

PASSING

𝑣1𝑘 𝑣2

𝑘

𝑣3𝑘 𝑣4

𝑘

𝑣𝑖𝑘 node features

𝑒𝑖𝑗𝑘 edge features

at iteration 𝑘

𝑒01𝑘 𝑒02

𝑘

𝑒03𝑘 𝑒04

𝑘

46


ARCHITECTURES

𝑣0𝑘+1 = MLP( Σ𝑒0𝑗

𝑘 𝑣𝑗𝑘 , 𝑣0

𝑘 )ATTENTION

GNN

𝑣1𝑘 𝑣2

𝑘

𝑣3𝑘 𝑣4

𝑘


𝑒𝑖𝑗𝑘 edge score [0, 1]

at iteration 𝑘

𝑒01𝑘 𝑒02

𝑘 = 𝑀𝐿𝑃 [𝑣0𝑘 , 𝑣2

𝑘]

𝑒03𝑘 𝑒04

𝑘Veličković, Petar, et al.

"Graph attention

networks." arXiv preprint

arXiv:1710.10903 (2017).

47


ARCHITECTURES

𝑣0𝑘+1 = 𝜙(𝑣0

𝑘, Σ𝑒0𝑗𝑘+1)

INTERACTION

NETWORK

𝑣1𝑘 𝑣2

𝑘

𝑣3𝑘 𝑣4

𝑘


𝑒𝑖𝑗𝑘 edge features

at iteration 𝑘

𝑒01𝑘 𝑒02

𝑘+1 = 𝜙 𝑣0𝑘, 𝑣2

𝑘 , 𝑒02𝑘

𝑒03𝑘 𝑒04

𝑘

Battaglia, Peter, et al. "Interaction

networks for learning about

objects, relations and

physics." Advances in neural

information processing systems.

2016.

• We convert from a doublet graph to triplet graph: triplet

edges have direct access to curvature information, therefore

we hypothesise the accuracy should be even better

• Doublets are associated to nodes, triplets are associated to

edges

48

0.99

x1

x2

x4

0.87 0.84

x2x2

x3


PERFORMANCE

( ) ( )

( )

49

0.99

x1x2

x4

0.87 0.84

x2x2

x3


PERFORMANCE

• We convert from a doublet graph to triplet graph: triplet

edges have direct access to curvature information, therefore

we hypothesise the accuracy should be even better

• Doublets are associated to nodes, triplets are associated to

edges

( ) ( )

( )

50

0.99

x1x2

x4

0.87 0.84

x2x2

x3

• We convert from a doublet graph

to triplet graph: triplet

edges have direct access to

curvature information, therefore

we hypothesise the accuracy

should be even better

• Doublets are associated to

nodes, triplets are associated to

edges


PERFORMANCE

Barrel-only, adjacent-layers

( ) ( )

( )

51

0.99

x1x2

x4

0.87 0.84

x2x2

x3

Purity: 𝟗𝟗. 𝟏% ± 𝟎. 𝟎𝟕%

Inference time: ∼ 𝟓 seconds per

event per GPU, split between:

∼ 3 ± 1 seconds for embedding

construction

∼ 2 ± 1 seconds for two GNN

steps and processing0.82

0.84

0.86

0.88

0.9

0.92

0.94

0 0.5 1 1.5 2 2.5 3 3.5 4

Tota

l E

ffic

ien

cy

pT [GeV]

Seed Efficiency (PU=200)


TRACK SEEDING PERFORMANCE

53

• 0.84 TrackML Score in barrel, no

noise

• Upper limit score out of metric

learning was 0.90, with only barrel,

adjacenct layers, etc.

• Now able to run full detector, with

geometry-free embedding

• Upper limit score out of metric

learning is now at least 0.935

• Can keep pushing this value at the

expense of purity = GPU memory


TRACK LABELLING PERFORMANCE

Tra

ck

ML

Sco

re

Barrel-only, adjacent-layers

55


CHALLENGES

Going to full detector, does the GNN understand our more epistemologically

motivated truth?

Going to full detector, can we fit a whole model on a GPU?

56


CHALLENGES

Going to full detector, does the GNN understand our more epistemologically

motivated truth?

We will see performance results in the code walkthrough

Going to full detector, can we fit a whole model on a GPU? No.

We need some memory hacks: Checkpointing, Larger GPU, Mixed Precision

GRAPH NEURAL NETWORK

MEMORY MANAGEMENT

Half precision

Performance unaffected

Mention no speed

improvements yet,

pending PyTorch (PyTorch

Geometric) Github pull

0.450.4550.460.4650.470.4750.480.4850.490.4950.50.505

0

2

4

6

8

10

12

14

16

32, 8 iter, 1

batch

32, 8 iter, 2

batch

64, 6x, 1

batch

128, 8x, 1

batch

MP

/F

P R

ati

o

Pe

ak

GP

U U

sa

ge

(G

b)

Configuration (hidden features, message passing iters, training batch size)

FP

(GB)

MP

(GB)

57


MEMORY MANAGEMENT

Checkpointing (Pytorch)

Memory improvement

Minimal speed penalty (~1.2x longer training)

We checkpoint each iteration (partial checkpointing)

so there is still a scaling with iterations

Can checkpoint all iterations (maximal checkpointing)

58

No checkpointing

Maximal

checkpointing


CODE WALKTHROUGH

59


PERFORMANCE

60

Full detector, geometry-free construction

METRIC LEARNING & GRAPH NEURAL NETWORK PIPELINE

SUMMARY

We handle full detector, noise, geometry-free inference, distributed training,

with care

Can learn embedding space without layer information, provided we equip

training with hard negative mining, cell information, warmup

Can run GNN of full event, provided we equip training with gradient

checkpointing, mixed precision

Can include noise without re-training, at a small (~20%) penalty to purity

61

BACKUP

64

OUTLOOK

Converging on better architectures (attention, gated RNN, generalising

dense, flat methods to sparse, graph structure - not that the two are

mutually inclusive, there is increasing interest in sparse CNN

techniques, for example)

…

Dwivedi, Vijay Prakash, et al. "Benchmarking graph neural networks." arXiv preprint arXiv:2003.00982 (2020).

65

OUTLOOK Converging on better architectures (attention, gated RNN, generalising




Converging on better methods (sparse operations, triplet graph

structure, fast clustering, approximate NN, piggy-backing off big tech

methods, e.g. Facebook FAISS)

…


Google Trends of “Graph Neural Networks”

x1

x2

x3x4

0.99

x1x2

x3x4

0.87 0.84

x2x2

( )

( ) ( )

Doublet graph to higher-order classification66

Converging on better architectures (attention, gated RNN, generalising




Converging on better methods (sparse operations, triplet graph

structure, fast clustering, approximate NN, piggy-backing off big tech

methods, e.g. Facebook FAISS)

…


Google Trends of “Graph Neural Networks”

Converging on better hardware (mixed

precision handling on new GPUs/TPUs,

sparse handling in IPUs , compilability

of graph-structure ML libraries for

FPGA ports, e.g. IEEE HPEC

GraphChallenge)

OUTLOOK

67

graph data structures and graph neural networks in high

Documents