deeptailsseminar#1 deeptails july9th,2020 · 2020. 7. 7. · deeptailsseminar#1 july9th,2020...

Deeptails Seminar #1

July 9th, 2020

Deeptails

How To Train Your Deep Multi-Object Tracker

Presenters: Yihong Xu1 and Xavier Alameda-Pineda1

Joint work with: Aljosa Osep2, Yutong Ban1,3, Radu Horaud1 and Laura Leal-Taixe2

1Inria, LJK, MIAI, Univ. Grenoble Alpes, France2Technical University of Munich, Germany3Distributed Robotics Lab, CSAIL, MIT, USA

Research Page | Download Code

Deeptails

http://project.inria.fr/ml3ri/deeptails – [email protected] 1/26

Deeptails Seminars?

Rationale: deep learning requires engineering.

Aim: discuss best engineering practices togetherwith methodology.

Format: a not-so-young researcher (methodology)paired with a young researcher (deep details ordeeptails) – around 30 min.

The Devil is in the

A Series of Seminars on the Engineering behind Science

Deeptails

https://project.inria.fr/ml3ri/deeptails/

Deeptails


What is MOT?

Deeptails


Motivation

DeepMulti-Object

Tracker

RGB InputImages

Ground-truthBounding boxes

HungarianAlgorithm

(Optimal Assignment)

Training

Evaluation

L2 loss

MOT Metrics

Deeptails


Motivation

DeepMulti-Object

Tracker

RGB InputImages


HungarianAlgorithm

(Optimal Assignment)

Training

Evaluation

L2 loss

MOT Metrics

Deep HungarianNetwork

(Differentiable Assig.)

Training

EvaluationMOT Metrics

MOT loss

Deeptails


Part 1: Methodology

Deeptails


Methodology: Table of Contents

◮ Standard Practice: HA and MOT Metrics

◮ DHN: Deep Hungarian Net

◮ DeepMOT Loss: MOTA and MOTP Approximations

◮ Some Results

Deeptails


Standard practice

1. Compute the distance matrix Dt between predicted and ground-truth bboxes.

2. Apply the Hungarian algorithm to obtain the optimal assignment matrix At.

Deeptails


Standard practice



At train Compute the L2 distance, and back-propagate this error.

At test Compute:

MOTA = 1−

∑t(FPt +FNt + IDSt)∑

t TPt

,

MOTP =

∑t

∑n,m dtnma

∗tnm

∑t |TPt |

,

MOTA is the classification accuracy.

MOTP is the estimated bbox precision.

Deeptails


Standard practice



At train Compute the L2 distance, and back-propagate this error.

At test Compute:

MOTA = 1−

∑t(FPt +FNt + IDSt)∑

t TPt

,

MOTP =

∑t

∑n,m dtnma

∗tnm

∑t |TPt |

,

MOTA is the classification accuracy.

MOTP is the estimated bbox precision.

Two issues: neither the HA nor MOTA/P are differentiable procedures.

Deeptails


DHN: approximating the HA

We propose the Deep Hungarian Net (DHN) to approximate the HA:

I DHN must accept inputs of different sizes.

I DHN must have a global receptive field.

I → DHN is designed combining flattening and Bi-RNN.

Distance Matrix(Track to Ground Truth)

M

N

Seq-to-seqBi-RNN

N

M2 × hidden units

Row-wiseflatten Reshape

M × N

...

M × N

2 × hidden units

Column-wiseflatten

First-stage hidden representation

N

M

Seq-to-seqBi-RNN

FC layers

Sigmoid

Soft Assignment Matrix

Reshape

N

M2 × hidden units

Second-stage hidden representation

Reshape

D

Ã

Deeptails


DeepMOT loss: approximating the MOTA and MOTP

δ δ δ0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.00.9 0.0

Column-wiseSoftmax

δ

δ

δ

Row-wiseSoftmax

1 0 0

0 1 0

Apply mask

0.0

0.8

0.1

0.00.0

0.00.0

0.20.0 0.0

0.0

DeepHungarian

100

Xt1

Xt2

Xt3

0.5 0.3 0.2

0.7 0.1 0.6

∞

D

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.0 0.0 1.0

0.10.0 0.9 0.0

0.1 0.1 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

1

00

0

0 00

0

1

Create maskwi th TP

*element-wise multiplication

yt2yt1 yt3

a)

b)

c)

∞ ∞

Ã

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt-1,1

Xt-1,2

Xt-1,3

yt-1,1 yt-1,2 yt-1,3

Xt1

Xt2

Xt3

yt2yt1 yt3

dMOTP = || Bᵀᴾ ||0

Σ

FP = Σ

FN = Σ

IDS = Σ

Cᶜ

Cʳ

|| Bᵀᴾ ||0 = Σ

Net

D and A have estimated objects as rowsand ground-truth as columns.

To compute the FP: complete with aconstant column + row-wise softmax.

Analogous for FN.

IDSwitches are computed masking FNmatrix assignments at previous frame.

We have MOTA. To obtain MOTP, wemask the distance matrix D.

Deeptails



δ δ δ0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.00.9 0.0

Column-wiseSoftmax

δ

δ

δ

Row-wiseSoftmax

1 0 0

0 1 0

Apply mask

0.0

0.8

0.1

0.00.0

0.00.0

0.20.0 0.0

0.0

100

Xt1

Xt2

Xt3

0.5 0.3 0.2

0.7 0.1 0.6

∞

D

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.0 0.0 1.0

0.10.0 0.9 0.0

0.1 0.1 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

1

00

0

0 00

0

1

Create maskwi th TP


yt2yt1 yt3

a)

b)

c)

∞ ∞

Ã

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt-1,1

Xt-1,2

Xt-1,3

yt-1,1 yt-1,2 yt-1,3

Xt1

Xt2

Xt3

yt2yt1 yt3


Σ

FP = Σ

FN = Σ

IDS = Σ

Cᶜ

Cʳ

|| Bᵀᴾ ||0 = Σ

DeepHungarian

Net



Analogous for FN.



Deeptails



δ δ δ0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.00.9 0.0

Column-wiseSoftmax

δ

δ

δ

Row-wiseSoftmax

1 0 0

0 1 0

Apply mask

0.0

0.8

0.1

0.00.0

0.00.0

0.20.0 0.0

0.0

Deep Hungarian

Net

100

Xt1

Xt2

Xt3

0.5 0.3 0.2

0.7 0.1 0.6

∞

D

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

0.0 0.0 1.0

0.10.0 0.9 0.0

0.1 0.1 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.0 0.0 1.0

0.0 1.0 0.0

0.1 0.0 0.0

0.1 0.1 0.9

0.2 0.8 0.2

0.3 0.3 0.2

1

00

0

0 00

0

1

Create mask with TP


yt2yt1 yt3

a)

b)

c)

∞ ∞

Ã

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

Xt1

Xt2

Xt3

yt2yt1 yt3

yt2yt1 yt3

Xt1

Xt2

Xt3

yt2yt1 yt3

Xt-1,1

Xt-1,2

Xt-1,3

yt-1,1 yt-1,2 yt-1,3

Xt1

Xt2

Xt3

yt2yt1 yt3


Σ

FP = Σ

FN = Σ

IDS = Σ

Cᶜ

Cʳ

|| Bᵀᴾ ||0 = Σ



Analogous for FN.



Deeptails


Using the approximation

DeepMulti-Object

Tracker

RGB InputImages


Deep HungarianNetwork

(Differentiable Assig.)

Training

EvaluationMOT Metrics

MOT loss

Thanks to DHN and (Deep)MOT loss, we optimise a proxy of the evaluation metrics.

Deeptails


Tracking Results VisualizationOriginal

t t + 1 t + 2 t + 3Ours

t t + 1 t + 2 t + 3IDS

Original Ours Original Ours

FN FP

Table: Original v.s.Ours: on IDS (top), FN (bottom-left) and FP (bottom-right).

Deeptails


Quantitative comparison

Method MOTA ↑ MOTP ↑ IDF1 ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDS ↓

MOT17

DeepMOT-Tracktor 53.7 77.2 53.8 19.4 36.6 11731 247447 1947Tracktor 53.5 78.0 52.3 19.5 36.6 12201 248047 2072

eHAF 51.8 77.0 54.7 23.4 37.9 33212 236772 1834FWT 51.3 77.0 47.6 21.4 35.2 24101 247921 2648jCC 51.2 75.9 54.5 20.9 37.0 25937 247822 1802MOTDT17 50.9 76.6 52.7 17.5 35.7 24069 250768 2474MHT DAM 50.7 77.5 47.2 20.8 36.9 22875 252889 2314

MOT16

DeepMOT-Tracktor 54.8 77.5 53.4 19.1 37.0 2955 78765 645Tracktor 54.4 78.2 52.5 19.0 36.9 3280 79149 682

HCC 49.3 79.0 50.7 17.8 39.9 5333 86795 391LMP 48.8 79.0 51.3 18.2 40.1 6654 86245 481GCRA 48.2 77.5 48.6 12.9 41.1 5104 88586 821FWT 47.8 75.5 44.3 19.1 38.2 8886 85487 852MOTDT 47.6 74.8 50.9 15.2 38.3 9253 85431 792

Deeptails


Part 2: Deeptails

Deeptails


Deeptails: Table of Contents

◮ Distance Matrix Calculation

◮ Deep Hungarian Network (DHN)◮ Data Augmentation◮ Training Strategy◮ Ablation Study

◮ DeepMOT Training◮ Soft Discretization◮ Data Augmentation◮ Training Strategy◮ Ablation Study

DeepMOT

Deep Hungarian Net DeepMOT LossDeep Multi-Object Tracker

RGB Images

Bounding Boxes gradients

Deeptails


Distance Matrix Calculation 1/2

2

111IoU( , )=0

IoU( , )=02

DIoU = 1− IoU = 1 if no overlap.→ zero gradient.⇒ We use D = (DL2 + DIoU)/2.

DL2 = 1− exp(−5Dnorm)),

Dnorm = [(x − x)2 + (y − y)2]/d2

d2 = h2 + w2, h, w height/width(x , y) and (x , y) is the center ofthe predicted/ground-truth bbox.

0 100 200 300 400 500 600

0.0

0.2

0.4

0.6

0.8

1.0

distance

1-IoU

D_norm

D_l2

0.5*(D_l2+1-IoU)

I-IoU 0.5*(I-IoU+DL2)

DL2

Dnorm

Deeptails


Distance Matrix Calculation 2/2

ROI pooling

L2 Normalize

Concat1×1

conv

conv1 conv2 conv3 conv4 conv5

RPN

fc fc

fc

fc

fc fc

fc

fc

softmax

bbox

fc bboxreid branch

1×1

fcRGB

Images

Bounding

Box proposals

source: image modified from Hoang Ngan Le, T., et al. 2016.

Appearance vectors (F ):ROI pooling + reid-branch.

Cosine distance:

Dcos = 0.5 ∗ (1−Fgt · Fpred

‖Fgt‖ ‖Fpred‖)

Deeptails


DHN-Data Augmentation

Xt1

Xt2

Xt3

0.5 0.3 0.20.7 0.1 0.6

D

yt2yt1 yt3

0.3 0.7 0.4Threshold = 0.65

Xt1

Xt2

Xt3

0.5 0.3 0.2 D1

yt2yt1 yt3

0.3 ∞ 0.4∞ 0.1 0.6

Th

resh

old

= 0

.55

Xt1

Xt2

Xt3

0.5 0.3 0.2D2

yt2yt1 yt3

0.4∞ 0.1

∞∞

0.3

Xt1

Xt2

Xt3

D3

yt2yt1 yt3

0.4∞ 0.1

∞∞

0.3

Thre

shold

= 0

.42

∞ 0.3 0.2

Xt1

Xt2

Xt3

0.5 0.3 0.20.7 0.1 0.6

D

yt2yt1 yt3

0.3 0.7 0.4Random Row

Permutation

X

X

X

t1

t2

t3

D1

yt2yt1 yt3

0.7 0.1 0.60.40.70.3

0.5 0.3 0.2

Ran

dom

Colu

mn

Perm

utatio

n

X

X

X

t1

t2

t3

D2

yt2yt1 yt3

0.60.4

0.70.3

0.50.10.7

0.3 0.2

◮ We randomly threshold DHN input distance Matrix D with three different thresholds,constructing a dataset 114,483 training and 17,880 testing instances.

◮ During DHN training, with a probability of 0.5 (uniform distribution), we randomly permute therows/columns of the distance matrix and its corresponding target assignment matrix.

Deeptails


DHN-Training Strategy

◮ We train DHN as a 2D classification task.

◮ RMSprop optimizer is used with a learning rate of 0.0003, gradually decreasing by5% every 20,000 iterations for 20 epochs (6 hours on a Titan XP GPU).

◮ For unbalanced labels (too many zeros in the target matrices), we weightzero-class by w0 = n1/(n0 + n1) and one-class by w1 = 1− w0. Here n0 is thenumber of zeros and n1 the number of ones in target matrix.

◮ The loss function is Focal loss with a modulating factor of γ = 2.

◮ Once trained, DHN weights are fixed during the DeepMOT training.

Deeptails


DHN-Ablation Study 1/2

We ablate the architectures ofDHN and compare theirperformance using the metrics:

- MA (Missing Assignment),- SA (Several Assignment),- WA (Weighted Accuracy).

Deep Hungarian

Net

Xt1

Xt2

Xt3

0.5 0.3 0.2

0.7 0.1 0.6

∞

D

yt2yt1 yt3

∞ ∞

Ã

Xt1

Xt2

Xt3

yt2yt1

0.3 0.3 0.2

0.20.80.2

0.1 0.1 Row-wise

Maximum

th=0.5

yt3

0.9Colum

n-wise

Maxim

um

th=0.5

1

00

0

0 00

0

1Xt1

Xt2

Xt3

yt2yt1 yt3

Ac¯

1

00

0

0 00

0

1Xt1

Xt2

Xt3

yt2yt1 yt3

Ar¯

Hard-assigned Predictions

Soft Assignment MatrixDistance Matrix

Ã

Xt1

Xt2

Xt3

yt2yt1 yt3

0.7 0.3 0.2

0.80.20.9

0.1 0.1 0.1


1

00

0

0 00

0

1Xt1

Xt2

Xt3

yt2yt1 yt3

A*

Ground-truth Predictions

Row

-wise

Maxim

um

th=0.5

Column-w

ise

Maxim

um

th=0.5

Xt1

Xt2

Xt3

yt2yt1 yt3

Ac¯

Xt1

Xt2

Xt3

yt2yt1 yt3

Ar¯

Hard-assigned Predictions

0

000

1

00

0

1

0

000

1

00

1

0

#

# #

#

Deeptails


DHN-Ablation Study 2/2Distance Matrix

(Track to Ground Truth)

M

N

Seq-to-seqBi-RNN

N

M2 × hidden units

Row-wiseflatten Reshape

M × N

...

M × N

2 × hidden units

Column-wiseflatten

First-stage hidden representation

N

M

Seq-to-seqBi-RNN

FC layers

Sigmoid


Reshape

N

M2 × hidden units

Second-stage hidden representation

Reshape

D

Ã

Proposed Sequential (seq) Bi-RNN DHN


M

N

N

M


Seq-to-seqBi-RNN

Seq-to-seqBi-RNN

...

M × N

...

N × M

Row-wiseflatten

Colum-wiseflatten

FC layers

Sigmoid

Reshape ÃD

Parallel (paral) Bi-RNN DHN


M

N

Row-wiseflatten

...

M × N

Conv1D(1,24,15)

M × N

24

1/2×M × N

48

Conv1D(24,48,15)

481/4×M

× N

Pooling 1/21/2×M × N

48

Upsampling × 2

1/2×M × N48

Conv1D(96,48,5)

Concatenate

M × N

48

Upsampling × 2

M × N

24 Conv1D(72,24,5)

Concatenate

M × N

25

Concatenate

N

M


Conv1D(25,1,1)

...

M × NSigmoid

Reshape

Conv1D(48, 48,15)

D Ã

1D-convolutional (1d conv) DHN

Discretization Network WA % (↑) MA% (↓) SA% (↓)

Row-wisemaximum

seq gru (proposed) 92.71 13.17 9.70seq lstm 91.64 14.55 10.37paral gru 86.84 23.50 17.15paral lstm 71.58 42.48 22.621d conv 83.12 32.73 5.73

Discretization Network WA % (↑) MA% (↓) SA% (↓)

Column-wisemaximum

seq gru (proposed) 92.36 12.21 3.69seq lstm 91.93 13.15 4.71paral gru 87.24 20.56 16.67paral lstm 72.58 39.55 23.161d conv 82.74 32.94 1.11

Ablation Study of DHN architectures on 252,355 matrices collected during the DeepMOT training process.

Deeptails


DeepMOT Training-Soft Discretization

A soft discretization process replacesargmax operation is performed byappending a threshold column(row) withvalue δ = 0.5 and a row(column)-wisesoftMax.

i=1 i=2 i=30.0

0.2

0.4

0.6

0.8

1.0

Original Vector

ai

Original Vector

i=1 i=2 i=3 i=1 i=2 i=3

Soft

Max(a

i)

SoftMax T=1

Soft

Max(a

i)

SoftMax T=0.01

i=1 i=2 i=3

0.50 0.56 0.60

0.31 0.34 0.35

0.020.00

0.98

softMax(ai ) =exp(ai/T )

∑j exp(aj/T )

(1)

A lower temperature T = 0.01 is used inthe softMax activation function.

Deeptails


DeepMOT Training-Data Augmentation

Random Shift

Ran

dom

Scale

>1

Random

Sca

le <

1

During the track initialization, to mimicnoisy detections in the real world, noise isadded to ground-truth (GT) boundingboxes during DeepMOT training.

◮ with a probability of 0.4 (uniformdistribution), a GT box is randomlyshifted by k * box height or k * boxwidth, k ∈ [0.025, 0.05),

◮ with a probability of 0.4(uniformdistribution), a GT box is randomlyscaled by λ * box height or λ * boxwidth, λ ∈ [0.8, 1.5).

Deeptails


DeepMOT Training-Training Strategy

◮ We initialise tracks with noisy GT boxes at t = 0 and keep regressing next-framebounding boxes. The tracking lasts for 10 frames.

◮ We calculate deepMOT loss and update weights of the deep multi-object trackerwith gradients from deepMOT loss at each time step .

◮ We use Adam optimizer with a learning rate of 0.0001.

◮ We train the baseline SOTs for 15 epochs (72h), and we train Tracktor(regression head and ReID head) for 18 epochs (12h) on a Titan XP GPU.

Deeptails


DeepMOT Training-Ablation StudyTraining loss MOTA ↑ MOTP ↑ IDF1 ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDS ↓

Vanilla 60.20 89.50 71.15 35.13 27.80 276 31827 152Smooth L1 60.38 91.81 71.27 34.99 27.25 294 31649 164

dMOTP 60.51 91.74 71.75 35.41 26.83 291 31574 142dMOTA 60.52 88.31 71.92 35.41 27.39 254 31597 142

dMOTA+dMOTP- ˜IDS 60.61 92.03 72.10 35.41 27.25 222 31579 124dMOTA+dMOTP 60.66 91.82 72.32 35.41 27.25 218 31545 118

Table: The effect of different components of the DeepMOT Loss.

Training MOTA ↑ MOTP ↑ IDF1 ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDS ↓

GOTURN Pre-trained 45.99 85.87 49.83 22.27 36.51 2927 39271 1577

Smooth L1 52.28 90.56 63.53 29.46 34.58 2026 36180 472DeepMOT 54.09 90.95 66.09 28.63 35.13 927 36019 261

SiamRPN Pre-trained 55.35 87.15 66.95 33.61 31.81 1907 33925 356

Smooth L1 56.51 90.88 68.38 33.75 32.64 925 34151 167DeepMOT 57.16 89.32 69.49 33.47 32.78 889 33667 161

Tracktor Vanilla 60.20 89.50 71.15 35.13 27.80 276 31827 152

Smooth L1 60.38 91.81 71.27 34.99 27.25 294 31649 164DeepMOT 60.66 91.82 72.32 35.41 27.25 218 31545 118

Table: DeepMOTv.s. Smooth L1 on the validation set.

◮ We demonstrate the merit of different components of the DeepMOT Loss.◮ We obtain better performance when using our DeepMOT compared to Smooth L1 based loss =>

explicitly establishing matching is important!◮ DeepMOT can jointly train bounding box regressor and an internal re-identification module!

Deeptails


Conclusion

(i) We propose a novel framework to train deep multi-object trackers.◮ DHN as a differentiable alternative to HA.◮ DeepMOT loss as a proxy to MOT metrics [2].

(ii) Detailed description of the design of DHN and deepMOT loss.

(iii) Provided the training and data augmentation strategies for the DHN.

(iv) Once DHN is leraned, we describe the training of a deep multi-object trackerusing the DHN and DeepMOT loss.

(v) This allows to back-propagate “through the assignment problem” !!!

(vi) Use the framework to train Tracktor [1] and establish a new state-of-the-art onMOT Challenge [4, 3].

Deeptails


References

Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe.

Tracking without bells and whistles.ICCV, 2019.

Keni Bernardin and Rainer Stiefelhagen.

Evaluating multiple object tracking performance: The clear mot metrics.JIVP, 2008:1:1–1:10, 2008.

Laura Leal-Taixe, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler.

MOTChallenge 2015: Towards a benchmark for multi-target tracking.arXiv preprint arXiv:1504.01942, 2015.

Anton Milan, Laura. Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler.

MOT16: A benchmark for multi-object tracking.arXiv preprint arXiv:1603.00831, 2016.

Deeptails


deeptailsseminar#1 deeptails july9th,2020 · 2020. 7. 7. · deeptailsseminar#1 july9th,2020...

Documents