learning graph representations for video...
TRANSCRIPT
![Page 1: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/1.jpg)
Learning Graph Representations for Video Understanding
Xiaolong WangCarnegie Mellon University
![Page 2: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/2.jpg)
Computer Vision
Dog
He et al. Mask R-CNN. ICCV 2017.Güler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.
![Page 3: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/3.jpg)
Deep Learning
Mushroom Dog Ant Jelly Fungus Nest
ImageNet
Image
Mushroom
Dog
Ant
Jelly Fungus
Nest
Train a Convolutional Neural Network
Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.
![Page 4: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/4.jpg)
Convolutional Neural Networks
Figure credit: Van Den Oord et al.
• Convolution is local
• Long-range Pairwise relations are not modeled
![Page 5: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/5.jpg)
Related Work: Relation Networks
[Santoro et al, 2017]
![Page 6: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/6.jpg)
Related Work: Self-Attention
[Vaswani et al, 2017]
![Page 7: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/7.jpg)
Related Work: Graph Convolution Networks
[Kipf et al, 2017]
![Page 8: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/8.jpg)
This Tutorial
• Perform connections on different graph/relation networks
• Under the application of video understanding
• Both supervised and self-supervised methods
![Page 9: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/9.jpg)
Video Recognition
3D Conv
3D Conv
3D Conv
Playing Soccer
![Page 10: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/10.jpg)
Reasoning for Action Recognition
X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks. CVPR 2018.
Long-rang explicit reasoning
![Page 11: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/11.jpg)
Non-local Means
𝑝
𝑞2
𝑞1
𝑞3
Buades et al. A non-local algorithm for image denoising. CVPR, 2005.
![Page 12: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/12.jpg)
Non-local Operator
Operation in feature space
Can be embedded into any ConvNets
𝑥𝑖 𝑥𝑗
![Page 13: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/13.jpg)
Non-local Operator
𝑥𝑖 𝑥𝑗
Affinity Features
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
![Page 14: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/14.jpg)
Non-local Operator
14𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
𝑇𝐻𝑊
512𝑇𝐻𝑊
512
𝑇𝐻𝑊
𝑇𝐻𝑊× =
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
![Page 15: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/15.jpg)
Non-local Operator
15𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
𝑇𝐻𝑊
512𝑇𝐻𝑊
512
𝑇𝐻𝑊
𝑇𝐻𝑊× =
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
![Page 16: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/16.jpg)
Non-local Operator
16𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
normalize
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
![Page 17: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/17.jpg)
Non-local Operator
17𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
normalize
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
𝑓 𝑥𝑖 , 𝑥𝑗 = exp(𝑥𝑖𝑇𝑥𝑗)
𝐶(𝑥) = ∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗
𝑓 𝑥𝑖 , 𝑥𝑗
𝐶(𝑥)=
exp(𝑥𝑖𝑇𝑥𝑗)
∀𝑗 exp(𝑥𝑖𝑇𝑥𝑗)
![Page 18: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/18.jpg)
Non-local Operator
18𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑔: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 512
𝑇𝐻𝑊 × 512normalize
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
![Page 19: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/19.jpg)
Non-local Operator
19𝑥
𝜃: 1 × 1× 1
𝜙: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512 𝑇 × 𝐻 × 𝑊 × 512
𝑔: 1 × 1× 1
𝑇 × 𝐻 × 𝑊 × 512
𝑇𝐻𝑊 × 512 512 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 𝑇𝐻𝑊
𝑇𝐻𝑊 × 512
𝑇𝐻𝑊 × 512normalize
𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
𝑇 × 𝐻 × 𝑊 × 512
![Page 20: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/20.jpg)
Non-local Operator as A Residual Block
3D Conv
3DConv
Non-local3D
ConvNon-local
VideoAction Class
𝑧𝑖 = 𝑦𝑖𝑊 + 𝑥𝑖
![Page 21: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/21.jpg)
Examples
![Page 22: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/22.jpg)
Action Recognition in Daily Lives
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV 2016.
Charades Dataset: 157 classes, 9.8k videos, 30s per video
We let the people upload their own videos!
![Page 23: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/23.jpg)
Action Recognition on Charades
Method mAP
3D Conv 31.8%
3D Conv + Non-local 33.5%
![Page 24: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/24.jpg)
Opening A Book
24
![Page 25: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/25.jpg)
Opening A Book
25
The Non-local Block
![Page 26: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/26.jpg)
Opening A Book
Object states changes over time
Human-object, object-object interactions
X. Wang and A. Gupta. Video as Space-Time Region Graphs. ECCV 2018.
![Page 27: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/27.jpg)
Opening A Book
27
A1
B1
A2
B2
A3
B3
A4
B4
Highly Correlated
![Page 28: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/28.jpg)
Relations between Regions
![Page 29: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/29.jpg)
Relations between Regions
𝑓 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖𝑇 𝜙′(𝑥𝑗)
𝐺𝑖𝑗 =exp 𝑓 𝑥𝑖, 𝑥𝑗
∀𝑗 exp 𝑓 𝑥𝑖, 𝑥𝑗
![Page 30: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/30.jpg)
Graph Convolutional Network
𝑍 = 𝐺𝑋𝑊
Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017
𝑁
𝐺𝑁 × 𝑋𝑁
𝑑
𝑑
𝑑
𝑊×𝑍𝑁
𝑑
=
![Page 31: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/31.jpg)
Graph Convolutional Network
31
Propagation
![Page 32: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/32.jpg)
Connecting Non-local and GCN
𝑧𝑖 = 𝑦𝑖𝑊 + 𝑥𝑖𝑦𝑖 =1
𝐶(𝑥)
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗 𝑔(𝑥𝑗)
=
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗
∀𝑗 𝑓 𝑥𝑖 , 𝑥𝑗𝑔(𝑥𝑗)
=
∀𝑗
𝐺𝑖𝑗 𝑔(𝑥𝑗)
=
∀𝑗
𝐺𝑖𝑗 𝑔(𝑥𝑗)𝑊 + 𝑥𝑖
𝑍 = 𝐺 𝑔 𝑋 𝑊 + 𝑋
The Non-local Operator:
The Graph Convolution
![Page 33: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/33.jpg)
Action Recognition on Charades
33
Method mean AP
3D Conv 31.8%
3D Conv + Non-local 33.5%
3D Conv + Region Graph 36.2%
+4.4%
![Page 34: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/34.jpg)
Action Recognition on Charades
34
30%
35%
40%
45%
Involves Objects ?
No Yes
3D Conv
3D Conv + Graph
![Page 35: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/35.jpg)
Action Recognition on Charades
35
30%
35%
40%
45%
Pose Variances
3D Conv
3D Conv + Graph
![Page 36: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/36.jpg)
Connection to Mean-Shift
𝑦𝑖 =
∀𝑗
𝑓 𝑥𝑖 , 𝑥𝑗
∀𝑗 𝑓 𝑥𝑖 , 𝑥𝑗𝑔(𝑥𝑗)
The Non-local Operator:
The Mean-Shift Clustering:
𝑚(𝑥) =
𝑥𝑗∈𝑁(𝑥)
𝐾 𝑥, 𝑥𝑗
𝑥𝑗∈𝑁(𝑥) 𝐾 𝑥, 𝑥𝑗𝑥𝑗
Converging to the same mean?
https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering
![Page 37: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/37.jpg)
Recent Related Work
Actor-Centric Relation Network[Sun et al, 2018]
Video Action Transformer Network[Girdhar et al, 2019]
Long-Term Feature Banks for Detailed Video Understanding[Wu et al, 2019]
![Page 38: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/38.jpg)
Learning Affinity with Semantic Supervision
![Page 39: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/39.jpg)
Learn Correspondencewithout Human Supervision
Goal:
![Page 40: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/40.jpg)
The visual world exhibits continuity
![Page 41: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/41.jpg)
Prior Work: Learning from Time
Predict Color in Time [Vondrick et al, 2018]
Inputs
Outputs
Predict Pixel in Time [Mathieu et al, 2015]
Predict Arrow of Time [Wei et al, 2018]
![Page 42: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/42.jpg)
Using Tracking to Learn Features
CNN CNN
Similarity
Tracking → Similarity[Wang et al, 2015]
![Page 43: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/43.jpg)
Using Tracking to Learn Features
CNN CNN
Similarity
Tracking → Similarity[Wang et al, 2015]
Limited by Off-the-shelf Trackers
![Page 44: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/44.jpg)
Similarity requires tracking Tracking requires similarity
Let’s jointly learn both!
![Page 45: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/45.jpg)
Learning to Track
How to obtain supervision?
ℱℱℱ
ℱ: a deep tracker
![Page 46: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/46.jpg)
Supervision: Cycle-Consistency in Time
Track backwards
Track forwards, back to the future
ℱℱ
ℱ
ℱℱℱ
![Page 47: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/47.jpg)
Backpropagation through time along the cycle
Supervision: Cycle-Consistency in Time
ℱℱ
ℱ
ℱℱℱ
![Page 48: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/48.jpg)
Differentiable Tracking
48
Encoder
𝜙
Encoder
𝜙
transpose
Patch feature in time 𝑡: 𝑥𝑡𝑝
Image feature in time 𝑡 − 1: 𝑥𝑡−1𝐼
100
900× =900
𝑐100
𝑐
𝑥𝑡−1𝐼 𝑥𝑡
𝑝
![Page 49: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/49.jpg)
Spatial
Transformer
𝜃Cropping
Differentiable Tracking
49
Encoder
𝜙
Encoder
𝜙
transpose
Patch feature in time 𝑡: 𝑥𝑡𝑝
Image feature in time 𝑡 − 1: 𝑥𝑡−1𝐼
Patch feature in time 𝑡 − 1: 𝑥𝑡−1𝑝
![Page 50: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/50.jpg)
Spatial
Transformer
𝜃Cropping
Differentiable Tracking
50
Encoder
𝜙
Encoder
𝜙
transpose
𝑥𝑡−1𝑝
= ℱ(𝑥𝑡−1𝐼 , 𝑥𝑡
𝑝)
ℱ
![Page 51: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/51.jpg)
Recurrent Tracking
51
𝑥𝑡𝑝
𝑥𝑡𝑝
𝑡 − 1
ℱ
𝑡
ℱ
𝑡 − 1
ℱ
𝑡 − 2
ℱ
𝑡 − 2
ℱ
𝑡 − 3
ℱ
ℒ𝑐𝑦𝑐𝑙𝑒
![Page 52: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/52.jpg)
Cycle-Consistency Loss Function
ℒ𝑐𝑦𝑐𝑙𝑒 = ||𝐿𝑜𝑐 𝑥𝑡𝑝
− 𝐿𝑜𝑐 𝑥𝑡𝑝
||22
𝑥𝑡𝑝
𝑥𝑡𝑝
ℱ
ℱℱℱ
ℱℱ
![Page 53: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/53.jpg)
Multiple Cycles
Sub-cycles: a natural curriculum
![Page 54: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/54.jpg)
Skip Cycles
Skip-cycles: skipping occlusions
![Page 55: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/55.jpg)
Visualization of Training
![Page 56: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/56.jpg)
Test Time: Nearest Neighbors in Feature Space
𝑡 − 1 𝑡
![Page 57: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/57.jpg)
𝑡 − 1 𝑡
Test Time: Nearest Neighbors in Feature Space
![Page 58: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/58.jpg)
Instance Mask Tracking
DAVIS Dataset
DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.
![Page 59: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/59.jpg)
Pose Keypoint Tracking
JHMDB Dataset
![Page 60: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/60.jpg)
Comparison
Our Correspondence Optical Flow
![Page 61: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/61.jpg)
Pose Keypoint Tracking
JHMDB Dataset
Method PCK @.1
Optical Flow 45%
Vondrick et al. 45%
Ours 58%
Vondrick et al. Tracking Emerges by Colorizing Videos. ECCV 2018.
![Page 62: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/62.jpg)
Texture Tracking
DAVIS Dataset
DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.
![Page 63: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/63.jpg)
Semantic Masks Tracking
Video Instance Parsing Dataset
Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.
![Page 64: Learning Graph Representations for Video Understandingvalser.org/webinar/slide/slides/20190703/graph_tutorial...2019/07/03 · Action Recognition in Daily Lives Gunnar A. Sigurdsson,](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff2950e958ea445075a400e/html5/thumbnails/64.jpg)
Questions?