roland memisevic at ai frontiers: common sense video understanding at twentybn
TRANSCRIPT
Twenty Billion NeuronsBerlin & Toronto based Video Understanding Company
DOMESTIC COMPANIONS AUGMENTED REALITY
AUTOMOTIVE
(10M cars)
(85M smart cameras) (6M AR glasses)
COLLABORATIVE ROBOTICS
(150M cobots)
SMARTPHONE APPS
(3 BN phones)
All figures are estimated number of devices in 2020
By 2020:
(CONSUMER VIDEOS)(80% of Internet Traffic)
Sources: KPCB, Barclays
DogCat
15 people, 3 street signs
2012 2014 2016 2017
“Neural networks can’t doimage
classification”
“Neural networks can’t
translate text”
“Neural networks can’t play Go”
“Neural networks don’t have
common sense”
1986
“Neural networks don’t work”
?
At TwentyBN we build the brain that allows cameras to see
Prof. Yoshua Bengio
Scientific Advisor
Professor at MILA Montréal; noted for his pioneering work
on deep learning
Valentin Haenel
VP Engineering
Co-initiator of PyData Berlin; contributor in more than 50
open source projects
Nathan Benaich
Advisor
VC investor, technologist, former scientist; Organizer of
London.ai and RAAIS
+ 13 full-time staff, including AI researchers, engineers and product people
Roland Memisevic
15+ years experience in DL as Professor (MILA Montreal) & PhD student of Geoff Hinton
CEO & Chief Scientist
Moritz Müller-Freitag
COO & Head of Product
Experience as Professor (FH Münster) & principal software
architecture (XING AG)
Experience as data scientist (Eleven) & country manager
(Savedo/HitFox Group)
Ingo Bax
CTO
Christian Thurau
CBDO
Experience as Co-founder, CTO (Game Analytics, exit) & researcher (Fraunhofer)
Research & engineering
Data platform
Integrated technology stack
1 2Embedded real-time net
3
Solutions4
● RGB (for example, cheap, built-in laptop camera) ● Recognizes 25 hand gestures● Very high accuracy ● Runs in real-time on a laptop using RGB camera input
● Require depth sensor devices ● ~5 gestures ● Low accuracy ● Never gained traction
Camera based gesture control
Existing solutions
TwentyBN solution
Note: Click picture for video
VariationsCamera angles and scene layouts
Multi-person actions and localization
Interactivity
Complex object interactions
Indoor activity monitoring
Output: “Person picking [something] up”
Output: “[Something] falling like a feather or paper”
Output: “Person leaving through a door”
Output: “Bending [something] until it breaks”
Output: “Trying to bend [something unbendable] so nothing happens”
Output: “[gesture] Zooming Out With Two Fingers”
We support all stages of our clients’ product cycles
Softcore IP
Data licensing
Software licensing
Hardware licensing
Product Description
Software that adds video capabilities to your product
High-quality labeled videos customized to support your video applications
20BN-JESTER
A crowd-acted dataset of generic human hand gestures.
Number of Videos: 148.094
License: Free for academic use
(Creative Commons Attribution 4.0 International license CC BY-NC-ND 4.0)
https://www.twentybn.com/datasets/jester
20BN-SOMETHING-SOMETHING
A crowd-acted dataset of basic interactions with everyday objects.
Number of Videos: 108.499
License: Free for academic use
(Creative Commons Attribution 4.0 International license CC BY-NC-ND 4.0)
https://www.twentybn.com/datasets/something-something
Contrastive classes make learning harder and networks stronger
Tearing [something] into two pieces VS Tearing [something] just a little bit 0.74 (0.52)
Pretending to pick [something] up VS Picking [something] up 0.86 (0.75)
Pretending to pour VS Pouring 0.82 (0.64)
Pouring with overflow VS Pouring without 0.76 (0.54)
Pretending to put [something] onto VS Putting [something] onto [something] 0.82 (0.64)
Mistaken “opening” predictions
Ground truth: Moving [part] of [something]
Prediction: Opening [something]
Ground truth: Unfolding [something]
Ground truth: Putting [something] on a flat surface
without letting it roll
Prediction: Opening [something]
Prediction: Opening [something]
Mistaken “covering” predictions
Ground truth: Putting [something] in front of [something]
Prediction: Covering [something]
Ground truth: Turning [something] upside down
Prediction: Covering [something]
Transfer learning