deep visual understanding from deep learning by prof. jitendra malik

Deep Learning for Perception and Action

Deep Visual Understanding from Deep LearningJitendra MalikUC Berkeley & Google

Moravecs argument(1998) ROBOT: Mere Machine To Transcendent Mind1 neuron = 1000 instructions/sec1 synapse = 1 byte of informationHuman brain then processes 10^14 IPS and has 10^14 bytes of storageIn 2000, we have 10^9 IPS and 10^9 bytes on a desktop machineAssuming Moores law we obtain human level computing power in 2025, or with a cluster of 100 nodes in 2015.

Embodied Cognition

Vision(broadly,perception) Motor Control (broadly,planning) LanguageSemantic Reasoning

Ontogeny of IntelligenceThe Cambrian period (543-490 million yrs ago) led to the emergence of wide variety of animal life. These animals had vision and locomotion capabilities.Sensory systems provide great benefits only when accompanied by the ability to move - to find food, avoid predators etc.

If you dont need to move you dont need an eye or a brain!https://goodheartextremescience.wordpress.com/2010/01/27/meet-the-creature-that-eats-its-own-brain/7

Hominid evolution in last 5 million yearsBipedalism freed the hand for tool making. Dexterous hands coevolved with larger brains.

Anaxagoras: It is because of his being armed with hands that man is the most intelligent animal

Origins of Language (from Trask)

9

The evolutionary progressionVision and LocomotionManipulationLanguageSuccesses in AI seem to follow the same order!

Is Object Detection nearly solved?

Hubel and Wiesel (1962) discovered orientation sensitive neurons in V1

Convolutional Neural Networks (LeCun et al ) Used backpropagation to train the weights in this architectureFirst demonstrated by LeCun et al for handwritten digit recognition(1989)Applied in sliding window paradigm for tasks such as face detection in the 1990s.However was not competitive on standard computer vision object detection benchmarks in the 2000s.And then Imagenet and Alexnet happened..

14

R-CNN: Regions with CNN featuresGirshick, Donahue, Darrell & Malik (CVPR 2014)

InputimageExtract regionproposals (~2k / image)Compute CNNfeaturesClassify regions(linear SVM) This and the Multibox work from Google showed how to apply these architectures for object detection

Fast R-CNN (Girshick, 2015) R-CNN with SPP features, no need to warp individual windows

There is also Faster R-CNN which doesnt require external proposals

The 3Rs of Vision:Recognition, Reconstruction & Reorganization

Talk at POCV Workshop, CVPR 2012

Fifty years of computer vision 1965-20151960s: Beginnings in artificial intelligence, image processing and pattern recognition1970s: Foundational work on image formation: Horn, Koenderink, Longuet-Higgins 1980s: Vision as applied mathematics: geometry, multi-scale analysis, control theory, optimization 1990s: Multiple View Reconstruction well understood2000s: Learning approaches to recognition problems in full swing. Large datasets are collected and annotated e.g. ImageNet2010s: Deep Learning becomes popular building off availability of GPUs and annotated datasets.

19

Reconstructing the world

automatically from huge collections of photos downloaded from the Internet

Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenesSnavely, Seitz, Szeliski.Reconstructing the World from Internet Photo Collections.IJCV 2007

Reconstructing the worldOver the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over timeMatzen & Snavely.Scene Chronology.ECCV 2014

21

Reconstructing the worldOver the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over timeMatzen & Snavely.Scene Chronology.ECCV 2014

Martin-Brualla, Gallup, Seitz.Time-lapse Mining from Internet Photos.SIGGRAPH 2015

22

Reconstructing the great indoors

Choi,Zhou, Koltun.Robust Reconstruction of Indoor Scenes.CVPR 2015 using Depth Cameras

Ikehata, Yan, Furukawa.Structured Indoor Modeling.ICCV 2015 using Semantic Reconstruction of Rooms and Objectspoint cloud3D meshrendering

23

ShapeNet (Stanford & Princeton)

Some problems that we can solve

Block Diagram of the Primate Visual System

Neuroscience & Computer VisionA feed-forward view of processing in the ventral stream with layers of simple and complex cells led to the neo-cognitron and subsequently convolutional networks.We now know that the ventral stream is much more complicated with bidirectional as well as feedback connections. I am interested in computer vision tasks where feedback is key to the solution. This is a very natural way to capture context. Helpful in pose recovery, instance segmentation etc.

IEF : Carreira, Agrawal, Fragkiadaki & Malik

Social PerceptionComputers today have pitifully low social intelligenceWe need to understand the internal state of humans as they interact with each other and the external worldExamples: emotional state, body language, current goals.

What we would like to inferWill person B put some money into Person Cs tip bag?

Visual Semantic Role LabelingGupta & Malik (2015)

What we cant do (yet)

The hierarchical structure of human behavior- movement, goals, actions and eventsACTION = MOVEMENT + GOAL

Eventse.g. A meal at a restaurantClassical AI/Cognitive Science Solution Schemas (frames, scripts etc.)To have a robust, visually grounded solution we need to learn the equivalent from video + Knowledge Graph like structuresPerhaps best tackled in particular domains e.g. team sports, instructional videos etc.

What has been responsible for recent AI successes?Big ComputingBig Data

What has been responsible for recent AI successes?Big ComputingBig DataBig AnnotationBig Simulation

Game scenarios can be simulated, but its not so easy in other settings

Consider infants, 41

42

External Teacher SignalInternalTeacher SignalSelfSupervisionExternalSupervision

The Development of Embodied Cognition: Six Lessons from BabiesLinda Smith & Michael Gasser

The Six LessonsBe multi-modalBe incrementalBe physicalExploreBe socialUse language

An example: Learning to see by moving, P. Agrawal, J. Carreira, J. Malik (ICCV 2015)

Consider Poking45

Same Poke Different Outcomes

The knowledge of object class explains it!

Different objects change in different ways. 45

On Mental Models

If the organism carries a `small-scale model of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilize the knowledge of past events in dealing with the present and the future, and in every way to react in a much fuller, safer, and more competent manner to the emergencies which face it (Craik, 1943,Ch. 5, p.61)Modern Control theory (Kalman et al) uses a state space formalism to achieve this.

For acting in novel situations,

Model of the Agent

Model of the Environment How will the environment look in the future when the agent interacts with it?

48

Force?Consider the specific case of billiardsWhat force to apply?

Model of the environment

Even if a person has received no supervision for playing billiards, he can still predict the direction in whhich the blue ball needs to hit to strike the yellow ball. How can this person perform this prediction? 48

49

Force?Consider the specific case of billiardsWhat force to apply?

Model of the environmentHow to apply this force?

Model of the agent

Even if a person has received no supervision for playing billiards, he can still predict the direction in whhich the blue ball needs to hit to strike the yellow ball. How can this person perform this prediction? 49

Learning Visual predictive models of Physics for playing billiardsKaterina Fragkiadaki*, Pulkit Agrawal*, Sergey Levine, Jitendra Malik ICLR, 2016

*Equal contribution50

Moving Balls World

Factors of change

Table Geometry

Number of balls

Ball Size

Color of Balls/Walls51Like the real world, moving ball worlds provides constantly changing environments

52Visual Predictive Model of Physics

Neural network

F

PredictionModule A model that can predict future visual states(i.e. visual imagination)

One idea is that person has a visual model of physics, i.e. he ca predict how the worls would look in the future when he applues forces on some objects.

52

Key Idea: Use object-centric predictions

F

53

F

World-Centric PredictionObject-Centric Prediction

We decompose the problem of predicting thr world dynamis into the problem of predicting dynamics of individual objects. This makes our approach scalable to worlds with multiple objects. 53

Our Model54

Models Imagination

55Only Inputs

Visual of the first frame

Applied forces

We show that our model can learn ball-wal collisions directly from visual inputs.

55

Models Imagination Multiple Balls56Only Inputs

Visual of the first frame

Applied forces

And also predict how the world would evolve with multiple objects. We can use our models to peform novel actions such as hitting a second moving ball for which no training data was provided. 56

Models ImaginationModels ImaginationGround-TruthWe successfully model collisions. Input to the model visual of the first frame and the applied force (yellow arrow)Dark Light Blue: Progression of time57

58

Object CentricFrame CentricTrain on 2/3 Balls Generalize to 6 ball worldsREPLACE THIS PLOT

59Now that we have learnt predictive models

We can plan Actions!

60Visually imagine the effect of different forces !!Then chose the optimal force which in imagination (simulation) leads to the desired goal state. Planning Actions

For illustration suppose, the agent had to strike the ball so that it hits the green ball. The agent can use the visual mdoel of physics to imagine differet configurations of the world that would result from applying different forces. He can then chose the force that would lead to the world configuration closest to the goal state. Using the method of visual imagination the agent can perform novel tasks without requiring any supervision. 60

Accuracy of pushing a ball to a desired position61

Learning To Poke by Poking:Experiential Learning of Intuitive PhysicsPulkit Agrawal*, Ashvin Nair*, Pieter Abbeel, Jitendra Malik,Sergey Levine

NIPS 2016

*Equal contribution62

63

Data Collection

Forward Model65

DecoderZtvt

PredictorZt+1

Image(Xt)

Action(ut)

Image Encoder

Action EncoderWe donot want to predict pixels!!

Inverse ModelMultimodality in output space

65

Our Method: Jointly Learn Forward and Inverse Model66Ztvt

PredictorZt+1

Image(Xt)

Action(ut)

Image Encoder

Action EncoderSimultaneously learn and predict in an abstract features space (Zt)Forward model regularizes inverse model !!

Image Encoder

66

67

Joint learning of forward and inverse models!

Results68

69

Results

70

Results

Pinto, L., & Gupta, A., Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, ICRA, 2016.

Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D., Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection, arXiv:1603.02199, 2016.

Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A., The Curious Robot: Learning Visual Representations via Physical Interactions, arXiv:1604.01360, 2016.Related and Contemporary Work

Embodied Cognition

Vision(broadly,perception) Motor Control (broadly,planning) LanguageSemantic Reasoning

deep visual understanding from deep learning by prof. jitendra malik

Technology