how machines learn to talk amitabha mukerjee iit kanpur work done with: computer vision: profs. c....

How Machines Learn to Talk

Amitabha Mukerjee IIT Kanpur

work done with:Computer Vision: Profs. C. Venkatesh, Pabitra Mitra

Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela

Natural Language: Prof. Achla Raina, V. Shreeniwas

Robotics

Collaborations:IGCAR Kalpakkam

Sanjay Gandhi PG Medical Hospital

Visual Robot Navigation

Time-to-Collisionbased Robot Navigation

Hyper-Redundant Manipulators

• Reconfigurable Workspaces / Emergency Access

• Optimal Design of Hyper-Redundant Systems – Scara and 3D

The same manipulator can work in changing workspaces

Planar Hyper-Redundancy

4-link PlanarRobot

Motion Planning

Micro-Robots

• Micro Soccer Robots (1999-)

• 8cm Smart Surveillance Robot – 1m/s

• Autonomous Flying Robot (2004)

• Omni-directional platform (2002)

Omni-Directional Robot Sponsor: [email protected]

Flying Robot

heli-flight.wmv

Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production

Start-Up at

IIT Kanpur

WhirligigRobotics

Tracheal Intubation Device

Device for Intubation during general Anesthesia

Aperture for Fibre optic video cable

Endotracheal tube Aperture

Aperture for Oxygenation tube

Hole for suction tube

Control cables Attachment Points

Ball & Socket joint

Assists surgeon while inserting breathing tube during general anaesthesia

Sponsor: DST / [email protected]

Draupadi’s Swayamvar

Can the Arrow hit the rotating mark? Sponsor: Media Lab Asia

High DOF Motion Planning

• Accessing Hard to Reach spaces

• Design of Hyper-Redundant Systems

• Parallel Manipulators

Sponsor: BRNS / [email protected]

10-link 3D Robot – Optimal Design

Multimodal Language Acquisition

Consider a child observing a scene together with adults talking about it

Grounded Language : Symbols are grounded in perceptual signals

Use of simple videos with boxes and simple shapes – standardly used in sociopsychology

Objective

To develop a computational frameworkfor Multimodal Language Acquisition• acquiring the perceptual structure

corresponding to verbs • using Recurrent Neural Networks as

a biologically plausible model for temporal abstraction

• Adapt the learned model to interpret activities in real videos

Visually Grounded Corpus

Two psychological research films, one based on the classic Heider & Simmel (1944) and other based on Hide & Seek

These animation portray motion paths of geometric figures (Big Square, Small square & Circle)

Chase Alt

Cognate clustering Similarity Clustering: Different

expressoins for same action, e.g.: “move away from center” vs “go to a corner”

Frequency: Remove Infrequent lexical units

Synonymy: Set of lexical units being used consistently in the same intervals, to mark the same action, for the same set of agents.

Perceptual Process

Cognate Clustering

Trained Simple

Recurrent Network

Descriptions

FeaturesVideo

Events

Feature Extraction

Multi Modal Input

VICES

Design of Feature Set The features selected here are related to

spatial aspects of conceptual primitives in children, such as position, relative pose, velocity etc.

Use features that are kinematical in nature, temporal derivations or simple transforms of the basic ones.

Monadic Features

Dyadic Predicates

VIdeo and Commentary for Event Structures [VICES]

Cognate Clustering

Trained Simple

Recurrent Network

Descriptions

FeaturesVideo

Events

Feature Extraction

Multi Modal Input

VICES

The classification problem

The problem is of time series classification

Possible methodologies include: Logic based methods Hidden Markov Models Recurrent Neural Networks

Elman Network Commonly a two-

layer network with feedback from the first-layer output to the first layer input

Elman Networks detect and generate time-varying patterns

It is also able to learn spatial patterns

Feature Extraction in Abstract Videos

Each image is read into a 2D matrix Connected Component Analysis is

performed Bounding box is computed for each

such connected component Dynamic tracking is used to keep

track of each object

Working with Real Videos Challenges

Noise in real world videos Illumination Changes Occlusions Extracting Depth Information

Our Setup Camera is fixed at head height. Angle of depression is 0 degrees (approx.).

Video

Background Subtraction Learn on still

background images Find pixel intensity

distributions Classify each pixel as

background if

Remove Shadows Special Case of Reduced

Illumination S = k*P where k<1.0

Background Subtraction

P(x,y) - µ(x,y) < P(x,y) - µ(x,y) < kkσσ(x,y)(x,y)

2

Contd.. Extract Human Blobs

By Connected Component Analysis

Bounding box is computed for each person

Track Human Blobs Each object is tracked

using a mean-shift tracking algorithm.

Contd..

Depth Estimation Two approximations

Using Gibson’s affordances Camera Geometry

Affordances: Visual Clues Action of a human is triggered by the

environment itself. A floor offers walk-on ability

Every object affords certain actions to perceive along with anticipated effects A cups handle affords grasping-lifting-drinking

Contd..

Gibson’s model Horizon is fixed at the head height of the

observer. Monocular Depth Cues

Interposition An object that occludes another is closer.

Height in the visual field Higher the object is the further it is.

Depth Estimation Pin hole Camera Model Mapping (X,Y,Z) to (x,y)

x = X * f / Z y = Y * f / Z

For the point of contact with the ground Z 1 / y X x / y

Depth plot for A chase B Top view (Z-X plane)

Results (contd..)

Results

Separate-SRN-for-each-action Trained & tested on different parts of the

abstract video Trained on abstract video and tested on

real video Single-SRN-for-all-actions

Trained on synthetic video and tested on real video

Basis for Comparison

E

E'-E Positives False

Mismatches Focus as classified Intervals :FM

E

E' E Positives True

occurring asevent an describe subjects when Intervals : E

E - t : E

occurring asevent an describes VICES when Intervals E'

'' EtE

E

E'-E Negatives False E

FM Mismatches Focus

t

EE 'E'EAccuracy

Let the total time of visual sequence for each verb be t time units

Separate SRN for each action

Framework : Abstract videoVerb True Positives False Positives False Negatives Focus Mismatches Accuracy

hit 46.02% 3.06% 53.98% 2.4% 92.37%

chase 24.44% 0% 75.24% 0.72% 93.71%

come Closer 25.87% 14.61% 73.26% 16.77% 63.66%

move Away 46.34% 7.21% 52.33% 15.95% 73.37 %

spins 82.54% 0% 16.51% 24.7% 97.03%

moves 68.24% 0.12% 31.76% 1.97% 77.33%

Verb True Positives False Positives False Negatives Focus Mismatches

hit 3 3 1 1

chase 6 0 3 4

come Closer 6 20 7 24

move Away 8 3 0 14

spins 22 0 1 9

moves 5 1 2 7

Time Line comparison for Chase

Separate SRN for each action Real video (action recognition only)

Verb Retrieved Relevant True Positives

False Positives

False Negatives

Precision Recall

A Chase B 237 140 135 96 5 58.4% 96.4%

B Chase A 76 130 76 0 56 100% 58.4%

Single SRN for all actions

Framework : Real video

Verb Retrieved Relevant True Positives

False Positives

False Negatives

Precision Recall

Chase 239 270 217 23 5 91.2% 80.7%

Going Away 21 44 13 8 31 61.9% 29.5%

Conclusions & Future Work Sparse nature of video provides for ease of

visual analysis Directly learning event structures from

perceptual stream. Extensions: Learn fine nuances between

event structures of related action words. Learn the Morphological variations. Extend the work towards using Long Short

Term Memory (LSTM). Hierarchical acquisition of higher level

action verbs.

how machines learn to talk amitabha mukerjee iit kanpur work done with: computer vision: profs. c....

Documents

omnidirectional robot

smart surveillance robot

itgrounded language

perceptual structure

simple transforms

simple shapes

omnidirectional platform

breathing tube