making robust computer vision in games presented by: diarmid campbell

Making robust computer vision in gamesPresented by: Diarmid Campbell

Introduction

Who I am: Diarmid Campbell

What I do: Run the Vision R&D group

Where we do it : Sony’s London development studio

What we do: Research computer vision for camera based games

This talk: Making robust computer vision in games

Contents

What we do and why The development process Testing and videos Computer Vision Concepts A robust head tracker Marker based Augmented Reality The problems we faced A demo of EyePet

Camera based games

Camera mounted on the TV You see yourself on the TV Game is overlaid on top of you

Past games on PS2

Computer Vision is hard

“Computer vision makes you want to kill yourself”

-Dr Nick Lord 2009

Why is it hard?

Humans mange it effortlessly Image is a 2D array of numbers Take 5 images and plot them as a

height map

Pick the odd one out

Pick the odd one out

Odd one out

Factors affecting the pixels

Background objects in scene Orientation/position of objects Lighting/Shadows Occlusion

George is in the pixels

Not interested in those George was hidden in the pixels “Here is an image, what is it of?” The general computer vision

problem is hard If we constrain the problem, it is

much easier (but still hard)

Robust Inputs

We can use computer vision as an input mechanism Motion detection in EyeToy games

Robustness is how consistently an input mechanism does what the player is expecting

An input mechanism must be robust

Importance of robustness

If your fire button only worked 9 times out of 10, you would chuck your controller out.


There are ways around it


Imagine your gun is a champagne bottle


Each button click shakes it Eventually the top blows off The lack of robustness is hidden


Perhaps you need to now fight tortoises instead of warriors


The mechanic is now “robust” But it is laggy and unresponsive Cannot rely on split-second timing


Illustrates a general point If the game copes well with non-

robust inputs It will also cope well with someone

not playing it well It creates a skill ceiling Manifests itself as lack of game-

play depth


If you want a deep skill base game mechanics

Robust input is essential

The Development Process

Computer Vision Researcher Game designer

“Tell me what the game mechanic is and I’ll make you a state of the art solution “

“Give me something that works and I’ll see what we can make that’s fun”

The chicken and the egg

You cannot do one before the other Both development timelines happen in

parallel We are still figuring it out Here are some guidelines

Months Game development Computer Vision research1st pass prototype

1 Concepting2 Survey state of the art3 Prototyping4 Extend methods/analyse56789 Production Production

101112131415 Alpha1617 Beta1819 Alpha Master2021 Beta2223 Master24

Research timeline

Convinced we can create the technology

Something up and running

Time

Vision tech beta before game reaches alpha

Required infrastructure

Prototyping environment Matlab Octave

Be able to capture videos Runtime algorithms

Open CV VXL

Videos and testing

Videos and testing

Computer vision is hard because many variables affect the images The lighting The player’s clothes The wallpaper Spectators

3D cameras have their own pros and cons

Representative videos

Videos allow us to capture these variables and test

Videos MUST be representative Works in 99% of cases Useless if that 1% appear in 50% of

living rooms Make videos early in development Demo: head tracker capturing

Head detection videos

We run it through different algorithms Cell SDK face detector Show failure modes

When it fails we can find the frame it failed in and debug

Regression testing

Automated testing Run through load of videos Compare with expected results Expected results could be is head

visible?

When videos aren’t enough

SCEA R&D labs invented the forthcoming PlayStation®Move controller

Uses a camera and other sensors to track the controller

Videos were good early on But cannot change a video:

Lighting Backgrounds Camera settings

Solution

Video1 Video2

Reasons to buy a robot arm(as if you really need persuading)

Can test the same motion under many different conditions

Can try special hard cases

Computer Vision Concepts

Computer Vision Concepts

Videos tell us when it fails How do we fix it? This is the field of computer vision I cannot go into details of techniques Instead I will explain:

The common concepts How they link together

This should help if you: Read papers Talk to experts

Feature extraction

Images contain a lot of information This one is 900K

Feature extraction

Instead of using pixels directly extract high level properties of groups of pixels

Result in less data which is more relevant to the problem at hand

Image Feature Extraction Features

Feature extraction

PS3 Demo: Basic image PS3 Demo: Canny edge detector

Invariant to lighting changes Store additional gradient info

PS3 Demo: Motion Used in all our camera games

PS3 Demo: Feature points Store image patch for each one Can match them frame to frame

Likelihood functions

“Given that we have observed these features, what is the probability that we are observing what we modelled“

Conditional probability

Bayesian statistics underpins most vision algorithm

Model

Features

Likelihood Function P(F | M)

Cost functions = Likelihood functions Some terminology Sometimes you will here about

“Cost functions” They are the same concept:

Likelihood goes up with a good match Cost goes down

One is (conceptually) the inverse of the other

Cost functions

Sum of Squared Differences (SSD)

SSD

SSD

1532

12

High cost = bad match

Low cost = good match

Cost functions

Sum of Squared Differences (SSD)

Model(1) . . .

Classifier Most likely model

Model(2) Model(n)

Features

Classifiers

Compares observed features to a number of models Tells you which model fits the features best

Which model fits best

Classifiers: Face example

Is this a face?


Classic detector (Voila-Jones)

Image

Feature Extraction

Haar Wavelet Features

Boosted cascade Classifier

Face Model

Non-Face model

Is it a face?

Models are trained on example images

Classifier


PS3 demo

Detectors

We have a model (with associated state) Given some observed features Detector returns:

Is the object present? What it’s state? It’s state (X,Y position/rotation/Human pose)

Model

Features Detector

State

Is object present?

Detectors: Faces again

Viola-Jones face detector: Scans a box over the image Different positions and sizes Runs the classifier and returns any

positives Recall face detection demo

Trackers

We have a model, some observed features and the previous state

Tracker returns the next state

Model

Features TrackerNext State

Previous State

Trackers: Face example

PS3 Demo: SSD tracker PS3 Demo: Wand game If we move quickly the tracker gets stuck in a

local minimum

Learning more

Computer Vision Conferences ICCV CVPR ECCV

Read papers accepted by conferences

Get friendly with an academic Or hire one!

Robust Head Tracking

Track rotation and scale

The SSD based tracker did not track rotation and scale

Next iteration of tracker does: X, Y position Scale θ : in plane rotation

PS demo: Hager tracker (swap demo)


Tracked more types of movement But very fragile Problem:

A 2D image patch is not a good model of a head


Does not deal with out-of-plane rotation


Even in-plane rotation is not right

Colour histograms

Lets move away from comparing pixels and think about features

Consider these images of the same objects

Colour histograms

If we compared them pixel for pixel they would seem very different

But look at a histogram of the colours that appear in them and they look the same

Colour histograms

Histograms are a feature that throw away all spatial information

Where we are now

Current system uses: Colour histograms Keeps approximate spatial information

Where we are now

It has a foreground and a background model – each with its own histograms

Where we are now

PS3 Demo

Marker based Augment Reality (AR)

Marker based AR

Marker based AR is in a published game: EyePet

Camera setup

Topics to discuss

Camera based gamesWhat is EyePet?Improving the techFuture research

What the player sees on the TV

Virtual

RealTopics to discuss


Marker based AR

We shipped a “magic card” with the game

Allows the players to manipulate virtual objects in 3D

Finding the marker Input image

Topics to discuss


Finding the marker Threshold

Topics to discuss


Finding the marker Trace outlines

Topics to discuss


Finding the marker Test for quad shapes

Topics to discuss


Finding the marker Actually, just keep pairs of quads

Topics to discuss


Finding the marker

Take corner positions Calculate a 2D transform

Topics to discuss


Finding the marker Match the pattern

Topics to discuss


Finding the marker Match the pattern (Yes!)

Topics to discuss


Finding the marker

Decompose the 2D transform Camera projection Model view matrix

Use a Kalman filter

Topics to discuss


http://en.wikipedia.org/wiki/File:Homography-transl.svg

Finding the marker And we’re done…..

Topics to discuss


Problems we faced

Picking the right threshold

Threshold to find black and white regions

But which one? Many clever solutions – didn’t work Brute force approach Try lots (around 60) thresholds

Picking the right threshold

PS3 Demo: Thresholds PS3 Demo: AR Thresholds

Light sensitive matching

Pattern matching used Sum of Square Differences (SSD)

SSD = 2242 SSD = 14 SSD = 874

Brightness of image affected the score


Use Normalised Cross Correlation (NCC) instead

SSD = 0.8 SSD = 0.9 SSD = 0.9


New way to look at images An image is an array of numbers We can list out every number and it becomes a

vector

12 0 34 23 123 63 23143 7 34 23 23 34 51

156 34 34 51 4 234 2313 34 234 23 63 1 23

0 2 1 14 2 4 2456 52 4 254 24 132 13232 34 132 23 23 4 35

2 4 4 54 3 34 4523 2 34 231 35 23 0

1 32 23 143 34 254 23243 0 1 32 234 23 234

232 34 45 65 4 54 14232 4 45 54 132 254231 2 1 143 234 23 45

100

100

1243

15613

05632

223

143

23207

3434

25234

42

320

34…

34

10,000


This is a co-ordinate vector in “image space” Every 100X100 image corresponds to a single

unique point in image space


This is a co-ordinate vector in “image space” Every 100X100 image corresponds to a single

unique point in image space Brightening an image corresponds to scaling the

position vector


When comparing two images SSD corresponds to the distance between them

in image space NCC corresponds to their angle

SSD

θ

θ


Linear algebra is the other pillar of computer vision

Feature extracting is just a transformation from one space to another

Image space -> Feature space Classifiers are often just planes

which divide up the space (e.g. into a region that contains faces and a region that doesn’t)

Occlusion

It is easy to occlude the marker with your fingers

Occlusion

Put big red handle on and instruct the player to hold it

Also put handle on the back

Occlusion: Another approach(still in research phase)

Edge based tracking Uses AR Marker to initialise Then tracks using edge features PS3Demo (load EyePet)

False positives

When not occluded, we find the marker (almost) all the time

Our home videos showed this False positives were a problem Not represented in our videos Added some Hollywood films to the

video tests We knew that no markers were

present

False positives

Saved out all spurious frames

False positives

Made a number of tweaks to algorithm E.g. Pattern matching whole marker,

not just the centre pattern 20 times less false detections

EyePet Demo

EyePet Demo

Use motion detection for normal interaction Call Jump Stroke

Use AR card for health monitor Screen-facing case

Needs stimulation Trampoline

Finally Give him a shower

Summary

What we do and why The development process Testing and videos Computer Vision Concepts A robust head tracker Marker based Augmented Reality The problems we faced A demo of EyePet

The End(please fill out your questionnaires)

making robust computer vision in games presented by: diarmid campbell

Documents

robust slide

hard slide

testing slide

hidden slide

alpha slide

essential slide

hand slide

ps2 slide