cbir 13 tools libraries v2 - kit · user statistics in the first four months after release 13,630...

Content-based image and video analysis

Tools and Libraries

11.07.2011

Lecture overview

Labeled data is needed in almost all discussed approaches But: Labeling data is tedious and expensive work We will discuss different approaches to solve this

problem Software libraries for standard tasks can

drastically reduce development time We will discuss software libraries for image

processing and machine learning

Content-based Image and Video Retrieval 2

Labels Diverse types of data annotations needed

Face recognition face bounding box, facial landmarks, id Object recognition bounding box, object class High-level features: images depicting concept Genre-classification:videos of certain genre …

The only way to get the labels is to get people to manually label the data It is a very boring task People need to be compensated somehow Most common way: just pay them!

Can be quite expensive for large datasets


LabelMe

Collaborative labeling To download the database, a certain number of

images have to be annotated Label quality probably good, since annotators

have to use them Public (everyone can join) A large database (460000 labeled objects of all

kinds) http://labelme.csail.mit.edu/

Credit: Franziska KrausContent-based Image and Video Retrieval 4


LabelMe

Annotation Tool

Draw polygons and name the labeled object


Annotation Tool

Quality of polygons and names is not ensured by supervision → still sufficient A lot of images have more than 80% of their pixels

labeled Several different object categories in many images


Annotation Tool

Objects are mostly labeled completely despite of partial occlusions Object-parts hierarchies Depth-ordering


Browse Database


Query Database


Query Database

Label names are up to the user→ no consistency

Use WordNet (dictionary) to group categories


Object-parts hierarchies

Polygons with high overlap either Object-part hierarchy (head body) Occlusion

For a given query (e.g. “car”) check for polygons that often have a high overlap (e.g. “wheel”)

Compute score as percentage of images where part (“wheel”) has high overlap with object (“car”)

List of object-part candidates


Occlusion / Depth-ordering

Simple heuristics Some things can never occlude other objects (sky) An object completely contained in another is on

top May be wrong if containing object is transparent etc.

The polygon with more control points in the intersecting area is on top

Use color histograms: compare histogram in overlapping region to the two other regions More similar region is on top

Combined heuristic achieves 2.9% error


Semi-automatic labeling

Use available labels to train detectors Run detector on unlabeled data Let user verify the generated labels


Summary of LabelMe

Advantages Motivation to label images is given To download the database a user first has to label

a certain number of images→ Data quality is probably good

Disadvantages Only a few people are interested in the labels However they are the only ones providing the

labels Is it possible to get people with no interest in

the labels to annotate the images (without paying them)?


Human Computation

Humans can solve problems that computers can't solve yet Simple example: captchas

Humans are way better than computers in labeling and tagging images How do we get them to do that? Do people on the Internet have enough time?


Human Computation

“People all over the world spent 9 billion hours playing solitaire in 2003” Luis von Ahn, 2006

Construction of the Empire State Building took ~ 6.8h playing solitaire(7 million human-hours)

Building the Panama Canal took less than 1 day of playing solitaire(20 million human-hours)


Human Computation

Humans have a lot of time Humans can easily label and tag images BUT you'll have to pay them to do that for you

Solution – Create a game that encourages people to label images Or any other task you want them to do


Games With A Purpose

http://www.gwap.com/Content-based Image and Video Retrieval 19

ESP-Game

Game for tagging images from the Web

Users see a picture and have to agree on a tag for what the picture contains

→ A lot of image-tag pairs, but no information about location or size of objects in the image


ESP-Game


User statistics

In the first four months after release 13,630 people played the game

80% of them played more than once

1,271,451 labels for 293,760 images were generated

33 people played more than 1000 games (>50h) Extrapolating

5000 people playing 24h/d could label all images in Google image search in one month!

In popular online gaming sites, many more players are online at a time (>100,000 players)


Label quality

Search for labels For 10 random labels, images with this label were

displayed For all returned images, the label made sense Very high precision

Manually labeled images For all images at least 83% of the ESP game tags

were also used by the manual annotators For all images the three most common tags used by

the manual annotators were also in the ESP game tags Manual quality assessment

People would use 85% of the ESP game tags to describe the corresponding image

Only 1.7% of the tags don’t make sense as description of the image


Peekaboom

Game for locating objects in images Improves data collected by the ESP Game Two random players are paired up

Boom (Player 1) reveals parts of the image Peek (Player 2) guesses the associated word Role switch on successful guess

Playing “bots” for uneven number of players or when people quit the game


Game Overview


Pings

Boom can “ping” parts of the revealed object to point out particular parts


Word – Image Relation

Boom can give hints about how the word relates to the image


Image Metadata

For each image-word pair, metadata is collected How the word relates to the image

Hints Necessary pixels to guess the word

Area that is revealed Pixels inside the specified object

Pings Most salient aspects of objects in the image

Sequence of Boom's clicks Elimination of poor image-word pairs

Many pairs of players click “pass”


Cheating

People could try to violate the game by logging in at the same time Tell each other which words to type

→ reveal the wrong parts and type the right word anyway

Multiple anti-cheating mechanisms


Anti-Cheating Mechanisms

Player queue Player has to wait n seconds until he is paired up

IP address checks Seed images

Images with hand-verified metadata in the game

Limited freedom to enter guesses Guess field used for communication Only letters, only words in the dictionary

Aggregating data from multiple players


Implementation

Spelling check Incorrect words are displayed

in a different color Inappropriate word

replacement Substituted with words like

“ILuvPeekaboom” Top scores list and ranks

Top scores of the day and of all time

Users get a rank based on their total number of points


Additional Applications

Improving image-search results Peekaboom gives estimate of the fraction of the

image that is related to the word → use this fractions to order image results Use ping data for pointing to objects Object bounding boxes

Image search engine with highlighted results


Ping Accuracy

Use ping data for pointing Arrow lines pointing to the

objects Pings selected at random 100% accuracy shown in

experiment


Bounding Box Accuracy

Bounding box created with Peekaboom Lowest overlap with user-

created box was 50%


User Statistics

How enjoyable is the game? 14,000 different people and 1.1 million pieces of

data during the first month User comments:


“One unfortunate side effect of playing so much in such a short time was a mild case of carpal tunnel syndrome in my right hand and forearm, but

that dissipated quickly.”

“This game is like crack. I've been Peekaboom-free for 32 hours. Unlike other

games, Peekaboom is cooperative.”

“[...] I would say that it gives the same gut feeling as combining gambling with charades while riding on a roller coaster. The good points are that you increase

and stimulate your intelligence, you don't lose all your money and you don't fall off the ride. The bad point is that you look at your watch and eight hours have just

disappeared!”

Summary of Peekaboom

Peekaboom works because people like to play games

Experiments show that results are sufficiently accurate

In addition to the ESP data, information about location and size of objects in the image is retrieved


Other “Games with a purpose” Tag a tune

Some piece of music is played to you and partner You must describe the music with words Based on your partner’s description you have to decide whether you

two are listening to the same song Verbosity

Describe a word to the partner using other words Partner must guess secret word

Squigl Trace the outline of an object in the same way as your partner Very limited time (5-10 seconds)

Matchin Decide which one of two images you like best Points when partner agrees

Popvideo You and your partner are shown a video clip You have to enter tags describing the video (and audio!) Points when tags match


Publicly available labeled datasets

Some datasets are available for free in order to advance research Caltech 101/256, AR, …

Some evaluation campaigns generate a lot of labeled data and provide it for everybody: PASCAL VOC challenge for participants: TRECVID, ImageCLEF


Caltech 101/256

Freely available http://vision.caltech.edu/Image_Datasets/Caltech101 http://vision.caltech.edu/Image_Datasets/Caltech256

Pictures of 101/256 object categories Labels and outlines of the objects


PASCAL

Collection of object recognition databases with ground truth

VOC challenge uses it Freely available

http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html


Software libraries

Many systems use standard algorithms Linear algebra Image processing

Image filters, image features, etc. Machine learning

SVMs, PCA, LDA, etc.

Advantages of using standard libraries Avoid errors in own implementation Leverage know-how of other people

The implementation details of simple algorithms can be quite tricky sometimes!

Save a lot of development time More time to work on your algorithm


OpenCV

Open Computer Vision library http://sourceforge.net/projects/opencvlibrary/ http://opencv.willowgarage.com/

Features C-like interface (upcoming 2.0 is more C++) Linear algebra Image processing functions

Filters, FFT, DCT, etc. Image features

SURF, HOG, Haar-like features (mainly in 2.0/SVN) Detectors

Haar-cascades (training & detection) Machine learning

SVMs, Neural nets, decision trees, GMMs, … Much much more…


OKAPI

Open Karlsruhe library for processing of images http://isl.ira.uka.de/msmmi/okapi/doc/

Not public at the moment If you want to use it, come work with us

Features C++ Cameras (V4L, Firewire, VfW) Videos (frame-accurate random access) Image Features (DCT, Gabor, LBP, MCT, …) Linear Projections (PCA, LDA, RCA) Detector (using MCT features, very fast) SVMs (libsvm, liblinear) 3D geometry functions Simple GUI for prototypes Many utility functions (timing, XML, in-memory image IO, …)


SVMs

Libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Simple and standard

Liblinear (linear SVMs) http://www.csie.ntu.edu.tw/~cjlin/liblinear/ Much faster for linear SVMs

SVMlight http://svmlight.joachims.org/ Also very popular

Shogun toolbox http://www.shogun-toolbox.org/ Many kernel functions and Multi-Kernel Learning (MKL) Uses libsvm or SVMlight in the background


Machine Learning Weka (Java)

http://www.cs.waikato.ac.nz/ml/weka/ Java-ML (Java)

http://java-ml.sourceforge.net/ Torch (C/Lua)

http://torch5.sourceforge.net/ MLC++ (C++)

http://www.sgi.com/tech/mlc/ MLPACK / FASTlib

http://mloss.org/software/view/152/ http://fastlib.analytics1305.com/

Spider (Matlab) http://www.kyb.tuebingen.mpg.de/bs/people/spider/

FLANN (Approximate k-Nearest Neighbors) http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

List of Open Source Machine Learning software http://mloss.org/


References

Russell, Torralba, MurphyLabelMe: A Database and Web-Based Tool for Image AnnotationInternational Journal of Computer Vision, vol. 77, issue 1, May 2008

von Ahn, DabbishLabeling images with a computer gameProceedings of the SIGCHI conference on Human factors in computing systems, 2004

von Ahn, Liu, BlumPeekaboom: a game for locating objects in imagesProceedings of the SIGCHI conference on Human Factors in computing systems, 2006


Lecture Overview

Introduction Visual Descriptors Image Segementation Classification Shot Boundary Detection & Genre Classification High-level Feature Detection High-level Feature Detection II Person Identification Copy Detection

Semantics Search Tools and Libraries


Computer Vision & Machine Learning Overview

Indexing & Matching

High-level topics

Intro. to Video proc.

Development of superpixel-based features forobject recognition (SA/Bachelor, DA/Master)

Superpixels compute an oversegmentation of an image

Object is segmented into multiple superpixels thatcan be used as atomic building blocks to describethe objects

Develop descriptor based on single superpixels andapply it in an object recognition setting

Evaluate object recognition performance andcompare it to state-of-the-art algorithms

Contact:Alexander [email protected]

Foreground-background segmentationwith superpixels (SA/Bachelor)

Superpixels significantly reduce the number of imageelements from hundreds of thousands of pixels to a coupleof hundred superpixels

Apply existing foreground-background segmentationalgorithms to superpixel segmentations of video streams

Compare and evaluate superpixel-based segmentation topixel-based approaches with respect to quality and speed-up

Contact:Alexander [email protected]

Part-based person detection using the Modified Census Transform (MCT)

Bachelor Thesis Extend an existing holistic

MCT person detector to a part-based detector.

Tasks Determine suitable person

parts Train part detectors Fuse part detections Evaluate detection

performance

Requirements C++ programming

experience Ability to work independently

52Content-based image and video retrieval

Part-based person detections.

Contact Martin Bäuml [email protected] Arne Schumann [email protected]

Hiwi: Face Tracking Dataset

no standard benchmark for facetracking we are working on one

Tasks search for fitting videos (TV

series, Youtube, news…) labeling the data

Contact Martin Bäuml <[email protected]>

Geb. 50.20 R228

Computer Vision forthe Blind and Human-Robot Interaction

(MA/BA Theses and HiWi Jobs)

Help blind people and robots to visually explore and investigate their environment

Various topics available, for example: Who is looking at me, at whom or at what? Describe objects, e.g., their color, or read text, e.g.,

street and door signs Allow following spoken path descriptions, e.g. “go to

the crossroads, turn left and after the church go right” How to present the information to the blind?

Integrated in SFB 588 “Humanoide Roboter”, collaboration with Fraunhofer IOSB possibleContact: B. Schauerte <[email protected]>

Situationserkennung im SmartControlRoom

Aufgabe Situationserkennung in Bildfolgen am Beispiel eines

SmartControlRoom Modellierung von Situationen und

Weiterentwicklung neuartiger logischer Verfahren Fusionieren von Tracking, Kopfdrehungs-

Schätzung, Gestenerkennung, Spracherkennung und Raumbeschreibung

Thomas and Alex areediting the map together

Kontakt:

[email protected]

S1 and S2 are editingthe map

M and S3 are listeningto EL

S4 is doing individual work

Studentische Arbeiten am Fraunhofer IOSBAnalyse der Arbeitsprozesse im Stabsraum

Aufgabe Analyse der Arbeitsprozesse im

Stabsraum anhand mehrerer Kameraströme mit Tonspur und Nachrichtenverkehr

Weiterentwicklung eines entsprechenden Analyse-Tools

Grundlage bilden für eine automatische Situationserkennung mittels Computer Vision

Kontakt:[email protected]

Beispiel-Daten:

Hiwi: Web Admin

Task taking care of our website uploading lecture slides !

Requirements good knowledge of Joomla reliable, timely work


Geb. 50.20 R228

Studienarbeit• Previous system(Open‐set face recognition for visitor interface)– Challenges

• Mismatch in Pose and illumination

• Unbalanced unknown and known samples for training

• Personalize TV program for different family members according to identity– New challenges

• Multi‐resolution due to various distances

Studienarbeit

• Goal– Improve previous system– Evaluate performance with different person‐camera distance

– Scale (distance)‐invariant

Contact: Hua [email protected]

Open Hiwi position

Interested in event recognition? We are looking for a student to teach ARMAR how to recognize events within rooms (eg. cooking, cleaning)

Motivated by Laptev et al., “Learning realistic actions in movies”

Options for SA/DA!!!

Required skills Interest in computer vision C++ programming experience under Linux

Email [email protected] for more details

62Content-based image and video retrieval

HOG HOF

bag-of-features

...

classification

STIP

Studienarbeit/HiwiMultiview ISM for Localization and

Segmentation of HumansTask• Extend the probabilistic formulation of the standard

Implicit Shape Model to 3D/multiview• Annotate existing smart room data for training the

model• Implement and evaluate your approach

Requirements• Knowledge of computer vision and machine learning• Programming experience in C++ and Python

Contact• Martin Bäuml <[email protected]>

Studienarbeit/HiwiClothing-based person recognition

on ISM segmented dataTask• Use ISM to localize and segment people entering the

door• Identify persons based on their clothing using the

automatically segmented data• Compare your approach against a discriminative

body detector-based approach

Requirements• Knowledge of computer vision and machine learning• Programming experience in C++ and Python

Contact• Martin Bäuml <[email protected]>

65

Persondetection in a bird‘s view

• Person detection works well in most perspectives, but are not very much explored in bird views– Difficulties are arise from different

appearance distortions, depending on the person‘s position

• Nevertheless, bird views offer a great deal of information we could and should use!

• Goal: Build a person detector, that also responds to different body orientations seen from above

Contact: Florian van de Camp, [email protected], fon: 6091-449Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)

66

Headpose in Active Camera Views

• Headpose allows to deduce a coarse gaze approximation– Passive cameras, that are set up far away

only deliver low-resolution captures– A lot of work tries to cope with this by using

multiple cameras and hence, different views, for a more stable estimate

• But: active cameras allow to zoom onto a person in order to deliver a high-resolution capture

• Goal: Build a system that uses an active camera to zoom onto a person and outputs a high-detailed head orientation estimation

Contact: Michael Voit, [email protected], fon: 6091-449Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)

67

3D Voxel Coloring (Studienarbeit)Current situation:

• 3D reconstruction of multiple people in a SmartControlRoom using voxel carving

• Multiple camera views of the same scene from different angles are available

• Voxel coloring is challenging due to noise in 3D data

Goal:

• Develop voxel coloring algorithm that computes the correct color for each voxel considering all camera images

Contact: Alexander Schick, 0721-6091-348, [email protected]

Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)

68

Visualization for Human-Machine Interaction (Hiwi)

Current situation:• Multiple computer vision components extract information about

people working in our SmartControlRoom:– Tracking and identification– Body model and gestures– Headpose and focus of attention

• Each component has its own visualization

Goal:• Build framework that uses all provided information to create one

integrated visual representation of the whole scene

Requirements:• Experience with visualization using, for example, openGL or VTK

(we are open to suggestions)• Very good C++ or python skills under Linux

Contact: Alexander Schick, 0721-6091-348, [email protected]

Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)

MA/DA/Hiwi: Person Retrieval in a camera network

Person Retrieval using faces only works quite well in well-defined cases. For more general settings, incorporation of more features and global knowledge is required.

Task Extend existing face retrieval for camera network (~ 10%) Incorporate non-biometric features into the retrieval (~ 40%) Use global model knowledge to improve the retrieval (~ 30%) Evaluate your approach on an appropriate dataset (~ 20%)

Requirements Knowledge of computer vision and machine learning Very good programming experience in C++ & Python Experience with distributed systems is a plus


Geb. 50.20, R228

cbir 13 tools libraries v2 - kit · user statistics in the first four months after release 13,630...

Documents