cbir 13 tools libraries v2 - kit · user statistics in the first four months after release 13,630...
TRANSCRIPT
Content-based image and video analysis
Tools and Libraries
11.07.2011
Lecture overview
Labeled data is needed in almost all discussed approaches But: Labeling data is tedious and expensive work We will discuss different approaches to solve this
problem Software libraries for standard tasks can
drastically reduce development time We will discuss software libraries for image
processing and machine learning
Content-based Image and Video Retrieval 2
Labels Diverse types of data annotations needed
Face recognition face bounding box, facial landmarks, id Object recognition bounding box, object class High-level features: images depicting concept Genre-classification:videos of certain genre …
The only way to get the labels is to get people to manually label the data It is a very boring task People need to be compensated somehow Most common way: just pay them!
Can be quite expensive for large datasets
Content-based Image and Video Retrieval 3
LabelMe
Collaborative labeling To download the database, a certain number of
images have to be annotated Label quality probably good, since annotators
have to use them Public (everyone can join) A large database (460000 labeled objects of all
kinds) http://labelme.csail.mit.edu/
Credit: Franziska KrausContent-based Image and Video Retrieval 4
Content-based Image and Video Retrieval 5
LabelMe
Annotation Tool
Draw polygons and name the labeled object
Content-based Image and Video Retrieval 6
Annotation Tool
Quality of polygons and names is not ensured by supervision → still sufficient A lot of images have more than 80% of their pixels
labeled Several different object categories in many images
Content-based Image and Video Retrieval 7
Annotation Tool
Objects are mostly labeled completely despite of partial occlusions Object-parts hierarchies Depth-ordering
Content-based Image and Video Retrieval 8
Browse Database
Content-based Image and Video Retrieval 9
Query Database
Content-based Image and Video Retrieval 10
Query Database
Label names are up to the user→ no consistency
Use WordNet (dictionary) to group categories
Content-based Image and Video Retrieval 11
Object-parts hierarchies
Polygons with high overlap either Object-part hierarchy (head body) Occlusion
For a given query (e.g. “car”) check for polygons that often have a high overlap (e.g. “wheel”)
Compute score as percentage of images where part (“wheel”) has high overlap with object (“car”)
List of object-part candidates
Content-based Image and Video Retrieval 12
Occlusion / Depth-ordering
Simple heuristics Some things can never occlude other objects (sky) An object completely contained in another is on
top May be wrong if containing object is transparent etc.
The polygon with more control points in the intersecting area is on top
Use color histograms: compare histogram in overlapping region to the two other regions More similar region is on top
Combined heuristic achieves 2.9% error
Content-based Image and Video Retrieval 13
Semi-automatic labeling
Use available labels to train detectors Run detector on unlabeled data Let user verify the generated labels
Content-based Image and Video Retrieval 14
Summary of LabelMe
Advantages Motivation to label images is given To download the database a user first has to label
a certain number of images→ Data quality is probably good
Disadvantages Only a few people are interested in the labels However they are the only ones providing the
labels Is it possible to get people with no interest in
the labels to annotate the images (without paying them)?
Content-based Image and Video Retrieval 15
Human Computation
Humans can solve problems that computers can't solve yet Simple example: captchas
Humans are way better than computers in labeling and tagging images How do we get them to do that? Do people on the Internet have enough time?
Content-based Image and Video Retrieval 16
Human Computation
“People all over the world spent 9 billion hours playing solitaire in 2003” Luis von Ahn, 2006
Construction of the Empire State Building took ~ 6.8h playing solitaire(7 million human-hours)
Building the Panama Canal took less than 1 day of playing solitaire(20 million human-hours)
Content-based Image and Video Retrieval 17
Human Computation
Humans have a lot of time Humans can easily label and tag images BUT you'll have to pay them to do that for you
Solution – Create a game that encourages people to label images Or any other task you want them to do
Content-based Image and Video Retrieval 18
Games With A Purpose
http://www.gwap.com/Content-based Image and Video Retrieval 19
ESP-Game
Game for tagging images from the Web
Users see a picture and have to agree on a tag for what the picture contains
→ A lot of image-tag pairs, but no information about location or size of objects in the image
Content-based Image and Video Retrieval 20
ESP-Game
Content-based Image and Video Retrieval 21
User statistics
In the first four months after release 13,630 people played the game
80% of them played more than once
1,271,451 labels for 293,760 images were generated
33 people played more than 1000 games (>50h) Extrapolating
5000 people playing 24h/d could label all images in Google image search in one month!
In popular online gaming sites, many more players are online at a time (>100,000 players)
Content-based Image and Video Retrieval 22
Label quality
Search for labels For 10 random labels, images with this label were
displayed For all returned images, the label made sense Very high precision
Manually labeled images For all images at least 83% of the ESP game tags
were also used by the manual annotators For all images the three most common tags used by
the manual annotators were also in the ESP game tags Manual quality assessment
People would use 85% of the ESP game tags to describe the corresponding image
Only 1.7% of the tags don’t make sense as description of the image
Content-based Image and Video Retrieval 23
Peekaboom
Game for locating objects in images Improves data collected by the ESP Game Two random players are paired up
Boom (Player 1) reveals parts of the image Peek (Player 2) guesses the associated word Role switch on successful guess
Playing “bots” for uneven number of players or when people quit the game
Content-based Image and Video Retrieval 24
Game Overview
Content-based Image and Video Retrieval 25
Pings
Boom can “ping” parts of the revealed object to point out particular parts
Content-based Image and Video Retrieval 26
Word – Image Relation
Boom can give hints about how the word relates to the image
Content-based Image and Video Retrieval 27
Image Metadata
For each image-word pair, metadata is collected How the word relates to the image
Hints Necessary pixels to guess the word
Area that is revealed Pixels inside the specified object
Pings Most salient aspects of objects in the image
Sequence of Boom's clicks Elimination of poor image-word pairs
Many pairs of players click “pass”
Content-based Image and Video Retrieval 28
Cheating
People could try to violate the game by logging in at the same time Tell each other which words to type
→ reveal the wrong parts and type the right word anyway
Multiple anti-cheating mechanisms
Content-based Image and Video Retrieval 29
Anti-Cheating Mechanisms
Player queue Player has to wait n seconds until he is paired up
IP address checks Seed images
Images with hand-verified metadata in the game
Limited freedom to enter guesses Guess field used for communication Only letters, only words in the dictionary
Aggregating data from multiple players
Content-based Image and Video Retrieval 30
Implementation
Spelling check Incorrect words are displayed
in a different color Inappropriate word
replacement Substituted with words like
“ILuvPeekaboom” Top scores list and ranks
Top scores of the day and of all time
Users get a rank based on their total number of points
Content-based Image and Video Retrieval 31
Additional Applications
Improving image-search results Peekaboom gives estimate of the fraction of the
image that is related to the word → use this fractions to order image results Use ping data for pointing to objects Object bounding boxes
Image search engine with highlighted results
Content-based Image and Video Retrieval 32
Ping Accuracy
Use ping data for pointing Arrow lines pointing to the
objects Pings selected at random 100% accuracy shown in
experiment
Content-based Image and Video Retrieval 33
Bounding Box Accuracy
Bounding box created with Peekaboom Lowest overlap with user-
created box was 50%
Content-based Image and Video Retrieval 34
User Statistics
How enjoyable is the game? 14,000 different people and 1.1 million pieces of
data during the first month User comments:
Content-based Image and Video Retrieval 35
“One unfortunate side effect of playing so much in such a short time was a mild case of carpal tunnel syndrome in my right hand and forearm, but
that dissipated quickly.”
“This game is like crack. I've been Peekaboom-free for 32 hours. Unlike other
games, Peekaboom is cooperative.”
“[...] I would say that it gives the same gut feeling as combining gambling with charades while riding on a roller coaster. The good points are that you increase
and stimulate your intelligence, you don't lose all your money and you don't fall off the ride. The bad point is that you look at your watch and eight hours have just
disappeared!”
Summary of Peekaboom
Peekaboom works because people like to play games
Experiments show that results are sufficiently accurate
In addition to the ESP data, information about location and size of objects in the image is retrieved
Content-based Image and Video Retrieval 36
Other “Games with a purpose” Tag a tune
Some piece of music is played to you and partner You must describe the music with words Based on your partner’s description you have to decide whether you
two are listening to the same song Verbosity
Describe a word to the partner using other words Partner must guess secret word
Squigl Trace the outline of an object in the same way as your partner Very limited time (5-10 seconds)
Matchin Decide which one of two images you like best Points when partner agrees
Popvideo You and your partner are shown a video clip You have to enter tags describing the video (and audio!) Points when tags match
Content-based Image and Video Retrieval 37
Publicly available labeled datasets
Some datasets are available for free in order to advance research Caltech 101/256, AR, …
Some evaluation campaigns generate a lot of labeled data and provide it for everybody: PASCAL VOC challenge for participants: TRECVID, ImageCLEF
Content-based Image and Video Retrieval 38
Caltech 101/256
Freely available http://vision.caltech.edu/Image_Datasets/Caltech101 http://vision.caltech.edu/Image_Datasets/Caltech256
Pictures of 101/256 object categories Labels and outlines of the objects
Content-based Image and Video Retrieval 39
PASCAL
Collection of object recognition databases with ground truth
VOC challenge uses it Freely available
http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html
Content-based Image and Video Retrieval 40
Software libraries
Many systems use standard algorithms Linear algebra Image processing
Image filters, image features, etc. Machine learning
SVMs, PCA, LDA, etc.
Advantages of using standard libraries Avoid errors in own implementation Leverage know-how of other people
The implementation details of simple algorithms can be quite tricky sometimes!
Save a lot of development time More time to work on your algorithm
Content-based Image and Video Retrieval 42
OpenCV
Open Computer Vision library http://sourceforge.net/projects/opencvlibrary/ http://opencv.willowgarage.com/
Features C-like interface (upcoming 2.0 is more C++) Linear algebra Image processing functions
Filters, FFT, DCT, etc. Image features
SURF, HOG, Haar-like features (mainly in 2.0/SVN) Detectors
Haar-cascades (training & detection) Machine learning
SVMs, Neural nets, decision trees, GMMs, … Much much more…
Content-based Image and Video Retrieval 43
OKAPI
Open Karlsruhe library for processing of images http://isl.ira.uka.de/msmmi/okapi/doc/
Not public at the moment If you want to use it, come work with us
Features C++ Cameras (V4L, Firewire, VfW) Videos (frame-accurate random access) Image Features (DCT, Gabor, LBP, MCT, …) Linear Projections (PCA, LDA, RCA) Detector (using MCT features, very fast) SVMs (libsvm, liblinear) 3D geometry functions Simple GUI for prototypes Many utility functions (timing, XML, in-memory image IO, …)
Content-based Image and Video Retrieval 44
SVMs
Libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Simple and standard
Liblinear (linear SVMs) http://www.csie.ntu.edu.tw/~cjlin/liblinear/ Much faster for linear SVMs
SVMlight http://svmlight.joachims.org/ Also very popular
Shogun toolbox http://www.shogun-toolbox.org/ Many kernel functions and Multi-Kernel Learning (MKL) Uses libsvm or SVMlight in the background
Content-based Image and Video Retrieval 45
Machine Learning Weka (Java)
http://www.cs.waikato.ac.nz/ml/weka/ Java-ML (Java)
http://java-ml.sourceforge.net/ Torch (C/Lua)
http://torch5.sourceforge.net/ MLC++ (C++)
http://www.sgi.com/tech/mlc/ MLPACK / FASTlib
http://mloss.org/software/view/152/ http://fastlib.analytics1305.com/
Spider (Matlab) http://www.kyb.tuebingen.mpg.de/bs/people/spider/
FLANN (Approximate k-Nearest Neighbors) http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
List of Open Source Machine Learning software http://mloss.org/
Content-based Image and Video Retrieval 46
References
Russell, Torralba, MurphyLabelMe: A Database and Web-Based Tool for Image AnnotationInternational Journal of Computer Vision, vol. 77, issue 1, May 2008
von Ahn, DabbishLabeling images with a computer gameProceedings of the SIGCHI conference on Human factors in computing systems, 2004
von Ahn, Liu, BlumPeekaboom: a game for locating objects in imagesProceedings of the SIGCHI conference on Human Factors in computing systems, 2006
Content-based Image and Video Retrieval 47
Lecture Overview
Introduction Visual Descriptors Image Segementation Classification Shot Boundary Detection & Genre Classification High-level Feature Detection High-level Feature Detection II Person Identification Copy Detection
Semantics Search Tools and Libraries
Content-based Image and Video Retrieval 48
Computer Vision & Machine Learning Overview
Indexing & Matching
High-level topics
Intro. to Video proc.
Content-based Image and Video Retrieval 49
Development of superpixel-based features forobject recognition (SA/Bachelor, DA/Master)
Superpixels compute an oversegmentation of an image
Object is segmented into multiple superpixels thatcan be used as atomic building blocks to describethe objects
Develop descriptor based on single superpixels andapply it in an object recognition setting
Evaluate object recognition performance andcompare it to state-of-the-art algorithms
Contact:Alexander [email protected]
Foreground-background segmentationwith superpixels (SA/Bachelor)
Superpixels significantly reduce the number of imageelements from hundreds of thousands of pixels to a coupleof hundred superpixels
Apply existing foreground-background segmentationalgorithms to superpixel segmentations of video streams
Compare and evaluate superpixel-based segmentation topixel-based approaches with respect to quality and speed-up
Contact:Alexander [email protected]
Part-based person detection using the Modified Census Transform (MCT)
Bachelor Thesis Extend an existing holistic
MCT person detector to a part-based detector.
Tasks Determine suitable person
parts Train part detectors Fuse part detections Evaluate detection
performance
Requirements C++ programming
experience Ability to work independently
52Content-based image and video retrieval
Part-based person detections.
Contact Martin Bäuml [email protected] Arne Schumann [email protected]
Hiwi: Face Tracking Dataset
no standard benchmark for facetracking we are working on one
Tasks search for fitting videos (TV
series, Youtube, news…) labeling the data
Contact Martin Bäuml <[email protected]>
Geb. 50.20 R228
Computer Vision forthe Blind and Human-Robot Interaction
(MA/BA Theses and HiWi Jobs)
Help blind people and robots to visually explore and investigate their environment
Various topics available, for example: Who is looking at me, at whom or at what? Describe objects, e.g., their color, or read text, e.g.,
street and door signs Allow following spoken path descriptions, e.g. “go to
the crossroads, turn left and after the church go right” How to present the information to the blind?
Integrated in SFB 588 “Humanoide Roboter”, collaboration with Fraunhofer IOSB possibleContact: B. Schauerte <[email protected]>
Situationserkennung im SmartControlRoom
Aufgabe Situationserkennung in Bildfolgen am Beispiel eines
SmartControlRoom Modellierung von Situationen und
Weiterentwicklung neuartiger logischer Verfahren Fusionieren von Tracking, Kopfdrehungs-
Schätzung, Gestenerkennung, Spracherkennung und Raumbeschreibung
Thomas and Alex areediting the map together
Kontakt:
S1 and S2 are editingthe map
M and S3 are listeningto EL
S4 is doing individual work
Studentische Arbeiten am Fraunhofer IOSBAnalyse der Arbeitsprozesse im Stabsraum
Aufgabe Analyse der Arbeitsprozesse im
Stabsraum anhand mehrerer Kameraströme mit Tonspur und Nachrichtenverkehr
Weiterentwicklung eines entsprechenden Analyse-Tools
Grundlage bilden für eine automatische Situationserkennung mittels Computer Vision
Kontakt:[email protected]
Beispiel-Daten:
Hiwi: Web Admin
Task taking care of our website uploading lecture slides !
Requirements good knowledge of Joomla reliable, timely work
Contact Martin Bäuml <[email protected]>
Geb. 50.20 R228
Content-based Image and Video Retrieval 58
Content-based Image and Video Retrieval 59
Studienarbeit• Previous system(Open‐set face recognition for visitor interface)– Challenges
• Mismatch in Pose and illumination
• Unbalanced unknown and known samples for training
• Personalize TV program for different family members according to identity– New challenges
• Multi‐resolution due to various distances
Studienarbeit
• Goal– Improve previous system– Evaluate performance with different person‐camera distance
– Scale (distance)‐invariant
Contact: Hua [email protected]
Open Hiwi position
Interested in event recognition? We are looking for a student to teach ARMAR how to recognize events within rooms (eg. cooking, cleaning)
Motivated by Laptev et al., “Learning realistic actions in movies”
Options for SA/DA!!!
Required skills Interest in computer vision C++ programming experience under Linux
Email [email protected] for more details
62Content-based image and video retrieval
HOG HOF
bag-of-features
...
classification
STIP
Studienarbeit/HiwiMultiview ISM for Localization and
Segmentation of HumansTask• Extend the probabilistic formulation of the standard
Implicit Shape Model to 3D/multiview• Annotate existing smart room data for training the
model• Implement and evaluate your approach
Requirements• Knowledge of computer vision and machine learning• Programming experience in C++ and Python
Contact• Martin Bäuml <[email protected]>
Studienarbeit/HiwiClothing-based person recognition
on ISM segmented dataTask• Use ISM to localize and segment people entering the
door• Identify persons based on their clothing using the
automatically segmented data• Compare your approach against a discriminative
body detector-based approach
Requirements• Knowledge of computer vision and machine learning• Programming experience in C++ and Python
Contact• Martin Bäuml <[email protected]>
65
Persondetection in a bird‘s view
• Person detection works well in most perspectives, but are not very much explored in bird views– Difficulties are arise from different
appearance distortions, depending on the person‘s position
• Nevertheless, bird views offer a great deal of information we could and should use!
• Goal: Build a person detector, that also responds to different body orientations seen from above
Contact: Florian van de Camp, [email protected], fon: 6091-449Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)
66
Headpose in Active Camera Views
• Headpose allows to deduce a coarse gaze approximation– Passive cameras, that are set up far away
only deliver low-resolution captures– A lot of work tries to cope with this by using
multiple cameras and hence, different views, for a more stable estimate
• But: active cameras allow to zoom onto a person in order to deliver a high-resolution capture
• Goal: Build a system that uses an active camera to zoom onto a person and outputs a high-detailed head orientation estimation
Contact: Michael Voit, [email protected], fon: 6091-449Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)
67
3D Voxel Coloring (Studienarbeit)Current situation:
• 3D reconstruction of multiple people in a SmartControlRoom using voxel carving
• Multiple camera views of the same scene from different angles are available
• Voxel coloring is challenging due to noise in 3D data
Goal:
• Develop voxel coloring algorithm that computes the correct color for each voxel considering all camera images
Contact: Alexander Schick, 0721-6091-348, [email protected]
Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)
68
Visualization for Human-Machine Interaction (Hiwi)
Current situation:• Multiple computer vision components extract information about
people working in our SmartControlRoom:– Tracking and identification– Body model and gestures– Headpose and focus of attention
• Each component has its own visualization
Goal:• Build framework that uses all provided information to create one
integrated visual representation of the whole scene
Requirements:• Experience with visualization using, for example, openGL or VTK
(we are open to suggestions)• Very good C++ or python skills under Linux
Contact: Alexander Schick, 0721-6091-348, [email protected]
Prof. Dr.-Ing. Rainer StiefelhagenInstitut für AnthropomatikForschungsbereich Maschinensehen für Mensch-Maschine InteraktionFakultät für Informatik, Universität Karlsruhe (TH)
MA/DA/Hiwi: Person Retrieval in a camera network
Person Retrieval using faces only works quite well in well-defined cases. For more general settings, incorporation of more features and global knowledge is required.
Task Extend existing face retrieval for camera network (~ 10%) Incorporate non-biometric features into the retrieval (~ 40%) Use global model knowledge to improve the retrieval (~ 30%) Evaluate your approach on an appropriate dataset (~ 20%)
Requirements Knowledge of computer vision and machine learning Very good programming experience in C++ & Python Experience with distributed systems is a plus
Contact Martin Bäuml <[email protected]>
Geb. 50.20, R228