multimedia search: from lab to web prof. dr. l. schomaker ki rug invited lecture, presented at the...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Multimedia search: From Lab to Web
prof. dr. L. Schomaker
KI
RuGInvited lecture, presented at the 4e Colloque International sur le Document Electronique,24-26 octobre 2001.Schomaker, LRB (2001) Image Search and Annotation: From Lab to Web.Proceedings of CIDE 2001, pp.373-375, ISBN 2-909285-17-0.
©2001 LRB Schomaker - KI/RuG
2
KI
RuG
Overview
Methods in content-based image search
The user’s perspective: ergonomics, cognition and perception
Feeding the data-starved machine
©2001 LRB Schomaker - KI/RuG
3
KI
RuG
Researchers
L. Schomaker L. Vuurpijl E. Deleau E. Hoenkamp A. Baris
©2001 LRB Schomaker - KI/RuG
4
KI
RuG
A definition
In content-based image retrieval systems, the goal is to provide the user with a set of images, based on a query which consists - partly or completely - on pictorial information
Exclude: point & click navigation in pre-organized image bases
©2001 LRB Schomaker - KI/RuG
5
KI
RuG
Image-based queries on WWW: existing methods and their problems
IBIR - image-based information retrieval
CBIR - content-based image retrieval
QBIC - queries based on image content
PBIR - pen-based image retrieval
©2001 LRB Schomaker - KI/RuG
6
KI
RuG
Existing systems & prototypes
QBIC (IBM) VisualSEEk (Columbia) Four-Eyes (MIT Media) … and many more: (Webseek,
Excalibur,Imagerover,Chabot,Piction) Research: IMEDIA (Inria),
Viper/GIFT (Marchand-Maillet)
©2001 LRB Schomaker - KI/RuG
7
KI
RuG
Query Methods
Query Matched with Algorithm
Keywords Manual text annotation
String search,
Information Retrieval
Keywords Textual context of image
String search, IR
Exemplar image Complete image Template matching, feature vector matching
Rectangular sub image
Complete image Feature, texture based
Layout structure Complete image Texture and Color
Object outline Partial image Outlines, Edges
Object sketch Partial image Features, Edges
©2001 LRB Schomaker - KI/RuG
8
KI
RuG
Example 1. QBIC (IBM)
Features:– Colors, textures, edges, shape
Matching– Layout, full-image templates, shape
Upper-left picture is the query
“boy in yellow raincoat”
…yields very counter-intuitive results
What was the user’s intention?
©2001 LRB Schomaker - KI/RuG
10
KI
RuG
Example 2. VisualSEEk
Features:– Colors, textures, edges. bitmap shape
Matching:– layout, full-image templates
Layout- and feature-based query construction
Requires detailed user knowledge on pattern-recognition issues!
VisualSEEk (Columbia Univ.)
©2001 LRB Schomaker - KI/RuG
12
KI
RuG
Example 3. FourEyes (MIT Medialab)
Imposed block segmentation
Textual annotation per block
Labels are propagated on the basis of texture matching
©2001 LRB Schomaker - KI/RuG
14
KI
RuG
FourEyes…
Imposed block segmentation: is unrelated to object placement
object details are lost: global + textural
Interesting: a role for the user
©2001 LRB Schomaker - KI/RuG
15
KI
RuG
Problems
Full-image template matching yields bad retrieval results
Feature-based matching requires a lot of input and knowledge by the user
Layout-based search only suits a subset of image needs
Grid-based partitioning misses details and breaks up meaningful objects
©2001 LRB Schomaker - KI/RuG
16
KI
RuG
Problems…
Reasons behind a retrieved image list are unclear (Picard, 1995)
Features and matching scheme are not easily explainable to the user
An intelligent system should learn from previous queries of the user(s)
©2001 LRB Schomaker - KI/RuG
17
KI
RuG
A statement
In content-based image retrieval systems, just as in text-based Information Retrieval, the performance of current systems is limited due to their incomplete and weak modeling of the user’s – Needs– Goals– Perception– Cognition (semantics)
©2001 LRB Schomaker - KI/RuG
18
KI
RuG
User-Interfacing aspects
Computer users are continuously evaluating the value of system responses as a function of the effort spent on input actions (cost / benefit evaluation)
Consequence: after formulating a query with a large amount of key clicks, slider adjustments and mouse clicks, the quality of an image hit list is expected to be very high…
Conversely, user expectancies are low when the effort only consists of a single mouse click
©2001 LRB Schomaker - KI/RuG
19
KI
RuG
Pragmatic aspects
a survey on WWW revealed that users are interested in objects (71%) and not in layout, texture or abstract features.
The preferred image type is photographs (68%)
©2001 LRB Schomaker - KI/RuG
20
KI
RuG
Cognitive & Perceptual aspects
Objects are best recognized from 'canonical views' (Blanz et al., 1999),
Photographers know and utilize this phenomenon by manipulating camera attitude or objects
©2001 LRB Schomaker - KI/RuG
21
KI
RuG
Photographs and paintings imply communication
Photographer
Painter
User, viewer
=Surveillance
camera
Computer Vision
World World
©2001 LRB Schomaker - KI/RuG
22
KI
RuG
Photographs and paintings imply communication
Photographer
Painter
User, viewer
=Surveillance
camera
Computer Vision
World World
Problems of geometrical invariance are less extreme
©2001 LRB Schomaker - KI/RuG
25
KI
RuG
More cognition: Basic-level object categories
In a hierarchy of object classes (ontology) a node of the type 'Basic Level' (Rosch et al.,1976) adds many structural features in its description, as compared to the level above, whereas the number of unique additional features is reduced when going down towards a more specific node.
©2001 LRB Schomaker - KI/RuG
26
KI
RuG
Basic-level categories, example
“furniture” [virtually no geometrical features]
“chair” [many clearly-defined structural features]
“kitchen chair” [only a few additional features].
©2001 LRB Schomaker - KI/RuG
27
KI
RuG
Basic-level object categories and mental imagery
A basic level is the highest level for which clear mental imagery exists in an object ontology
©2001 LRB Schomaker - KI/RuG
28
KI
RuG
Basic-level object categories and mental imagery
A basic level is the highest level for which clear mental imagery exists in an object ontology
A basic-level object elicits almost the same feature description when it is named, or shown visually
©2001 LRB Schomaker - KI/RuG
29
KI
RuG
Basic-level object categories and mental imagery
A basic level is the highest level for which clear mental imagery exists in an object ontology
A basic-level object elicits almost the same feature description when it is named or shown visually
Basic-level object descriptions often contain reference to structural components (parts)
©2001 LRB Schomaker - KI/RuG
30
KI
RuG
Basic-level object categories and mental imagery
A basic level is the highest level for which clear mental imagery exists in an object ontology
A basic-level object elicits almost the same feature description when it is named or shown visually
Basic-level object descriptions often contain reference to structural components (parts)
In verbally describing the contents of a picture, people will tend to use 'basic-level' words.
©2001 LRB Schomaker - KI/RuG
31
KI
RuG
Basic-level object categories and mental imagery
A basic level is the highest level for which clear mental imagery exists in an object ontology
A basic-level object elicits almost the same feature description when it is named or shown visually
Basic-level object descriptions often contain reference to structural components (parts)
In verbally describing the contents of a picture, people will tend to use 'basic-level' words.
Rosch, E., Mervis, C.B., Gray, W.E., Johnson, E.M.
and Boyes-Braem, P. (1976).
Basic objects in natural categories.
Cognitive Psychology, 8, pp. 382-439.
©2001 LRB Schomaker - KI/RuG
32
KI
RuG
Implication of the ‘basic level’ category
The basic level forms a natural bridge between textual and pictorial information
It is likely to determine both annotation and search behavior of the users
It is an ideal starting point for developing computer vision systems which generate text on the basis of a photograph (ultimately)
©2001 LRB Schomaker - KI/RuG
33
KI
RuG
Misconception about Perception and Cognition
“A picture is worth a thousand words”?
True or False?
©2001 LRB Schomaker - KI/RuG
37
KI
RuG
Assumptions
In image retrieval, the media type of photographs is preferred
There is a predominant interest in objects (in the broad sense: including humans and animals)
The most likely level of description in real-world images is the “basic-level” category (Rosch et al.)
©2001 LRB Schomaker - KI/RuG
38
KI
RuG
Goal: object-based image search
Object recognition in an open domain?
Not possible yet.
Extensive annotation is needed in any case: for indexed access and for machine learning (MPEG-7 allows for sophisticated annotation)
But who is going to do the annotation: the
content provider or the user, and how?
©2001 LRB Schomaker - KI/RuG
39
KI
RuG
How to realize object-based image search?
Bootstrap process for pattern recognition
cf.: Project CyC (Lenat) and openMind (Stork)
Collaborative, opportunistic annotation and object labeling (browser side)
Background learning process (server side)
©2001 LRB Schomaker - KI/RuG
40
KI
RuG
Design considerations
Focus on object-based representations and queries
Material: photographics with identifiable objects for which a verbal description can be given
Exploit human perceptual abilities
Allow for incremental annotation to obtain a growing training set
©2001 LRB Schomaker - KI/RuG
41
KI
RuG
Outline-based queries
In order to bridge the gap between what is currently possible and the ultimate goal of automatic object detection and classification, a closed curve, drawn around a known object is used as a bootstrap representation: An outline.
This closed curve contains shape information itself (XY, dXdY, curvature) and allows to separate visual object characteristics represented by the pixels which are enclosed by it from the background
©2001 LRB Schomaker - KI/RuG
45
KI
RuG
More outline-based features
Lengths of radii from center of gravity Curvature Curvature scale space Bitmap of an outline Absolute Fourier transform |FFT| Others (not tried yet): wavelets, Freeman
coding
©2001 LRB Schomaker - KI/RuG
51
KI
RuG
Annotation
After the user has produced an outline (by pen or mouse), it is fruitful to ask for a text label (keyboard, speech, handwriting)
Knowledge on semantics can be exploited to guide the user (e.g., with menus)
©2001 LRB Schomaker - KI/RuG
54
KI
RuG
Problems in performance measurement
The systems usually have the goal to return a list of similar-looking images
What is good? What is bad?
No clear-cut definition of ‘class’, unlike speech and handwriting recognition
Performance measurement is borrowed from Information Retrieval: Precision & Recall
©2001 LRB Schomaker - KI/RuG
58
KI
RuG
Intermediate summary
Outline-based search yields promising results Many questions remain:
Can users do it? Do they like to perform outlining+annotation? Is the ‘bootstrap’ idea valid: can the outlines
be used for matching with unseen images?
Can users produce outlines?
Object classes:• Locomotive• Christmas tree• Atomic explosion• Jukebox• 4-wheel drive car• Brain• Motor bike• Pistol• Buddha• Stop sign
User (N=33) differences in outline production
©2001 LRB Schomaker - KI/RuG
60
KI
RuG
Multistable outlining behavior?
Locomotive: with or without smoke?
Accurate or sloppy curvature followers
Observations
©2001 LRB Schomaker - KI/RuG
65
KI
RuG
Outline vs Edge search results
Caveat: no translation, orientation, scaleinvariance (early results)
More use for outlines: class-specific edge detectors
Generic edge detection
Edge detector (MLP)Trained with outline points from the motor bicycle base as targets for output neuron
©2001 LRB Schomaker - KI/RuG
67
KI
RuG
User input is highly valuable
Labeled outlines are needed to train classifiers/matchers
Labeled outlines are needed to develop benchmark sets (like: “BenchAthlon”)
Examples from other fields:
Unipen: 5 million characters,
NIST: millions of characters,
LDC: thousands of hours labeled speech
©2001 LRB Schomaker - KI/RuG
68
KI
RuG
User input is highly valuable
openMind arguments (Stork, 1999; 2001)
The best teams have the largest labeled training sets
Differences between algorithms vanish when huge training sets are used (Ho & Baird, 1997)
Processor speed can be exploited if sufficient amounts of data are used (free ride on Moore’s Law)
©2001 LRB Schomaker - KI/RuG
69
KI
RuG
The Vind(X) site (Schomaker & Vuurpijl)
The experience collected thus far has been integrated in a functional Web site for image search and collaborative annotation
In collaboration with Amsterdam Rijksmuseum: a large image base of paintings and their descriptions in a text base
©2001 LRB Schomaker - KI/RuG
70
KI
RuG
The Vind(X) site (Schomaker & Vuurpijl)
Site: http://kepler.cogsci.kun.nl/vindx/
The site will become part of the openMind initiative: http://www.openmind.org
System consisting of Java/Javascript WWW pages, server-side pattern recognition in C
Vind(X) has extensive search and rendering functions
Vind(x) system with paintings data base of Amsterdam Rijksmuseum.
Query at upperleft “sitting man”
(Schomaker & Vuurpijl, 1999)
http://kepler.cogsci.kun.nl/vindx/
©2001 LRB Schomaker - KI/RuG
73
KI
RuG
More questions: open user access
How to detect non-cooperative outlining and annotation?
How to merge ‘identical’ outlines?
How to merge ‘identical’ textual annotation
How to detect valuable expert input?
©2001 LRB Schomaker - KI/RuG
74
KI
RuG
More questions: semantics and geometry
How to achieve ‘explainable’ image hit list results?
make sure the underlying features are based on human perception
Hypothesis: “The construction of ontologies based on both semantics and feature- space characteristics will help in producing ‘explainable’ hit list results”
Example of an ontology created from all collected object annotations
©2001 LRB Schomaker - KI/RuG
77
KI
RuG
Summary
Existing systems have problems in useability
Knowledge on the user (ergonomics, perception, cognition) may help substantially
Objects are a preferred search criterion
Object-based approaches have a strong connection to semantics
©2001 LRB Schomaker - KI/RuG
78
KI
RuG
Summary (continued)
An outline-based object search system was presented
The prototype was converted to a Web site with real content: Dutch paintings (> 80)
The site is used for collecting human annotations to this image base (> 1000)
The resulting data are very useful for future research in a number of areas: IR, outline matching, pixel matching, dedicated preprocessing