lecture 03 internet video search

6: Location and context

What makes a cow a cow?

Google knows because other people know

We think we know

“because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…

How do you know?

What is the object in the middle?

No segmentation … Not even the pixel values of the object …

Where is evidence for an object?

Uijlings IJCV 2011

What is the visual extent of an object?

Uijlings IJCV 2012

Where: exhaustive search

Look everywhere for the object window Imposes computational constraints on

Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)

Impressive but takes long.

Viola IJCV 2004 Dalal CVPR 2005 Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7

Where: the need for a hierarchy

An image is intrinsically hierarchical.

Gu CVPR 2009

Selective search

Van de Sande ICCV 2011

Windows formed by hierarchical grouping. Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004

Selective search example

11

Selective search example

Average best overlap ~88%

… looks like this

High recall cat

Pairs of concepts

Uijlings ICCV demo 2012

6 Conclusion

Selective search gives good localization. Localization needed to understand pairs of concepts.

7 Data and metadata

http://bit.ly/visualsearchengines

How many concepts?

Li Fei Fei slide. Biederman, Psychological Rev. 1987

How many examples?

Once you are over 100 – 1000 examples, success is there.

Russell IJCV 2008

LabelMe 290,000 object annotations

Amateur labeling

Amateur labeling

Xirong Li, TMM 2009

Tag relevance by social annotation

Consistency in tagging between users on similar images.

Tag relevance by social annotation

Pretty good for snow not so good for rainbow.

Social negative bootstrapping

Xirong Li ACM MM 2009

Negative images are as important as positive images to learn. Not just random negative images, but close ones. • We want to learn positive

example from an expert, and obtain as many negative samples as we like for free from the web.

• We iteratively aim for the hardest negatives.

Social negative bootstrapping

Xirong Li ICMR 2011

Knowledge ontology ImageNet

acknowledgement WordNet friends

Christiane Fellbaum Dan Osherson

Princeton Kai Li

Princeton Alex Berg Columbia

Jia Deng Princeton/Stanford

Hao Su Stanford

PASCAL VOC

The PASCAL Visual Object Classes (VOC). 500,000 Images downloaded from flickr. Queries like “car”, “vehicle”, “street”, “downtown”. 10,000 objects, 25,000 labels. Mark Everingham, Luc Van Gool, Chris Williams, John Winn, Andrew Zisserman

7. Conclusion

Data is king. The data are beginning to reflect the human cognition capacity [at a basic level]. Harvesting social data requires advanced computer vision control.

8 Performance

PASCAL 2010 Aeroplane

Bus

Bicycle Bird Boat Bottle

Car Cat Chair Cow

True Positives - Person UOCTTI_LSVM_MDPM

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

False Positives - Person UOCTTI_LSVM_MDPM



Non-birds & non-boats

Non-bird images: Highest ranked

Non-boat images: Highest ranked

Water texture and scene composition?

Non-chair

True Positives - Motorbike MITUCLA_HIERARCHY



False Positives - Motorbike MITUCLA_HIERARCHY



Object localization 2008-2010

Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.

0

10

20

30

40

50

60

aerop

lane

bicyc

le bird

boat

bottle bu

s car cat

chair cow

dining

table dog

horse

motor

bike

perso

n

potte

dplan

t

shee

pso

fa

train

tvmon

itor

Max A

P (%

)

200820092010

TRECvid evaluation standard

Concept detection

Aircraft

Beach

Mountain

People marching

Police/Security

Flower

Measuring performance

• Precision

Set of retrieved items

Set of relevant items

Set of relevant retrieved items

inverse relationship Recall

1.

2.

3.

4.

5.

Results

UvA-MediaMill@TRECVID

• other systems

Snoek et al, TRECVID 04-10

Performance doubled in just 3 years

• 36 concept detectors

Snoek & Smeulders, IEEE Computer 2010

Even when using training data of different origin, great progress. But the number of concepts is still limited.

8. Conclusion

Impressive results and quickly improving per year. Very valuable competition. Best non-classes start to make sense!

9 Speed

SURF based on integral images

Introduced by Viola & Jones in the context of face detection: sliding windows in left to right / up to bottom integral images.

46

SURF principle

LREC 2004, 26 May 2004, Lisbon 47

LyyLyyLxyLxy

Lyy

Lyy

L L L xx yy xy

Approximate Gaussian derivatives with box filters:

SURF speed

LREC 2004, 26 May 2004, Lisbon 48

Computation time: 6 times faster than DoG (~100msec). Independent of filter scale.

Sca

le

Dense descriptor extraction

Pixel-wise Responses Final Descriptor

Factor 16 speed improvement, Another factor 2 by the use of matrix libs.

Projection: Random Forest

Binary decision trees

Moosmann et al. 2008 ......

.... ....

Real-time bag of words

D-SURF 2x2 <empty> Random

Forest RBF

Descriptor Extraction

Projection Classification

Pre-projection Actual projection SVM kernel

MAP: 0.370

Total computation time is 38 milliseconds per image

26 frames per second on a normal PC in any 20 concepts.

15 10 13

9. Conclusion

SURF scale and rotation invariant Fast due to the use of integral images Download: http://www.vision.ee.ethz.ch/~surf/ DURF extraction is 6x faster than Dense-SIFT. Projection using Random Forest 50x faster than NN.

Internet Video Search: the beginning

concept

detection

telling stories

browsing

video video

video measuring

features

lexicon

learning

lecture 03 internet video search

Documents