sparse kernel learning for image annotation

Sparse Kernel Learning for Image Annotation

Sean Moran and Victor Lavrenko

Institute of Language, Cognition and ComputationSchool of Informatics

University of Edinburgh

ICMR’14 Glasgow, April 2014

Sparse Kernel Learning for Image Annotation

Overview

SKL-CRM

Evaluation

Conclusion

Assigning words to pictures

Feature Extraction

GIST SIFT LAB HAAR

Tiger, Grass, Whiskers

City, Castle, Smoke

Tiger, Tree, Leaves

Eagle, Sky

Training Dataset

P(Tiger | ) = 0.15

P(Grass | ) = 0.12

P(Whiskers| ) = 0.12

Top 5 words as annotation

This talk:How best to

combinefeatures?

Multiple Features

Ranked list of words

Tiger, Grass, Tree Leaves, Whiskers

Annotation Model

P(Leaves | ) = 0.10

P(Tree | ) = 0.10

P(Smoke | ) = 0.01

Testing Image

P(City | ) = 0.03

P(Waterfall | ) = 0.05

P(Castle | ) = 0.03

P(Eagle | ) = 0.02

P(Sky | ) = 0.08

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X6

X5

X4

X3

X2

X1

X6

X5

X4

X3

X2

X1

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X6

X5

X4

X3

X2

X1

X6

X5

X4

X3

X2

X1

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X6

X5

X4

X3

X2

X1

X6

X5

X4

X3

X2

X1

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X6

X5

X4

X3

X2

X1

X6

X5

X4

X3

X2

X1

X1

X2

X3

X4

X5

X6

Previous work

I Topic models: latent Dirichlet allocation (LDA) [Barnard etal. ’03], Machine Translation [Duygulu et al. ’02]

I Mixture models: Continuous Relevance Model (CRM)[Lavrenko et al. ’03], Multiple Bernoulli Relevance Model(MBRM) [Feng ’04]

I Discriminative models: Support Vector Machine (SVM)[Verma and Jahawar ’13], Passive Aggressive Classifier[Grangier ’08]

I Local learning models: Joint Equal Contribution (JEC)[Makadia’08], Tag Propagation (Tagprop) [Guillaumin et al.’09], Two-pass KNN (2PKNN) [Verma et al. ’12]

Combining different feature types

I Previous work: linear combination of feature distances in aweighted summation with “default” kernels:

Kernels

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2

Laplacian UniformGaussian

I Standard kernel assignment: Gaussian for Gist, Laplacianfor colour features, χ2 for SIFT

Data-adaptive visual kernels

I Our contribution: permit the visual kernels themselves toadapt to the data:

Kernels

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2


Corel 5K

I Hypothesis: Optimal kernels for GIST, SIFT etc dependenton the image dataset itself

Data-adaptive visual kernels

I Our contribution: permit the visual kernels themselves toadapt to the data:

Kernels

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2


IAPR TC12

I Hypothesis: Optimal kernels for GIST, SIFT etc dependenton the image dataset itself

Sparse Kernel Continuous Relevance Model (SKL-CRM)

Overview

SKL-CRM

Evaluation

Conclusion

Continuous Relevance Model (CRM)

I CRM estimates joint distribution of image features (f) andwords (w)[Lavrenko et al. 2003]:

P(w, f) =∑J∈T

P(J)N∏

j=1

P(wj |J)M∏i=1

P(~fi |J)

I P(J): Uniform prior for training image JI P(~fi |J): Gaussian non-parametric kernel density estimateI P(wi |J): Multinomial for word smoothing

I Estimate marginal probability distribution over individual tags:

P(w |f) =P(w , f)∑w P(w , f)

I Top e.g. 5 words with highest P(w |f) used as annotation

Sparse Kernel Learning CRM (SKL-CRM)

I Introduce binary kernel-feature alignment matrix Ψu,v

P(I |J) =M∏i=1

R∑j=1

exp

{− 1

β

∑u,v

Ψu,vkv (~f ui ,~f uj )

}

I kv (~f ui ,~f uj ): v -th kernel function on the u-th feature type

I β: kernel bandwidth parameter

I Goal: learn Ψu,v by directly maximising annotation F1 scoreon held-out validation dataset

Generalised Gaussian Kernel

I Shape factor p: traces out an infinite family of kernels

P(~fi |~fj) =p1−1/p

2βΓ(1/p)exp

[−1

p

|~fi − ~fj |p

βp

]

I Γ: Gamma functionI β: kernel bandwidth parameter



P(~fi |~fj) =p1−1/p

2βΓ(1/p)exp

[−1

p

|~fi − ~fj |p

βp

]

x

GG(x ;

p)

p =2



P(~fi |~fj) =p1−1/p

2βΓ(1/p)exp

[−1

p

|~fi − ~fj |p

βp

]

x

GG(x ;

p)

p =1



P(~fi |~fj) =p1−1/p

2βΓ(1/p)exp

[−1

p

|~fi − ~fj |p

βp

]

x

GG(x ;

p)

p =15

Multinomial Kernel

I Multinomial kernel optimised for count-based features:

P(~fi |~fj) =(∑

d fi ,d)!∏d (fi ,d !)

∏d

(pj ,d)fi,d

I fi,d : count for bin d in the unlabelled image iI fj,d count for the training image j

I Jelinek-Mercer smoothing used to estimate pj ,d :

pj ,d = λfj ,d∑d fj ,d

+ (1− λ)

∑j fj ,d∑

j ,d fj ,d

I We also consider standard χ2 and Hellinger kernels

Greedy kernel-feature alignment

Features

Kernels

Laplacian

GIST HAAR

Gaussian Uniform

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

0 0 0 0

0 0 0 0

0 0 0 0

GIST SIFT LAB HAAR

Laplacian

Gaussian

Uniform

Ψ vu

X6

Iteration 0:

F1 0.0

Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

X6

Testing Image

Training Image

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2


Features

Kernels

Laplacian

GIST HAAR

Uniform

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

0 0 0 0

1 0 0 0

0 0 0 0

GIST SIFT LAB HAAR

Laplacian

Gaussian

Uniform

Ψ vu

X6

Iteration 1:

F1 0.25

Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

X6

Testing Image

Training Image

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2

Gaussian


Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

0 0 0 0

1 0 0 0

0 0 0 1

GIST SIFT LAB HAAR

Laplacian

Gaussian

Uniform

Ψ vu

X6

Iteration 2:

F1 0.34

Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

X6

Testing Image

Training Image

Kernels

Laplacian Uniformx

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2

Gaussian


Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

0 0 0 0

1 1 0 0

0 0 0 1

GIST SIFT LAB HAAR

Laplacian

Gaussian

Uniform

Ψ vu

X6

Iteration 3:

F1 0.38

Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

X6

Testing Image

Training Image

Kernels

x

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2

Gaussian Laplacian Uniform


Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

0 0 1 0

1 1 0 0

0 0 0 1

GIST SIFT LAB HAAR

Laplacian

Gaussian

Uniform

Ψ vu

X6

Iteration 4:

F1 0.42

Features

GIST HAAR

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

X1

X2

X3

X4

X5

X6

SIFT LAB

X6

Testing Image

Training Image

Kernels

Laplacian Uniformx

GG(x

;p)

p =1

x

GG(x

;p)

p =15

x

GG(x

;p)

p =2

Gaussian

Evaluation

Overview

SKL-CRM

Evaluation

Conclusion

Datasets/Features

I Standard evaluation datasets:

I Corel 5K: 5,000 images (landscapes, cities), 260 keywords

I IAPR TC12: 19,627 images (tourism, sports), 291 keywords

I ESP Game: 20,768 images (drawings, graphs), 268 keywords

I Standard “Tagprop” feature set [Guillaumin et al. ’09]:

I Bag-of-words histograms: SIFT [Lowe ’04] and Hue [van deWeijer & Schmid ’06]

I Global colour histograms: RGB, HSV, LAB

I Global GIST descriptor [Oliva & Torralba ’01]

I Descriptors, except GIST, also computed in a 3x1 spatialarrangement [Lazebnik et al. ’06]

Evaluation Metrics

I Standard evaluation metrics [Guillaumin et al. ’09]:

I Mean per word Recall (R)

I Mean per word Precision (P)

I F1 Measure

I Number of words with recall > 0 (N+)

I Fixed annotation length of 5 keywords

F1 score of CRM model variants

Corel 5K IAPR TC12 ESP Game0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

CRM

CRM 15

SKL-CRM

F1



0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

CRM

CRM 15

SKL-CRM

F1

Original CRMDuygulu et al.

features



0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

CRM

CRM 15

SKL-CRM

F1


features

Original CRM15 Tagprop

features +71%



0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

CRM

CRM 15

SKL-CRM

F1


features

Original CRM15 Tagprop

features +71%

SKL-CRM15 Tagprop

features +45%

F1 score of SKL-CRM on Corel 5K

HSV_V3H1DS

HS_V3H1HSV

HSHH_V3H1

GISTLAB_V3H1

RGB_V3H1RGB

DH_V3H1DH

HHLAB

DS_V3H1

0.31

0.33

0.35

0.37

0.39

0.41

0.43

0.45

SKL-CRM (Valid F1)

SKL-CRM (Test F1)

Tagprop (Test F1)

Feature type

F1

Optimal kernel-feature alignments on Corel 5K

I Optimal alignments1:

I HSV: Multinomial (λ = 0.99)I HSV V3H1: Generalised Gaussian (p=0.9)I Harris Hue (HH V3H1): Generalised Gaussian (p=0.1) ≈

Dirac spike!I Harris SIFT (HS): GaussianI HS V3H1: Generalised Gaussian (p=0.7)I DenseSift (DS): Laplacian

I Our data-driven kernels more effective than standard kernels

I No alignment agrees with literature default assignment i.e.Gaussian for Gist, Laplacian for colour histogram, χ2 for SIFT

1V3H1 denotes descriptors computed in a spatial arrangement

SKL-CRM Results vs. Literature (Precision & Recall)

R P R P0.20

0.25

0.30

0.35

0.40

0.45

0.50

MBRM JEC

Tagprop GS

SKL-CRM

Corel 5K IAPR TC12

SKL-CRM Results vs. Literature (N+)

MBRM JEC Tagprop GS SKL-CRM0

50

100

150

200

250

300

Corel 5K

IAPR TC12

N+

Conclusion

Overview

SKL-CRM

Evaluation

Conclusion

Conclusions and Future Work

I Proposed a sparse kernel model for image annotation

I Key experimental findings:

I Default kernel-feature alignment suboptimal

I Data-adaptive kernels are superior to standard kernels

I Sparse set of features just as effective as much larger set

I Greedy forward selection as effective as gradient ascent

I Future work: superposition of kernels per feature type

Thank you for your attention

Sean Moran

[email protected]

sparse kernel learning for image annotation

Technology

p fi fj p p x ggxp p

kernels x ggxp p

d d fj

d count

optimal kernels

training image j pfi

default kernels

hellinger kernels