transfer learning algorithms for image classificationaquattoni/allmypapers/thesis_talk.pdf ·...

43
1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell

Upload: phamphuc

Post on 12-Apr-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

1

Transfer Learning Algorithms for Image Classification

Ariadna QuattoniMIT, CSAIL

Advisors:Michael CollinsTrevor Darrell

2

Motivation

We want to be able to build classifiers for thousands of visual categories.

We want to exploit rich and complex feature representations.

Problem:

Goal:

We might only have a few labeled samples per category.

Word: president

Word: actress

Word: team

3

Thesis Contributions

Learn an image representation using supervised data from auxiliary tasks automatically derived from unlabeled images + meta-data.

A feature sharing transfer algorithm based on joint regularization.

An efficient algorithm for training jointly sparse classifiers in high dimensional feature spaces.

We study efficient transfer algorithms for image classification whichcan exploit supervised training data from a set of related tasks.

4

Outline

A joint sparse approximation model for transfer learning.

Asymmetric transfer experiments.

An efficient training algorithm.

Symmetric transfer image annotation experiments.

5

Transfer Learning: A brief overview

The goal of transfer learning is to use labeled data from relatedtasks to make learning easier. Two settings:

Asymmetric transfer: Resource: Large amounts of supervised data for a set of related tasks. Goal: Improve performance on a target task for which training data is scarce.

Symmetric transfer: Resource: Small amount of training data for a large number of related tasks. Goal: Improve average performance over all classifiers.

6

Transfer Learning: A brief overview

Three main approaches:

Learning intermediate latent representations:[Thrun

1996, Baxter 1997, Caruana

1997, Argyriou

2006, Amit

2007]

Learning priors over parameters: [Raina 2006, Lawrence et al. 2004 ]

Learning relevant shared features via joint sparse regularization:[Torralba

2004, Obozinsky

2006]

7

Feature Sharing Framework:

Work with a rich representation:Complex features, high dimensional spaceSome of them will be very discriminative (hopefully)Most will be irrelevant

Related problems may share relevant features.

If we knew the relevant features we could:Learn from fewer examplesBuild more efficient classifiers

We can train classifiers from related problems together using a regularization penalty designed to promote joint sparsity.

Presenter�
Presentation Notes�
Here is our framework for image recognition tasks. Something that works well is to use unlabeled data to generate a large set of features, some of which might be complex. The advantage Is that in such a representation we expect to find features that are discriminative. However since the features where generated without regads to the classification tasks at hand many of them will be irrelevant. �

8

Church Airport Grocery Store Flower-Shop

9

Church Airport Grocery Store Flower-Shop

10

Related Formulations of Joint Sparse Approximation

Torralba et al. [2004] developed a joint boosting algorithm based on theidea of learning additive models for each class that share weak learners.

Obozinski et al. [2006] proposed L1-2 joint penalty and developed a blockwise boosting scheme based on Boosted-Lasso.

.

11

Our Contribution

Previous approaches to joint sparse approximation (Torralba et al., 2004, Obozinski et al.,2006;) have relied on greedy coordinate descent methods.

We propose a simple an efficient global optimization algorithm with guaranteed convergence rates

A new model and optimization algorithm for training jointly sparse classifiersin high dimensional feature spaces.

We test our model on real image classification tasks where we observe improvements in both asymmetric and symmetric transfer settings.

⎟⎠⎞

⎜⎝⎛

2

O

We show that our algorithm can successfully recover jointly sparse solutions.

Our algorithm can scale to large problems involving hundreds of problems and thousands of examples and features.

12

Notation

Collection of Tasks

},...,,{ 21 mDDD=D},...,,{ 21 mDDD=D

Joint SparseApproximation

1D2D mD

)},(),....,,{( 11kn

kn

kkk kk

yxyxD =

dℜ∈x }1,1{ −+∈y

mddd

m

m

www

wwwwww

,2,1,

,22,21,2

,12,11,1

L

MOOM

L

L

W

13

Single Task Sparse Approximation

xwxf ⋅=)(

∑ ∑∈ =

+Dyx

d

jjwQyxfl

),( 1||)),((minarg

w

Consider learning a single sparse linear classifier of the form:

We want a few features with non-zero coefficients

Recent work suggests to use L1 regularization:

Classificationerror

L1

penalizesnon-sparse solutions

Donoho [2004] proved (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

14

Joint Sparse Approximation

∑ ∑= ∈

+m

kmk

Dyxk

QyxflD

k121

),(,...,, ),....,,R()),((

||1minarg www

m21 www

Setting : Joint Sparse Approximation

Average Losson training set k

penalizes solutions that

utilize too many features

xxf kk ⋅= w)(

15

Joint Regularization Penalty

rowszerononW −−=#)R(

mddd

m

m

WWW

WWWWWW

,2,1,

,22,21,2

,12,11,1

L

MOOM

L

L

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients forclassifier 2

Would lead to a hard combinatorial problem .

16

Joint Regularization Penalty

We will use a L1-∞ norm [Tropp 2006]

∑=

=d

iikk

WW1

|)(|max)R(

The combination of the two norms results in a solution where onlya few features are used but the features used will contribute in solvingmany classification problems.

This norm combines:

An L1

norm on the maximum absolute values of the coefficients across tasks promotes sparsity.

Use few features

The L∞

norm on each row promotes non-sparsity

on the rows. Share features

17

Joint Sparse Approximation

∑ ∑∑= =∈

+m

k

d

iikkk

Dyxk

WQyxflD

k1 1),(|)(|max)),((

||1minW

Using the L1-∞ norm we can rewrite our objective function as:

For the hinge loss:the optimization problem can be expressed as a linear program.

))(1,0max()),(( xyfyxfl −=

For any convex loss this is a convex objective.

18

Joint Sparse Approximation

Objective:

∑ ∑ ∑= = =

+m

k

D

j

d

ii

kj

k

k

tQD1

||

1 1],,[ ||

1min εtεW

Linear program formulation (hinge loss):

Max value constraints:

mkfor :1: =

difor :1: =

0≥kjε

iiki twt ≤≤−mkfor :1: =

|:|1: kDjfor =

kj

kjk

kj xfy ε−≥1)(

and

Slack variables constraints:and

19

Outline

A joint sparse approximation model for transfer learning.

Asymmetric transfer experiments.

An efficient training algorithm.

Symmetric transfer image annotation experiments.

20

Setting: Asymmetric TransferSuperBowl Danish CartoonsSharon

Australian

Open Trapped Miners Golden globes

Grammys Figure Skating

Academy Awards

Iraq

Learn a representation using labeled data from 9 topics.

Train a classifier for the 10th

held out topic using the relevantfeatures R

only.

}0|)(|max:{ >= rkk wrRDefine the set of relevant features to be:

Learn the matrix W using our transfer algorithm.

21

4 20 40 60 80 100 120 1400.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

Average AUC

# training samples

Asymmetric Transfer

Baseline RepresentationTransfered Representation

Results

22

The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.

We might want a more general optimization algorithm that can handle arbitrary convex losses.

An efficient training algorithm

The LP formulation can be optimized using standard LP solvers.

23

Outline

A joint sparse approximation model for transfer learning.

Asymmetric transfer experiments.

An efficient training algorithm.

Symmetric transfer image annotation experiments.

24

L1-∞

Regularization: Constrained Convex Optimization Formulation

∑ ∑= ∈

m

kk

Dyxk

yxflD

k1 ),()),((

||1minarg W

∑=

≤d

iikk

CWts1

|)(|max..

We will use a Projected SubGradient method.Main advantages: simple, scalable, guaranteed convergence rates.

A convex function

Convex constraints

Projected SubGradient methods have been recently proposed:L2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007] L1 regularization [Duchi et al. 2008]

25

Euclidean Projection into the L1-∞

ball

We map the projection to a simpler problem which involves finding new maximums for each feature across tasks and using them to truncate the original matrix.

The total mass removed from a feature across tasks should be thesame for all features whose coefficients don’t become zero.

Snapshot of the idea:

26

Euclidean Projection into the L1-∞

ball

27

Characterization of the solution

1

28

Characterization of the solution

Feature I Feature IIIFeature II Feature VI

3μ2μ

4μ∑>

−=2,2

2,2μ

μθjA

jA

2

29

Mapping to a simpler problemWe can map the projection problem to the following problem which

finds the optimal maximums μ:

30

Efficient Algorithm for: , in pictures

4 Features, 6 problems, C=14 ∑=

=d

iikkA

1

29|)(|max

Presenter�
Presentation Notes�
It turns out that if we give a Theta we can compute the new maximums of each group of parameters. For example, if Theta=8, this would be the new maximum for the first group, because it cuts three units in the first parameter, two units in the second and third, and one unit in the fourth parameter. It sums to eight. A similar thing for the second group. Now, for the third group, �the parameters sum to 7, so the new maximum would go to 0 (which is the special condition). Now say that we have this functions (that go from Theta to the maximums). If we sum them, we get the L1-Infinity norm of the new matrix B. So we have a function that goes from a Theta to the new norm. So the algorithm is as follows. It starts with Theta at zero (mu at the original maximums), and it keeps increasing theta and decreasing the maximums until they sum to C. IT turns out that mu as function of theta is a piece wise linear function, so is their sum. To compute this function the dominant computational cost involves sorting the entries of A to obtain the intervals of the function. �

31

Complexity

The total cost of the algorithm is dominated by a sort of

The total cost is in the order of:

the entries of A

))log(( dmdmO

Notice that we only need to consider non-zero entries of A,so the computational cost is dominated by the number of non-zero.

32

Outline

A joint sparse approximation model for transfer learning.

Asymmetric transfer experiments.

An efficient training algorithm.

Symmetric transfer image annotation experiments.

33

Synthetic Experiments

Generate a jointly sparse parameter matrix W:

)( ki

tk

ki xwsigny =

For every task we generate pairs: ),( ki

ki yx

where

We compared three different types of regularization(i.e. projections):

L1−∞ projectionL2 projectionL1 projection

34

Synthetic Experiments

10 20 40 80 160 320 64015

20

25

30

35

40

45

50

# training examples per task

Erro

r

Synthetic Experiments Results: 60 problems 200 features 10% relevant

L2L1L1-LINF

Test ErrorPerformance on predicting

relevant features

10 20 40 80 160 320 64010

20

30

40

50

60

70

80

90

100

# training examples per task

Feature Selection Performance

Precision L1-INFRecall L1Precision L1Recall L1-INF

35

Dataset: Image Annotation

40 top content wordsRaw image representation: Vocabulary Tree

(Grauman

and Darrell 2005, Nister

and Stewenius

2006)11000 dimensions

Word: president Word: actress Word: team

36

Perform hierarchical k-means

Experiments: Vocabulary Tree representation

To compute a representation for an image:

Find patches Map each patch to a feature vector.

Find patches.

Map each patch to its closest cluster in each level.

]#,....,#,....#,...,#,[# 2,11,

12

11 1

lp

llp l

ccccccx =

37

Results

38

Results

5 10 15 205 10 15 20

39

Results:

40

41

Summary of Thesis Contributions

We presented a method that learns efficient image representations using unlabeled images + meta-data.

We developed a feature sharing transfer based on performing a joint loss minimizationover the training sets of related tasks with a shared regularization.

Previous approaches to joint sparse approximation have relied on greedy coordinate descent methods.

We propose a simple an efficient global optimization algorithm for training joint models with L1−∞ constraints.

We show the performance of our transfer algorithm on real image classification tasks for both an asymmetric and symmetric transfer setting.

We provide a tool that makes implementing a joint sparsity regularization penalty as easy and almost as efficient as implementing the standard L1 and L2 penalties.

42

Future Work

Task Clustering.

Online Optimization.

Generalization properties of L1−∞ regularized models.

Combining feature representations.

43

Thanks!