optimal reverse prediction: linli xu, martha white and dale schuurmans icml 2009, best overall paper...

39
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning Discussion led by Chunping Wang ECE, Duke University October 23, 2009

Upload: oliver-craig

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Motivations 2/31 Lack of a foundational connection between supervised and unsupervised learning  Supervised learning: minimizing prediction error  Unsupervised learning: re-representing the input data For semi-supervised learning, one needs to consider both together The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” A unification demonstrated in this paper leads to a novel semi- supervised principle

TRANSCRIPT

Page 1: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Optimal Reverse Prediction:

Linli Xu, Martha White and Dale SchuurmansICML 2009, Best Overall Paper Honorable Mention

A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning

Discussion led by Chunping WangECE, Duke University

October 23, 2009

Page 2: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Outline

• Motivations • Preliminary Foundations• Reverse Supervised Least Squares • Relationship between Unsupervised Least Squares and

PCA, K-means, and Normalized Graph-cut• Semi-supervised Least Squares• Experiments• Conclusions

1/31

Page 3: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Motivations

2/31

• Lack of a foundational connection between supervised and unsupervised learning

Supervised learning: minimizing prediction error

Unsupervised learning: re-representing the input data

• For semi-supervised learning, one needs to consider both together

• The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption”

• A unification demonstrated in this paper leads to a novel semi-supervised principle

Page 4: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Preliminary Foundations Forward Supervised Least Squares

3/31

• Data:– a input matrix X, a output matrix Y, – t instances, n features, k responses– regression: – classification: – assumption: X, Y full rank,

• Problem: – Find parameters W minimizing least squares loss for a

model

ktRY 11 YY kt ,}1,0{

YXfW :

nt kt

kn

Page 5: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Preliminary Foundations

4/31

• Linear

• Ridge regularization

• Kernelization

• Instance weighting

])')([(trmin YXWYXWW

YXIXXW

W'WYXWYXWW

')'(

]tr[])')([(trmin1

YIKA

AA'KYKAYKAA

1)(

]tr[])')([(trmin

YIKA

AA'KYKAYKAA

1)(

]tr[])')(([trmin

'XXK

Page 6: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Preliminary Foundations Principal Components Analysis - dimensionality reduction

k-means – clustering

Normalized Graph-cut – clustering

5/31

)'(where,' max XXQWXWZ k

k

i Sxij

Sij

xS1

2* minarg

Weighted undirected graph ),,( AEVGnodes

edgesaffinity matrix

Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.

Page 7: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Preliminary Foundations Normalized Graph-cut – clustering

6/31

• Partition indicator matrix Z

• Weighted degree matrix

• Total cut

• Normalized cut

kjtiSiSi

Zj

jij ,,1,,,1for,

if0if1

)(diag 1A

]'[tr)( 21

1

'21 LZZzAzC

k

jjj

)]'()'[(tr)( 1

21

1'

'

21 LZZZZ

zzzAz

NCk

j jj

jj

constraint

objective

objective

From Xing & Jordan, 2003

Page 8: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

SupervisedLeast

Squares Regression

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

Least Squares

Classification

First contribution

In literature

7/31

Page 9: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

This paper

7/31

SupervisedLeast

Squares Regression

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

Least Squares

Classification

Unification

First contribution

Page 10: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Reverse Supervised Least Squares

8/31

• Traditional forward least squares: predict the outputs from the inputs

• Reverse least squares: predict the inputs from the outputs

Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.

])')([(trmin YXWYXWW

])')([(trmin YUXYUXU

rankfullX

Page 11: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Reverse Supervised Least Squares

9/31

• Ridge regularization

• Kernelization

• Instance weighting

YYUIXXWYYUYXWIXX '')'(''')'( 1

YYBIKAYYBYAIK '')('')( 1

')'(])'()([trmin 1YYYBYBIKYBIA

])'()[(trmin YBIKYBIB

Reverse problem:

Recover:

YYBIKAYYBYAIK '')('')( 1

Reverse problem:

Recover:

Recover:

Page 12: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Reverse Supervised Least Squares

10/31

For supervised learning with least squares loss

forward and reverse perspectives are equivalent

each can be recovered exactly from the other

the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!

Page 13: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares

11/31

Unsupervised learning: no training labels Y are given

Principle: optimize over guessed labels Z

])')([(trminmin ZXWZXWWZ

• forward

• reverse ])')([(trminmin ZUXZUXUZ

For any W, we can choose Z=XW to achieve zero lossIt only gives trivial solutions It does not Work!

It gives non-trivial solutions

nkktktnt RURZRX ),}1,0{or(,

Page 14: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares PCA

12/31

Proposition 1 Unconstrained reverse prediction

is equivalent to principal components analysis.

])')([(trminmin ZUXZUXUZ

This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases

Corollary 1 Kernelized reverse prediction

is equivalent to kernel principal components analysis.

])'()[(trminmin ZBIKZBIBZ

Page 15: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares PCA

13/31

Proposition 1 Unconstrained reverse prediction

is equivalent to principal components analysis.

])')([(trminmin ZUXZUXUZ

])')([(trminarg* ZUXZUXUU

Proof

])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ

Page 16: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares PCA

13/31

Proposition 1 Unconstrained reverse prediction

is equivalent to principal components analysis.

])')([(trminmin ZUXZUXUZ

])')([(trminarg* ZUXZUXUU

Proof

])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ

]')[(trmin XXZZIZ

]'[trmax XXZZZ

')'( 1 ZZZZ Recall thatThe solution for Z is not unique

)('')''()()( 1 ZRZTZTZTZTZTZTZTR

Page 17: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares PCA

14/31

Proposition 1 Unconstrained reverse prediction

is equivalent to principal components analysis.

])')([(trminmin ZUXZUXUZ

Proof Consider the SVD of Z:

diagonaland','for,' kk IQQIPPQPZ

')( PPZZZR Then

The objective ]''[trmax]''[trmax':':

PXXPXXPPkk IPPPIPPP

)'(max XXQP kSolution

Page 18: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares k-means

15/31

Proposition 2 Constrained reverse prediction

is equivalent to k-means clustering.

])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares.

Corollary 2 Constrained kernelized reverse prediction

is equivalent to kernel k-means.

])'()[(trminmin,}1,0{:

ZBIKZBIBZZZ kt

11

Page 19: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares k-means

16/31

Proposition 2 Constrained reverse prediction

is equivalent to k-means clustering.

])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

])')([(trmin,}1,0{:

XZZXXZZXZZZ kt

11

Proof Equivalent problem

Consider the differenceXZZZZXXZZX ')'( 1

Diagonal matrix

Counts of data in each class

matrix

Each row: sum of data in each class

nk

Page 20: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares k-means

17/31

Proposition 2 Constrained reverse prediction

is equivalent to k-means clustering.

])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

Proof

k classofmean

1 classofmean')'( 1 XZZZ

classinstanceofmean

class1instanceofmean')'( 1

tXZZZZ

k

n

n

t

means

encoding

Page 21: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares k-means

18/31

Proposition 2 Constrained reverse prediction

is equivalent to k-means clustering.

])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

Proof Therefore

])')([(trmin,}1,0{:

XZZXXZZXZZZ kt

11

k

i Sxij

Sij

xS1

2* minarg

Page 22: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares Norm-cut

19/31

Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction

is equivalent to normalized graph-cut.

])'()([trminmin 11

,}1,0{:ZBKZB

BZZZ kt

11

)(diag 1K

Proof For any Z, the solution to the inner minimization

')'( 1* ZZZB

])')'([(trmin 11

,}1,0{:KZZZZ

ZZZ kt

11Reduced objective

]')'([tr-tr[I]min 1

,}1,0{:KZZZZ

ZZZ kt

11

])(')'[(trmin 1

,}1,0{:ZKZZZ

ZZZ kt

11

Page 23: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares Norm-cut

20/31

Proof Recall the normalized-cut (from Xing & Jordan, 2003)

])(')'[(tr 121 ZAZZZNC

Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction

is equivalent to normalized graph-cut.

])'()([trminmin 11

,}1,0{:ZBKZB

BZZZ kt

11

)(diag 1K

Since K doubly nonnegative, it could be a affinity matrix.

The objective is equivalent to normalized graph-cut.

Page 24: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Unsupervised Least Squares Norm-cut

21/31

Corollary 3 The weighted least squares problem

is equivalent to normalized graph-cut on if .

])')(([trminmin 11

,}1,0{:ZUXZUX

UZZZ kt

11

With a specific K, we can relate normalized graph-cut to the reverse least squares.

'XXK 0K

Page 25: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Second contribution

22/31

Reverse Prediction

SupervisedLeast

Squares Learning

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

The figure is taken from Xu’s slides

Page 26: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

22/31

Reverse Prediction

SupervisedLeast

Squares Learning

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

New

Semi-Supervised

The figure is taken from Xu’s slides

Second contribution

Page 27: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x

The figure is taken from Xu’s slides

Supervised reverse losses ])')([(tr YUXYUX

Page 28: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x

The figure is taken from Xu’s slides

Supervised reverse losses ])')([(tr YUXYUX

Page 29: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x

The figure is taken from Xu’s slides

3x

Supervised reverse losses ])')([(tr YUXYUX

Page 30: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x

The figure is taken from Xu’s slides

Supervised reverse losses

3x

])')([(tr YUXYUX

Page 31: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x

3x*

3x

The figure is taken from Xu’s slides

Supervised reverse losses ])')([(tr YUXYUX Unsupervised reverse losses ])')([(tr ZUXZUX

2*33

2*33

233 ˆˆ xxxxxx

Page 32: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

24/31

Proposition 4 For any X, Y, and U

where

])')([(tr

])')([(tr

])')([(tr

**

**

YUUZYUUZ

UZXUZX

YUXYUX

])')([(trminarg* ZUXZUXZZ

Supervised loss

Unsupervised loss

Squared distance

Unsupervised loss depends only on the input data X;Squared distance depends on both X and Y.

Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.

Page 33: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

25/31

Corollary 4 For any U

where

]1E[

][]1E[

]1E[

2*

2*2*

2

FLLL

FUUU

FLLL

FLLL

UYUZT

UZXT

EUZXT

UYXT

])')([(trminarg* ZUXZUXZZ

Supervised loss estimate

Unsupervised loss estimate

Squared distance estimate

Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.

Page 34: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Semi-supervised Least Squares

26/31

A naive approach:]//[minmin 22

UFULFLLUZTZUXTUYX

Loss on labeled data Loss on unlabeled data

Advantages: • The authors combine supervised and unsupervised

reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units.

• Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)

Page 35: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Regression ExperimentsLeast Squares + PCA

27/31

]//[minmin 22UFULFLLUZ

TZUXTUYX

• Two terms are not jointly convex no closed form solution

• Learning method: alternating (with a initial U got from supervised setting)

• Recovered forward solution

• Testing: given a new x,

• Can be kernelized

Basic formulation

YYUIXXW '')'( 1

xWy 'ˆ

Page 36: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Regression ExperimentsLeast Squares + PCA

28/31

Forward root mean squared error (mean± standard deviations for 10 random splits of the data)

The values of (k, n; TL , TU ) are indicated for each data set.

The table is taken from Xu’s paper

Page 37: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Classification ExperimentsLeast Squares + k-means

29/31

]//[minmin 22

,}1,0{:UFULFLLUZZZ

TZUXTUYXkt

11

xWy 'ˆ

• Recovered forward solution

• Testing: given a new x, , predict max response

Least Squares + Norm-cut

YYUIXXW '')'( 1

]/)(/)([minmin2121

,}1,0{:UFUUULFLLLLUZZZ

TZUXTUYXkt

11

Page 38: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Classification ExperimentsLeast Squares + k-means

30/31

Forward root mean squared error (mean± standard deviations for 10 random splits of the data)

The values of (k, n; TL , TU ) are indicated for each data set.

The table is taken from Xu’s paper

Page 39: Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Conclusions

31/31

Two main contributions:

1. A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms;

2. In the unified framework, a novel semi-supervised principle is proposed.