learning with limited supervision by input and output coding yi zhang machine learning department...
TRANSCRIPT
Learning with Limited Supervision by Input and
Output Coding
Yi Zhang
Machine Learning DepartmentCarnegie Mellon University
April 30th, 2012
1
Thesis Committee
Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-
Madison
2
Introduction
Learning a prediction system, usually based on examples
Training examples are usually limited Cost of obtaining high-quality examples Complexity of the prediction problem
3
Learning
(x1,y1)…
(xn,yn)
X Y
Introduction
Solution: exploit extra information about the input and output space Improve the prediction performance Reduce the cost for collecting training examples
4
Learning
(x1,y1)…
(xn,yn)
X Y
Introduction
Solution: exploit extra information about the input and output space Representation and discovery? Incorporation?
5
Learning
(x1,y1)…
(xn,yn)
X Y
? ?
Outline
6
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Outline
8
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Learning with unlabeled text
For a text classification task : plenty of unlabeled text on the Web : seemingly unrelated to the task What can we gain from such unlabeled text?
9
Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008
A motivating example for text learning
Humans learn text classification effectively! Two training examples:
+: [gasoline, truck] -: [vote, election]
Query: [gallon, vehicle]
Seems very easy! But why?
10
A motivating example for text learning
Humans learn text classification effectively! Two training examples:
+: [gasoline, truck] -: [vote, election]
Query: [gallon, vehicle]
Seems very easy! But why? Gasoline ~ gallon, truck ~ vehicle
11
A covariance operator for regularization
Covariance structure of model coefficients Usually unknown -- learn from unlabeled text?
12
Learning with unlabeled text
Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic
[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words
13
Learning with unlabeled text
Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic
[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words
For a new task, we learn with regularization
14
Experiments
Empirical results on 20 newsgroups 190 1-vs-1 classification tasks, 2% labeled examples For any task, majority of unlabeled text (18/20) is irrelevant
Similar results on logistic regression and least squares15
[1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006
Outline
16
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Learn compressible models
Projection penalties
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Multi-task learning
Different but related prediction tasks An example
Landmine detection using radar images Multiple tasks: different landmine fields
Geographic conditions Landmine types
Goal: information sharing among tasks
17
Regularization for multi-task learning
18
W =
Our approach: view MTL as estimating a parameter matrix
Regularization for multi-task learning
19
W =
?
(Gaussian prior)
Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010
Our approach: view MTL as estimating a parameter matrix
A covariance operator for regularizing a matrix?
Vector w:
Matrix W:
Matrix-normal distributions
Consider a 2 by 3 matrix W:
The full covariance = Kronecker product of and
20
full covariance
≈
row covariance
column covariance
Matrix-normal distributions
Consider a 2 by 3 matrix W:
The full covariance = Kronecker product of and
The matrix-normal density offers a compact form for
21
full covariance
≈
row covariance
column covariance
Joint learning of multiple tasks
Alternating optimization
Learning with a matrix-normal penalty
22
Matrix-normal prior
Joint learning of multiple tasks
Alternating optimization
Other recent work as variants of special cases Multi-task feature learning [Argyriou et al, NIPS 06]: learning with
the feature covariance Clustered multi-task learning [Jacob et al, NIPS 08]: learning with
the task covariance and spectral constraints Multi-task relationship learning [Zhang et al, UAI 10]: learning
with the task covariance
Learning with a matrix-normal penalty
23
Matrix-normal prior
Sparse covariance selection
Sparse covariance selection in matrix-normal penalties
Sparsity of Conditional independence of rows (tasks) and columns
(feature dimensions) of W
24
Sparse covariance selection
Sparse covariance selection in matrix-normal penalties
Sparsity of Conditional independence of rows (tasks) and columns
(feature dimensions) of W
Alternating optimization Estimating W: same as before Estimating and : L-1 penalized covariance estimation
25
Results on multi-task learning Landmine detection: multiple landmine fields
Face recognition: multiple 1-vs-1 tasks
26
[1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008[2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006
0.2
0.25
0.3
0.35
0.4
ST
L
MT
L_C
lust
[1]
MT
L_F
eat[2
]
Pro
pose
d
1-AUC, 30 samples per task
0.05
0.06
0.07
0.08
0.09
ST
L
MT
L_C
lust
[1]
MT
L_F
eat[2
]
Pro
pose
d
Classification error, 5 samples per subject
Outline
27
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures
Learn compressible models
Projection penalties
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Learning compressible models
Learning compressible models
A compression operator P instead of Bias: model compressibility
28
Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010
Energy compaction
Image energy is concentrated at a few frequencies
29
JPEG (2D-DCT), 46 : 1 compression
Energy compaction
Image energy is concentrated at a few frequencies Models need to operate at relevant frequencies
30
JPEG (2D-DCT), 46 : 1 compression
2D-DCT
Sparse vs. compressible
Model coefficients w
Digit recognition:
31
coefficients w compressed coefficients Pw coefficients w as an image
sparse vs compressible sparse vs compressible sparse vs compressible
Outline
32
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Dimension reduction
Dimension reduction conveys information about the input space Feature selection importance Feature clustering granularity Feature extraction more general structures
33
How to use a dimension reduction?
However, any reduction loses certain information May be relevant to a prediction task
Goal of projection penalties: Encode useful information from a dimension reduction Control the risk of potential information loss
34
Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
Projection penalties: the basic idea
The basic idea: Observation: reduce the feature space restrict the
model search to a model subspace MP
Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP
35
Projection penalties: linear cases
Learn with projection penalties
Optimization:
37
projection distance
Projection penalties: nonlinear cases
38Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
FF’X
RpRd
FF’
P
P
P
M MPw
wP
?
?
Projection penalties: nonlinear cases
39Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
FF’X
RpRd
FF’
P
P
P
M MPw
wP
M MPw
wP
M MPw
wP
Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation
40
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2O
rig
inal
Red
uct
ion
Pro
ject
ion
Pen
alty
2% training 5% training 10% training
Ori
gin
al
Red
uct
ion
Pro
ject
ion
Pen
alty
Ori
gin
al
Red
uct
ion
Pro
ject
ion
Pen
alty
Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation
41
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
2% training
Orig Red Proj
5% training
Orig Red Proj
10% training
Similar results on face recognition, using SVM (poly-2)Dimension reduction: KPCA, KDA, OLaplacian Face
Similar results on house price prediction, using regressionDimension reduction: PCA and partial least squares
Outline
42
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Outline
43
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Multi-label classification
Multi-label classification
Existence of certain label dependency Example: classify an image into scenes (deserts, river, forest, etc) Multi-class problem is a special case: only one class is true
44
xLearn to predict
yqy2y1 …
Label dependency
Output coding
d < q: compression, i.e., source coding d > q: error-correcting codes, i.e., channel coding
Use the redundancy to correct prediction (“transmission”) errors
45
yqy2y1 …
z1 … zdxLearn to predict
y
z
encoding decoding
z2 z3
Error-correcting output codes (ECOCs)
Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001]
Encode into a (redundant) set of binary problems Learn to predict the code Decode the predictions
Our goal: design ECOCs for multi-label classification46
yqy2y1 …
zt…z2z1 …
encoding decoding
y1 {y3,y4} vs. y7y2 vs. y3
xLearn to predict
Outline
47
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods
e.g., pairwise likelihood:
e.g., full conditional likelihood
Estimation using composite likelihoodsComputational and statistical efficiencyRobustness under model misspecification
Composite likelihood
48
Multi-label problem decomposition
Problem decomposition methods
Decomposition into subproblems (encoding) Decision making by combining subproblem predictions
(decoding) Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc
49
yqy2y1 …
…… …x
Learn to predict …
1-vs-All (Binary Relevance)
Classify each label independently
The composite likelihood view
50
xLearn to predict
yqy2y1 …
Independently
Pairwise label ranking [1]
1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning
comparisons, and thresholding
51
yqy2y1 …
…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3
[1] Hullermeier et. al. Artif. Intell., 2008
Pairwise label ranking [1]
1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning
comparisons, and thresholding
The composite likelihood view
52
yqy2y1 …
…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3
[1] Hullermeier et. al. Artif. Intell., 2008
Calibrated label ranking [2]
1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based
on 1-vs-1 and 1-vs-all predictions
53
yqy2y1 …
…y1 vs. y2 yq-1 vs. yq
x
Learn to predict
Learn to predict
y1 vs. y3
[2] Furnkranz et. al. MLJ, 2008
Calibrated label ranking [2]
1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based
on 1-vs-1 and 1-vs-all predictions
The composite likelihood view
54
yqy2y1 …
…y1 vs. y2 yq-1 vs. yqLearn to predict
Learn to predict
y1 vs. y3
x
[2] Furnkranz et. al. MLJ, 2008
A composite likelihood view
A composite likelihood view for problem decomposition Choice of subproblems specification of a composite likelihood? Decision making inference on the composite likelihood?
55
yqy2y1 …
…… …x
Learn to predict …
A composite pairwise coding
Subproblems: individual and pairwise label densities
conveys more information than
56
Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012
yi=0, yj=0 yi=0, yj=1
yi=1, yj=0 yi=1, yj=1
yi=0, yj=0 yi=0, yj=1
yi=1, yj=0 yi=1, yj=1
A composite pairwise coding
Decision making: a robust mean-field approximation
is not robust to underestimation of label densities
57
Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012
A composite pairwise coding
Decision making: a robust mean-field approximation
is not robust to underestimation of label densities
A composite divergence, robust and efficient to optimize
58
Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012
Data sets
The Scene data Image scenes (beach, sunset, fall foliage, field,
mountain and urban)
beach, urban
59
[Boutell et. al., Pattern Recognition 2004]
Data sets
The Emotion data Music emotions (amazed, happy, relaxed, sad, etc)
The Medical data Clinical text medical categories (ICD-9-CM codes)
The Yeast data Gene functional categories
The Enron data Email tags on topics, attachment types, and emotional tones
60
Empirical results
Similar results on other data sets (emotions, medical, etc)
61
[1] Hullermeier et. al. Label ranking by learning pairwise preferences. Artif. Intell., 2008[2] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[3] Read et. al. Classifier chains for multi-label classification. ECML, 2009[4] Tsoumak et. al. Random k-labelsets: an ensemble method for multilabel classification. ECML, 2007[5] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010
Outline
62
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
problem-dependent coding and code predictability
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Multi-label output coding
Design output coding to multi-label problems Problem-dependent encodings to exploit label dependency Code predictability
Propose: multi-label ECOCs via CCA63
yqy2y1 …
zt…z2z1 …
encoding decoding
? ??
xLearn to predict
Canonical correlation analysis
Given , CCA finds projection directions
with maximum correlation:
Also known as “the most predictable criterion”:
CCA finds most predictable directions v in the label space
65
Multi-label ECOCs using CCA
Encoding and learning Perform CCA: Code includes both original labels and label projections
66
Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
Multi-label ECOCs using CCA
Encoding and learning Perform CCA: Code includes both original labels and label projections Learn classifiers for original labels Learn regression for label projections
67
Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
Multi-label ECOCs using CCA
Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections
68
Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
Multi-label ECOCs using CCA
Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections
Mean-field approximation
69
Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
Empirical results
Similar results on other criteria (macro/micro F-1 scores)Similar results on other data (emotions)Similar results on other base learners (decision trees, SVMs)
70
[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009[3] Zhang and Schneider. A composite likelihood view for multi-label classification. AISTATS 2012
Outline
71
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
problem-dependent coding and code predictability
Discriminative and predictable codes
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Recall: coding with CCA
CCA finds label projections z that are most predictable Low “transmission errors” in channel coding
72
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
z
xpredict
A recent paper [1]: coding with PCA
Label projections z obtained by PCA z has maximum sample variance, i.e., far away from each other. Minimum code distance?
73[1] Tai and Lin, 2010
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
z
Goal: predictable and discriminative codes
Predictable: prediction is closed to the correct codeword Discriminative: prediction is far away from incorrect codewords
74
yqy2y1 …
z1 … zdyqy2y1 …xLearn to predict
y
z
encoding decoding
z
xpredict
Maximum margin output coding
A max-margin formulation
Assume M is best linear predictor (in closed form of X, Y, V) Reformulate using metric learning Deal with the exponentially large number of constraints
The cutting plane method Overgenerating
76
Maximum margin output coding
A max-margin formulation
Assume M is best linear predictor, and define
77
Maximum margin output coding
A max-margin formulation
Metric learning formulation: define the Mahalanobis metric:
and the notation:
78
Maximum margin output coding
The metric learning problem
An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!
79
Maximum margin output coding
The metric learning problem
An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!
Cutting plane method with overgenerating (relaxation) Relax into Linearize for the relaxed domain: New separation oracle: box-constrained QP
80
Empirical results
Similar results on other data (emotions and medical)81
[3] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009 [4] Tai and Lin. Multi-label Classification with Principal Label Space Transformation. Neur. Comp.[5] Zhang and Schneider. Multi-label output codes via canonical correlation analysis, AISTATS 2011
[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010
Conclusion
Regularization to exploit input information Semi-supervised learning with word correlation Multi-task learning with a matrix-normal penalty Learning compressible models Projection penalties for dimension reduction
Output coding to exploit output information Composite pairwise coding Coding via CCA Coding via max-margin formulation
Future
82
Thank you! Questions?
83
Part II: Encoding Output Information by Output Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
problem-dependent coding and code predictability
Discriminative and predictable codes
Part I: Encoding Input Information by Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Multi-task generalization
Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible
modelsProjection penalties
Local smoothness
Smoothness of model coefficients Key property: certain order of derivatives are sparse
86
Differentiation operator
Brain computer interaction
Classify Electroencephalography (EEG) signals Sparse models vs. piecewise smooth models
87
Projection penalties: linear cases
Learn a linear model with projection penalties
90
projection distance
Projection penalties: RKHS cases
Learning in RKHS with projection penalties Primal:
Solve for in the dual (see the next page) Solve for v and b in the primal
91
FF’
X
MMP
w
wP
P
Projection penalties: nonlinear cases
Learning linear models
Learning RKHS models
93
FF’X
RpRdFF’
?
P P P
P(xi)
Empirical results
Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian
94
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
3 per class
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
5 per class
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
7 per class
Empirical results
Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian
95
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
3 per class
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
5 per class
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
7 per class
Empirical results
Face recognition (Yale), SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian
96
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
3 per class
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
5 per class
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Orig Red Proj
7 per class
Empirical results
Price forecasting (Boston house), ridge regression Dimension reduction: partial least squares
97
1-R2
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Orig Red Proj
50 training samples
Binary relevance
Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency
98
xLearn to predict
yqy2y1 …
Independently
Binary relevance
Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency
The composite likelihood view
99
xLearn to predict
yqy2y1 …
Independently