learning with implicit/explicit structures
TRANSCRIPT
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Learning with Implicit/Explicit Structures
James Kwok
Department of Computer Science and EngineeringHong Kong University of Science and Technology
Hong Kong
Chinese Workshop on Machine Learning & Applications
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Structure is Everywhere
Data have structure
Example (structured data)
DNA (strings) natural language (trees) molecules (graphs)
various kernels have been defined for these structured data
structure is about the input
Example (structured sparsity)
prefers certain sparsity patterns of the parameter, not just asmall number of nonzero coefficients
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Structure is Everywhere...
Outputs can also have structure
Example (hierarchical classification)
instance labels reside in a known tree- or DAG-structuredhierarchy
structure is explicit (e.g., yahoo hierarchy)
Structure can be implicit
Example (multitask learning)
tasks may have an underlying clustering structure
may have to be discovered by the learning algorithm
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Hierarchical Multilabel Classification
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Multiclass vs Multilabel Classification
Example (multiclass classification)
an instance can have only one label
Example (multilabel classification)
an instance can have more than onelabels
image tagging
tags: elephant, jungle and africa
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Multilabel Classification
Example
video tagging
text categorization
gene functions analysis in bioinformatics
Are these labels independent?
NO! Labels often have structure!
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Hierarchical Classification
Example (text classification)
labels may be organized in a known tree-structured hierarchy
Yahoo! taxonomy (as of 2004):: 16-level hierarchy
an instance is associated with the label of a node if it is alsoassociated with the node’s parent
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Hierarchical Classification...
More generally, label hierarchy is a directed acyclic graph (DAG)
Example (bioinformatics: Gene Ontology (GO))
Genome annotation
a node can have multiple parents
if a node is positive, all its parents must be positive
Consider this label hierarchy information in making predictions
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Training
train estimators for p(yi = 1 | ypa(i) = 1, x) at each node i
many possible estimation methods
Example
at each node i : train a binary SVM using those trainingexamples s.t. the parent of i is labeled positive
convert SVM output to a probability estimate using Platt’sprocedure
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Potentially Large Number of Labels
Example
flickr (as of 2010): > 20 millions unique tags
humans can recognize 10,000-100,000 unique object classes
Yahoo! taxonomy (as of 2004): nearly 300,000 categories
How to reduce the number of learning problems?
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Projection
1 project the long label vector to a low-dimensional vectore.g., PCA, randomized matrix
2 learn a mapping from input to each projected dimension
3 predict and reconstruct the label vector in the original space
Advantages
number of learning problems = dimension in projected space
flexible: any regressor can be used in the learning step
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Prediction: Simple Case
Suppose that it is known that the test sample has k labels, and thelabels are unstructured, how to obtain the prediction from h?
Simply pick the k largest entries in h!
maxψi
∑i
hiψi
s.t. ψi ∈ 0, 1, ∀i (indicator)∑i
ψi = k
ψi = 1: node i is selected; 0 otherwiseJames Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
What if the Labels are Structured?
can no longer simply pick the k largest entries
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Mandatory vs Non-mandatory Leaf Node Prediction
Mandatory leaf node prediction (MLNP)
labels must end at leaf nodes
Example
leaf nodes are objects of interest (have strongersemantic/biological meanings)
e.g., taxonomies of musical signals and genes
label hierarchy is learned from the data internal nodes areonly artificial
Non-mandatory leaf node prediction (NMLNP)
labels can end at an internal node
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
NMLNP on Tree-structured Hierarchies
maxψi
∑i∈T
hiψi
s.t. ψi ∈ 0, 1, ∀i ∈ T∑i∈T
ψi = k
parent’s ψi value must be ≥ that of the child
if a node is labeled positive, its parent must also be labeledpositive
Relax the binary constraint to 1 ≥ ψi ≥ 0
Efficient Algorithm
Condensing Sort and Select Algorithm (CSSA) [Baraniuk & Jones,TSP 1994]: originally used in signal processing
time complexity: O(N log N)
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example...
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
NMLNP on DAG-Structured Hierarchies
Idea: Merge the supernode with its unassigned parent that has thesmallest supernode value
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example...
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example...
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Mandatory Leaf Node Prediction
Non-mandatory leaf node prediction (NMLNP)
labels can end at an internal node
Mandatory leaf node prediction (MLNP)
labels must end at a leaf node
Existing MLNP methods are for hierarchical multiclassclassification
train a classifier at each node; for a positive parent, recursivelylabel its child having the largest prediction as positiveat each node, exactly one subtree is to be pursued
Extension for hierarchical multilabel classification not easy
at each node, how many and which subtrees to pursue?
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Probabilistic Approach
Notations (consider tree hierarchy first)
training examples (x(n), y(n))x(n): input; y(n) = [y
(n)1 , . . . , y
(n)|T | ]
′ ∈ 0, 1|T |: multilabel
denoting memberships of x(n) to each of the nodes
tree structure: yi = 1⇒ ypa(i) = 1 for any non-root node i
Assume: Labels for any group of siblings i1, i2, . . . , im areconditionally independent, given the label of their parent and x
p(yi1 , yi2 , . . . yim |ypa(i1), x) =∏m
j=1 p(yij |ypa(i1), x)
popularly used in Bayesian networks and hierarchicalmultilabel classification
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Maximum a Posteriori MLNP
represent y by a set Ω ⊆ T : yi = 1 if i ∈ Ω; and 0 otherwise
MAP MLNP find Ω∗ that1 maximizes the posterior probability p(y0, . . . , y|T ||x)2 respects T
Ω∗ = maxΩ p(yΩ = 1, yΩc = 0|x)s.t. y0 = 1
Ω contains no partial path (MLNP)
yi ’s respect the label hierarchy
Note:
p(yΩ = 1, yΩc = 0|x) considers all the node labels in thehierarchy simultaneouslycf. existing MLNP (multiclass) methods: considers thehierarchy information locally at each node
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Maximum a Posteriori MLNP on Label DAGs
Using the conditional independence simplification
p(y1, . . . , y|G||x) = p(y0|x)∏
i∈G\0
p(yi | yPa(i), x)
Pa(i): the set of (possibly multiple) parents of node i
direct maximization of p(y1, . . . , y|G||x) is difficult
Assume
p(y1, . . . , y|G||x) ∝ p(y0|x)∏
i∈G\0
∏j∈Pa(i)
p(yi |yj , x)
composite likelihood (or pseudolikelihood): replace a difficultpdf by a set of marginals or conditionals that are easier toevaluate
pairwise conditional likelihood
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
MAP MLNP on Label DAGs...
for each node i , define
wi =
∑
l∈child(0) log(1− pl0) i = 0∑j∈Pa(i)(log pij − log(1− pij)) leaf i∑j∈Pa(i)(log pij − log(1− pij)) +
∑l∈child(i) log(1− pli ) o.w.
pij ≡ p(yi = 1|yj = 1, x) for j ∈ Pa(i)if we knew that prediction of x has k leaf labels
maxψi
∑i∈G
wiψi
s.t.∑
leaf node i
ψi = k
ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)
ψj ≥ 1 ∀ internal node i : ψi = 1
ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0(|# leaf nodes|k
)candidate solutions expensive
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Nested Approximation Property (NAP)
Definition (k-leaf-sparse)
A multilabel y is k-leaf-sparse if k of the leaf nodes are labeled one
Example
1-leaf-sparse 2-leaf-sparse
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Nested Approximation Property (NAP)...
Definition (Nested Approximation Property (NAP))
For a pattern x, let its optimal k-leaf-sparse multilabel be Ωk . TheNAP is satisfied if
i : i ∈ Ωk ⊂ i : i ∈ Ωk ′ for all k < k ′
NAP is often implicitly assumed in many hierarchicalclassification algorithms
e.g., classifications that are based on threshold tuninghigher threshold: 1-leaf-sparse; lower threshold: 2-leaf-sparse
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Solve
maxψi
∑i∈G
wiψi
s.t.∑
leaf node i
ψi = k
ψ0 = 1, ψi ∈ 0, 1 ∀i ∈ G∑j∈child(i)
ψj ≥ 1 ∀ internal node i : ψi = 1
ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\0
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example: Find Optimal 1-leaf-sparse Solution
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Example: Find Optimal 2-leaf-sparse Solution
update takesO(|L|N log N) time
repeat this process until kleaf nodes are selected
total: O(k|L|N log N)time
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Unknown Number of Labels
We assumed that we know the prediction of x has k leaf labels
What if k is not known?
Straightforward approach
run the algorithm with k = 1, . . . , |L| (L: #leaf nodes)
find the Ωk ∈ Ω1, . . . ,Ω|L| that maximizes the posteriorprobability for DAG
Recall that Ωk ⊂ Ωk+1 under the NAP assumption
can simply set k = |L|, and Ωi is immediately obtained as theΩ in iteration i
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Experiments
12 functional genomics data sets with 2 sets of labelstree-structured labels: FunCat hierarchyDAG-structured labels: GO hierarchy
max #class depth #label on average
FunCat 500 5 8.8
GO 4134 14 35
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Evaluation
precision-recall (PR) curve
Prec =
∑i TPi∑
i TPi +∑
i FPi,Rec =
∑i TPi∑
i TPi +∑
i FNi
TPi/FPi/FNi : number of true positives / false positives /false negatives for label i
AUPRC: the larger the better
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
NMLNP on Label Trees (FunCat)
CLUS-HMC (Vens et al., MLJ 2008) - decision tree(state-of-the-art)
AUPRC values
data set CSSA CLUS-HMC (state-of-the-art)seq 0.226 0.218
pheno 0.167 0.166struc 0.194 0.189hom 0.257 0.254
cellcycle 0.196 0.180church 0.179 0.178derisi 0.194 0.183eisen 0.220 0.212
gasch1 0.216 0.212gasch2 0.218 0.203
spo 0.216 0.195expr 0.228 0.218
CSSA outperforms CLUS-HMC on all data sets!James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
NMLNP on Label DAGs (GO)
AUPRC values
data set CSSAG CLUS-HMCseq 0.478 0.469
pheno 0.426 0.425struc 0.455 0.446hom 0.493 0.481
cellcycle 0.454 0.443church 0.442 0.436derisi 0.442 0.440eisen 0.479 0.454
gasch1 0.468 0.453gasch2 0.454 0.449
spo 0.442 0.440expr 0.473 0.453
CSSA outperforms CLUS-HMC on all data sets again!
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
MLNP on Label Trees
Compare with
HMC-LP [Cerri et al, IDA 2011]
the only existing algorithm that can perform MLNP on trees(but not on DAGs)
other NMLNP methods (converted for MLNP)
first use the MetaLabeler to predict #leaf labels (k) that thetest pattern hasthen pick the k leaf labels using different NMLNP methods
hierarchical SVM (H-SVM) [Cesa-Bianchi et al., JMLR 2006]Bayesian classifier chain (BCC) [Zaragoza et al, IJCAI 2011]CLUS-HMC
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
MLNP on Label Treesused with MetaLabeler
data set MAT HMC-LP H-SVM BCC CLUS-HMCrcv1v2 subset1 0.81 [1] 0.05 [5] 0.80 [2] 0.60 [4] 0.62 [3]rcv1v2 subset2 0.83 [1] 0.05 [5] 0.81 [2] 0.62 [3] 0.61 [4]rcv1v2 subset3 0.82 [1] 0.05 [5] 0.81 [2] 0.41 [4] 0.58 [3]rcv1v2 subset4 0.82 [1] 0.05 [5] 0.81 [2] 0.59 [3] 0.59 [3]rcv1v2 subset5 0.81 [1] 0.05 [5] 0.80 [2] 0.57 [4] 0.61 [3]imageclef07a 0.89 [1] 0.01 [5] 0.89 [1] 0.77 [3] 0.65 [4]imageclef07d 0.86 [1] 0.19 [5] 0.86 [1] 0.82 [3] 0.65 [4]
delicious 0.47 [3] 0.08 [5] 0.50 [2] 0.34 [4] 0.53 [1]enron 0.72 [1] 0.37 [5] 0.72 [1] 0.67 [4] 0.68 [3]wipo 0.78 [1] 0.36 [5] 0.78 [1] 0.52 [4] 0.71 [3]
caltech-101 0.77 [2] 0.73 [3] 0.78 [1] 0.52 [4] -seq (funcat) 0.29 [1] 0.00 [5] 0.29 [1] 0.16 [4] 0.27 [3]
pheno (funcat) 0.24 [1] 0.02 [5] 0.20 [2] 0.09 [4] 0.18 [3]struc (funcat) 0.23 [1] 0.00 [5] 0.23 [1] 0.03 [4] 0.20 [3]hom (funcat) 0.31 [2] 0.01 [5] 0.33 [1] 0.04 [4] 0.23 [3]
cellcycle (funcat) 0.24 [2] 0.01 [5] 0.26 [1] 0.13 [4] 0.19 [3]church (funcat) 0.19 [2] 0.01 [5] 0.19 [2] 0.06 [4] 0.21 [1]derisi (funcat) 0.21 [2] 0.01 [5] 0.21 [2] 0.13 [4] 0.22 [1]eisen (funcat) 0.32 [2] 0.01 [5] 0.35 [1] 0.20 [4] 0.29 [3]
gasch1 (funcat) 0.30 [2] 0.01 [5] 0.33 [1] 0.17 [4] 0.29 [3]gasch2 (funcat) 0.28 [1] 0.01 [5] 0.28 [1] 0.11 [4] 0.24 [3]
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
MLNP on Label DAGs
used with MetaLabelerdata set MAS H-SVM BCC CLUS-HMCseq (GO) 0.61 [1] 0.55 [3] 0.48 [4] 0.57 [2]
pheno (GO) 0.61 [1] 0.60 [2] 0.58 [3] 0.55 [4]struc (GO) 0.53 [2] 0.47 [3] 0.45 [4] 0.60 [1]hom (GO) 0.63 [1] 0.56 [3] 0.52 [4] 0.62 [2]
cellcycle (GO) 0.55 [1] 0.50 [3] 0.21 [5] 0.50 [3]church (GO) 0.55 [1] 0.49 [3] 0.26 [4] 0.54 [2]derisi (GO) 0.53 [1] 0.49 [2] 0.36 [4] 0.47 [3]eisen (GO) 0.60 [1] 0.53 [2] 0.49 [4] 0.53 [2]
gasch1 (GO) 0.62 [1] 0.54 [2] 0.49 [3] 0.46 [4]gasch2 (GO) 0.54 [1] 0.49 [3] 0.41 [4] 0.50 [2]
spo (GO) 0.50 [1] 0.47 [3] 0.48 [2] 0.46 [4]expr (GO) 0.59 [1] 0.54 [4] 0.50 [5] 0.55 [3]
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Validating the NAP Assumption
use brute-force search to find the best k-leaf-sparse prediction
check if it includes the best (k − 1)-leaf-sparse prediction
% test patterns satisfying NAP at different k:
2 4 6 8 10 12 1490
91
92
93
94
95
96
97
98
99
100
k
insta
nce
s s
atisfy
ing
NA
P(%
)
2 4 6 8 10 12 14 16 18 2090
91
92
93
94
95
96
97
98
99
100
k
insta
nce
s s
atisfy
ing
NA
P(%
)
5 10 15 20 2590
91
92
93
94
95
96
97
98
99
100
k
inst
ance
s sa
tisfy
ing
NAP
(%)
pheno(funcat) pheno(GO) church(GO)
NAP holds almost 100% of the time
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Multitask Learning
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Multitask Learning (MTL)
Example
rating of products: each customer is a task
handwritten digit recognition: each digit is a task
Learn all tasks simultaneously, instead of separately
different tasks share some information
better to learn them together, especially when there areinsufficient training samples
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Notations
T tasks
task t: training samples (x(t)1 , y
(t)1 ), . . . , (x
(t)nt , y
(t)nt )
y(t) = X(t)wt + e
X(t) ≡ [x(t)1 , . . . , x
(t)nt ]T
y(t) ≡ [y(t)1 , . . . , y
(t)nt ]T
wt : weight
Given a set of T tasks, how to learn them together?
how to estimate W = [w1, . . .wT ]?
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Popular Approaches
Pooling
pool all the tasks together, and treat them as one single task
w1 = w2 = · · · = wT
Regularized MTL
all tasks are close to some model w0
wt = w0︸︷︷︸common
+ vt︸︷︷︸task-specific
Learning with outlier tasks: robust MTL
assume: there are some outlying wt ’s
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Task Clustering Structure
Tasks have structure!
Learning with clustered tasks: clustered MTL
assume: wt ’s form several clusters
(clustered MTL) (pooling/regularized MTL) (robust MTL)
how many clusters?
all features have the same task clustering structure
Good?
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
More Complicated Task Clustering Structure?
Example (movie recommendation)
different features may have different task clustering structures
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Flexible Task Clustering Structure
decompose wt as ut + vt
ut : clustering center at each featurevt : variations specific to each task
U = [u1, . . . ,uT ],V = [v1, . . . , vT ]
Flexible Task-Clustered MTL (FlexTClus)
minU,V
T∑t=1
‖y(t) − X(t)(ut + vt)‖2︸ ︷︷ ︸square loss
+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸ ︷︷ ︸ridge regularizers
‖U‖clus =∑D
d=1
∑i<j |Udi − Udj |
for each feature: pairwise difference in the parameters of tasksi and jtry to group parameters of tasks i and j together (k-meansclustering)
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Special Cases
minU,V∑T
t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F
Special cases
λ1 = λ2 = λ3 = 0: independent least squares regression oneach task
λ1 =∞: regularized MTL
λ1 = 0: independent ridge regression on each task
λ2 6= 0, λ3 = 0: independent least squares regression on eachtask
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Properties
Convergence to ground truth
With high probability,
maxd=1,...,D maxt=1,...,T |Wdt −W ∗dt | ≤ O
(1√n
)W∗: ground truth
n: number of samples
U captures the clustering structure at feature level
With high probability, and for sufficiently large n,
(i , j in the same cluster) W ∗di = W ∗
dj ↔ Udi = Udj
(i , j in different clusters) |W ∗di −W ∗
dj | ≥ ρ ↔ Udi 6= Udj
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Optimization
minU,V∑T
t=1 ‖y(t)−X(t)(ut+vt)‖2+λ1‖U‖clus+λ2‖U‖2F +λ3‖V‖2F
How to solve this optimization problem?
Back to basics! Gradient descent
1m
∑mi=1 `(w ; xi , yi ) + λΩ(w) LOOP
1 find descent direction
2 choose stepsize
3 descent
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Gradient Descent
Advantages
easy to implement
low per-iteration complexity good scalability (BIG DATA)
Disadvantage
uses first-order (gradient) information
slow convergence rate may require a large number ofiterations
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Accelerated Gradient Methods
First developed by Nesterov in 1983
for smooth optimization
minβ f (β) (f is smooth in β)
Problem: ` and/or Ω may be nonsmooth
SVM: hinge loss (nonsmooth) +‖w‖22 (smooth)
lasso: square loss (smooth) +‖w‖1 (nonsmooth)
Extension to composite optimization
objective has both smooth and nonsmooth components
minβ f (β)︸︷︷︸smooth
+ r(β)︸︷︷︸nonsmooth
recently popularly used in machine learningJames Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Accelerated Gradient Decent
Gradient decent
1 find descent direction
2 choose stepsize
3 descent
FISTA [Beck & Teboulle, 2009]
Q(β, βt) ≡ (β − β
t)T∇f (β
t)
+L
2‖β − β
t‖2 + r(β)
L: Lipschitz constant of ∇f‖∇f (x)−∇f (y)‖ ≤ L‖x−y‖
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Convergence
Let F (β) = f (β) + r(β)
Fast convergence
F (βt)− F (β∗)︸ ︷︷ ︸
optimal obj value
≤ 2L‖ ˆβ0−β∗‖2
(t+1)2
optimal convergence rate
to obtain an ε-optimal solution
F (βt)− F (β∗) ≤ ε
needs 1/√ε iterations
cf. gradient descent: convergence rate is O(1t ) for smooth
objectives
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Optimization via FISTA
minU,V
T∑t=1
‖y(t) − X(t)(ut + vt)‖2︸ ︷︷ ︸f (U,V)
+λ1‖U‖clus + λ2‖U‖2F + λ3‖V‖2F︸ ︷︷ ︸r(U,V)
1: Initialize: U1, V1, τ1 ← 1.2: for k = 1, 2, . . . ,N − 1 do3: Compute U = Uk − 1
Lk∂Uf (Uk ,Vk), V = Vk − 1
Lk∂Vf (Uk ,Vk).
4: Uk ← arg minU ‖U− U‖2F + 2λ1
Lk‖U‖clus + 2λ2
Lk‖U‖2F (can be solved
in O(T log T ) time)
5: Vk ←[
vij
1+2λ3/Lk
].
6: τk+1 ←1+√
1+4τ 2k
2 .
7:
[Uk+1
Vk+1
]←
[Uk
Vk
]+ τk−1
τk+1
([Uk
Vk
]−
[Uk−1
Vk−1
]).
8: end for9: Output UN .
complexity: O(
1√ε(TDn + DT log T )
)James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Experiment: Synthetic Data Sets
(C1) all tasks are independent(C2) all tasks are from the same cluster(C3) same as C2, but with corrupted features(C4) a main task cluster plus a few outlier tasks(C5) tasks in overlapping groups(C6) all but the last two features are from a common cluster
clusters obtainedare close toground truth
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
NMSE on Synthetic Data Sets
proposed model outperforms existing methods when the taskclusters are not well formed (C3,C5 and C6)
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Examination Score Prediction
consists of examination records from 139 secondary schools inyears 1985, 1986 and 1987
each task is to predict exam scores for students in one school
inputs: year of the exam, gender, school gender, schooldenomination, etc
(10% samples for training) (20% samples for training)
FlexTClus has very competitive NMSE
task clustering structure: only one underlying task cluster(consistent with previous studies)
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Rating of Products
ratings of 201 students (tasks) on 20 different personalcomputers, each described by 13 attributes
root mean squared error
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Task Clustering Structure
one main cluster for the first 12 attributes (related toperformance & service)
lots of varied opinions on price
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Handwritten Digit Recognition
10-class classification problem 10 1-vs-rest problemsPCA to reduce the dimensionality to 64
FlexTClus consistently has the lowest classification errortask clustering structures for the leading PCA features arevery different from those of the trailing PCA features
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
Conclusion
Structure is everywhere
Hierarchical multilabel classification
(label) structure is knownconsidered both mandatory and non-mandatory leaf nodepredictionsalgorithms are efficientcan be used on label hierarchies of both trees and DAGs
Multitask learning
(task clustering) structure is unknowncaptures task structures at the feature levelcan be solved efficiently by accelerated proximal methodbetter accuracy; and the obtained task structures agree withthe known/plausible properties of the data
James Kwok Learning with Implicit/Explicit Structures
Introduction Multilabel NMLNP MLNP Experiments Multitask Experiments Conclusion
James Kwok Learning with Implicit/Explicit Structures