efficient multitask feature and relationship...

1
Efficient Multitask Feature and Relationship Learning Han Zhao, Otilia Stretcu, Renato Negrinho, Alex Smola and Geoff Gordon {han.zhao, ostretcu, negrinho, ggordon}@cs.cmu.edu, [email protected] Machine Learning Department, Carnegie Mellon University, Amazon Motivation Multitask Learning: Joint Learning {human, dog} {male, female} Input Target w 1 w 2 Multiple linear regression models Weight matrix W : I rows = tasks I columns = features Goal: I Joint learning multiple tasks I Better generalization with less data I Find correlation between tasks/features Task Feature W = Formulation Empirical Bayes with prior: W | ξ, Ω 1 , Ω 2 m Y i=1 N (w i | 0i I d ) · MN d×m (W | 0 d×m , Ω 1 , Ω 2 ) MN d×m (W | 0 d×m , Ω 1 , Ω 2 ) is matrix-variate normal distribution Ω 1 S d ++ , covariance matrix over features Ω 2 S m ++ , covariance matrix over tasks W R d×m , weight matrix Maximum marginal-likelihood with empirical estimators: minimize W,Σ 1 ,Σ 2 ||Y - XW || 2 F + η ||W || 2 F + ρ||Σ 1/2 1 W Σ 1/2 2 || 2 F - ρ(m log |Σ 1 | + d log |Σ 2 |) subject to lI d Σ 1 uI d , lI m Σ 2 uI m Σ 1 : -1 1 , Σ 2 : -1 2 Multi-convex in W, Σ 1 , Σ 2 Optimization Algorithm Solvers for W when Σ 1 , Σ 2 are fixed: minimize W h(W ) , ||Y - XW || 2 F + η ||W || 2 F + ρ||Σ 1/2 1 W Σ 1/2 2 || 2 F Three different solvers: A closed form solution with O (m 3 d 3 + mnd 2 ) complexity: vec(W * )= I m (X T X )+ ηI md + ρΣ 2 Σ 1 -1 vec(X T Y ) Gradient computation: W h(W )= X T (Y - XW )+ ηW + ρΣ 1 W Σ 2 Conjugate gradient descent with O ( κ log(1)(m 2 d + md 2 )) complexity, κ is the condition number, ε is the approximation accuracy Sylvester equation AX + XB = C using the Bartels-Stewart solver. The first-order optimality condition: Σ -1 1 (X T X + ηI d )W + W (ρΣ 2 )=Σ -1 1 X T Y Exact solution for W computable in O (m 3 + d 3 + nd 2 ) time. Solvers for Σ 1 and Σ 2 when W is fixed: minimize Σ 1 tr1 W Σ 2 W T )-m log |Σ 1 |, subject to lI d Σ 1 uI d minimize Σ 2 tr1 W Σ 2 W T )-d log |Σ 2 |, subject to lI d Σ 2 uI d Exact solution by reduction to minimum-weight perfect matching: λ 1 λ 2 ··· λ d ν 1 ν 2 ··· ν d λ 1 λ 2 ··· λ d ν 1 ν 2 ··· ν d λ 1 λ 2 ··· λ d ν 1 ν 2 ··· ν d Algorithms: Input: W , Σ 2 and l, u. 1: [ V,ν ] SVD(W Σ 2 W T ). 2: λ T [ l,u] (m/ν ). 3: Σ 1 V diag(λ)V T . Input: W , Σ 1 and l, u. 1: [ V,ν ] SVD(W T Σ 1 W ). 2: λ T [ l,u] (d/ν ). 3: Σ 2 V diag(λ)V T . Exact solution only requires one SVD Time complexity: O (max {dm 2 , md 2 }) Experiments Convergence analysis: Synthetic data: I The closed form solution does not scale when md 10 4 . Robot data: I d = 21 (7 joint positions, 7 joint velocities, 7 joint accelerations), m =7 (7 joint torques). I #Train/#Test = 44,484/4,449 instances. School data: I d = 27, m = 139, n = 15, 362 instances. I Goal: students’ score prediction. Feature covariance matrix and task covariance matrix: (a) Covariance matrix over features. (b) Covariance matrix over tasks. SARCOS School Method 1st 2nd 3rd 4th 5th 6th 7th MNMSE STL 31.40 22.90 9.13 10.30 0.14 0.84 0.46 0.9882 ± 0.0196 MTFL 31.41 22.91 9.13 10.33 0.14 0.83 0.45 0.8891 ± 0.0380 MTRL 31.09 22.69 9.08 9.74 0.14 0.83 0.44 0.9007 ± 0.0407 SPARSE 31.13 22.60 9.10 9.74 0.13 0.83 0.45 0.8451 ± 0.0197 FETR 31.08 22.68 9.08 9.73 0.13 0.83 0.43 0.8134 ± 0.0253

Upload: others

Post on 23-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Multitask Feature and Relationship Learninghzhao1/papers/NIPS2017/fetr-poster.pdfEfficient Multitask Feature and Relationship Learning Han Zhao, Otilia Stretcu, Renato Negrinho,

Efficient Multitask Feature and Relationship LearningHan Zhao, Otilia Stretcu, Renato Negrinho, †Alex Smola and Geoff Gordon

han.zhao, ostretcu, negrinho, [email protected], †[email protected] Learning Department, Carnegie Mellon University, †Amazon

MotivationMultitask Learning:

Joint Learning

human, dog male, female

Input

Target

w1 w2

• Multiple linear regression models• Weight matrix W :I rows = tasksI columns = features

• Goal:I Joint learning multiple tasksI Better generalization with less dataI Find correlation between tasks/features

Task

Feature

W =

FormulationEmpirical Bayes with prior:

W | ξ,Ω1,Ω2 ∼ m∏i=1N (wi | 0, ξiId)

· MN d×m(W | 0d×m,Ω1,Ω2)

•MN d×m(W | 0d×m,Ω1,Ω2) is matrix-variate normal distribution• Ω1 ∈ Sd++, covariance matrix over features• Ω2 ∈ Sm++, covariance matrix over tasks• W ∈ Rd×m, weight matrix

Maximum marginal-likelihood with empirical estimators:

minimizeW,Σ1,Σ2

||Y −XW ||2F + η||W ||2F + ρ||Σ1/21 WΣ1/2

2 ||2F− ρ(m log |Σ1| + d log |Σ2|)

subject to lId Σ1 uId, lIm Σ2 uIm

• Σ1 := Ω−11 , Σ2 := Ω−1

2• Multi-convex in W,Σ1,Σ2

Optimization AlgorithmSolvers for W when Σ1, Σ2 are fixed:

minimizeW

h(W ) , ||Y −XW ||2F + η||W ||2F + ρ||Σ1/21 WΣ1/2

2 ||2FThree different solvers:

• A closed form solution with O(m3d3 + mnd2) complexity:

vec(W ∗) =(Im ⊗ (XTX) + ηImd + ρΣ2 ⊗ Σ1

)−1vec(XTY )

• Gradient computation:

∇Wh(W ) = XT (Y −XW ) + ηW + ρΣ1WΣ2

Conjugate gradient descent with O(√κ log(1/ε)(m2d + md2))

complexity, κ is the condition number, ε is the approximationaccuracy

• Sylvester equation AX + XB = C using the Bartels-Stewart solver.The first-order optimality condition:

Σ−11 (XTX + ηId)W + W (ρΣ2) = Σ−1

1 XTY

Exact solution for W computable in O(m3 + d3 + nd2) time.

Solvers for Σ1 and Σ2 when W is fixed:

minimizeΣ1

tr(Σ1WΣ2WT )−m log |Σ1|, subject to lId Σ1 uId

minimizeΣ2

tr(Σ1WΣ2WT )−d log |Σ2|, subject to lId Σ2 uId

Exact solution by reduction to minimum-weight perfect matching:

λ1 λ2 · · · λd

ν1 ν2 · · · νd

λ1 λ2 · · · λd

ν1 ν2 · · · νd

λ1 λ2 · · · λd

ν1 ν2 · · · νd

Algorithms:

Input: W , Σ2 and l, u.1: [V, ν]← SVD(WΣ2W

T ).2: λ← T[l,u](m/ν).3: Σ1← V diag(λ)V T .

Input: W , Σ1 and l, u.1: [V, ν]← SVD(W TΣ1W ).2: λ← T[l,u](d/ν).3: Σ2← V diag(λ)V T .

• Exact solution only requires one SVD• Time complexity: O(maxdm2,md2)

ExperimentsConvergence analysis:

5.8e-04

1.2e-03

1.9e-03

2.5e-03

d=

10

m= 10

2.4e-03

1.5e-02

2.7e-02

4.0e-02m= 100

3.4e-02

1.6e+00

3.1e+00

4.7e+00

m= 1000

3.7e-03

1.3e-02

2.3e-02

3.3e-02

d=

100

6.6e-03

1.6e+00

3.1e+00

4.7e+00

0.0e+00

4.2e-01

8.4e-01

1.3e+00

3.7e-01

1.7e+00

3.1e+00

4.4e+00

d=

100

0

0.0e+00

1.4e+00

2.9e+00

4.3e+00

0.0e+00

1.4e+01

2.8e+01

4.3e+01

Closed Gradient Sylvester

0 10 20 30 40 50

Time (seconds)

4

6

8

10

12

14

16

log

func

tion

valu

e

Coordinate Minimization v.s. Projected Gradient Descent

Coordinate MinimizationProjected Gradient Descent

• Synthetic data:I The closed form solution does not scale when md ≥ 104.

• Robot data:I d = 21 (7 joint positions, 7 joint velocities, 7 joint accelerations),m = 7 (7 joint torques).

I #Train/#Test = 44,484/4,449 instances.• School data:I d = 27, m = 139, n = 15, 362 instances.I Goal: students’ score prediction.

Feature covariance matrix and task covariance matrix:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

212019181716151413121110

98

76

54

32

1

FETR feature correlation matrix

0.8

0.4

0.0

0.4

0.8

(a) Covariance matrix over features.1 2 3 4 5 6 7

76

54

32

1

FETR task correlation matrix

0.8

0.4

0.0

0.4

0.8

(b) Covariance matrix over tasks.

SARCOS SchoolMethod 1st 2nd 3rd 4th 5th 6th 7th MNMSESTL 31.40 22.90 9.13 10.30 0.14 0.84 0.46 0.9882 ± 0.0196MTFL 31.41 22.91 9.13 10.33 0.14 0.83 0.45 0.8891 ± 0.0380MTRL 31.09 22.69 9.08 9.74 0.14 0.83 0.44 0.9007 ± 0.0407SPARSE 31.13 22.60 9.10 9.74 0.13 0.83 0.45 0.8451 ± 0.0197FETR 31.08 22.68 9.08 9.73 0.13 0.83 0.43 0.8134 ± 0.0253