outline - sr3.t.u-tokyo.ac.jp

20
†‡ JST 2015 12 24 1 / 95 Outline 1 2 3 n p n p 4 5 2 / 95 d = 10000 n = 1000 ( ) 3 / 95 : 1992 Donoho and Johnstone Wavelet shrinkage (Soft-thresholding) 1996 Tibshirani Lasso 2000 Knight and Fu Lasso (n p) 2006 Candes and Tao, Donoho ( p n) 2009 Bickel et al., Zhang (Lasso , p n) 2013 van de Geer et al., Lockhart et al. (p n) L1 (2010) . 4 / 95 Outline 1 2 3 n p n p 4 5 5 / 95 6 / 95

Upload: others

Post on 09-Feb-2022

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline - sr3.t.u-tokyo.ac.jp

† ‡

‡JST

2015 12 24

1 / 95

Outline

1

2

3

n≫ pn≪ p

4

5

2 / 95

d = 10000 n = 1000

( )

3 / 95

:

1992 Donoho and Johnstone Wavelet shrinkage(Soft-thresholding)

1996 Tibshirani Lasso

2000 Knight and Fu Lasso(n≫ p)

2006 Candes and Tao,Donoho ( p ≫ n)

2009 Bickel et al., Zhang(Lasso , p ≫ n)

2013 van de Geer et al.,Lockhart et al. (p ≫ n)

L1

(2010) .4 / 95

Outline

1

2

3

n≫ pn≪ p

4

5

5 / 95

6 / 95

Page 2: Outline - sr3.t.u-tokyo.ac.jp

≪≫

6 / 95

LassoR. Tsibshirani (1996). Regression shrinkage and selection via the lasso.J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267–288.

14728 (2015 12 23 )

7 / 95

X = (Xij) ∈ Rn×p, Y = (Yi ) ∈ R

n

p ( )≫ n ( ).β∗ ∈ R

p: d ( ).

: Y = Xβ∗ + ǫ

(Yi =∑p

j=1 Xijβ∗j + ǫi ) (i = 1, . . . , n))

(Y ,X ) β∗

d

8 / 95

X = (Xij) ∈ Rn×p, Y = (Yi ) ∈ R

n

p ( )≫ n ( ).β∗ ∈ R

p: d ( ).

: Y = Xβ∗ + ǫ

(Yi =∑p

j=1 Xijβ∗j + ǫi ) (i = 1, . . . , n))

(Y ,X ) β∗

d

Mallows’ Cp, AIC:

βMC = argminβ∈Rp

‖Y − Xβ‖2 + 2σ2‖β‖0

‖β‖0 = |{j | βj 6= 0}|.2p NP-

8 / 95

http://www.astroml.org/sklearn_tutorial/practical.html

y = b + β1x + β2x2 + · · ·+ βdx

d + ǫ

9 / 95

Lasso

Mallows’ Cp : βMC = argminβ∈Rp

‖Y − Xβ‖2 + 2σ2‖β‖0.

: ‖β‖0

Lasso [L1 ]

βLasso = argminβ∈Rp

‖Y − Xβ‖2 + λ‖β‖1

‖β‖1 =∑p

j=1 |βj |.

L1 L0 [−1, 1]p( )

L1 Lovasz

10 / 95

Page 3: Outline - sr3.t.u-tokyo.ac.jp

Lasso

p = n, X = I

βLasso = argminβ∈Rp

1

2‖Y − β‖2 + C‖β‖1

⇒ βLasso,i = argminb∈R

1

2(yi − b)2 + C |b|

=

{sign(yi )(yi − C ) (|yi | > C )

0 (|yi | ≤ C ).

0

11 / 95

Lasso

β = argminβ∈Rp

1

n‖Xβ − Y ‖22 + λn

p∑

j=1

|βj |.

12 / 95

β = argminβ∈Rp

1

n‖Xβ − Y ‖22 + λn

p∑

j=1

|βj |.

Theorem (Lasso )

C

‖β − β∗‖22 ≤ Cd log(p)

n.

log(p) d

13 / 95

Y = Xβ + ǫ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

7

8

9

10

True

14 / 95

Y = Xβ + ǫ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-2

0

2

4

6

8

10

12

TrueLasso

14 / 95

Y = Xβ + ǫ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-2

0

2

4

6

8

10

12

TrueLassoLeastSquares

14 / 95

Page 4: Outline - sr3.t.u-tokyo.ac.jp

Outline

1

2

3

n≫ pn≪ p

4

5

15 / 95

Lasso

Lasso:

minβ∈Rp

1

n

n∑

i=1

(yi − x⊤i β)2 + ‖β‖1︸︷︷︸ .

16 / 95

Lasso

Lasso:

minβ∈Rp

1

n

n∑

i=1

(yi − x⊤i β)2 + ‖β‖1︸︷︷︸ .

:

minw∈Rp

1

n

n∑

i=1

ℓ(zi , β) + ψ(β).

L1

16 / 95

L1

17 / 95

C∑

g∈G‖βg‖

0

18 / 95

Lounici et al. (2009)

T :

y(t)i ≈ x

(t)⊤i β(t) (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑

t=1

n(t)∑

i=1

(yi − x(t)⊤i β(t))2 + C

p∑

k=1

‖(β(1)k , . . . , β

(T )k )‖

︸ ︷︷ ︸.

β(1)β(2) β(T)

19 / 95

Page 5: Outline - sr3.t.u-tokyo.ac.jp

Lounici et al. (2009)

T :

y(t)i ≈ x

(t)⊤i β(t) (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑

t=1

n(t)∑

i=1

(yi − x(t)⊤i β(t))2 + C

p∑

k=1

‖(β(1)k , . . . , β

(T )k )‖

︸ ︷︷ ︸.

β(1)β(2) β(T)

19 / 95

W : M × N

‖W ‖Tr = Tr[(WW⊤)12 ] =

min{M,N}∑

j=1

σj(W )

σj(W ) W j ( )

= L1

=

20 / 95

:

A B C · · · X

1 4 8 * · · · 2

2 2 * 2 · · · *

3 2 4 * · · · *

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

21 / 95

:

1

A B C · · · X

1 4 8 4 · · · 2

2 2 4 2 · · · 1

3 2 4 2 · · · 1

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

(i,j)∈T

(Yi,j −Wi,j)2 + λ‖W ‖Tr

21 / 95

:

W*=

N

M

映画

ユーザ

:

Rademacher Complexity: Srebro et al. (2005).

Compressed sensing: Candes and Tao (2009), Candes and Recht (2009).

22 / 95

:

(Anderson, 1951, Burket, 1964, Izenman, 1975)

(Argyriou et al., 2008)

=

W*

n

Y X

N M

N

W

+

W ∗ .

23 / 95

Page 6: Outline - sr3.t.u-tokyo.ac.jp

xk ∼ N(0,Σ) (i.i.d.,Σ ∈ Rp×p), Σ = 1

n

∑nk=1 xkx

⊤k .

S = argminS :

{− log(det(S)) + Tr[SΣ] + λ

p∑

i,j=1

|Si,j |}.

(Meinshausen and B uhlmann, 2006, Yuan and Lin, 2007, Banerjee et al., 2008)

Σ S

Si,j = 0 ⇔ X(i),X(j)

24 / 95

NASDAQ 50.

(2011 1 4 2014 12 31 )(Lie Michael, Bachelor thesis)

25 / 95

( ) Fused Lasso

ψ(β) = C∑

(i,j)∈E

|βi − βj |.

(Tibshirani et al. (2005), Jacob et al. (2009))

Fused lasso (Tibshirani and

Taylor ‘11)

TV (Chambolle ‘04)

26 / 95

SCAD (Smoothly Clipped Absolute Deviation) (Fan and Li, 2001)

MCP (Minimax Concave Penalty) (Zhang, 2010)

Lq (q < 1), Bridge (Frank and Friedman, 1993)

27 / 95

L1

Adaptive Lasso (Zou, 2006)

β

ψ(β) = C

p∑

j=1

|βj ||βj |γ

Lasso ( )

(Hastie and Tibshirani, 1999, Ravikumar et al., 2009)

f (x) =∑p

j=1 fj(xj)

fj ∈ Hj (Hj : )

ψ(f ) = C

p∑

j=1

‖fj‖Hj

Group LassoMultiple Kernel Learning

28 / 95

Outline

1

2

3

n≫ pn≪ p

4

5

29 / 95

Page 7: Outline - sr3.t.u-tokyo.ac.jp

Y = Xβ∗ + ǫ.

Y ∈ Rn : , X ∈ R

n×p : , ǫ = [ǫ1, . . . , ǫn]⊤ ∈ R

n.

30 / 95

n≫ p

31 / 95

Lasso

p n→∞1nX⊤X

p−→ C ≻ O.

ǫi 0 σ2 .

Theorem (Lasso (Knight and Fu, 2000))

λn√n→ λ0 ≥ 0 √

n(β − β∗)d−→ argmin

uV (u),

V (u) = u⊤Cu− 2u⊤W +λ0∑p

j=1[ujsign(β∗j )1(β

∗j 6= 0)+ |uj |1(β∗

j = 0)],

W ∼ N(0, σ2C ).

β√n-consistent

β∗j = 0 βj 0

β = argminβ1n‖Y − Xβ‖2 + λn

∑pj=1 |βj |.

32 / 95

Adaptive Lasso

β

β = argminβ

1

n‖Y − Xβ‖2 + λn

p∑

j=1

|βj ||βj |γ

.

Theorem (Adaptive Lasso (Zou, 2006))

λn√n→ 0, λnn

1+γ

2 →∞1 limn→∞ P(J = J)→ 1 (J := {j | |βj | 6= 0}, J := {j | |β∗

j | 6= 0}),2√n(βJ − β∗

J )d−→ N(0, σ2C−1

JJ ).

β∗ (β∗j = O(1/

√n)

)

33 / 95

n≪ p

34 / 95

Lass

β = argminβ∈Rp

1

n‖Xβ − Y ‖22 + λn

p∑

j=1

|βj |.

Theorem (Lasso (Bickel et al., 2009, Zhang, 2009))

Restricted eigenvalue condition (Bickel et al., 2009)

maxi,j |Xij | ≤ 1 E[eτξi ] ≤ eσ2τ 2/2 (∀τ > 0) ,

1− δ‖β − β∗‖22 ≤ C

d log(p/δ)

n.

log(p) d

35 / 95

Page 8: Outline - sr3.t.u-tokyo.ac.jp

Lasso minimax

Theorem ( (Raskutti and Wainwright, 2011))

1/2

minβ:

maxβ∗:d-

‖β − β∗‖2 ≥ Cd log(p/d)

n.

Lasso minimax ( d log(d)n

)

Multiple Kernel Learning : Raskutti et al. (2012),Suzuki and Sugiyama (2013).

36 / 95

(Restricted eigenvalue condition)

A = 1nX⊤X

Definition ( (RE(k ′,C )))

φRE(k′,C ) = φRE(k

′,C ,A) := infJ⊆{1,...,n},v∈Rp :

|J|≤k′,C‖vJ‖1≥‖vJc ‖1

v⊤Av

‖vJ‖22

φRE > 0

J

37 / 95

(Compatibility condition)

A = 1nX⊤X

Definition ( (COM(J ,C )))

φCOM(J,C ) = φCOM(J,C ,A) := infv∈R

p :C‖vJ‖1≥‖vJc ‖1

kv⊤Av

‖vJ‖21

φCOM > 0

|J| ≤ k ′ RE

38 / 95

(Restricted isometory condition)

Definition ( (RI(k ′, δ)) (Candes and Tao, 2005))

1 > δ > 0

(1− δ)‖β‖2 ≤ ‖Xβ‖2 ≤ (1 + δ)‖β‖2

k ′- β ∈ Rp

Johnson-Lindenstrauss .

39 / 95

β : Lasso .J := {j | β∗

j 6= 0}. d := |J|.

1n‖X (β − β∗)‖22 ‖β − β∗‖22 ‖β − β∗‖21

RI(Cd , δ)+

⇓RE(2d , 3)

d log(p)

n

d log(p)

n

d2 log(p)

n⇓

COM(J, 3)d log(p)

n

d2 log(p)

n

d2 log(p)

n

Buhlmann and van de Geer (2011)(Candes and Tao, 2005) (Candes, 2008).

RI RE Rudelson and Zhou (2013)

40 / 95

(RE)

p Z : E[〈Z , z〉2] = ‖z‖22 (∀z ∈ Rp).

‖Z‖ψ2:

‖Z‖ψ2= supz∈Rp ,‖z‖=1 inft{t | E[exp(〈Z , z〉)2/t2] ≤ 2}.

1. Z = [Z1,Z2, . . . ,Zn]⊤ ∈ R

n×p Zi ∈ Rp

2. Σ ∈ Rp×p X = ZΣ

Theorem (Rudelson and Zhou (2013))

‖Zi‖ψ2 ≤ κ (∀i) c0 m = c0maxi (Σi,i )

2

φ2RE

(k,9,Σ)

n ≥ 4c0mκ4 log(60ep/(mκ))

P

(φRE(k , 3, Σ) ≥

1

2φRE(k , 9,Σ)

)≥ 1− 2 exp(−n/(4c0κ4)).

⇒41 / 95

Page 9: Outline - sr3.t.u-tokyo.ac.jp

: Massart (2003), Bunea et al. (2007), Rigollet andTsybakov (2011).

minβ∈Rp

‖Y − Xβ‖2 + Cσ2‖β‖0{1 + log

(p

‖β‖0

)}.

Bayes : Dalalyan and Tsybakov (2008), Alquier and Lounici (2011),Suzuki (2012).

: X :

1

n‖Xβ∗ − X β‖2 ≤ Cσ2 d

nlog(1 +

p

d

).

42 / 95 43 / 95

( )

yi =

p∑

j=1

x(j)i bj + ǫi −→ yi =

p∑

j=1

fj(x(j)i ) + ǫi

44 / 95

( )

yi =

p∑

j=1

x(j)i bj + ǫi −→ yi =

p∑

j=1

fj(x(j)i ) + ǫi

44 / 95

Multiple Kernel Learning

Multiple Kernel Learning (MKL) (Lanckriet et al., 2004, Bach et al., 2004)

f =M∑

m=1

fm ← minfm∈Hm

n∑

i=1

(yi −

M∑

m=1

fm(xi )

)2

+ CM∑

m=1

‖fm‖Hm

(Hm: )

: fm 0.

( ) (Sonnenburg et al., 2006,Rakotomamonjy et al., 2008, Suzuki and Tomioka, 2009)

45 / 95

Multiple Kernel Learning

Multiple Kernel Learning (MKL) (Lanckriet et al., 2004, Bach et al., 2004)

f =M∑

m=1

fm ← minfm∈Hm

n∑

i=1

(yi −

M∑

m=1

fm(xi )

)2

+ CM∑

m=1

‖fm‖Hm

(Hm: )

: fm 0.

( ) (Sonnenburg et al., 2006,Rakotomamonjy et al., 2008, Suzuki and Tomioka, 2009)

∑Mm=1 ‖fm‖Hm

−→ ψ((‖fm‖Hm)Mm=1),

e.g., ℓp- (Micchelli and Pontil, 2005, Kloft et al., 2009)

(Shawe-Taylor, 2008, Tomioka and Suzuki, 2009) Variable Sparsity KernelLearning (VSKL) (Aflalo et al., 2011).

45 / 95

Page 10: Outline - sr3.t.u-tokyo.ac.jp

(Gehler&Nowozin, CVPR2009) (Widmer et al., BMC Bioinformatics 2010)Time varying coefficient model

f(t)(x) = β⊤(t)

x (Lu et al., 2015)46 / 95

.

1 (Suzuki, 2011)

‖f − f ∗‖2L2(Π) = Op

(M1− 2s

1+s n−1

1+s (‖1‖ψ∗‖f ∗‖ψ)2s1+s +

M log(M)

n

).

2 elastic-net (Suzuki and Sugiyama, 2013)

(L1) ‖f − f ∗‖2L2(Π) = Op

(d

1−s1+s n

− 11+s R

2s1+s

1,f ∗ +d log(M)

n

),

(Elastic) ‖f − f ∗‖2L2(Π) = Op

(d

1+q1+q+s n

− 1+q1+q+s R

2s1+q+s

2,g∗ +d log(M)

n

),

3 : + (Suzuki, 2012)

EY1:n|x1:n

[

‖f − fo‖2n

]

= Op

m∈I0

n− 1

1+sm +|I0|n

log

(

Me

κ|I0|

)

.

Restricted Eigenvalue Condition

47 / 95

(s)

0 < s < 1: .

(cf., Mercer ):

km(x , x′) =

∑∞ℓ=1 µℓ,mφℓ,m(x)φℓ,m(x

′),

{φℓ,m}∞ℓ=1 L2(P) .

(s)

0 < s < 1

µℓ,m ≤ Cℓ−1s (∀ℓ,m).

s s .

s : Op(n− 1

1+s ).

Proposition (Steinwart et al. (2009))

µℓ,m ∼ ℓ−1s ⇔ logN(B(Hm), ǫ, L2(P)) ∼ ǫ−2s

48 / 95

映画

ユーザ

Movie

Use

r

Context

−→ ( )

Xij =∑d

r=1 u(1)r ,i u

(2)r ,j Xijk =

∑dr=1 u

(1)r ,i u

(2)r ,j u

(3)r ,k

49 / 95

11131222

222222 2222222444 222 421

3241

2 3

22234442133322 22

11111122222222222222221111 4441113 2 44 41 41 3

21333 22

User

Item

ContextRating

Prediction

Task type

1

Task type 2

feature

( )

50 / 95

Yi = 〈Xi ,A∗〉+Wi .

A∗,Xi ∈ RM1×...MK : .

〈Xi ,A∗〉 :=∑j1,...,jKXi,(j1,...,jK )A∗

j1,...,jK.

Wi ∼ N (0, σ2): observational noise.E.g., Xi = ej1 ⊗ ej2 ⊗ · · · ⊗ ejK .

: A∗ “ ”.

51 / 95

Page 11: Outline - sr3.t.u-tokyo.ac.jp

Yi = 〈Xi ,A∗〉+Wi .

A∗,Xi ∈ RM1×...MK : .

〈Xi ,A∗〉 :=∑j1,...,jKXi,(j1,...,jK )A∗

j1,...,jK.

Wi ∼ N (0, σ2): observational noise.E.g., Xi = ej1 ⊗ ej2 ⊗ · · · ⊗ ejK .

: A∗ “ ”.

:

minA∈R

M1×M2×···×MK

n∑

i=1

(Yi − 〈Xi ,A∗〉)2 + pen(A).

CP-Schatten-1 (Tomioka et al., 2011)

Schatten-1 (Tomioka and Suzuki, 2013)51 / 95

: CP-

CP- (Canonical Polyadic Decomp.)(Hitchcock, 1927a,b)(figure is from (Kolda and Bader, 2009))

Xijk =∑d

r=1 airbjrckr =: [[A,B,C ]].CP- CP- .

CP- NP- .

CP-

( ).

NP-

52 / 95

Schatten-1

|||A|||S1/1

:=

K∑

k=1

‖A(k)‖Tr

Schatten-1

A = argminA

|||Y − A|||2F + λn|||A|||S1/1.

Tucker- .

−→

A A(1) A(2) A(3)

53 / 95

Schatten-1

|||A|||S1/1:= inf

A=A1+A2+···+AK

K∑

k=1

‖A(k)k ‖Tr

Schatten1

A = argminA

|||Y − A|||2F + λn|||A|||S1/1,

s.t. A =K∑

k=1

Ak , ‖A(k′)k ‖S∞

≤ α

K

√N/nk′ (∀k ′ 6= k).

A = A1 + A2 + A3

↓ ↓ ↓

A(1)1 A

(2)2 A

(3)3

54 / 95

minA∈R

M1×M2×···×MK

1

n

n∑

i=1

(Yi − 〈Xi ,A〉)2 + pen(A).

� �(Tomioka et al., 2011)

1N|||A − A∗|||2F ≤ C

n

(1K

∑Kk=1(

√Mk +

√N/Mk )

)2 (1K

∑Kk=1

√rk

)2

(Tomioka and Suzuki, 2013):

1N|||A − A∗|||2F ≤ C

n

(maxk(

√Mk +

√N/Mk )

)2mink rk

Square deal (Mu et al., 2014):

1M‖A− A∗‖22 ≤ C

min{∏

k∈I1rk ,

∏k∈I2

rk}

n

(∏k∈I1

Mk +∏

k∈I2Mk

),

I1 I2 {1, . . . ,K} .� �: .

: .

55 / 95

.

56 / 95

Page 12: Outline - sr3.t.u-tokyo.ac.jp

‖A∗‖max,2 ≤ Rσp .( )

Theorem

n, {Mk}k C

( ) EY1:n|X1:n

[1

2σ2

∫‖A −A∗‖2ndΠ(A|X1:n,Y1:n)

]

≤C d(∑K

k=1 Mk)

n

(1 ∨ R2

)log

(KσKp

ξ

√nRK

).

( ) EY1:n|X1:n

[1

2σ2

∫‖A −A∗‖2L2

dΠ(A|X1:n,Y1:n)

]

≤C d(∑K

k=1 Mk)

n

(1 ∨ R2(K+1)

)log

(KσKp

ξ

√nRK

).

57 / 95

Hd(R) :={A ∈ R

M1×···×MK | CP- d , ‖A‖max,2 ≤ R}.

( d )

Theorem

Hd(R) :

minA

maxA∗∈Hd (R)

E[‖A − A∗‖2L2] &

d(M1 + · · ·+MK )

n.

log

58 / 95

Figure: ( Schatten-1 ) .

59 / 95

Figure: : The scaled predictive accuracy (left) and the actual predictiveaccuracy (right) against the number of samples.

scaled accuracy =actual accuracy

d(∑K

k=1 Mk)/n

60 / 95

Outline

1

2

3

n≫ pn≪ p

4

5

61 / 95

Lasso β .(van de Geer et al., 2014, Javanmard and Montanari, 2014)� �

β = β +MX⊤(Y − X β)

� �M (X⊤X )−1

β = β∗ + (X⊤X )−1X⊤ǫ.

( )

: p ≫ n X⊤X

62 / 95

Page 13: Outline - sr3.t.u-tokyo.ac.jp

M :

minM∈Rp×p

|ΣM⊤ − I |∞.

(| · |∞ )

Theorem (Javanmard and Montanari (2014))

ǫi ∼ N(0, σ2) (i .i .d .)

√n(β − β∗) = Z +∆, Z ∼ N(0, σ2MΣM⊤), ∆ =

√n(MΣ− I )(β∗ − β).

X λn = cσ√log(p)/n

‖∆‖∞ = Op

(d log(p)√

n

).

n≫ d2 log2(p) ∆ ≈ 0√n(β − β∗)

63 / 95

M :

minM∈Rp×p

|ΣM⊤ − I |∞.

(| · |∞ )

Theorem (Javanmard and Montanari (2014))√n(β − β∗) = Z︸︷︷︸ + ∆︸︷︷︸

X0

n≫ d2 log2(p) ∆ ≈ 0√n(β − β∗)

63 / 95

(a) 95% .(n, p, d) = (1000, 600, 10).

(b) p CDF.(n, p, d) = (1000, 600, 10).

Javanmard and Montanari (2014)

64 / 95

(Lockhart et al., 2014)

0 500 1000 1500 2000 2500 3000

−6

00

−4

00

−2

00

02

00

40

06

00

L1 Norm

Co

e"

cie

nts

0 2 4 6 8 10 10

65 / 95

(Lockhart et al., 2014)

0 500 1000 1500 2000 2500 3000

−6

00

−4

00

−2

00

02

00

40

06

00

L1 Norm

Co

e"

cie

nts

0 2 4 6 8 10 10

J = supp(β(λk)), J∗ = supp(β∗),β(λk+1) := argmin

β:βJ∈R|J|,βJc=0

‖Y − XJβJ‖2 + λk+1‖βJ‖1.

J∗ ⊆ J ( β∗j = 0)

Tk =(〈Y ,X β(λk+1)〉 − 〈Y ,X β(λk+1)〉

)/σ2 d−→ Exp(1) (n, p →∞).

65 / 95

Outline

1

2

3

n≫ pn≪ p

4

5

66 / 95

Page 14: Outline - sr3.t.u-tokyo.ac.jp

R(β) =n∑

i=1

ℓ(yi , x⊤i β)

︸ ︷︷ ︸f (β):

+ ψ(β)︸︷︷︸

= f (β) + ψ(β)

ψ

f

ψ R

67 / 95

R(β) =n∑

i=1

ℓ(yi , x⊤i β)

︸ ︷︷ ︸f (β):

+ ψ(β)︸︷︷︸

= f (β) + ψ(β)

ψ

f

ψ R

L1 ψ(β) = C∑p

j=1 |βj | .

minb{(b − y)2 + C |b|} .67 / 95

1 j ∈ {1, . . . , p}2 j βj

β(k+1)j ← argminβj

R([β(k)1 , . . . , βj , . . . , β

(k)p ]).

gj =∂f (β(k))

∂βj

β(k+1)j ← argminβj

〈gj , βj〉+ ψj(βj) +ηk2‖βj − β(k)

j ‖2.

:

68 / 95

failure success

:

, .f (x) =

∑pj=1 fj(xj).

69 / 95

minx{P(x)} = minx{f (x) + ψ(x)} = minx{f (x) +∑p

j=1 ψj(xj)}.: f γ- (|∂xj f (x)− ∂xj f (x + aej)| ≤ γa)

(Saha and Tewari, 2013, Beck and Tetruashvili, 2013)

P(x (t))− R(x∗) ≤ γp‖x (0)−x∗‖2

2t = O(pγ/t).

(Nesterov, 2012, Richtarik and Takac, 2014)

: O(pγ/t).Nesterov : O(γ(p/t)2) (Fercoq and Richtarik, 2013).f α- : O(exp(−C(α/γ)t/p)).f α- + : O(exp(−C

α/γt/p)) (Lin et al., 2014).

: Wright (2015).

70 / 95

Hydra: (Richtarik and Takac, 2013, Fercoqet al., 2014).

Lasso (p = 5× 108, n = 109) Hydra (Richtarik andTakac, 2013) 128 4,096

71 / 95

Page 15: Outline - sr3.t.u-tokyo.ac.jp

f (β)︸︷︷︸ +ψ(β)

gk ∈ ∂f (β(k)), gk = 1k

∑kτ=1 gτ .

:

β(k+1) = argminβ∈Rp

{g⊤k β + ψ(β) +

ηk2‖β − β(k)‖2

}.

(Xiao, 2009, Nesterov, 2009):

β(k+1) = argminβ∈Rp

{g⊤k β + ψ(β) +

ηk2‖β‖2

}.

: prox(q|ψ) := argminx

{ψ(x) +

1

2‖x− q‖2

}.

L1 (Soft-thresholding )ψ Lovasz

72 / 95

: L1

prox(q|C‖ · ‖1) = argminx

{C‖x‖1 +

1

2‖x− q‖2

}

= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding . !

73 / 95

f

exp(−√α/Lk)

1

k1

k2

1√k

1 Nesterov(Nesterov, 2007, Zhang et al., 2010)

2 exp(−(α/L)k), 1k

3 (First order method)

74 / 95

minβ

f (β) + ψ(β)

⇔ minx,y

f (x) + ψ(y) s.t. x = y .

: L(x , y , λ) = f (x) + ψ(y)− λ⊤(y − x) + ρ2‖y − x‖2.

(Hestenes, 1969, Powell, 1969, Rockafellar, 1976)

1 (x (k+1), y (k+1)) = argminx,y L(x , y , λ(k)).2 λ(k+1) = λ(k) − ρ(y (k+1) − x (k+1)).

x , y

75 / 95

minx,y

f (x) + ψ(y) s.t. x = y .

L(x , y , λ) = f (x) + ψ(y)− λ⊤(y − x) +ρ

2‖y − x‖2.

(Gabay and Mercier, 1976)

x(k+1) = argmin

x

f (x) + λ(k)⊤x +

ρ

2‖y (k) − x‖2

y(k+1) = argmin

y

ψ(y)− λ(k)⊤y +

ρ

2‖y − x

(k+1)‖2(= prox(x (k+1) + λ(k)/ρ|ψ/ρ))

λ(k+1) = λ(k) − ρ(y (k+1) − x(k+1))

x , y

y L1

O(1/k) (He and Yuan, 2012), (Deng and Yin,

2012, Hong and Luo, 2012)

76 / 95

.

FOBOS (Duchi and Singer, 2009)

RDA (Xiao, 2009)

SVRG (Stochastic Variance Reduced Gradient) (Johnson and Zhang, 2013)

SDCA (Stochastic Dual Coordinate Ascent) (Shalev-Shwartz and Zhang, 2013)

SAG (Stochastic Averaging Gradient) (Le Roux et al., 2013)

: Suzuki (2013), Ouyang et al. (2013), Suzuki (2014).

77 / 95

Page 16: Outline - sr3.t.u-tokyo.ac.jp

R(w) =1

n

n∑

i=1

ℓ(zi ,w)

︸ ︷︷ ︸+ ψ(w)

� �{zi}ni=1

w� �

O(1) ( O(p)) .

SGD, SDA: O(1/

√T )

: O(1/T )

SVRG, SAG, SDCAexp(− T

n+λ/γ)

78 / 95

minw

EZ∼P(Z)[ℓ(Z ,w)] + ψ(w)

1 zt ∼ P(Z ) .

2 gt ∈ ∂w ℓ(zt ,w (t−1)), gt =1t

∑tτ=1 gτ .

3 SGD (Stochastic Gradient Descent):

w(t) = argmin

w∈Rp

{

g⊤t w + ψ(w) +

1

2ηt‖w − w

(t−1)‖2}

.

SDA (Stochastic Dual Averaging):

w(t) = argmin

w∈Rp

{

g⊤t w + ψ(w) +

1

2ηt‖w‖2

}

.

[ ] E[‖gt‖2] ≤ G 2, E[‖xt − x∗‖2] ≤ D2 (∀t) (SDA )

R(w (T )) ≤ 2GR√T

( , ) Polyak-Ruppert w (T ) =∑

t w(t)

T

R(w (T )) ≤ 2G 2

µT( , ) w (T ) =

∑t tw

(t)

T (T+1)/2

79 / 95

( )

1

n

n∑

i=1

ℓ(zi ,w) + ψ(w)

Stochastic Average Gradient descent, SAG(Le Roux et al., 2012, Schmidt et al., 2013, Defazio et al., 2014)

Stochastic Variance Reduced Gradient descent, SVRG(Johnson and Zhang, 2013, Xiao and Zhang, 2014)

Stochastic Dual Coordinate Ascent, SDCA(Shalev-Shwartz and Zhang, 2013)

- SAG SVRG .

- SDCA .

80 / 95

P(x) = 1n

∑ni=1 ℓi (x)︸ ︷︷ ︸

+ψ(x)︸︷︷︸

:

ℓi : γ- . (‖∇ℓi (x)−∇ℓi (x ′)‖ ≤ L‖x − x ′‖)ψ: λ- . (ψ(x)− λ

2 ‖x‖2 )λ = O(1/n) O(1/

√n).

:

00

L2

ψ(x) + λ‖x‖2

0

81 / 95

( ),

( ):

T > (n + γ/λ)log(1/ǫ)

ǫγ- λ- .

( ) :

T > nγ/λ log(1/ǫ)

� �[ ] n(γ/λ)log(1/ǫ) [ ] (n + γ/λ)log(1/ǫ)

� �

82 / 95

=

ǫ :

T ≥ C(n +

γ

λ

)log

(n + γ/λ

ǫ

).

SVRG

:

ℓ(zi , ·) γ- (↔ ℓ∗(zi , ·) 1/γ- )

ψ λ- (↔ ψ∗ 1/λ- )

83 / 95

Page 17: Outline - sr3.t.u-tokyo.ac.jp

( ) .

Definition ( )

( ) f : Rp → R

:f ∗(y) := sup

x∈Rp

{〈x , y〉 − f (x)}.

f f ∗ .

f f ∗

f(x)

f*(y)

line with gradient y

xx*

0

f(x*)

84 / 95

f (x) f ∗(y)12x2 1

2y2

max{1− x , 0}{y (−1 ≤ y ≤ 0),

∞ (otherwise).

log(1 + exp(−x)){(−y) log(−y) + (1 + y) log(1 + y) (−1 ≤ y ≤ 0),

∞ (otherwise).

L1 ‖x‖1{0 (maxj |yj | ≤ 1),

∞ (otherwise).

Lp∑d

j=1 |xj |p∑d

j=1p−1

pp

p−1|yj |

pp−1

(p > 1)

00

000 1

L1

85 / 95

∃fi : R→ R ℓ(zi , x) = fi (a⊤i x) .

A = [a1, . . . , an] .� �(Primal) inf

x∈Rp

{1

n

n∑

i=1

fi (a⊤i x) + ψ(x)

}

� �[Fenchel ]

infx∈Rp

{f (A⊤x) + nψ(x)} = − infy∈Rn

{f ∗(y) + nψ∗(−Ay/n)}

� �(Dual) inf

y∈Rn

{1

n

n∑

i=1

f ∗i (yi )

︸ ︷︷ ︸+ ψ∗

(−1

nAy

)

︸ ︷︷ ︸

}

� �:

f (α) =∑n

i=1 fi (αi ) f ∗(β) =∑n

i=1 f∗i (βi ).

ψ(x) = nψ(x) ψ∗(y) = nψ∗(y/n).

86 / 95

(Shalev-Shwartz and Zhang, 2013)

Iterate the following for t = 1, 2, . . .

1 Pick up an index i ∈ {1, . . . , n} uniformly at random.

2 Calculate x (t−1) = ∇ψ∗(−A⊤y (t−1)/n).

3 Update the i-th coordinate yi so that the objective function is decreased:

• y(t)i ∈ argmin

yi∈R

{f ∗i (yi )− 〈x (t−1), aiyi 〉+

1

2η‖yi − y

(t−1)i ‖2

}

= prox(y(t−1)i + ηa⊤i x

(t−1)|ηf ∗i ),• y

(t)j =y

(t−1)j (for j 6= i).

x (t)

87 / 95

Method SDCA SVRG SAGA/

✓ ✓ △ℓi (β) = fi (x

⊤i β)

ǫ :

(n +

γ

λ

)log(1/ǫ).

(Catalyst (Lin et al., 2015), Acc-SDCA (Lin et al., 2014)):

(n +

√nγ

λ

)log(1/ǫ)

88 / 95

Let A = [a1, a2, . . . , an] ∈ Rp×n.

minw

{1n

n∑

i=1

fi (a⊤i w) + ψ(B⊤w)

}(Primal)

⇔ minx∈Rn,y∈Rd

{1n

n∑

i=1

f ∗i (xi ) + ψ∗(y

n

) ∣∣∣ Ax + By = 0}

(Dual)

ψ B⊤

= +

: Suzuki (2013), Ouyang et al. (2013)( ): SDCA-ADMM (Suzuki, 2014)

T = O

((n +

√nγ

λ

)log(1/ǫ)

).

.89 / 95

Page 18: Outline - sr3.t.u-tokyo.ac.jp

Fused Lasso (Tibshirani et al., 2005, Jacob et al., 2009).

(Signoretto et al., 2010; Tomioka et al., 2011).

Robust PCA (Candes et al., 2009).

ψ B

ψ(B⊤x) = ψ(x)

90 / 95

: ( ) Fused Lasso

ψ(β) = C∑

(i,j)∈E

|βi − βj |.

(Tibshirani et al. (2005), Jacob et al. (2009))

Fused lasso (Tibshirani and

Taylor ‘11)

TV (Chambolle ‘04)

91 / 95

(Suzuki, 2014)

Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).For each t = 1, 2, . . .

Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik .

y (t) ← argminy

{nψ∗(y/n)− 〈w (t−1),Ax (t−1)+By〉

2‖Ax (t−1) + By‖2 + 1

2‖y − y (t−1)‖2Q

},

x(t)I ← argmin

xI

{∑i∈I f

∗i (xi )− 〈w (t−1),AI xI + By (t)〉

2‖AI xI+A\I x

(t−1)\I +By (t)‖2+ 1

2‖xI − x

(t−1)I ‖2GI,I

},

w (t) ←w (t−1) − γρ{n(Ax (t) + By (t))

− (n − n/K )(Ax (t−1) + By (t−1))}.

where Q,G are some appropriate positive semidefinite matrices.

92 / 95

Q = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤I ZI ),

prox(q|ψ) + prox(q|ψ∗) = q

SDCA-ADMM

For q(t) = y (t−1) + B⊤

ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let

y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),

For p(t)I = x

(t−1)I +

Z⊤I

ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let

x(t)i ← prox(p

(t)i |f ∗i /(ρηZ ,I )) (∀i ∈ I ).

⋆ xy ψ ( )

93 / 95

CPU time (s)

Em

pir

ical

Ris

k

RDA

94 / 95

L1

Lasso

Adaptive Lasso‖β − β∗‖2 = Op(d log(p)/n)

( )

95 / 95

Page 19: Outline - sr3.t.u-tokyo.ac.jp

J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman. Variablesparsity kernel learning. Journal of Machine Learning Research, 12:565–592,2011.

P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimationwith exponential weights. Electronic Journal of Statistics, 5:127–145, 2011.

T. Anderson. Estimating linear restrictions on regression coefficients formultivariate normal distributions. Annals of Mathematical Statistics, 22:327–351, 1951.

A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularizationframework for multi-task structure learning. In Y. S. J.C. Platt, D. Koller andS. Roweis, editors, Advances in Neural Information Processing Systems 20,pages 25–32, Cambridge, MA, 2008. MIT Press.

F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, andthe SMO algorithm. In the 21st International Conference on Machine Learning,pages 41–48, 2004.

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparsemaximum likelihood estimation for multivariate gaussian or binary data. Journalof Machine Learning Research, 9:485–516, 2008.

A. Beck and L. Tetruashvili. On the convergence of block coordinate descent typemethods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.

95 / 95

J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup andWorkshop 2007, 2007.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso andDantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.

P. Buhlmann and S. van de Geer. Statistics for high-dimensional data. Springer,2011.

F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for gaussian regression.The Annals of Statistics, 35(4):1674–1697, 2007.

G. R. Burket. A study of reduced-rank models for multiple prediction, volume 12of Psychometric monographs. Psychometric Society, 1964.

E. Candes. The restricted isometry property and its implications for compressedsensing. Compte Rendus de l’Academie des Sciences, Paris, Serie I, 346:589–592, 2008.

E. Candes and T. Tao. The power of convex relaxations: Near-optimal matrixcompletion. IEEE Transactions on Information Theory, 56:2053–2080, 2009.

E. J. Candes and B. Recht. Exact matrix completion via convex optimization.Foundations of Computational Mathematics, 9(6):717–772, 2009.

E. J. Candes and T. Tao. Decoding by linear programming. Information Theory,IEEE Transactions on, 51(12):4203–4215, 2005.

95 / 95

E. J. Candes and T. Tao. Near-optimal signal recovery from random projections:Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.

A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting sharpPAC-Bayesian bounds and sparsity. Machine Learning, 72:39–61, 2008.

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradientmethod with support for non-strongly convex composite objectives. InZ. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,editors, Advances in Neural Information Processing Systems 27, pages1646–1654. Curran Associates, Inc., 2014.

W. Deng and W. Yin. On the global and linear convergence of the generalizedalternating direction method of multipliers. Technical report, Rice UniversityCAAM TR12-14, 2012.

D. Donoho. Compressed sensing. IEEE Transactions of Information Theory, 52(4):1289–1306, 2006.

D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage.Biometrika, 81(3):425–455, 1994.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and itsoracle properties. Journal of the American Statistical Association, 96(456),2001.

95 / 95

O. Fercoq and P. Richtarik. Accelerated, parallel and proximal coordinate descent.Technical report, 2013. arXiv:1312.5799.

O. Fercoq, Z. Qu, P. Richtarik, and M. Takac. Fast distributed coordinate descentfor non-strongly convex losses. In Proceedings of MLSP2014: IEEEInternational Workshop on Machine Learning for Signal Processing, 2014.

I. E. Frank and J. H. Friedman. A statistical view of some chemometricsregression tools. Technometrics, 35(2):109–135, 1993.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

T. Hastie and R. Tibshirani. Generalized additive models. Chapman & Hall Ltd,1999.

B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachfordalternating direction method. SIAM J. Numerical Analisis, 50(2):700–709, 2012.

M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory &Applications, 4:303–320, 1969.

F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products.Journal of Mathematics and Physics, 6:164–189, 1927a.

F. L. Hitchcock. Multilple invariants and generalized rank of a p-way matrix ortensor. Journal of Mathematics and Physics, 7:39–79, 1927b.

95 / 95

M. Hong and Z.-Q. Luo. On the linear convergence of the alternating directionmethod of multipliers. Technical report, 2012. arXiv:1208.3922.

A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journalof Multivariate Analysis, pages 248–264, 1975.

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso.In Proceedings of the 26th International Conference on Machine Learning, 2009.

A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing forhigh-dimensional regression. Journal of Machine Learning, page to appear,2014.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in Neural InformationProcessing Systems 26, pages315–323. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Muller, and A. Zien.Efficient and accurate ℓp-norm multiple kernel learning. In Advances in NeuralInformation Processing Systems 22, pages 997–1005, Cambridge, MA, 2009.MIT Press.

95 / 95

K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals ofStatistics, 28(5):1356–1378, 2000.

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAMReview, 51(3):455–500, 2009.

G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. Jordan. Learningthe kernel matrix with semi-definite programming. Journal of Machine LearningResearch, 5:27–72, 2004.

N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with anexponential convergence rate for finite training sets. In F. Pereira, C. Burges,L. Bottou, and K. Weinberger, editors, Advances in Neural InformationProcessing Systems 25, pages 2663–2671. Curran Associates, Inc., 2012.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with anexponential convergence rate for strongly-convex optimization with finitetraining sets. In Advances in Neural Information Processing Systems 25, 2013.

H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-orderoptimization. Technical report, 2015. arXiv:1506.02186.

Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient methodand its application to regularized empirical risk minimization. Technical report,2014. arXiv:1407.1296.

R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test forthe lasso. The Annals of Statistics, 42(2):413–468, 2014.

95 / 95

Page 20: Outline - sr3.t.u-tokyo.ac.jp

K. Lounici, A. Tsybakov, M. Pontil, and S. van de Geer. Taking advantage ofsparsity in multi-task learning. 2009.

J. Lu, M. Kolar, and H. Liu. Post-regularization confidence bands for highdimensional nonparametric models with local sparsity, 2015. arXiv:1503.02978.

P. Massart. Concentration Inequalities and Model Selection: Ecole d’ete deProbabilites de Saint-Flour 23. Springer, 2003.

N. Meinshausen and P. B uhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.

C. A. Micchelli and M. Pontil. Learning the kernel function via regularization.Journal of Machine Learning Research, 6:1099–1125, 2005.

C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds andimproved relaxations for tensor recovery. In Proceedings of the 31thInternational Conference on Machine Learning, pages 73–81, 2014.

Y. Nesterov. Gradient methods for minimizing composite objective function.Technical Report 76, Center for Operations Research and Econometrics(CORE), Catholic University of Louvain (UCL), 2007.

Y. Nesterov. Primal-dual subgradient methods for convex problems. MathematicalProgramming, Series B, 120:221–259, 2009.

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimizationproblems. SIAM Journal on Optimization, 22(2):341–362, 2012.

95 / 95

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating directionmethod of multipliers. In Proceedings of the 30th International Conference onMachine Learning, 2013.

M. Powell. A method for nonlinear constraints in minimization problems. InR. Fletcher, editor, Optimization, pages 283–298. Academic Press, London,New York, 1969.

A. Rakotomamonjy, F. Bach, S. Canu, and G. Y. SimpleMKL. Journal of MachineLearning Research, 9:2491–2521, 2008.

G. Raskutti and M. J. Wainwright. Minimax rates of estimation forhigh-dimensional linear regression over ℓq-balls. IEEE Transactions onInformation Theory, 57(10):6976–6994, 2011.

G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additivemodels over kernel classes via convex programming. Journal of MachineLearning Research, 13:389–427, 2012.

P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models.Journal of the Royal Statistical Society: Series B, 71(5):1009–1030, 2009.

P. Richtarik and M. Takac. Distributed coordinate descent method for learningwith big data. Technical report, 2013. arXiv:1310.2059.

P. Richtarik and M. Takac. Iteration complexity of randomized block-coordinatedescent methods for minimizing a composite function. MathematicalProgramming, 144:1–38, 2014.

95 / 95

P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparseestimation. The Annals of Statistics, 39(2):731–771, 2011.

R. T. Rockafellar. Augmented Lagrangians and applications of the proximal pointalgorithm in convex programming. Mathematics of Operations Research, 1:97–116, 1976.

M. Rudelson and S. Zhou. Reconstruction from anisotropic randommeasurements. IEEE Transactions of Information Theory, 39, 2013.

A. Saha and A. Tewari. On the non-asymptotic convergence of cyclic coordinatedescent methods. SIAM Journal on Optimization, 23(1):576–601, 2013.

M. Schmidt, N. Le Roux, and F. R. Bach. Minimizing finite sums with thestochastic average gradient, 2013. hal-00860051.

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods forregularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013.

J. Shawe-Taylor. Kernel learning for novelty detection. In NIPS 2008 Workshopon Kernel Learning: Automatic Selection of Optimal Kernels, Whistler, 2008.

S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiplekernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.

N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborativeprediction with low-rank matrices. In Advances in Neural InformationProcessing Systems (NIPS) 17, 2005.

95 / 95

I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squaresregression. In Proceedings of the Annual Conference on Learning Theory, pages79–93, 2009.

T. Suzuki. Unifying framework for fast learning rate of non-sparse multiple kernellearning. In Advances in Neural Information Processing Systems 24, pages1575–1583, 2011. NIPS2011.

T. Suzuki. Pac-bayesian bound for gaussian process regression and multiple kerneladditive model. In JMLR Workshop and Conference Proceedings, volume 23,pages 8.1–8.20, 2012. Conference on Learning Theory (COLT2012).

T. Suzuki. Dual averaging and proximal gradient descent for online alternatingdirection multiplier method. In Proceedings of the 30th InternationalConference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating direction method ofmultipliers. In Proceedings of the 31th International Conference on MachineLearning, pages 736–744, 2014.

T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning:trade-off between sparsity and smoothness. The Annals of Statistics, 41(3):1381–1405, 2013.

T. Suzuki and R. Tomioka. SpicyMKL, 2009. arXiv:0909.5026.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society, Series B, 58(1):267–288, 1996.

95 / 95

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity andsmoothness via the fused lasso. 67(1):91–108, 2005.

R. Tomioka and T. Suzuki. Sparsity-accuracy trade-off in MKL. In NIPS 2009Workshop: Understanding Multiple Kernel Learning Methods, Whistler, 2009.

R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schattennorm regularization. In Advances in Neural Information Processing Systems 26,page accepted, 2013. NIPS2013.

R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance ofconvex tensor decomposition. In Advances in Neural Information ProcessingSystems 24, pages 972–980, 2011. NIPS2011.

S. van de Geer, P. B uehlmann, Y. Ritov, and R. Dezeure. On asymptoticallyoptimal confidence regions and tests for high-dimensional models. The Annalsof Statistics, 42(3):1166–1202, 2014.

S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.

L. Xiao. Dual averaging methods for regularized stochastic learning and onlineoptimization. In Advances in Neural Information Processing Systems 23, 2009.

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressivevariance reduction. SIAM Journal on Optimization, 24:2057–2075, 2014.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphicalmodel. Biometrika, 94(1):19–35, 2007.

95 / 95

C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty.The Annals of Statist, 38(2):894–942, 2010.

P. Zhang, A. Saha, and S. V. N. Vishwanathan. Regularized risk minimization bynesterov’s accelerated gradient methods: Algorithmic extensions and empiricalstudies. CoRR, abs/1011.0472, 2010.

T. Zhang. Some sharp performance bounds for least squares regression with l1regularization. The Annals of Statistics, 37(5):2109–2144, 2009.

H. Zou. The adaptive lasso and its oracle properties. Journal of the AmericanStatistical Association, 101(476):1418–1429, 2006.

. . IEICE Fundamentals Review, 4(1):39–47, 2010.

95 / 95