adanet: adaptive structural learning of artificial neural ... · outline 1 introduction motivation...

AdaNet: Adaptive Structural Learning of ArtificialNeural Networks

Corinna Cortes1, Xavier Gonzalvo1, Vitaly Kuznetsov1, MehryarMohri21, Scott Yang 2

1Google Research

2Courant Institute

ICML, 2017/ Presenter: Anant Kharkar

Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang (Northwestern University and Intel Corporation)AdaNet: Adaptive Structural Learning of Artificial Neural NetworksICML, 2017/ Presenter: Anant Kharkar 1

Outline

1 IntroductionMotivation & Contributions

2 Theoretical BackgroundRegularizer DerivationGeneralization Bounds

3 AdaNetBasicsAlgorithmResultsVariations

Outline

Motivation & Contributions

Motivation

Lack of theory behind model architecture design

Usually requires domain knowledge

Contributions

New regularizer that learns parameters & architecture simultaneously

Additive model

Context

Strongly convex optimization problems

Generic network structure (not just MLP)

Outline

Regularizer Derivation

Function families:

H1 = {x 7→ u ·Ψ(x) : u ∈ Rn0 , ‖u‖p ≤ Λ1,0}

{x 7→

k−1∑s=1

us · (ϕs ◦ hs)(x) : us ∈ Rns , ‖us‖p ≤ Λk,s , hk,s ∈ Hs

}Definitions:

x: input

u: connections to previous layers

Ψ: feature vector

ϕ: activation function

Outline

Generalization Bounds

Rademacher Complexity:

R̂S(G) =1

[suph∈G

m∑i=1

σih(xi )

Definitions:

x: input

h: function expressed by single unit

σ: Rademacher RV {-1, +1}

Theorem 1:

R(f ) ≤ R̂S ,ρ(f ) +4

l∑k=1

‖wk‖1Rm(H̃k) +2

√log l

m+ C (ρ, l ,m, δ)

C (ρ, l ,m, δ) =

√⌈4

ρ2log(

log l)

⌉log l

log( 2δ )

l∑k=1

‖wk‖1Rm(H̃k): weighted sum of Rademacher complexities

Corollary 1:

R(f ) ≤ R̂S,ρ(f ) +2

l∑k=1

‖wk‖1

[r̄∞ΛkN

√2log(2n0)

√log l

m+ C (ρ, l ,m, δ)

C (ρ, l ,m, δ) =

√⌈4

ρ2log(

log l)

⌉log l

log( 2δ )

Outline

Basics

Adaptively grows network structure

Objective function:

F (w) =1

m∑i=1

(1− yi

N∑j=1

Γj |wj |

Definitions:

Φ: nondecreasing convex function (Ex: ex)

Γj : λrj + β

rj : Rm(Hkj ) (Rademacher complexity)

This objective function is forcibly convex

Outline

Algorithm

Outline

Results

Outline

Variations

AdaNet.R: R(w,h) = Γh‖w‖1 regularization in obj.

AdaNet.P: subnetworks only connected to previous subnetwork

AdaNet.D: dropout on subnetwork connections

AdaNet.SD: std dev of final layers instead of Rademacher complexity

adanet: adaptive structural learning of artificial neural ... · outline 1 introduction motivation...

Documents

a smoothing regularizer for recurrent neural networks...a...

quality assessment in the framework of map...

a novel regularizer for temporally stable learning …...a...

distributional generalization: a new kind of generalization

generalization - msg2018.weebly.com

generalization and overfitting

recruitment generalization

dicomplemented lattices. a contextual generalization · pdf...

generalization in reinforcement learning: successful...

generalization popvicious inequality

fairness-aware classifier with prejudice remover regularizer

stimulus generalization

metaperturb: transferable regularizer for heterogeneous...

generalization from qualitative inquiry 1 generalization...

adaptive margin diversity regularizer for handling data...

what is generalization

generalization web services

digital cartographic generalization for database of...

metabolic model generalization

generalization abstraction