robust deep learning based on meta-learning

Deyu Meng

Xi’an Jiaotong [email protected]

http://gr.xjtu.edu.cn/web/dymeng

Robust Deep Learning Based on Meta-learning

• Deep Learning

• Robust

• Meta-learning

LFW

The Success of Deep Learning Relies on

well-annotated & big data sets

What we think

we have:But what we really

have is always:

Commonly Encountered Data Bias (low quality data)

Label noise Data noise Class imbalance

• Deep Learning

• Robust

• Meta-learning

Robust Machine Learning for Data Bias

Design specific optimization objective (especially, robust loss)

to make it robust to certain data bias:

Label noise Data noiseClass imbalance

Lin, et al., TPAMI, 2018 Yong, et al., TPAMI, 2018Meng, et al., Information Sciences, 2017

Two Critical Issues

Generalized Cross Entropy

Symmetric Cross Entropy

Bi-Tempered logistic Loss

Polynomial SoftWeighting loss

Focal loss

CT loss

Lin, et al., TPAMI, 2018

Xie, et al., TMI, 2018

Zhao, et al., AAAI, 2015

Amid, et al., NeurIPS, 2019

Wang, et al., ICCV, 2019

Zhang, et al., NeurIPS, 2018

Hyperparameter Tunning

Non-convexity

• Deep Learning

• Robust

• Meta-learning

Training Data VS Validation Data

Hyper-parameter tuning: by validation data

Training loss Validation loss

≈ argminΘ∈{Θ1,Θ2,⋯,Θ𝑠}

1

𝑀

𝑖=1

𝑀

𝐿𝑖𝑚(𝒘∗(Θ))

Training Data VS Validation Data

Hyper-parameter tuning: by validation data

Training loss Validation loss

✓ Low efficiency✓ Low accuracy✓ Search instead of optimization✓ Heuristic instead of intelligent

≈ argminΘ∈{Θ1,Θ2,⋯,Θ𝑠}

1

𝑀

𝑖=1

𝑀


• The function of validation data is higher than training data➢Hyper-parameter tuning VS classifier parameter learning➢Make the model adaptable to data fit (general to specific)

• Validation data is different from training data!➢Teacher vs. student➢ Ideal vs. real➢High quality vs. low quality➢ Small scale vs. large scale➢ Fixed vs. dynamic (relatively)

• What we should do?➢ Lower the threshold of training data collection; higher the threshold of validation

data selection

Intrinsic Functions of Validation Data

✓ Optimization instead of search

✓ Intelligent instead of heuristic (partially)

From Validation Loss Searching to Meta Loss Training

Hyper-parameter tuning: by meta data

Training loss Meta loss

= argminΘ∈𝒢

1

𝑀

𝑖=1

𝑀


Many Recent Attempts

◆ Loss function.

Wu L, Tian F, Xia Y, et al. Learning to teach with dynamic loss functions. In NeurIPS, 2018: 6466-6477.Huang C, Zhai S, Talbott W, et al. Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment. In ICML, 2019: 2891-2900.Xu H, Zhang H, Hu Z, et al. AutoLoss: Learning Discrete Schedule for Alternate Optimization. In ICLR, 2019.Li C, Yuan X, Lin C, et al. AM-LFS: AutoML for Loss Function Search. In ICCV, 2019: 8410-8419.Grabocka J, Scholz R, Schmidt-Thieme L. Learning Surrogate Losses[J]. arXiv preprint arXiv:1905.10108, 2019.

◆ Regularization.

Feng J, Simon N. Gradient-based regularization parameter selection for problems with nonsmooth penalty functions[J]. Journal of Computational and Graphical Statistics, 2018, 27(2): 426-435.Frecon J, Salzo S, Pontil M. Bilevel learning of the group lasso structure. In NeurIPS 2018: 8301-8311.Streeter M. Learning Optimal Linear Regularizers. In ICML. 2019: 5996-6004.

◆ learner (NAS).

Zoph B, Le Q V. Neural architecture search with reinforcement learning. In ICLR, 2017.Baker B, Gupta O, Naik N, et al. Designing neural network architectures using reinforcement learning. In ICLR, 2017.Pham H, Guan M, Zoph B, et al. Efficient Neural Architecture Search via Parameter Sharing. ICML. 2018: 4092-4101.Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In CVPR, 2018: 8697-8710.Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. In ICLR, 2019.Xie S, Zheng H, Liu C, et al. SNAS: stochastic neural architecture search. In ICLR, 2019.Liu C, Zoph B, Neumann M, et al. Progressive neural architecture search. In ECCV, 2018: 19-34.

Many Recent Attempts

◆ Hyper-parameters learning.

Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning. In ICML, 2015: 2113-2122.Pedregosa F. Hyperparameter optimization with approximate gradient. In ICML, 2016: 737-746.Luketina J, Berglund M, Greff K, et al. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML. 2016: 2952-2960.Franceschi L, Donini M, Frasconi P, et al. Forward and reverse gradient-based hyperparameter optimization. In ICML, 2017: 1165-1173.Franceschi L, Frasconi P, Salzo S, et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. In ICML, 2018: 1563-1572.

◆ Gradients and learning rate. Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016.Baydin A G, Cornish R, Rubio D M, et al. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.Jacobsen A, Schlegel M, Linke C, et al. Meta-descent for Online, Continual Prediction. In AAAI. 2019.Metz L,, et al. Understanding and correcting pathologies in the training of learned optimizers. In ICML,2019:4556-4565.Xu Z, Dai A M, Kemp J, et al. Learning an Adaptive Learning Rate Schedule. arXiv preprint arXiv:1909.09712, 2019.

◆ Sample reweighing.

Jiang L, Zhou Z, Leung T, et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, 2018: 2309-2318.Ren M, Zeng W, Yang B, et al. Learning to Reweight Examples for Robust Deep Learning. In ICML, 2018: 4331-4340.Shu J, Xie Q, Yi L, et al. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, 2019.Zhao S, Fard M M, Narasimhan H, et al. Metric-Optimized Example Weights. In ICML 2019: 7533-7542.

• Deep Learning

• Robust

• Meta-learning

Generalized Cross Entropy

Symmetric Cross Entropy

Bi-Tempered logistic Loss

Polynomial SoftWeighting loss

Zhao, et al., AAAI, 2015

Amid, et al., NeurIPS, 2019

Wang, et al., ICCV, 2019

Zhang, et al., NeurIPS, 2018

Adaptively Learning the Robust Loss


Hyperparameter Learning by Meta Learning

Shu, et al., submitted, 2019

Experimental Results


Experimental Results

✓ The hyper-parameter adaptively learned by meta-learning actually not the optimal one for the original loss, with fixed hyper-parameter throughout its iteration.

✓Meta learning adaptively finds a proper hyper-parameter and simultaneously explores a good initialization network parameter under its current hyper-parameter in a dynamical way.

✓ Such adaptive learning manner should be more suitable for simultaneously obtain optimal values for both of them rather than only updating one under the other fixed.


When Model Contains Large Amount of Hyperparameters?

➢ Overfitting issue easily occurs (similar to conventional machine learning)➢ How to alleviate this issue?➢ Build parametric prior representation (neither too large nor too small) for

hyperparameters (similar to conventional machine learning)➢ Learner VS meta-learner➢ Need to deeply understand the data as well as the learning problem!

✓ Multi-view learning, multi-task learning (parameter - similar)

✓ Subspace learning (matrix – low rank)


When Model Contains Large Amount of Hyperparameters?

• Deep Learning

• Robust

• Meta-learning

Deep Learning with Training Data Bias

Problem: big data often come with noisy labels or class imbalance.

Deep Networks tend to overfit to Training Data!

Deep neural networks easily fit(memorizing) random labels.

Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. ICLR 2017. best paper

Zhang et al. (2017) found that:

How to robustly train deep networks on training data bias to improve the generalization performance?

Related work: Learning with Training Data Bias◆ Sample weighting methods

✓dataset resampling(Chawla et al., 2002) ✓instance re-weight (Zadrozny, 2004)✓AdaBoost method (Freund & Schapire, 1997)✓Hard example mining (Malisiewicz et al., 2011)✓focal loss (Lin et al., 2018)✓self-paced learning (Kumar et al., 2010)✓Iterative reweighting strategy (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018)✓prediction variance method (Chang et al., 2017)

◆Meta learning methods✓FWL (Dehghani et al.,2018)✓learning to teach (Fan et al., 2018; Wu et al., 2018)✓MentorNet (Jiang et al., 2018)✓L2RW (Ren et al., 2018)

◆Other methods✓GLC (Hendrycks et al., 2018)✓Reed (Reed et al., 2015)✓Co-teaching (Han et al., 2018)✓D2L (Ma et al.,2018)✓S-Model (Goldberger & Ben-Reuven, 2017)

Sample weighting methods

Existing studies define a curriculum as a function(hand-design) for specific tasks and extra hyper-parameter setting.

Strategy Regularzer 𝑮 Weight 𝒗∗

Self-paced [Kumar et al. NIPS 2010] − 𝒗 1 𝒗∗ = 𝕀(𝒍𝒊 ≤ 𝝀)

Linear weighting [Jiang et al. AAAI 2015]𝟏

𝟐

𝒊=𝟏

𝒏

(𝒗𝒊𝟐 − 𝟐𝒗𝒊) 𝒗∗ = 𝐦𝐚𝐱 (𝟎, 𝟏 −

𝟏

𝝀𝒍𝒊)

Focal Loss [Lin et al., ICCV 2017] − 𝒗∗ = 𝟏 − 𝒆𝒙𝒑 −𝒍𝒊𝜶

Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒗∗ = 𝕀(𝒍𝒊 > 𝝀(𝟏 − 𝒚𝒊))

Prediction variance [Chang et al., NIPS 2017] − 𝒗∗ =𝟏

𝒁𝑽𝒂𝒓 𝒍𝒊 +

𝑽𝒂𝒓(𝒍𝒊)

|𝒍𝒊|

Strategy Regularzer 𝑮 Weight 𝒗∗

Self-paced [Kumar et al. NIPS 2010] − 𝒗 1 𝒗∗ = 𝕀(𝒍𝒊 ≤ 𝝀)

Linear weighting [Jiang et al. AAAI 2015]𝟏

𝟐

𝒊=𝟏

𝒏

(𝒗𝒊𝟐 − 𝟐𝒗𝒊) 𝒗∗ = 𝐦𝐚𝐱 (𝟎, 𝟏 −

𝟏

𝝀𝒍𝒊)

Focal Loss [Lin et al., ICCV 2017] − 𝒗∗ = 𝟏 − 𝒆𝒙𝒑 −𝒍𝒊𝜶

Hard example mining [Malisiewicz et al., ICCV 2011]

− 𝒗∗ = 𝕀(𝒍𝒊 > 𝝀(𝟏 − 𝒚𝒊))

Prediction variance [Chang et al., NIPS 2017] − 𝒗∗ =𝟏

𝒁𝑽𝒂𝒓 𝒍𝒊 +

𝑽𝒂𝒓(𝒍𝒊)

|𝒍𝒊|

⚫ Need to pre-specify the form of weighting function

⚫ Need to manually set hyper-parameters

Sample weighting methods

Meta Data and Meta Loss

Meta DataTraining Data

L2RW [Ren et al., ICML 2018]

Directly learning weights from training and meta data

Meta Data and Meta Loss

Meta DataTraining Data

Training Loss

Input Structure

Meta Loss

MentorNet [Jiang et al., ICML 2018]

The meta-learner is complex, hard to be reproduced.

Very Complex InputVery Complex Theta

Our work

Meta-Weight-Net

Input: LossTheta: MLP

Our work

Inner loop:

Outer loop:

Notation:

◆ Θ: Parameters of teacher◆ 𝑤: Parameters of student

Meta-Weight-NetShu, et al., NeurIPS, 2019

Our work

Step 5:

Step 6:

Step 7:

Shu, et al., NeurIPS, 2019

Our work


Experiments

Experimental Setup: Class Imbalance

Datasets: CIFAR-10 & CIFAR-100


Experimental Setup: Noisy Label

Datasets: CIFAR-10 & CIFAR-100


Stable analysis of Meta-Weight-Net


Real Data Experiment


Insight: Adaptively Learn the Weight Function


Future research

◆Extension to other semi/weakly-supervised learning problems

◆More amelioration to the Meta-Weight-Net

◆Multi-view learning, ensemble learning, domain adaptation

◆General hyper-parameter learning (meta-learner designing)

Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, Deyu Meng. Learning Adaptive Loss for Robust Learning with Noisy Labels. arXiv:2002.06482, 2020.

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, ZongbenXu, Deyu Meng. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. NeurIPS, 2019.

robust deep learning based on meta-learning

Documents