learning from noisy labels with deep neural networks · effectively trained on data with high level...

10
Learning from Noisy Labels with Deep Neural Networks Sainbayar Sukhbaatar Dept. of Computer Science, Courant Institute, New York University [email protected] Rob Fergus Facebook AI Research & Dept. of Computer Science, Courant Institute New York University [email protected] Abstract We propose several simple approaches to training deep neural networks on data with noisy labels. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demon- strate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark, showing how additional noisy data can improve state-of-the-art recognition models. 1 Introduction In recent years, deep learning methods have shown impressive results on image classification tasks. However, this achievement is only possible because of large amount of labeled images. Labeling images by hand is a laborious task and takes a lot of time and money. An alternative approach is to generate labels automatically. This includes user tags from social web sites and keywords from image search engines. Considering the abundance of such noisy labels, it is important to find a way to utilize them in deep learning. Unfortunately, those labels are very noisy and unlikely to help training deep networks without additional tricks. Our goal is to study the effect label noise on deep networks, and explore simple ways of improve- ment. We focus on the robustness of deep networks instead of data cleaning methods, which are well studied and can be used together with robust models directly. Although many noise robust classifiers are proposed so far, there are not many works on training deep networks on noisy labeled data, especially on large scale datasets. Our contribution in this paper is a novel way of modifying deep learning models so they can be effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top of the softmax layer, which makes it easy to implement. This additional layer changes the output from the network to give better match to the noisy labels. Also, it is possible to learn the noise distribution directly from the noisy data. Using real-world image classification tasks, we demonstrate that the model actually works very well in practice. We even show that random images without labels (complete noise) can improve the classification performance. 2 Related Work In any classification model, degradation of performance is inevitable when there is noise in training labels [13, 15]. A simple approach to handle noisy labels is a data preprocessing stage, where labels suspected to be incorrect are removed or corrected [1, 3]. However, a weakness of this approach is the difficulty of distinguishing informative hard samples from harmful mislabeled ones [6]. Instead, in this paper, we focus on models robust to presence of label noise. 1 arXiv:1406.2080v1 [cs.CV] 9 Jun 2014

Upload: others

Post on 20-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

Learning from Noisy Labelswith Deep Neural Networks

Sainbayar SukhbaatarDept. of Computer Science,

Courant Institute,New York University

[email protected]

Rob FergusFacebook AI Research &

Dept. of Computer Science,Courant Institute

New York [email protected]

Abstract

We propose several simple approaches to training deep neural networks on datawith noisy labels. We introduce an extra noise layer into the network which adaptsthe network outputs to match the noisy label distribution. The parameters of thisnoise layer can be estimated as part of the training process and involve simplemodifications to current training infrastructures for deep networks. We demon-strate the approaches on several datasets, including large scale experiments onthe ImageNet classification benchmark, showing how additional noisy data canimprove state-of-the-art recognition models.

1 IntroductionIn recent years, deep learning methods have shown impressive results on image classification tasks.However, this achievement is only possible because of large amount of labeled images. Labelingimages by hand is a laborious task and takes a lot of time and money. An alternative approach isto generate labels automatically. This includes user tags from social web sites and keywords fromimage search engines. Considering the abundance of such noisy labels, it is important to find a wayto utilize them in deep learning. Unfortunately, those labels are very noisy and unlikely to helptraining deep networks without additional tricks.

Our goal is to study the effect label noise on deep networks, and explore simple ways of improve-ment. We focus on the robustness of deep networks instead of data cleaning methods, which arewell studied and can be used together with robust models directly. Although many noise robustclassifiers are proposed so far, there are not many works on training deep networks on noisy labeleddata, especially on large scale datasets.

Our contribution in this paper is a novel way of modifying deep learning models so they can beeffectively trained on data with high level of label noise. The modification is simply done by addinga linear layer on top of the softmax layer, which makes it easy to implement. This additional layerchanges the output from the network to give better match to the noisy labels. Also, it is possible tolearn the noise distribution directly from the noisy data. Using real-world image classification tasks,we demonstrate that the model actually works very well in practice. We even show that randomimages without labels (complete noise) can improve the classification performance.

2 Related WorkIn any classification model, degradation of performance is inevitable when there is noise in traininglabels [13, 15]. A simple approach to handle noisy labels is a data preprocessing stage, where labelssuspected to be incorrect are removed or corrected [1, 3]. However, a weakness of this approach isthe difficulty of distinguishing informative hard samples from harmful mislabeled ones [6]. Instead,in this paper, we focus on models robust to presence of label noise.

1

arX

iv:1

406.

2080

v1 [

cs.C

V]

9 J

un 2

014

Page 2: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

The effect of label noise is well studied in common classifiers (e.g., SVMs, kNN, logistic regression),and their label noise robust variants have been proposed. See [5] for comprehensive review. A morerecent work [2] proposed a generic unbiased estimator for binary classification with noisy labels.They employ a surrogate cost function that can be expressed by a weighted sum of the original costfunctions, and gave theoretical bounds on the performance. In this paper, we will also consider thisidea and extend it multiclass.

A cost function similar to ours is proposed in [2] to make logistic regression robust to label noise.They also proposed a learning algorithm for noise parameters. However, we consider deep networks,a more powerful and complex classifier than logistic regression, and propose a different learningalgorithm for noise parameters that is more suited for back-propagation training.

Considering the recent success of deep learning [8, 17, 16], there are very few works about deeplearning from noisy labels. In [11, 9], noise modeling is incorporated to neural network in thesame way as our proposed model. However, only binary classification is considered in [11], and [9]assumed symmetric label noise (noise is independent of the true label). Therefore, there is only asingle noise parameter, which can be tuned by cross-validation. In this paper, we consider multiclassclassification and assume more realistic asymmetric label noise, which makes it impossible to usecross-validation to adjust noise parameters (there can be a million parameters).

3 ApproachIn this paper, we consider two approaches to make an existing classification model, which we callthe base model, robust against noisy labels: bottom-up and top-down noise models. In the bottom-up model, we add an additional layer to the model that changes the label probabilities output by thebase model so it would better match to noisy labels. Top-down model, on other hand, changes givennoisy labels before feeding them to the base model. Both models require a noise model for training,so we will give an easy way to estimate noise levels using clean data. Also, it is possible to learnnoise distribution from noisy data in the bottom-up model. Although only deep neural networks areused in our experiments, the both approaches can be applied to any classification model with a crossentropy cost.

3.1 Bottom-up Noise ModelWe assume that label noise is random conditioned on the true class, but independent of the input x(see [10] for more detail about this type of noise). Based on this assumption, we add an additionallayer to a deep network (see Figure 1) that changes its output so it would better match to the noisylabels. The weights of this layer corresponds to the probabilities of a certain class being mislabeledto another class. Because those probabilities are often unknown, we will show how estimate themfrom additional clean data, or from the noisy data itself.

Let D be the true data distribution generating correctly labeled samples (x, y∗), where x is an inputvector and y∗ is the corresponding label. However, we only observe noisy labeled samples (x, y) thatgenerated from a some noisy distribution D. We assume that the label noise is random conditionedon the true labels. Then, the noise distribution can be parameterized by a matrix Q = {qji}:qji := p(y = j|y∗ = i). Q is a probability matrix because its elements are positive and eachcolumn sums to one. The probability of input x being labeled as j in D is given by

p(y = j|x, θ) =∑

i

p(y = j|y∗ = i)p(y∗ = i|x) =∑

i

qjip(y∗ = i|x, θ). (1)

where p(y∗ = i|x, θ) is the probabilistic output of the base model with parameters θ. If the truenoise distribution is known, we can modify this for noisy labeled data. During training, Q will actas an adapter that transforms the model’s output to better match the noisy labels.

Deep  network  

Learning from noisy labels in deep neural networks

Sainbayar SukhbaatarDept. of Computer Science, NYU,

715 Broadway,New York, NY 10003

[email protected]

Rob FergusCourant Institute, NYU,

715 Broadway,New York, NY 10003

[email protected]

Abstract

The abstract goes here.

1 Introduction

Introduction ...

2 Noise model

Here, we consider a scenario where D = {(x1, y1), .., (xN , yN )} is true labeled data, but we onlyobserved noisy labeled data D = {(x1, y1), .., (xN , yN )}, in which labels can be incorrect. Thenoise in labels could be a result of inaccurate measurement or human mistakes.

We will assume that mistakes in labels are consistent and independent of inputs given their truelabels. Therefore, noisy labels can be considered as random samples from a multinomial distributionwhich depends on only true labels.

y ⇠ Mult(q1y, ..., qKy),

where qjy are outcome probabilities and K is the number of categories. Then, the probability ofinput x being labeled as j is

p(y = j|x) =X

i

p(y = j|y = i)p(y = i|x) =X

i

qjip(y = i|x).

The matrix containing noise parameters Q = {qji} is a probability matrix because its elements arenon-negative and each column sums to 1. In the following sections, we describe two approaches fortraining learning model based on this noise model.

2.1 Bottom-up approach

In this approach, we extend an existing model so that its output would better match to noisy dataD. Let ✓ be the parameters of a probabilistic classification model. Then, the extended model wouldhave parameters ✓0 = {Q, ✓}, and its output will be

p(y = j|x, ✓0) =X

i

qjip(y = i|x, ✓).

We can think Q as an adaptor that changes the output from the model to label distributions of thenoisy data. Although the extended model is trained on noisy labeled data, the original model shouldbe used for testing.

1

Learning from noisy labels in deep neural networks

Sainbayar SukhbaatarDept. of Computer Science, NYU,

715 Broadway,New York, NY 10003

[email protected]

Rob FergusCourant Institute, NYU,

715 Broadway,New York, NY 10003

[email protected]

Abstract

The abstract goes here.

1 Introduction

Introduction ...

2 Noise model

Here, we consider a scenario where D = {(x1, y1), .., (xN , yN )} is true labeled data, but we onlyobserved noisy labeled data D = {(x1, y1), .., (xN , yN )}, in which labels can be incorrect. Thenoise in labels could be a result of inaccurate measurement or human mistakes.

We will assume that mistakes in labels are consistent and independent of inputs given their truelabels. Therefore, noisy labels can be considered as random samples from a multinomial distributionwhich depends on only true labels.

y ⇠ Mult(q1y, ..., qKy),

where qjy are outcome probabilities and K is the number of categories. Then, the probability ofinput x being labeled as j is

p(y = j|x) =X

i

p(y = j|y = i)p(y = i|x) =X

i

qjip(y = i|x).

The matrix containing noise parameters Q = {qji} is a probability matrix because its elements arenon-negative and each column sums to 1. In the following sections, we describe two approaches fortraining learning model based on this noise model.

2.1 Bottom-up approach

In this approach, we extend an existing model so that its output would better match to noisy dataD. Let ✓ be the parameters of a probabilistic classification model. Then, the extended model wouldhave parameters ✓0 = {Q, ✓}, and its output will be

p(y = j|x, ✓0) =X

i

qjip(y = i|x, ✓).

We can think Q as an adaptor that changes the output from the model to label distributions of thenoisy data. Although the extended model is trained on noisy labeled data, the original model shouldbe used for testing.

1

So,max        Linear

If we use vector form c(x) = [c1(x), ..., cK(x)]T and c0(x) = [c01(x), ..., c0K(x)]T

Qc0(x) = c(x).

When Q is invertible, we can write surrogate cost functions as

c0(x) = Q�1c(x)

As with bottom-up approach, when Q is unknown we will replace Q�1 with R = {rji}.

c0(x) = Rc(x).

The final cost function becomes

C(R, ✓) =NX

n=1

KX

i=1

ryni log p(y = i|xn, ✓).

This is the same as replacing noisy label y = i with surrogate vector label ri = [ri1, ..., riK ]T .Unlike bottom-up approach, we cannot learn parameters R by minimizing this cost function.

2.4 Learning from multiple sources

In some cases, training data might consist from several sources with different noise levels. Forexample, we might have a clear dataset with 100% correct labels and a noisy dataset with manyincorrect labels. It is sensible to put more importance on datasets with less noise. This can be doneby putting different weights on different sources in the cost function:

1

|C| +P

k |Nk|

X

n2C

l(y(n), f(x(n))) +X

k

�k

X

n2Nk

l(y(n), f(x(n)))

!

3 Experiments

3.1 SVHN deliberately noise added

Street-View House Number (SVHN) is a dataset containing images of digits captured from housenumbers. It has about 600 thousand training images and 26 thousand test images. However, we onlyused 100 thousand training images in our experiments for the sake of speed-up.

Although all training images in SVHN are correctly labeled, we synthesized a noisy version of it bydeliberately changing labels. Original label i is changed to j with fixed probability qji. An exampleof noise matrix Q = {qji} is shown in Figure 4. Test images are used with their original labels.

The goal of this experiments is to show that bottom-up noise model can actually learn noise matrixQ from noisy labeled data. To understand the effect noise level on learning, we experimented withvarious amounts of true and false labels.

3.2 CIFAR-10 deliberately noise added

CIFAR10 is a dataset of small natural images labeled into ten categories. It has 50 thousand imagesfor training and 10 thousand images for testing. In this experiment, we deliberately changed somelabels of training images to simulate noisy labeled training data. We used the same noise matrix Qas the SVHN experiments.

Figure 5 shows experimental results.

3.3 CIFAR-10 + Noisy TinyImage

CIFAR10 is originally created by cleaning up a subset of TinyImage dataset, which has about 8 mil-lion small images with very noisy labels (it is estimated that about 80% of the labels are incorrect).In the process hand picking CIFAR10 images, some images are excluded because they were notshowing object clearly or labeled falsely. In this set of experiments, we used 150 thousand images

3

NLL  cost  

Base model Noise layer

Back-propagation

Noisy label

Figure 1: In the bottom-up noise model, we add a noise layer between the softmax and cost layers. The noiselayer is a special linear layer with weights equal to the noise distribution. It changes output probabilities fromthe base model into a distribution that better matches to noisy labels.

2

Page 3: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

If the base model is a neural network with a softmax output layer, this modification can be simplydone by adding a noise layer on top of the softmax layer, as shown in Figure 1. The role of the noiselayer is to perform operations in Eqn. 1. Therefore, the noise layer is a normal linear layer exceptthere is no bias and its weights are constrained between 0 and 1 because they represent conditionalprobabilities qji. Also, weights coming from a single node should sum to one because

∑j qji = 1.

When training labels are clean, we would set the weight matrix of the noise layer to identity, which isequivalent to not having the noise layer. For noisy labels, we would use noise distribution Q insteadof the identity matrix. Because the Q matrix is linear, we can pass gradients through it duringtraining and perform back-propagation to train the rest of the network. As before, the learningobjective is to maximize the log likelihood over N samples, but now the objective incorporates thenoise matrix Q:

L(θ) = 1

N

N∑

n=1

log p(y = yn|xn, θ) =1

N

N∑

n=1

log

(∑

i

qynip(y∗ = i|xn, θ)

)(2)

3.2 Estimating Noise Distribution Using Clean DataIn the bottom-up noise model, we need to know the noise distributionQ in order to train the network.Unfortunately,Q is often unknown for real-world tasks. Because the number of free parameters inQis the square of the number of classes, it is impossible to find Q using cross-validation methods forlarge-scale datasets. However, we can get unbiased estimation of Q if some clean data is availablein addition to the noisy data, which is often the case. The idea is to use the clean data to getthe confusion matrix of a some pre-trained model. We also can measure the confusion matrix onthe noisy data. Then, the difference between those two confusion matrices should be the noisedistribution Q. This is because the model’s mistakes on the noisy data is a combination of two typesof mistakes: (1) mistakes of the model and (2) mistakes in noisy labels. The statistics of the firsttype mistakes can be measured by the clean data, and the second type mistakes correspond to thenoise distribution.

Let us formulate this more precisely. We have clear data D∗ and noisy data D. The goal is toestimate the noise distribution of D. We also need a some pre-trained model M , which could havebeen trained on either noisy or clear data (but it should be separate from D and D). Let C∗ and Cbe the confusion matrices of M on the two data types

D∗ M−→ C∗ : c∗ij = p(y = i|y∗ = j,M) and DM−→ C : cij = p(y = i|y = j,M).

Then, the relation between those two is:∑

i c∗kirij = ckj for ∀j, k, where rij denotes p(y∗ = i|y =

j). If we use a matrix form, we can write:

C∗R = C =⇒ R = C∗−1C, (3)

when C∗ is an invertible matrix. If computed R has negative values, it should be projected back tothe subspace of probability matrices. Finally, we can compute Q from R using Bayes’ rule

p(y = j|y∗ = i) =p(y∗ = i|y = j)p(y = j)

p(y∗ = i)=⇒ qji =

rijp(y = j)

p(y∗ = i). (4)

We can measure p(y = j) from the data, and p(y∗ = i) can be computed from the fact∑

j qji = 1.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C∗ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

(0.5‖C∗R− C‖2 + λ|R|) (5)

under constraint that R should be a probability matrix. Here λ is a hyper-parameter controlling theL1 sparsity of R, which can determined by cross-validation. We can effectively solve this optimiza-tion with simple gradient descend method. Such sparse prior over R and Q is useful because inreal-world data classes are likely only to be mislabeled with a small set of other classes.

3

Page 4: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

3.3 Learning Noise Distribution From Noisy DataIn practice, the noise distribution Q is often unknown and we might not have clear data from whichto estimate it. If we only have noisy training data, then we have to learn Q, an approximation of Q,from the noisy data itself. Since the elements of Q correspond the weights of the noise layer in ourmodel, it can be learned in the same way as other weights using back-propagation. However, weightmatrix Q have to be projected back to the subspace of probability matrices after each update.

A problem with this is that there is no guarantee Qwould converge trueQ. The combined model canestimate the noise Q via the product of Q and C, the base model confusion matrix: QC = Q. Thus,if the base model is powerful enough, it can learn the noise distribution and Q will be an identitymatrix. To prevent this, we add a regularizer tr(Q) to the objective, which in turn makes the basemodel less noisy, so encouraging Q to converge to Q, see Theorem 1.

Theorem 1. In the following optimization problem, the only global minimum is Q = Q and C = I .(where Q, Q and C are probability matrices).

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for ∀i, j 6= i.

Proof. Let Q∗ be the global solution. Then

tr(Q) = tr(Q∗C) =∑

i

(∑

j

q∗ijcji) ≤∑

i

(∑

j

q∗iicji) =∑

i

q∗ii(∑

j

cji) =∑

i

q∗ii = tr(Q∗)

The equality will only hold true only when C = I . Therefore, Q∗ = Q.

Before the model is converged, QC may only approximately equal Q, but empirically we showthat the tr(Q) regularization works well. In practice, we use weight decay on Q since it is alreadyimplemented in most deep learning packages and has the same effect.

3.4 Training a Bottom-up Model

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

time/epochs

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

identity

uniform

true

fixed

less noisy

more noisy

2.3 Training a bottom-up model

Let us explain the training procedure for a bottom-up model. If we only have noisy data for training:

1. Initialize a bottom-up model with identity Q.2. Train several epochs while fixing Q to identity. We cannot allow Q to updated until the

confusion matrix has large values at its diagonal.3. Continue training, but now allow Q to be updated. Put small weight decay on Q if neces-

sary.4. When testing the model, throw away the noise layer or use an identity matrix instead of Q.

The training procedure is little bit complicated when we have both clean and noisy training data:

1. Split the training data into three sets: clean data D⇤, Dm and noisy data D. If clean data islimited and noise level is not high, use noisy data for Dm.

2. Train a normal deep network M using data Dm in a usual way.3. Measure M ’s confusion matrices using data D⇤ and D. Then, compute the estimate Q of

the noise distribution Q using those confusion matrices. Use sparsity constraint if the cleandata is limited and Q is high-dimensional.

4. Add a noise layer to M . Initialize its weight with the estimated Q.5. Train M on all of the training data. If the current batch is from noisy data, use the estimated

Q in the noise layer. If it is from the clean data, always use an identity matrix instead of Q.We can also allow Q to be updated during training. Also, the clean data is more importantthan the noisy data, put smaller weight on the noisy data.

2.4 Top-down approach

Here we consider an approach of changing given noisy labels prior to training.

Let ck(x) be the original cost function for label y = k. In case of maximum likelihood estimation,this cost function is

ck(x) = � log p(y = k|x, ✓),

where ✓ is the parameters of the model. For noisy labeled data D, we will use surrogate cost functionc0k(x) which should suffice the condition of unbiasedness:

ci(x) =X

j

p(y = j|y = i)c0j(x)

=X

j

qjic0j(x) for i = 1, ..., K.

If we use vector form c(x) = [c1(x), ..., cK(x)]T and c0(x) = [c01(x), ..., c0K(x)]T

Qc0(x) = c(x).

When Q is invertible, we can write surrogate cost functions as

c0(x) = Q�1c(x)

As with bottom-up approach, when Q is unknown we will replace Q�1 with R = {rji}.

c0(x) = Rc(x).

The final cost function becomes

C(R, ✓) = �NX

n=1

KX

i=1

ryni log p(y = i|xn, ✓).

This is the same as replacing noisy label y = i with surrogate vector label ri = [ri1, ..., riK ]T .Unlike bottom-up approach, we cannot learn parameters R by minimizing this cost function.

4

: confusion of the base model

Start  upda)ng            with  small  trace  cost  or  weight  decay  

2.3 Training a bottom-up model

Let us explain the training procedure for a bottom-up model. If we only have noisy data for training:

1. Initialize a bottom-up model with identity Q.2. Train several epochs while fixing Q to identity. We cannot allow Q to updated until the

confusion matrix has large values at its diagonal.3. Continue training, but now allow Q to be updated. Put small weight decay on Q if neces-

sary.4. When testing the model, throw away the noise layer or use an identity matrix instead of Q.

The training procedure is little bit complicated when we have both clean and noisy training data:

1. Split the training data into three sets: clean data D⇤, Dm and noisy data D. If clean data islimited and noise level is not high, use noisy data for Dm.

2. Train a normal deep network M using data Dm in a usual way.3. Measure M ’s confusion matrices using data D⇤ and D. Then, compute the estimate Q of

the noise distribution Q using those confusion matrices. Use sparsity constraint if the cleandata is limited and Q is high-dimensional.

4. Add a noise layer to M . Initialize its weight with the estimated Q.5. Train M on all of the training data. If the current batch is from noisy data, use the estimated

Q in the noise layer. If it is from the clean data, always use an identity matrix instead of Q.We can also allow Q to be updated during training. Also, the clean data is more importantthan the noisy data, put smaller weight on the noisy data.

2.4 Top-down approach

Here we consider an approach of changing given noisy labels prior to training.

Let ck(x) be the original cost function for label y = k. In case of maximum likelihood estimation,this cost function is

ck(x) = � log p(y = k|x, ✓),

where ✓ is the parameters of the model. For noisy labeled data D, we will use surrogate cost functionc0k(x) which should suffice the condition of unbiasedness:

ci(x) =X

j

p(y = j|y = i)c0j(x)

=X

j

qjic0j(x) for i = 1, ..., K.

If we use vector form c(x) = [c1(x), ..., cK(x)]T and c0(x) = [c01(x), ..., c0K(x)]T

Qc0(x) = c(x).

When Q is invertible, we can write surrogate cost functions as

c0(x) = Q�1c(x)

As with bottom-up approach, when Q is unknown we will replace Q�1 with R = {rji}.

c0(x) = Rc(x).

The final cost function becomes

C(R, ✓) = �NX

n=1

KX

i=1

ryni log p(y = i|xn, ✓).

This is the same as replacing noisy label y = i with surrogate vector label ri = [ri1, ..., riK ]T .Unlike bottom-up approach, we cannot learn parameters R by minimizing this cost function.

4

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

The  base  model  learns  noise  in  data  

2.3 Training a bottom-up model

Let us explain the training procedure for a bottom-up model. If we only have noisy data for training:

1. Initialize a bottom-up model with identity Q.2. Train several epochs while fixing Q to identity. We cannot allow Q to updated until the

confusion matrix has large values at its diagonal.3. Continue training, but now allow Q to be updated. Put small weight decay on Q if neces-

sary.4. When testing the model, throw away the noise layer or use an identity matrix instead of Q.

The training procedure is little bit complicated when we have both clean and noisy training data:

1. Split the training data into three sets: clean data D⇤, Dm and noisy data D. If clean data islimited and noise level is not high, use noisy data for Dm.

2. Train a normal deep network M using data Dm in a usual way.3. Measure M ’s confusion matrices using data D⇤ and D. Then, compute the estimate Q of

the noise distribution Q using those confusion matrices. Use sparsity constraint if the cleandata is limited and Q is high-dimensional.

4. Add a noise layer to M . Initialize its weight with the estimated Q.5. Train M on all of the training data. If the current batch is from noisy data, use the estimated

Q in the noise layer. If it is from the clean data, always use an identity matrix instead of Q.We can also allow Q to be updated during training. Also, the clean data is more importantthan the noisy data, put smaller weight on the noisy data.

2.4 Top-down approach

Here we consider an approach of changing given noisy labels prior to training.

Let ck(x) be the original cost function for label y = k. In case of maximum likelihood estimation,this cost function is

ck(x) = � log p(y = k|x, ✓),

where ✓ is the parameters of the model. For noisy labeled data D, we will use surrogate cost functionc0k(x) which should suffice the condition of unbiasedness:

ci(x) =X

j

p(y = j|y = i)c0j(x)

=X

j

qjic0j(x) for i = 1, ..., K.

If we use vector form c(x) = [c1(x), ..., cK(x)]T and c0(x) = [c01(x), ..., c0K(x)]T

Qc0(x) = c(x).

When Q is invertible, we can write surrogate cost functions as

c0(x) = Q�1c(x)

As with bottom-up approach, when Q is unknown we will replace Q�1 with R = {rji}.

c0(x) = Rc(x).

The final cost function becomes

C(R, ✓) = �NX

n=1

KX

i=1

ryni log p(y = i|xn, ✓).

This is the same as replacing noisy label y = i with surrogate vector label ri = [ri1, ..., riK ]T .Unlike bottom-up approach, we cannot learn parameters R by minimizing this cost function.

4

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

: confusion of the whole model

noisy  data  

model  

probability of images being incorrectly labeled as “jaguar” if its true class is “sport car” because ofthe car manufacturer named “Jaguar”. Based on this assumption, we add an additional layer to andeep network that changes its output so it would better match to the noisy labels. The weights of thislayer corresponds to the probabilities of a certain class being mislabeled to another class. Becausethose probabilities are often unknown to us, we will show how estimate them using additional cleandata. Also, we show how to learn them from the noisy data itself.

Let D be the true data distribution generating correctly labeled samples (x, y⇤), where x is an inputvector and y⇤ is the corresponding label. However, we only observe noisy labeled samples (x, y) thatgenerated from a some noisy distribution D. We assume that the label noise is random conditionedon the true labels. Then, the noise distribution can be parameterized by a matrix Q = {qji} suchthat

qji := p(y = j|y⇤ = i).

Here, matrix Q is a probability matrix because its elements are positive and each column sums toone. The probability of input x being labeled as j in D is given by

p(y = j|x) =X

i

p(y = j|y⇤ = i)p(y⇤ = i|x) =X

i

qjip(y⇤ = i|x).

If the true noise distribution is known, we can modify an existing model on noisy labeled data.During training, Q will act as an adapter that transforms model’s output so that would better matchto noisy labels. Let ✓ be the parameters of a some probabilistic classification model. Then, outputfrom the modified model would become

p(y = j|x, Q, ✓) =X

i

qjip(y⇤ = i|x, ✓).

In case the base model is a neural network with a softmax output layer, this modification can besimply done by adding a fully connected linear layer on top of the softmax layer, as shown inFigure 1. The weights of the noise layer will correspond to the noise distribution Q. Because theextended model is still a neural network, it can be trained by the back-propagation algorithm. Thelearning objective is to maximize the log likelihood

L(✓) =1

N

NX

n=1

log p(y = yn|xn, ✓0) =1

N

NX

n=1

log

X

i

qynip(y⇤ = i|xn, ✓)

!,

where N is the number of samples in training data.

2.1 Estimating the noise distribution using clean data

In the bottom-up noise model, we need to know the noise distribution Q in order to train the network.Unfortunately, Q is often unknown for real-world tasks. Because the number of free parameters in Qis the square of the number of classes, it is impossible to find Q using cross-validation methods forlarge-scale datasets. However, we can get unbiased estimation of Q if some clean data is availablein addition to the noisy data. The idea is to use the clean data to get the confusion matrix of a somepre-trained model. We also can measure the confusion matrix on the noisy data. Then, the differencebetween those two confusion matrix should be the noise distribution. This is because the model’smistakes on the noisy data is a combination of two types of mistakes: (1) mistakes of the model and(2) mistakes in noisy labels. The statistics of type 1 mistakes can be measured by the clean data, andtype 2 mistakes correspond to the noise distribution.

Let us formulate this more precisely. We have clear data D⇤ and noisy data D. The goal is toestimate the noise distribution of D. We also need a some pre-trained model M , which could havebeen trained on either noisy or clear data (but it should be separate from D and D). Let C⇤ and Cbe the confusion matrices of M on the two data types

D⇤ M�! C : c⇤ij = p(y = i|y⇤ = j, M)

DM�! C : cij = p(y = i|y = j, M).

2

clean data 1. train

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

2. measure confusion

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

the  base  model  

noise  layer  

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

3. compute

Then, the relation between those two isX

j

p(y = k|y⇤ = j, M)p(y⇤ = j|y = i) = p(y = k|y = j, M) =)X

j

ckjrji = cki for 8i, k,

where rij denotes p(y⇤ = i|y = j). If we rewrite this in a matrix form

C⇤R = C =) R = C⇤�1C,

when C⇤ is an invertible matrix. Finally, we can compute Q from R using Bayes’ rule.

In case when the number of clean samples is small relative to the number of elements in R, usingthe inverse of C⇤ is not good idea because the inverse operation is unstable in presence of noise.Instead we can put a sparsity prior on R and solve the following optimization problem

minR

1

2||C⇤R � C||2 + �|R|

under constraint that R should be a probability matrix. Here � is a hyper-parameter controlling theL1 sparsity of R. Such sparse prior over R and Q is useful because in real-world data it is unlikelythat a certain class being mislabeled to all other classes.

2.2 Learning noise distribution from noisy data

In practice, the noise distribution Q is often unknown to us and we might not have clear data toestimate it. If we only have noisy training data, then we have to learn Q approximate of Q from thenoisy data itself. Since the elements of Q correspond the weights of the noise layer in our model,it can be learned in the same way as other weights using back-propagation. In other words, we willmaximize the log likelihood with respect to both ✓ and Q at the same time using stochastic gradientdescent. However, weight matrix Q have to be projected back to the subspace of probability matricesafter each update.

A problem with this learning is that there is no guarantee Q would converge true Q. Actually, if thebase model is powerful enough, it can learn the noise distribution and Q can be an identity matrix.However, it is possible to prevent this problem by forcing Q to be more noisy, which in turn wouldmake the base model less noisy. The confusion matrix of the modified model can be written as

Cm = QCb,

where Cb is the confusion matrix of the base model. Training of model would bring Cm close to thenoise distribution Q of training data because the loss function is defined as cross-entropy betweenthem. Ideally, we want Cb to be identity and Q to be equal to true Q. Unfortunately, we cannotdirectly force Cb to be identity because we can only measure it with clean data. However, we canput any constraint on Q because it is a weight matrix in the model. We argue that making Q morenoisy would be make Cb more close to identity. The intuition for this comes from an ideal casewhere the learning always reaches the optimal state where the following holds true

QC = Q.

In that case, more noisy Q will correspond to less noisy C, because the product of Q and C isconstant.Theorom 1. Let us consider the following optimization problem for probability matrices Q, Q andC.

minimizeQ,C

tr(Q) subject to QC = Q, qii > qij , qii > qij for 8i, j 6= i.

Then, Q = Q and C = I is the only global minimum.

Trace has an inverse relation to the noise level of Q because it represents the correct labels. Then,this theorem shows that C will become identity if we manage to make Q the most noisy. Note thatwe presented this theorem to support our intuition. In practice, we cannot guarantee that minimiz-ing tr(Q) will make C identity because QC = Q is too an ideal assumption. However, we willempirically show that this actually works well in practice. In experiments, we used L2 cost on Qrather than the trace cost because it has a similar effect, but it is already implemented in most ofdeep learning packages.

3

4. train

noisy labels

clean labels

(a) (b)

Figure 2: (a) The training sequence when learning from noisy data alone. The noise matrix Q (red) is initiallyset to the identity, while the base model (green) is trained, inadvertently learning the noise in the data. Then westart updating Q also (with regularization) and this captures the noise properties of the data, leaving the modelto make “clean” predictions. (b) Training when we have noisy and clean data available. 1: We train a modelone on the clean data. 2 & 3: Comparing the confusion matrices on clean/noisy data, we compute Q. 4: Wetrain a new model with the noise layer Q fixed.

Noisy labels only: We first consider the case where we only have noisy labeled training data. Westart by initializing the base model in usual way (there are no special requirements). Initially, wefix Q = I during training until the validation error stops decreasing. At this point, the base modelcould have learned the noise in the training data, which is not what we want. Therefore, we startupdating Q along with the rest of the network, using weight decay to push Q toward Q. As Qgets more noisy, it should absorb the noise from the base model, thus making the base model moreaccurate. However, too large weight decay (or trace cost) would make Q noisier than the true Q,which will hurt the performance. Therefore, we have to find right amount of weight decay usingcross validation. We continue the training until the validation error stops decreasing to preventoverfitting. This procedure is illustrated in figure 2(a). If we want to make prediction or test themodel on clear data, the noise layer should be removed (or set to identity I), but for noisy labeledvalidation data we use the learned Q.

4

Page 5: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

Noisy and clean data: If some clean training data is available, we can use it to estimate the noisedistribution Q of the noisy data, using the method described in section 3.2. We first train a model(which can be a normal deep network) on a subset of the clean data. Then, we measure the confusionmatrices on the both clean (must exclude the subset used for training to avoid bias) and noisy data.From the difference between those two matrices, we compute an estimate of the noise distributionQ using Eqns. 3 & 4. This can then be used to train a new model on both clean and noisy dataset.Figure 2(b) shows the training scheme. A variant of this procedure involves one further stage wherethe initial estimate of Q is refined using the procedure in Section 3.3.

3.5 Top-down Noise Model

Deep  network  

Learning from noisy labels in deep neural networks

Sainbayar SukhbaatarDept. of Computer Science, NYU,

715 Broadway,New York, NY 10003

[email protected]

Rob FergusCourant Institute, NYU,

715 Broadway,New York, NY 10003

[email protected]

Abstract

The abstract goes here.

1 Introduction

Introduction ...

2 Noise model

Here, we consider a scenario where D = {(x1, y1), .., (xN , yN )} is true labeled data, but we onlyobserved noisy labeled data D = {(x1, y1), .., (xN , yN )}, in which labels can be incorrect. Thenoise in labels could be a result of inaccurate measurement or human mistakes.

We will assume that mistakes in labels are consistent and independent of inputs given their truelabels. Therefore, noisy labels can be considered as random samples from a multinomial distributionwhich depends on only true labels.

y ⇠ Mult(q1y, ..., qKy),

where qjy are outcome probabilities and K is the number of categories. Then, the probability ofinput x being labeled as j is

p(y = j|x) =X

i

p(y = j|y = i)p(y = i|x) =X

i

qjip(y = i|x).

The matrix containing noise parameters Q = {qji} is a probability matrix because its elements arenon-negative and each column sums to 1. In the following sections, we describe two approaches fortraining learning model based on this noise model.

2.1 Bottom-up approach

In this approach, we extend an existing model so that its output would better match to noisy dataD. Let ✓ be the parameters of a probabilistic classification model. Then, the extended model wouldhave parameters ✓0 = {Q, ✓}, and its output will be

p(y = j|x, ✓0) =X

i

qjip(y = i|x, ✓).

We can think Q as an adaptor that changes the output from the model to label distributions of thenoisy data. Although the extended model is trained on noisy labeled data, the original model shouldbe used for testing.

1

Learning from noisy labels in deep neural networks

Sainbayar SukhbaatarDept. of Computer Science, NYU,

715 Broadway,New York, NY 10003

[email protected]

Rob FergusCourant Institute, NYU,

715 Broadway,New York, NY 10003

[email protected]

Abstract

The abstract goes here.

1 Introduction

Introduction ...

2 Noise model

Here, we consider a scenario where D = {(x1, y1), .., (xN , yN )} is true labeled data, but we onlyobserved noisy labeled data D = {(x1, y1), .., (xN , yN )}, in which labels can be incorrect. Thenoise in labels could be a result of inaccurate measurement or human mistakes.

We will assume that mistakes in labels are consistent and independent of inputs given their truelabels. Therefore, noisy labels can be considered as random samples from a multinomial distributionwhich depends on only true labels.

y ⇠ Mult(q1y, ..., qKy),

where qjy are outcome probabilities and K is the number of categories. Then, the probability ofinput x being labeled as j is

p(y = j|x) =X

i

p(y = j|y = i)p(y = i|x) =X

i

qjip(y = i|x).

The matrix containing noise parameters Q = {qji} is a probability matrix because its elements arenon-negative and each column sums to 1. In the following sections, we describe two approaches fortraining learning model based on this noise model.

2.1 Bottom-up approach

In this approach, we extend an existing model so that its output would better match to noisy dataD. Let ✓ be the parameters of a probabilistic classification model. Then, the extended model wouldhave parameters ✓0 = {Q, ✓}, and its output will be

p(y = j|x, ✓0) =X

i

qjip(y = i|x, ✓).

We can think Q as an adaptor that changes the output from the model to label distributions of thenoisy data. Although the extended model is trained on noisy labeled data, the original model shouldbe used for testing.

1

So,max        Linear NLL  cost  

Base model Noise layer

Noisy label

and noisy data. From the difference between those two matrices, we compute an estimate of thenoise distribution Q. The general training schema is shown in figure 3.

Second, we can use the both clean and noisy data for training the bottom-up noisy model. We willuse the estimate of Q in the noise layer. However, we should bypass the noise layer for samplesfrom clean data.

3.5 Top-down Noise Model

As opposed the bottom-up model, another way to deal with noisy labels is a top-down model. Insteadof modifying the model to match with noisy labels, we can change the noisy labels so that it shouldproduce unbiased classification. When a noisy label is i, we would replace it with vector label si.Let S be the conversion matrix consisting from column vectors si. The learning objective is still tomaximize the log likelihood, but now noisy labels are converted by matrix S.

L(✓) =1

N

NX

n=1

KX

i=1

syni log p(y = i|xn, ✓), (8)

where K is the number of classes. The only difference between eq. 8 and 2 is that the sum overclasses is inside or outside the log operator. Unfortunately, we cannot optimize this objective withrespect to S, because there is a degenerate solution where S converts all labels to one class and themodel always predicts that class. Therefore, we cannot learn S directly from the noisy data.

An unbiased estimator for binary classification is introduced in [1], where the cost function is re-placed by a surrogate cost function that combines the two costs (costs for class +1 and -1) withcertain coefficients that depends on the noise level. We can view this as replacing noisy labels withdifferent label vectors. If we generalize this surrogate function to multi classes, we can see that Sshould be the inverse of the noise distribution Q (or at least QS should be equal to an identity plussome constant). However, the inverse operation is unstable with respect to noise, and also it usuallycontain negative elements when Q is a probability matrix. Such a negative element can make eq. 8diverge to infinity. In practice, using the Q inverse for S does not work well and often perform worsethan a normal model (e.g. a normal deep network) even if know the true Q.

4 Experiments

We experimented on two standard computer vision tasks where deep learning methods have achievedstate-of-art results. Although, the both datasets are correctly labeled, but we simulated noisy databy flipping labels randomly. Then we trained our model on this noisy data, and compared it with anormal neural network on clean test data.

4.1 SVHN Deliberately Noise Added

0 2 4 6 8 10

x 104

8

10

12

14

16

18

20

22

24

training data size

test

err

or

(%)

normal trainingnoise prob ground truthnoise prob learned

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

10

20

30

40

50

60

70

fraction of false labels in training data

test

err

or

(%)

normal training

noise prob ground truth

noise prob learned

Figure 4: Left: Test errors on SVHN dataset when 50% of the labels the training data are false.Right: Test errors on different noise levels

6

Back-propagation Figure 3: In the top-down noise model, noisy labels are converted by matrix S before being used by the basemodel for training.

Instead of modifying the model to match with noisy labels, we can change the noisy labels so thatit should produce an unbiased classification. Given a noisy label is i, we replace it with vector labelsi (see Figure 3). Let S be the conversion matrix consisting from column vectors si. The learningobjective is still to maximize the log likelihood, but now noisy labels are converted by matrix S.

L(θ) = 1

N

N∑

n=1

K∑

i=1

siynlog p(y = i|xn, θ), (6)

where K is the number of classes. Note the difference between Eqn. 6 and 2 is that the sum overclasses being inside or outside the log operator. Unfortunately, we cannot optimize this objectivewith respect to S, because there is a degenerate solution where S converts all labels to one class andthe model always predicts that class. Therefore, we cannot learn S directly from the noisy data.

An unbiased estimator for binary classification is introduced in [12], where the cost function isreplaced by a surrogate objective that combines the two costs (costs for class +1 and -1) with coeffi-cients that depends on the noise level. We can view this as replacing noisy labels with different labelvectors. If we generalize this surrogate function to multi classes, we can see that S should be theinverse of the noise distributionQ (or at leastQS should be equal to an identity plus some constant).However, the inverse operation is unstable with respect to noise, and also it usually contain negativeelements when Q is a probability matrix, which can make Eqn. 6 diverge to infinity.

As there is no practical way to learn S reliably, in our experiments we fix it to α · I +(1−α)/K ·1,where 1 is a K ×K matrix of ones. The hyperparameter α is selected by cross-validation.

3.6 Reweighting of Noisy DataIn some experiments in this paper, we have clean data in addition to noisy data. If we just mix them,we lose the valuable information about which labels are trustworthy. A simple way to express therelative confidence between the two sets of data is to down-weight the noisy examples, relative to theclean, in the loss function, as shown in Eqn. 7. This trick can be combined with both the bottom-upand top-down noise models described above. The objective is to maximize

L(θ) = 1

Nc +Nn

(Nc∑

n=1

log p(y = yn|xn, θ) + γ

Nn∑

n=1

log p(y = yn|xn, θ)

)(7)

where Nc and Nn are the number of clean and noisy samples respectively. The hyper-parameter γis the weight on noisy labels and is set by cross validation.

4 ExperimentsIn this section, we empirically examine the robustness of deep networks with and without noisemodeling. We experiment on several different image classification datasets with label noise. As the

5

Page 6: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

base model, we use convolutional deep networks because they produce state-of-art performance onmany image classification tasks.

This section consists from two parts. In the first part, we perform controlled experiments by de-liberately adding label noise to clean datasets. This is done by randomly changing some labels inthe training data according to some known noise distribution. This allows us to check if the learnednoise distribution is close the ground-truth noise distribution. First, we experiment on the Googlestreet-view house number dataset (SVHN) [14], which consists of 32x32 images of house numberdigits captured from Google Streetview. It has about 600k images for training and 26k images fortesting. Next, we experiment on CIFAR-10 [7], a more challenging dataset consisting from 60ksmall images of 10 object categories. The both datasets are hand-labeled, so all labels are clean.

In the second part, we show more realistic experiments using two datasets with inherent label noise.The first dataset consist from clean images from CIFAR-10 dataset and noisy images from TinyImages dataset [18]. The second dataset consists of clean images from ImageNet [4] and noisyimages downloaded from web search engines. In the both datasets, we do not know the true noisedistribution of noisy labels.

4.1 Deliberate Label NoiseWe synthesize noisy data from clean data by deliberately changing some of the labels. Originallabel i is randomly changed to j with fixed probability qji. An example of an noise distributionQ = {qji} we use is shown in Figure 4(c). By changing the probability on the diagonal, we cangenerate datasets with different noise levels. The labels of test images are left unperturbed.

We use a publicly available fast GPU code1 for training deep networks. As the base model, we usetheir “18% model” with three convolutional layers (layers-18pct.cfg) for both SVHN and CIFAR-10experiments. No data augmentation is done. The only data preprocessing was the contrast normal-ization for SVHN images.

0 20 40 60 80 100

10

15

20

25

30

training data size (k)

test

err

or

(%)

normal model

bottom−up (ground truth)

bottom−up (learned)

top−down (fix α=0.7)

0 20 40 60 800

20

40

60

80

% of incorrect labels in training data

test

err

or

(%)

normal model

bottom−up (ground truth)

bottom−up (learned)

top−down (fix α=0.7)

0 20 40 60 80 1005

10

15

20

25

training data size (k)

test

err

or

(%)

normal modelbottom−up (ground truth)bottom−up (learned)

0 20 40 60 800

10

20

30

40

50

60

70

% of incorrect labels in training data

test err

or

(%)

normal model

bottom−up (ground truth)

bottom−up (learned)

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Figure 5: Left: Test errors on SVHN dataset when 50% of the labels the training data are false.Right: Test errors on different noise levels

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

2

4

6

8

10

12

14

16

x 10−3

true noise learned noise |Q � Q|distribution Q distribution Q

Figure 6: The ground truth noise distribution used for generating noisy labels is in comparison withthe learned one. This shows the bottom-up noise model can successfully learn the hidden noisedistribution from the noisy data itself even if 50% of the training labels are incorrect.

In figure 5, we also plotted error rates when the bottom-up model is trained with the true noisedistribution Q, rather than learning it from data. Even with this advantage, there was no significantdifference, except for an extreme noise of 80%. This shows that the proposed method for learningQ from data is effective. Figure 6 shows one such learned Q in comparison with the correspondingtrue Q. We can see that the difference between them is negligible.

normal model bottom-up model

# of true labelsin training data

# o

f fa

lse

la

be

lsin

tra

inin

g d

ata

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

# of true labelsin training data

# o

f fa

lse labels

in tra

inin

g d

ata

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

Figure 7: The effect of incorrect labels in the training data on test error.

Figure 7 shows the effect of label noise on performance in more detail for SVHN data. For anormal deep network, the performance drops quickly as the number of incorrect labels increases. Incontrast, the bottom-up model showed more robustness against incorrect labels. For example, in the

8

0 20 40 60 80 1005

10

15

20

25

training data size (k)

test

err

or

(%)

normal modelbottom−up (ground truth)bottom−up (learned)

0 20 40 60 800

10

20

30

40

50

60

70

% of incorrect labels in training data

test

err

or

(%)

normal model

bottom−up (ground truth)

bottom−up (learned)

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Figure 5: Left: Test errors on SVHN dataset when 50% of the labels the training data are false.Right: Test errors on different noise levels

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

2

4

6

8

10

12

14

16

x 10−3

true noise learned noise |Q � Q|distribution Q distribution Q

Figure 6: The ground truth noise distribution used for generating noisy labels is in comparison withthe learned one. This shows the bottom-up noise model can successfully learn the hidden noisedistribution from the noisy data itself even if 50% of the training labels are incorrect.

In figure 5, we also plotted error rates when the bottom-up model is trained with the true noisedistribution Q, rather than learning it from data. Even with this advantage, there was no significantdifference, except for an extreme noise of 80%. This shows that the proposed method for learningQ from data is effective. Figure 6 shows one such learned Q in comparison with the correspondingtrue Q. We can see that the difference between them is negligible.

normal model bottom-up model

# of true labelsin training data

# o

f fa

lse la

bels

in tra

inin

g d

ata

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

# of true labelsin training data

# o

f fa

lse

lab

els

in t

rain

ing

da

ta

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

Figure 7: The effect of incorrect labels in the training data on test error.

Figure 7 shows the effect of label noise on performance in more detail for SVHN data. For anormal deep network, the performance drops quickly as the number of incorrect labels increases. Incontrast, the bottom-up model showed more robustness against incorrect labels. For example, in the

8

true

learned

estimated

0 20 40 60 80 1005

10

15

20

25

training data size (k)

test

err

or

(%)

normal modelbottom−up (ground truth)bottom−up (learned)

0 20 40 60 800

10

20

30

40

50

60

70

% of incorrect labels in training data

test

err

or

(%)

normal model

bottom−up (ground truth)

bottom−up (learned)

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

Figure 5: Left: Test errors on SVHN dataset when 50% of the labels the training data are false.Right: Test errors on different noise levels

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

2

4

6

8

10

12

14

16

x 10−3

true noise learned noise |Q � Q|distribution Q distribution Q

Figure 6: The ground truth noise distribution used for generating noisy labels is in comparison withthe learned one. This shows the bottom-up noise model can successfully learn the hidden noisedistribution from the noisy data itself even if 50% of the training labels are incorrect.

In figure 5, we also plotted error rates when the bottom-up model is trained with the true noisedistribution Q, rather than learning it from data. Even with this advantage, there was no significantdifference, except for an extreme noise of 80%. This shows that the proposed method for learningQ from data is effective. Figure 6 shows one such learned Q in comparison with the correspondingtrue Q. We can see that the difference between them is negligible.

normal model bottom-up model

# of true labelsin training data

# o

f fa

lse la

bels

in tra

inin

g d

ata

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

# of true labelsin training data

# o

f fa

lse

lab

els

in t

rain

ing

da

ta

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

1

1.2

1.4

1.6

1.8

10%

16%

25%

40%

63%

Figure 7: The effect of incorrect labels in the training data on test error.

Figure 7 shows the effect of label noise on performance in more detail for SVHN data. For anormal deep network, the performance drops quickly as the number of incorrect labels increases. Incontrast, the bottom-up model showed more robustness against incorrect labels. For example, in the

8

(a) (b) (c)Figure 4: (a) Test errors on SVHN dataset when noise level is 50% (b) Test errors when trained on 100ksamples. (c) The true noise distribution Q for 50% noise compared with learned and estimated Q.

SVHN noisy only: When training a bottom-up model with SVHN data, we fix Q to identity forthe first five epochs. Then, Q is updated with weight decay 0.05 for 100 epochs. Figure 4(a) and(b) shows the test errors for different training data sizes and different noise levels. Compared with anormal deep network, the bottom-up model always achieves better accuracy. However, the top-downmodel do not work well for any value of α (even using the true Q does not improve).

In Figure 4(a) and (b), we also plot error rates for a bottom-up model trained using the true noisedistribution Q. We see that it performs as well as the learned Q, showing that our proposed methodfor learning the noise distribution from data is effective. Figure 4(c) shows one such learned Qalongside the ground truthQ used to generate the noisy data. We can see that the difference betweenthem is negligible.

Figure 5(a) shows the effect of label noise on performance in more detail for SVHN data. For anormal deep network, the performance drops quickly as the number of incorrect labels increases. Incontrast, the bottom-up model shows more robustness against incorrect labels. For example, in thenormal model, training on 50k correct + 40k incorrect labels is the same as training on 10k correct

1https://code.google.com/p/cuda-convnet/

6

Page 7: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

normal bottom-up normal bottom-up

# of correct labels

# o

f in

corr

ect la

bels

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

# of correct labels

1 2 3 4 5

x 104

0

1

2

3

4

5x 10

4

10

16

25

40

63

size of training data

% o

f in

corr

ect la

bels

1 2 3 4 5

x 104

0

10

20

30

40

50

60

70

size of training data

1 2 3 4 5

x 104

0

10

20

30

40

50

60

70

20

25

32

40

50

63

(a) SVHN (b) CIFAR-10

Figure 5: The effect of incorrect labels on the test error (%) for an unmodified network (“normal”) and onewith our bottom-up noise layer appended.

labels. However, we can achieve the same performance using 30k correct and 50k incorrect labelswith the bottom-up model.

CIFAR-10 noisy only: We perform the same experiments as for SVHN on the CIFAR-10 dataset,varying the training data size and noise level. We fix Q to identity for the first 50 epochs of trainingand then run for another 70 epochs updating Q with weight decay of 0.05 or 0.1 (selected on valida-tion set). The results are shown in Figure 5(b). Again, the bottom-up model is more robust to labelnoise, compared to the unmodified model. The difference is especially large for high noise levelsand large training sets, which shows the scalability of the bottom-up model.

CIFAR-10 clean + noisy: Next, we consider the scenario where both noisy and clean data areavailable. We now are able to use the approach from Section 3.2 to estimate the noise distribution Qusing clean/noisy data, instead of learning it. We prepare two subsets of data: 20k clean images forestimating Q, and 30k noisy images for training the final model. First, we train a normal networkwith 10k clean data, which gives 30% test error. Then, we measure two confusion matrices, one onthe other 10k clean data and the other on 30k noisy data. From these we compute an estimated Q,which is used to train a new bottom-up model using only noisy data. An estimated Q for 50% noiseis shown in Figure 4(c), where it is very close to the ground truth. Table 1 compares the classificationperformance using learned and estimated Q for two different noise levels within the 30k noisy set.The estimated Q is as effective as the true Q, even at high noise levels.

Model normal true Q learned Q estimated QTest error (50% noise) 38% 28% 30% 29%Test error (70% noise) 60% 35% 40% 35%

Table 1: Test errors when trained on 30k images from CIFAR-10 with different noise levels

4.2 CIFAR-10 + Tiny ImagesCIFAR-10 was originally created by cleaning up a subset of Tiny Images, a dataset of loosely labeled80M 32x32 images (more than half of labels are estimated to be incorrect). In the process of handpicking CIFAR-10 images, some images were excluded because they did not show the object clearlyor were labeled falsely. Our training data consists of two subsets: 50k clean labeled images fromCIFAR-10 training data and 150k noisy labeled images from the excluded set of Tiny Images. Asshown in Figure 6, those extra training images have very noisy labels, and often contain objects notin the 10 categories.

airplane cat horseFigure 6: Sample images from the extra training data

We use a model architecture similar to the Krizhevsky’s “18% model”, but with more feature mapsin the three layers (32, 32, 64 → 64, 128, 128) to accomodate the larger training set. No data aug-mentation is used. During training, we alternate mini batches from two data sources (when the

7

Page 8: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

current mini batch is from clear data, the noise modeling is removed from the network). The noisematrix is fixed to identity during the first 50 epochs of the bottom-up model. Then the noise matrixis updated simultaneously with no weight decay. In the top-down model, however, the noise matrixis always fixed to α = 0.5. The network is evaluated on the original CIFAR-10 test set (which hasclear labels).

Table 2 shows test errors for several noise modeling approaches. In most cases, weighting thenoisy labels with γ = 0.2 helps performance, as shown in Figure 7 and row 4 of Table 2. Thebottom-up model performance is mediocre, most likely due to the large fraction of outside images(i.e., showing objects not in the 10 categories) present in the Tiny Images, which violate the noisemodel. By contrast, the top-down model imposes a more uniform label distribution on these outsideimages, and so works well in practice. We test this hypothesis by training with a new 150k image set,randomly drawn from the entire Tiny Image dataset (not just the excluded set) and having uniformlabels (equivalent to top-down with α = 0.1) and we see that these also give good test performance.Although we can treat outside images as a separate class, assigning them uniform labels worksbetter in practice. It can be considered as novel way of regularizing deep networks by randomimages, which are cheap to obtain. However, it is important that random images are not constrainedwithin training categories.

0 0.2 0.4 0.6 0.8 113

13.5

14

14.5

15

15.5

16

16.5

noise weight γ

test

err

or

(%)

Figure 7: Test error dependency onnoise weight γ

Model Extra data Noisyweight γ

Test error

Conv. net - - 16.1%Conv. net

150k noisy

1 15.4%Conv. net 0.2 13.2%Bottom-up 0.2 13.2%Top-down 0.4 12.5%Conv. net 150k random 0.2 13.8%

Table 2: Test error on CIFAR-10 + Tiny images.

4.3 ImageNet + Web Image SearchWe perform a large-scale experiment using the ImageNet 2012 dataset which has 1.3M image withclean labels over 1000 classes. We obtain a further noisy set of 1.4M images, scraped from Inter-net image search engines using the 1k ImageNet keywords. Overlapping images between the twosets were removed from the noisy set using cross-correlation. We trained the model of Krizhevskyet al. [8] on the clean and combined datasets, with several types of noise modeling. Table 3 showsthat training with the combined dataset does not improve the performance compared to training onthe clean dataset. But just weighting the noisy labels (using γ = 0.1) gives an improvement of 1.4%.This can be improved slightly with the bottom-up noise model. However, the absolute performanceachieved with our techniques and the 1.4M additional noisy data equals the performance when train-ing on an additional 15M clean images from ImageNet 2011 (row 3). This demonstrates that noisydata can be very beneficial for training. Note that no model averaging is done in those experiments.

Model Extra data Noisyweight γ

Top 5 val.error

Krizhevsky et al. [8] - - 18.2%Krizhevsky et al. [8] 15M full ImageNet - 16.6%Conv. net - - 18.0%Conv. net

1.4M noisy imagesfrom Internet

1 18.1%Conv. net 0.1 16.7%Bottom-up (learned) 0.1 16.5%Bottom-up (estimated) 0.2 16.6%Table 3: Error rates of different models on validation images of ImageNet

5 ConclusionIn this paper, we proposed two models for learning from noisy labeled data with deep networks,which can be implemented with minimal effort in existing deep learning implementations. Our ex-periments show that this model can reliably learn the noise distribution from data, under reasonable

8

Page 9: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

conditions. In addition, a simple technique to estimate the noise distribution using clean data is pro-posed. We also found that, if noisy and clean training sets exist, simply down-weighting the noisyset can help significantly. Both these techniques were demonstrated on large scale experiments,showing significant gains over training on clean data alone. Another surprising finding was thatrandom images can be used to regularize deep networks and improve the performance significantly.

References

[1] R. Barandela and E. Gasca. Decontamination of training samples for supervised pattern recog-nition methods. In Advances in Pattern Recognition, volume 1876 of Lecture Notes in Com-puter Science, pages 621–630. Springer, 2000.

[2] J. Bootkrajang and A. Kabn. Label-noise robust logistic regression and its applications. InMachine Learning and Knowledge Discovery in Databases, volume 7523 of Lecture Notes inComputer Science, pages 143–158. Springer, 2012.

[3] C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of ArtificialIntelligence Research, 11:131–167, 1999.

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar-chical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on, pages 248–255, June 2009.

[5] B. Frenay and M. Verleysen. Classification in the presence of label noise: A survey. NeuralNetworks and Learning Systems, IEEE Transactions on, 25(5):845–869, May 2014.

[6] I. Guyon, N. Matic, and V. Vapnik. Discovering informative patterns and data cleaning. InAdvances in Knowledge Discovery and Data Mining, pages 181–203. 1996.

[7] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. ComputerScience Department, University of Toronto, Tech. Rep, 2009.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutionalneural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105.2012.

[9] J. Larsen, L. Nonboe, M. Hintz-Madsen, and L. Hansen. Design of robust neural networkclassifiers. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEEInternational Conference on, volume 2, pages 1205–1208 vol.2, May 1998.

[10] N. D. Lawrence and B. Scholkopf. Estimating a kernel fisher discriminant in the presence oflabel noise. In Proceedings of the Eighteenth International Conference on Machine Learning,pages 306–313, 2001.

[11] V. Mnih and G. Hinton. Learning to label aerial images from noisy data. In Proceedings of the29th International Conference on Machine Learning (ICML-12), pages 567–574, 2012.

[12] N. Natarajan, I. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In Advancesin Neural Information Processing Systems 26, pages 1196–1204. 2013.

[13] D. Nettleton, A. Orriols-Puig, and A. Fornells. A study of the effect of different types of noiseon the precision of supervised learning techniques. Artificial Intelligence Review, 33(4):275–306, 2010.

[14] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in nat-ural images with unsupervised feature learning. In NIPS Workshop on Deep Learning andUnsupervised Feature Learning, 2011.

[15] M. Pechenizkiy, A. Tsymbal, S. Puuronen, and O. Pechenizkiy. Class noise and supervisedlearning in medical domains: The effect of feature extraction. In Computer-Based MedicalSystems, 2006. CBMS 2006. 19th IEEE International Symposium on, pages 708–713, 2006.

[16] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integratedrecognition, localization and detection using convolutional networks. In International Confer-ence on Learning Representations (ICLR 2014), April 2014.

[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the Gap to Human-LevelPerformance in Face Verification. Conference on Computer Vision and Pattern Recognition(CVPR), 2014.

9

Page 10: Learning from Noisy Labels with Deep Neural Networks · effectively trained on data with high level of label noise. The modification is simply done by adding a linear layer on top

[18] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for non-parametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEETransactions on, 30(11):1958–1970, Nov 2008.

10