measuring catastrophic forgetting in neural networksrmk6217/kemker_wnyip17_poster.pdfmeasuring...
TRANSCRIPT
Measuring Catastrophic Forgetting In Neural Networks
Ronald Kemker1, Marc McClure1, Angelina Abitino2, Tyler Hayes1, Christopher Kanan1
1. Rochester Institute of Technology, Rochester NY
{rmk6217, mcm5756, tlh6792, kanan}@rit.edu2. Swarthmore College, Swarthmore, PA
Mechanisms for Mitigating Catastrophic Forgetting
𝐿 𝜃 = 𝐿𝑡 𝜃 +
𝑖
𝜆
2𝐹𝑖 𝜃𝑖 − 𝜃𝐴,𝑖
∗ 2
#1 Regularization
Model adds constraints to the weight updates to protect previously learned
knowledge. Google DeepMind’s Elastic Weight Consolidation (EWC) model
uses a Fisher Information Matrix 𝐹𝑖 to re-direct plasticity towards the weightsthat are the least important to retaining old information [1].
#2 Ensembling
Model implicitly or explicitly trains multiple classifiers and combines them to
make a prediction. Google’s PathNet model uses a genetic algorithm to find the
optimal path through a large DCNN and then locks that path to preserve it [2].
#3 Rehearsal
Model revisits previous training examples to prevent forgetting of previously
trained knowledge. GeppNet stores previous all training examples and then
replays them during incremental learning stages [3].
#4 Dual-Memory
Model has separate processing centers for the fast acquisition of new
information and long-term storage of pre-trained knowledge. GeppNet+STM
uses a short-term memory buffer to store and recall previous training examples,
and then consolidates these samples during sleep phases.
#5 Sparse-Coding
Model makes sparse updates to the network to prevent the disruption of pre-
trained knowledge. The fixed expansion layer (FEL) model uses a large hidden
layer that is sparsely populated with excitatory and inhibitory weights [4].
Incremental Learning Paradigms
1. Data Permutation Experiment
This experiment measures how well a model can
incrementally learn datasets with similar feature
representations. We randomly permute the pixel
locations of each image.
2. Incremental Class Learning Experiment
First, we train the model on some base knowledge (half of the classes). Then,
we train the remaining classes one-by-one.
3. Multi-Modal Experiment
We measure how well a model can incrementally learn datasets with dissimilar
feature representations. First, we train the model on image classification, and
then we train that model on audio classification (and vice versa).
Catastrophic ForgettingNeural networks are incapable of learning new information without disturbing
the weights important for retaining existing memories, a phenomenon known as
catastrophic forgetting. Although many mitigation techniques have been
proposed, the only real way of preventing this is combining the old and new
data and retraining the model from scratch. State-of-the-art frameworks can
take weeks/months to train, so this is extremely inefficient.
Motivation
Researchers have proposed many strategies for mitigating catastrophic
forgetting, but these methods have all failed to scale-up to real-world
problems.
Kirkpatrick et. al. (2017) claimed to solve catastrophic forgetting. but they
only evaluated their frameworks on a toy dataset with only a few object classes
(i.e. MNIST)
We scaled up some of these mechanisms to large scale image and audio
classification datasets with 100-200 object classes. We evaluated their
performance on three different incremental learning paradigms using new
metrics that we established.
Datasets
Metrics
We established three metrics designed to measure
how well a model retains existing memories
(Ω𝑏𝑎𝑠𝑒), assimilate new data (Ω𝑛𝑒𝑤), and bothtasks all at once (Ω𝑎𝑙𝑙). We track mean-class testaccuracy for the base knowledge 𝛼𝑏𝑎𝑠𝑒 , mostrecently learned class 𝛼𝑛𝑒𝑤, and all classes seen tothat point 𝛼𝑎𝑙𝑙. We normalize the results by theaccuracy obtained by training the model offline
𝛼𝑖𝑑𝑒𝑎𝑙 so that we can have a fair comparisonbetween datasets.
Ω𝑏𝑎𝑠𝑒 =1
T − 1
𝑖=2
𝑇𝛼𝑏𝑎𝑠𝑒,𝑖𝛼𝑖𝑑𝑒𝑎𝑙
Ω𝑛𝑒𝑤 =1
T − 1
𝑖=2
𝑇
𝛼𝑛𝑒𝑤,𝑖
Ω𝑎𝑙𝑙 =1
T − 1
𝑖=2
𝑇𝛼𝑎𝑙𝑙,𝑖𝛼𝑖𝑑𝑒𝑎𝑙
Discussion/Conclusion
• No mechanism works well on
every paradigm and data type.
• The regularization and ensembling
mechanisms work well for
incrementally learning datasets with
similar feature representations.
• Incremental class learning models
benefit from the rehearsal and dual-
memory mechanisms; however,
storage of past training examples is
memory inefficient
• Models that employ sparsity as its
mitigation strategy are too memory
inefficient to be employed in a real-
world scenario
Mean of 𝛀𝒂𝒍𝒍 across datasets
Summary of Experimental Results
AcknowledgementsAngelina Abitino was supported by NSF Research Experiences for
Undergraduates (REU) award #1359361 to Roger Dube. We also thank
NVIDIA for the generous donation of a Titan X GPU.
References1. Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural
networks." Proceedings of the National Academy of Sciences (2017): 201611835.2. Fernando, Chrisantha, et al. "Pathnet: Evolution channels gradient descent in
super neural networks." arXiv preprint arXiv:1701.08734 (2017).3. Gepperth, Alexander, and Cem Karaoguz. "A bio-inspired incremental learning
architecture for applied perceptual problems." Cognitive Computation 8.5 (2016):924-934.
4. Coop, Robert, Aaron Mishtal, and Itamar Arel. "Ensemble learning in fixedexpansion layer networks for mitigating catastrophic forgetting." IEEE transactionson neural networks and learning systems 24.10 (2013): 1623-1634.
Fig 1. Permuted MNIST
Image
Experimental Results
Fig 2. Mean-class test accuracy of incremental class learning experiment
Experimental Results