arxiv:2012.12516v1 [cs.lg] 23 dec 2020

9
Analyzing Representations inside Convolutional Neural Networks Uday Singh Saini University of California Riverside [email protected] Evangelos E. Papalexakis University of California Riverside [email protected] Abstract How can we discover and succinctly summarize the con- cepts that a neural network has learned? Such a task is of great importance in applications of networks in ar- eas of inference that involve classification, like medical diagnosis based on fMRI/x-ray etc. In this work, we propose a framework to categorize the concepts a net- work learns based on the way it clusters a set of input examples, clusters neurons based on the examples they activate for, and input features all in the same latent space. This framework is unsupervised and can work without any labels for input features, it only needs ac- cess to internal activations of the network for each in- put example, thereby making it widely applicable. We extensively evaluate the proposed method and demon- strate that it produces human-understandable and co- herent concepts that a ResNet-18 has learned on the CIFAR-100 dataset. 1 Introduction With the advent of deep neural network architectures as the prominent machine learning paradigm [13] for human-centric applications a common issue that has plagued their application is the lack of interpretability of these models. As the spectrum of domains where deep learning replaces traditional and orthodox methods ex- pands, and deep learning methods percolate to areas of immediate applicability to daily life, like self driv- ing cars[3], understanding what networks do takes on a more central role than aspiring performance gains. Fu- ture challenges that machine learning engineers face, are not just limited to improving model accuracy, but also debugging[24] and training networks in order to make them conform to ever evolving regulations concerning ethics[17] and privacy[18]. Most literature in the area of explainable AI focuses on providing explanations for pre-trained networks[20],[5]. While some methods focus on designed models which have explainability as a part of their de- sign philosophy[1]. Our work belongs to the former cat- egory and focuses on providing explanation for already trained models, or what is colloquially called post-hoc explanation. Within the strata of post-hoc explana- tions, there exist multiple evolutionary branches, some focus on interpreting the features[7], and[27] interprets the network by breaking down an input prediction into semantically interpretable components and works like [26] focus on interpreting neurons based on their be- haviour when they activate for entities like different tex- tures, colours and images. We focus on unsupervised discovery of concepts learned by the network by trying to cluster the neurons, input features and inputs themselves in the same latent space. The motivation for doing so comes from works like[4] where it has been conjectured that natural images usu- ally lie on a manifold and that a neural network embeds this manifold as a subspace in it’s feature space. The work most similar in spirit to ours is ACE[5] where the goal is to explain the prediction of neural networks not in terms of individual neurons, but rather, by focusing on learning the concepts utilized by the network that are most sensitive for a successful prediction, and learn- ing of such concepts is a supervised process. Unlike our work, ACE[5] utilizes existing algorithms or man- ual annotation to curate a set of concepts, feed it to the network and measure the sensitivity of the network to those concepts using TCAV[9]. This solution, though elegant relies heavily on domain expert annotators or supervised tools, while we learn these concepts from the activations of the network and try to determine concepts learned by the network by probing it input examples. Another line of work in [9] focuses on learning vectors which when measured for their effects on class predic- tion, align with high-sensitivity directions in the latent space of the network. We also utilize [9] as a means to validate our approach in section 5. Our approach aims to find a latent representation for neurons, input features and examples in a common subspace, where clustering them aims to elicit mean- ingful insights about the networks ability to discern between examples. Using such a tri-factor clustering, we can analyze intersections between groups of neurons which fire for different classes, focus on which input fea- tures provide a basic structure upon which the model arXiv:2012.12516v1 [cs.LG] 23 Dec 2020

Upload: others

Post on 05-Apr-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Analyzing Representations inside Convolutional Neural Networks

Uday Singh SainiUniversity of California Riverside

[email protected]

Evangelos E. PapalexakisUniversity of California Riverside

[email protected]

Abstract

How can we discover and succinctly summarize the con-cepts that a neural network has learned? Such a taskis of great importance in applications of networks in ar-eas of inference that involve classification, like medicaldiagnosis based on fMRI/x-ray etc. In this work, wepropose a framework to categorize the concepts a net-work learns based on the way it clusters a set of inputexamples, clusters neurons based on the examples theyactivate for, and input features all in the same latentspace. This framework is unsupervised and can workwithout any labels for input features, it only needs ac-cess to internal activations of the network for each in-put example, thereby making it widely applicable. Weextensively evaluate the proposed method and demon-strate that it produces human-understandable and co-herent concepts that a ResNet-18 has learned on theCIFAR-100 dataset.

1 Introduction

With the advent of deep neural network architecturesas the prominent machine learning paradigm [13] forhuman-centric applications a common issue that hasplagued their application is the lack of interpretability ofthese models. As the spectrum of domains where deeplearning replaces traditional and orthodox methods ex-pands, and deep learning methods percolate to areasof immediate applicability to daily life, like self driv-ing cars[3], understanding what networks do takes on amore central role than aspiring performance gains. Fu-ture challenges that machine learning engineers face, arenot just limited to improving model accuracy, but alsodebugging[24] and training networks in order to makethem conform to ever evolving regulations concerningethics[17] and privacy[18].

Most literature in the area of explainable AIfocuses on providing explanations for pre-trainednetworks[20],[5]. While some methods focus on designedmodels which have explainability as a part of their de-sign philosophy[1]. Our work belongs to the former cat-egory and focuses on providing explanation for alreadytrained models, or what is colloquially called post-hoc

explanation. Within the strata of post-hoc explana-tions, there exist multiple evolutionary branches, somefocus on interpreting the features[7], and[27] interpretsthe network by breaking down an input prediction intosemantically interpretable components and works like[26] focus on interpreting neurons based on their be-haviour when they activate for entities like different tex-tures, colours and images.We focus on unsupervised discovery of concepts learnedby the network by trying to cluster the neurons, inputfeatures and inputs themselves in the same latent space.The motivation for doing so comes from works like[4]where it has been conjectured that natural images usu-ally lie on a manifold and that a neural network embedsthis manifold as a subspace in it’s feature space. Thework most similar in spirit to ours is ACE[5] where thegoal is to explain the prediction of neural networks notin terms of individual neurons, but rather, by focusingon learning the concepts utilized by the network thatare most sensitive for a successful prediction, and learn-ing of such concepts is a supervised process. Unlikeour work, ACE[5] utilizes existing algorithms or man-ual annotation to curate a set of concepts, feed it to thenetwork and measure the sensitivity of the network tothose concepts using TCAV[9]. This solution, thoughelegant relies heavily on domain expert annotators orsupervised tools, while we learn these concepts from theactivations of the network and try to determine conceptslearned by the network by probing it input examples.Another line of work in [9] focuses on learning vectorswhich when measured for their effects on class predic-tion, align with high-sensitivity directions in the latentspace of the network. We also utilize [9] as a means tovalidate our approach in section 5.

Our approach aims to find a latent representationfor neurons, input features and examples in a commonsubspace, where clustering them aims to elicit mean-ingful insights about the networks ability to discernbetween examples. Using such a tri-factor clustering,we can analyze intersections between groups of neuronswhich fire for different classes, focus on which input fea-tures provide a basic structure upon which the model

arX

iv:2

012.

1251

6v1

[cs

.LG

] 2

3 D

ec 2

020

correctly classify its inputs and analyze an individual ex-ample based on their similarity and differences to otherexamples, as determined by the network’s embeddingsof them. We model our problem as a coupled matrixfactorization, where the model is constrained to appro-priate constraints like non negativity, which aid in inter-pretability [14] and the possibility of adding regulariza-tions like group sparsity, orthogonality etc. to encodemeaningful priors into the model. We conduct our anal-ysis by observing the behaviour of a network on a setof images it has previously not seen, for the purpose ofthis study, we experiment with CIFAR-10 and CIFAR-100 as our datasets of choice. Our raison d’etre is toapproach the problem of concept discovery in an unsu-pervised manner, in order to bridge a gap unfulfilled by[5] and [26]. In doing so develop a methodology whichcan seed or supplement other interpretability methods.

2 Related Work

In our work we aim to interpret a learned model us-ing a set of images which may or may not have been apart of the set of training classes of the network. Ourwork comes in stark contrast with most existing litera-ture, since the goal in this work is not to evaluate thenetwork on a feature by feature or on a sample by sam-ple basis as in [25],[11], [23],[21]. Additionally, thereare other works, such as [25] which visualize a networkbased on images that maximize the activation of hiddenunits or works like [16] which use back-propagation togenerate salient features of an image.Works like [1] focus on explaining a network by propos-ing a new framework where the network is forced tolearn concepts and demonstrate their relevance towardsa prediction. This framework relies on prior constrain-ing and encoding for what is thought to be a concept.In [27] the focus is on explaining each prediction madeby the network by decomposing the activations of alayer in the network into a basis of pre-defined con-cepts, where each explanation a weighted sum of theseconcepts, where the weights determine the impact eachconcept has towards prediction. Our work has simi-larities of philosophy with the previous 2 works, butunlike [1] we don’t focus on learning an interpretablemodel, instead we focus on unsupervised explanationof an already trained network. And unlike [26] we donot have a pre-made notion of concepts, instead we letthe model learn underlying concepts based on the setof examples fed in the analysis. This way our approachis application agnostic. Recent work on Network Dis-section [2] tries to provide a framework where they cantie up a neuron in the network to a particular conceptfor which the neuron activates. These concepts can besimple elements like colour, to compound entities like

texture. They accomplish this through a range of cu-rated and labeled semantic concepts whereas our workdoesn’t need user labeled data. Another work whichrelies on interpreting the network through the lens ofabstract concepts is TCAV [8]. This work tries to pro-vide an interpretation into network’s workings in termsof human interpretable concepts. Like our work, theytoo rely on the internal representation of the network todetermine the network’s behaviour, but unlike us theyutilize manual/pre-defined concepts and test the net-work’s sensitivity towards it. The work presented in [19]uses a variant of canonical co-relational analysis and fo-cuses on learning the complexity of the representationslearned by the network to determine the dynamics oflearning, our work differs as we use the structure ofthe learned representation as a guideline for our factor-ization framework and don’t comment on the inherentcomplexity.The work most in line with our goals is [5], here theauthors seek to automatically discover concepts learnedby the network which are of high predictive value, asmeasured by their TCAV score [8].In Figure 1 we compare our work to other works inthe area, some of which relevant and others more tan-gential to our approach. While the axioms of inter-pretable machine learning are an ever evolving set ofprinciples, we enlist a few features that help us high-light the differences between our work and it’s closestneighbours in this space. Our work is the only unsu-pervised method in this space of model interpretabilitywhich helps us discover concepts learned by the networkin terms of the examples clustered by the network. ACE[5], SeNN[1] learn concepts but either by utilizing ex-plicit supervision or by employing pre-existing trainedmodels, whereas works like [26] require detailed humanlabeling of neurons and image pixels and patches, thusmaking the process slow and sluggish for adaptation to anew domain. LIME [20] on the other hand tries to visu-alize a linear decision boundary across an input, whichwe approximate by the input’s K-Nearest Neighboursbut unlike our work it cannot discover abstract con-cepts learned by the network without significant modi-fications.

3 Proposed Method

In this section we begin by outlining the motivationfor our methodology, we then proceed to outline theimplementation schema and optimization problem forour model. Subsequently we present the model detailsand lay down the groundwork for evaluation protocolssuited for this method.

Figure 1: Relevant Work Comparison

Model Features ACE[5] LIME[20] SeNN[1] Net Dissection [26] Our workPost-Hoc Interpretability 3 3 7 3 3

Unsupervised 7 7 Partial 7 3

Insights on Inputs 3 3 3 3 3

Insights on Neurons 7 7 7 3 3

Insights on Features 7 3 3 3 3

Collective Analysis of Inputs 7 7 7 7 3

Individual Analysis of Inputs 3 3 3 3 PartialAnalysis of Representation Space 7 7 7 7 3

3.1 Motivation Our goal is to visualize the latentrepresentation space learned by a Neural Network bycomparing and contrasting the behaviour of the networkon different types of inputs. We want to accomplishthis in a framework where we can explain the conceptslearned by the network in terms of the inputs thatare used to probe the network. In doing so we canassess the generalization ability of the network, bothto familiar and unseen datasets, thus providing insightsto human evaluators about the health of the trainednetwork and it’s suitability to a particular domain. Thisis possible because there are no restrictions on whatqualifies as a legitimate dataset for evaluating networkbehaviour, thus in theory, we can evaluate a network ona dataset which is different from it’s training dataset andassess the suitability of the architecture to learn atomicconcepts (which may be valid across domains) from thetraining data instead of learning it’s idiosyncrasies.

3.2 Proposed Model Given these goals in mind, welay down the model principles aligned with our objec-tives. Our approach is a method that relies on a cou-pled matrix factorization framework where we computeembeddings of test examples and individual neurons inprobed layers in a shared latent space. Our methodrelies on only having access to activations of internallayers of a network for a given input. Additionally,for ease of modelling, we assume that these activationsare non-negative in nature, for instance ReLU and Sig-moid non-linearities are used in the network. In doingso our model does not introduce any external learningconstraints while training the network, thus lending ituniversality. We probe various layers of a network witha set of test examples, and for each test example, westore the network’s response across all (observed) lay-ers. We do so with an aim to breakdown the processof interpretability into a process of finding common lo-cal structures across various test/evaluation examples,where each feature in the latent representation hope-fully captures a latent semantic concept. Thus, throughthe lens of our model, we can, hopefully view individual

concepts in an amalgam of constituents of a test ex-ample. In the following subsections we describe modelconstruction and provide mathematical details of imple-mentation.

3.2.1 Model Construction For our analysis weneed construct a set of matrices where each matrix Ai inthe set is a matrix ∈ Rai×N

+ , where ai is the number ofneurons in layer i of the network, and N is the numberof examples on which our analysis is conducted. Eachcolumn k of matrix Ai, is a vectorized activation of layeri of the network for a given test sample k. Thus, to reit-erate, a column k of this matrix Ai, denoted by Ai[:, k],is the activation of layer i of the network when the kth

test example is passed as an input to it. Along similarlines we construct another set of matrices where eachMatrix Di ∈ RSi×N

+ where Si is number of pixels in theith channel of input images and N is the same as earlier.On the same lines as before, each column k of matrixDi, is the kth test sample’s ith channel vectorized.

3.2.2 Model The objective function for our pro-posed method is as follows:

J(P, F,O) =

C−1∑i=0

‖Di − PiF‖2F +

L−1∑j=0

‖Aj −OjF‖2F +

C−1∑i=0

λP ‖Pi‖2p +

L−1∑j=0

λO‖Oj‖2p + λF ‖F‖2p

3 Pi, Oj , F,∈ RSi×d+ ,RNj×d

+ ,Rd×N+

||P [:, i]||22 = 1, ||O[:, i]||22 = 1∀i

(3.1)

In Equation 3.1, C is the number of channels in inputdata, L is the number of layers of the network that arepart of analysis - as we can select the non-negative layerswe want to analyze and are not obligated to includeall the layers of any architecture. p is usually 2 for2 − Norm regularization although for the purposes ofsome experiments we instead set the column norms ofthe Pixel and Neural Factor matrices to 1

For each matrix Di in Equation 3.1, it’s kth column isthe input data’s channel i vectorized as input. Thus forinstance, for a 3-channel image, with image number jof the test set, D0[:, j] is the vectorized 0th channel ofthe jth image and so on.As mentioned earlier, each Di is thus a matrix of Pixel-by-Example. Each Pi in the first term of the summationin equation 3.1 is a latent representation matrix for eachpixel. That is, Each row of Pi, for instance Pi[k, :] is thelatent representation of the kth pixel in the input space.For each matrix Aj in Equation 3.1, it’s kth column isthe activation of layer j of the network for kth test input.Thus for instance, image number j of the test set, A0[:, j]is the activation of layer 0 of the network for imagej, A1[:, j] is the activation of layer 1 of the networkfor image j, A2[:, j] is the activation of layer 2 of thenetwork for image j, as a point of caution we would liketo mention that A0,A1,A2 and so on, do not necessarilycorrespond to layers 0,1,2 of the network, they insteadcorrespond to the 0th,1st,2nd analyzed layers of thenetwork, as our model offers the ability to skip layers ofthe network, in our convention, the higher index of thelayer, the deeper we are in the network.Each matrix Aj encodes the activity of neurons of layerj for a given training example. Therefore, each Oj

in the factorization encodes the latent representationof neurons of layer j in it’s rows. That is, Oj [k, ; ]is the latent representation of kth neuron of layer j.Similarly, the matrix F encodes in it’s columns, thelatent representation of each test example fed to thenetwork. That is, F [:, k] is a d-dimensional latentrepresentation of test sample k. Each factor matrix inthe objective function obeys non-negativity constraints,and we use multiplicative update rules as described in[15] to solve for the factor matrices.

Update Steps for solving the factor matrices inEquation 3.1 are as presented in the following Equation3.2 :-

F ← F ∗

∑i

PTi Di +

∑j

OTj Aj∑

i

PTi PiF +

∑j

OTj OjF + λFF

Pi ← Pi ∗DiF

T

PiFFT + λPPi

Oj ← Oj ∗AjF

T

OjFFT + λOOj

3.2.3 Model Intuition We now provide some intu-ition for our modeling choices. Our goal is to identifyhidden patterns or concepts that the network learns inorder to classify data. To achieve this our model clus-ters the test examples, neurons and pixels in the sameinner product space. We achieve this clustering by in-

corporating a coupled non-negative matrix factorizationframework. In our learned vector representation of these3 types of objects, a high value along a latent dimensionindicates that a particular latent concept participates inexplaining the behaviour of the object. By constrain-ing the model to adhere to a non negative framework,we encourage an interpretable sum-of-concepts basedrepresentation[14].

Further elaborating on the learned factor matrices,Each column j of Matrix Pi ∈ RSi×d

+ is the activationof the pixels of channel i for the concept discoveredin latent factor j. Collecting such information overall input channels i for a given j in the respectivefactor matrices we can uncover the average activationof pixels across channels for a given concept. Thisrepresentation can be thought of as a channel-wisemask over features in the input, similar to LIME [20],but instead we discover a latent concept level mask asopposed to an input level mask. Matrix F ∈ Rd×N

+ isthe input representation matrix where each column kof F is a vector in Rd

+ representing the kth example inthe same latent space as Pixels and Neurons. For anyinput k, A high value along any component j of it’sd-dimensional representation indicates a high affinityof this input towards the latent concept encoded inthe dimension j and P0[: j],P1[: j],P2[: j] togetherwill us visualize the pixel activation mask for thislatent concept j as discussed earlier. Collecting all thehighest affinity inputs for each latent factor, we obtaina visual approximation of the concept learned in thislatent dimension. Given the unsupervised nature of thismodel, it extremely well suited for concept discovery forneural networks, akin to a similar role played by ACE [5]for TCAV [9]. Matrices Oj ’s embed neurons of a layer jin the same latent space as inputs and features and helpus visualize which neurons in a layer activate for whichconcept, we do this by demonstrating the similarity oflatent concepts when measured w.r.t. neurons of a layer.We can also look at the behaviour of neurons acrosslayers by observing the cohesiveness of latent space asthe neurons go deeper in the network.

4 Experimental Evaluation

In the following subsections we will present the analy-sis of the latent space learned by a ResNet-18 [6] whentrained on CIFAR-100 images [12]. Our analysis touchesall the modalities captured by our model, i.e. Analysisof Pixels, Analysis of Neurons and Analysis of Exam-ples. We present this analysis in 3 subsections for agiven network. We also released the code1 for verifica-tion.

1Code: https://github.com/23Uday/Project1CodeSDM2021

4.1 Analysis of A ResNet-18 on CIFAR-100Dataset: In the following subsections we analyze thebehaviour of a ResNet-182 trained and analyzed onCIFAR-100. Each subsection represents a modality ofanalysis, namely, inputs, Neurons, and input features orpixels themselves.

4.1.1 Analysis of Input representations: In thissection we present the analysis of representationslearned in the input representation Matrix F . For eachlatent dimension i we compute the total class-wise acti-vation score of inputs in the row F [i, :] and present thetop 3-4 activated classes along that latent dimension,the images which had the highest affinity in this latentdimension and most activated super-class in Table 1.The motivation behind analyzing super class labels isto validate the assertion that each latent factor cap-tures an abstract concept that is predominantly presentin the member images. We reiterate that these superclass labels were not used in training of the network butonly used as a means to assign a pseudonym to eachlatent factor, the validity of which can be verified bylooking at the topmost activated images and the groupof top most activate classes in Table 1.

4.1.2 Layer-wise Analysis of Neuron represen-tations: In this section we try to quantify the be-haviour of neurons as a cluster and across layers. Weutilize the neuron embedding matrix for a given layer

j, as denoted by Oj ∈ RNj×d+ , where Nj is the number

of neurons in layer j, whereas d is the number of latentfactors in the factorization. Next we compute pairwisecosine similarity between the columns of a matrix Oj

and we do this ∀j as shown in Figure 2a, Figure 2band Figure 2c. Here Layer 0,1,2 refer to 3 layers ana-lyzed in the ResNet-18 in increasing order of depth andare not necessarily the first,second and third layers ofthe network. In these plots a high value at any entry(i, j) indicates a higher overlap between the number ofneurons which fire for inputs belonging in the 2 superclasses best approximated by latent factor i and latentfactor j. As indicated in Figures 2a,2b and 2c the acti-vations tend to be more intra-superclass, a result similarin nature to one observed by SVCCA[19] , i.e. more con-centrated along the diagonal of the Similarity Matrix aswe go deeper down the layers. This is also borne out bythe eigen values of these Similarity Matrices, as the ma-trices tend to get closer to Identity, the lower the meanof first-K eigen-values as shown in 3a.

We also show similar results for a VGG-11[22]Trained and analyzed on CIFAR-100 in Figure 3b.

2https://github.com/kuangliu/pytorch-cifar

4.1.3 Co-Analysis of Pixels and Inputs: In thissection we analyze the pixel space along with inputs.The Matrices Pi’s ∈ RSi×d

+ hold the input representationof pixels in the input channel i where Si is the number ofPixels in Input Channel-i, or the vectorized size of thechannel. Each column of a matrix Pi represents a fea-ture activation score of all the pixels in channel i for thegiven latent factor. Therefore by collecting informationfrom column 2 of P0,P1 and P2 and resizing them appro-priately we get an average pattern of activation acrossthe pixel space for all the images that belong to LatentFactor 2, as shown in Figure 4a, and for Latent-Factor-3 in 5a. This functionality is very similar to LIME[20], but instead of individual images we can operateon pixel representations which represent learned con-cepts. We then take these Latent-Factor-Images, andcreate a mask where we assign a value of 1 at a pixellocation if it’s activation value is above the median acti-vation value for the Latent Image and 0 other wise andoverlay it with the topmost images of the Latent Factoras found in our analysis of Matrix-F in Table 1. Wealso take around 30 Nearest Neighbours of the Image asdetermined by the Latent Space of Matrix-F and givea distribution of the Latent Concepts those Neighbourshave their highest affinity for, thereby helping us achieveinterpretability on an input-by-input basis by being ableto say that a given image is close to another. Next, via2 examples we present a per example case study of in-terpretability possible by the use of this model.In Figures 4a,4b,4c and 4d For Latent Factor-2 wepresent the Latent Representations of Pixels, The top-most Image in that Latent Factor, The top 50% ac-tivated pixels super imposed on the original image,and the Latent Concept Distribution of top-30 NearestNeighbours of the image, respectively. As noted pre-viously in Table1, Latent Factor 2 Represents classeslike mountain, bridge, castles, skyscrapers etc, leadingto it’s topmost super class being ”large man-made out-door things”. On average, the most activated pixelsfor images belonging to this superclass tend to be bluepixels towards the top, green towards the middle andred towards the bottom. And the set of top-30 nearestneighbours for this particular image of a Mountain alsohas members belonging to Latent Factor 3 and 7, 2 con-cepts which have a high affinity for inputs belonging tosuper class of trees.In Figures 5a,5b,5c and 5d we present a similar anal-ysis for Latent Factor-3. As shown in Table1, LatentFactor 3 Represents classes like willow tree, maple tree, oak trees, pine trees etc, leading to it’s topmost superclass being ”trees”. In this case, the most activated setof pixels is on the right half of the pixel space with ahigher affinity in green and red channels of the image,

Table 1: Matrix-F Latent Factor Analysis For ResNet-18

Factor: Top Classes Top Images Top 1-2 Super Class

0: bed,television,wardrobe household furniture

1: kangaroo,beaver,bear large omnivores and herbivores

2: mountain,castle,bridge large man made outdoor things

3: willow, maple, pine, oak trees

4: shark,dolphin,whale fish and aquatic mammals

5: bee,beetle,spider insects

6: tulip,rose,poppy flowers

7: oak, willow, maple, pine trees

8: telephone,cockroach,cup household electrical devices

9: hamster,cockroach,mouse small mammals

10: boy,woman,girl,baby people

11: aquarium fish,trout fish

12: lawn mover,camel large omnivores and herbivores

13: motorcycle,bicycle,tiger vehicles1

14: sea,plain,could,mountain large natural outdoor scenes

15: caterpillar,skunk,worm reptiles

16: apple,orange,pear fruit and vegetables

17: hamster,raccoon,wolf medium mammals

18: castle,house,wardrobe large man made outdoor things

19: plate,cup,bowl food containers

(a) Cosine Similarity: Layer 0 (b) Cosine Similarity: Layer 1 (c) Cosine Similarity: Layer 2

Figure 2: Plots of Cosine Similarity of Latent Factors in Layers 0,1,2 of ResNet-18. This highlighits the layerwise learningdynamics of the network and helps us visualize with concepts and classes occupy similar neural regions in a given layerof a network and how they evolve as we go deeper into the network. In fact, we observe that as we go deeper into thenetwork, the similarity bcomes diagonal, showing higher separation of the latent concepts

as shown in Figure 5a. In Figure 5c we see the effectsof applying this latent image as a filter to the image ofa willow tree 5b and we see that the right half of theimage is redundant and the left half captures basic un-derlying features about the image like contours, shapescolours etc.

5 Extensions and Applications of the Model

We modify the model in Equation 3.1 by imposing agroup sparse regularization [10] on the factor Matrix-Fand only including the neural activations in the objec-tive function. In Table 2, we present a case study wherea ResNet-18 is trained on CIFAR-10 [12] and evaluatedon CIFAR-100, we omit some latent factors for brevity.The goal here is to visualize the generalization ability

(a) Eigenvalues of Similarity Matrices of ResNet-18 (b) Eigenvalues of Similarity Matrices of VGG-11

Figure 3: Plots of Eigenvalues of Similarity Matrix for ResNet-18 and VGG-11. This plot shows increasing independenceof learned latent concepts w.r.t. neurons as we go deeper in the non-classification layers of the network. The closer the amatrix is to Identity the closer the average of it’s eigen values is to 1 and vice versa. The last layer in each of the 2 figuresis the output of pre log softmax of the network, which is usually a much lower dimensional space than the previous layers

(a) Latent-Factor:2(b) Im-age:Mountain (c) Filtered Image

(d) Top-30 Nearest Neighbours of this Image

Figure 4: Analysis of Topmost Image from Latent-Factor:2

of the network to cluster and distinguish between nat-ural images based on an observation of similar but outof sample data distribution during training. In orderto validate the result we then take the latent conceptslearned by the model and evaluate TCAV[9] scores 3 foreach combination of latent concept and input class, theresults for which are shown in Figure 6.

6 Conclusions

In this paper, we introduced an unsupervised frame-work for exploration of the representations learned by

3https://github.com/rakhimovv/tcav

(a) Latent-Factor:3(b) Image: Willow-Tree (c) Filtered Image

(d) Top-30 Nearest Neighbours of this Image

Figure 5: Analysis of Topmost Image from Latent-Factor:3

a CNN, based on constrained and regularized coupledmatrix factorization. Our proposed method is uniqueand novel in that it is the first such framework to al-low for joint exploration of the representations that aCNN has learned across features (pixels), activations,and data instances. This is in stark contrast to existingstate-of-the-art works, which are typically restricted toone of those three modalities, as summarized in Fig. 1.Furthermore, owing to the simplicity of the factorizationmodel, our method can provide easily interpretable in-sights. As a result, our proposed framework offers max-imum flexibility and bridges the gap between existingworks, while producing comparable results to state-of-

Table 2: Latent Factor Analysis Group Sparsity - Activations Only

Factor: Top Classes Top Images

1: kangaroo,rabbit,fox,squirrel

2: woman,boy,girl,baby,man

4: pickup-truck,motorcycle,bus,tank

5: cattle,elephant,camel,chimpanzee

6: porcupine,possum,squirrel,raccoon

7: willow,maple,oak,palm -all trees

9: apple,sweet-pepper,rose,orange

10: plate,bowl,can,clock

12: whale,rocket,dolphin,sea

13: girl,snail,boy,spider,crab

16: lion,hamster,wolf,mouse

18: streetcar,train,bridge,bus

19: oak,maple,poppy,sunflower

Figure 6: TCAV scores for Group Sparsity based model

the-art, when used for the same (albeit limited) purposeof existing work. Case in point, in this paper, we demon-strate a number of applications of our framework draw-ing parallels to what existing work can offer comparedto our results, including the extraction of instance-basedinterpretable concepts (Sec. 4.1.1), and based on thoseconcepts we provide insights on the the behavior of neu-rons in different layers (Sec. 4.1.2), and instance-levelpixel-based insights (Sec. 4.1.3). In future work, wewill investigate the adaptation to our framework to dif-

ferent architectures (e.g., RNN and GCN) and differentapplications (e.g., NLP, Graph Mining, and Recommen-dation Systems).

References

[1] David Alvarez-Melis and Tommi S. Jaakkola. Towardsrobust interpretability with self-explaining neural net-works, 2018.

[2] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva,and Antonio Torralba. Network dissection: Quanti-

fying interpretability of deep visual representations.CoRR, abs/1704.05796, 2017.

[3] Mariusz Bojarski, Davide Del Testa, DanielDworakowski, Bernhard Firner, Beat Flepp, Pra-soon Goyal, Lawrence D. Jackel, Mathew Monfort,Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, andKarol Zieba. End to end learning for self-driving cars.CoRR, abs/1604.07316, 2016.

[4] Jacob R. Gardner, Matt J. Kusner, Yixuan Li,Paul Upchurch, Kilian Q. Weinberger, and John E.Hopcroft. Deep manifold traversal: Changing labelswith convolutional features. CoRR, abs/1511.06421,2015.

[5] Amirata Ghorbani, James Wexler, James Zou, andBeen Kim. Towards automatic concept-based expla-nations, 2019.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition.CoRR, abs/1512.03385, 2015.

[7] Ian Goodfellow Moritz Hardt Been Kim Julius Ade-bayo, Justin Gilmer. Sanity checks for saliency maps.In advances in neural information processing systems,2018.

[8] Been Kim, Martin Wattenberg, Justin Gilmer, Car-rie Cai, James Wexler, Fernanda Viegas, et al. In-terpretability beyond feature attribution: Quantitativetesting with concept activation vectors (tcav). In Inter-national Conference on Machine Learning, pages 2673–2682, 2018.

[9] Been Kim, Martin Wattenberg, Justin Gilmer, CarrieCai, James Wexler, Fernanda Viegas, and Rory Sayres.Interpretability beyond feature attribution: Quanti-tative testing with concept activation vectors (tcav),2017.

[10] Jingu Kim, Renato D. C. Monteiro, and Haesun Park.Group Sparsity in Nonnegative Matrix Factorization,pages 851–862.

[11] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions, 2017.

[12] Alex Krizhevsky. Learning multiple layers of featuresfrom tiny images. Technical report, 2009.

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. Imagenet classification with deep convolutionalneural networks. In F. Pereira, C. J. C. Burges, L. Bot-tou, and K. Q. Weinberger, editors, Advances in NeuralInformation Processing Systems 25, pages 1097–1105.Curran Associates, Inc., 2012.

[14] Daniel D. Lee and H. Sebastian Seung. Learning theparts of objects by non-negative matrix factorization.Nature, 401(6755):788–791, October 1999.

[15] Daniel D. Lee and H. Sebastian Seung. Algorithms

for non-negative matrix factorization. In Proceedingsof the 13th International Conference on Neural Infor-mation Processing Systems, NIPS’00, pages 535–541,Cambridge, MA, USA, 2000. MIT Press.

[16] Aravindh Mahendran and Andrea Vedaldi. Under-standing deep image representations by inverting them,2014.

[17] Deirdre K. Mulligan and Kenneth A. Bamberger. Sav-ing governance-by-design. Calif. L. Rev.. CaliforniaLaw Review, 106(IR):697.

[18] Deirdre K. Mulligan, Colin Koopman, and NickDoty. Privacy is an essentially contested con-cept: a multi-dimensional analytic for mapping pri-vacy. Philosophical Transactions of the Royal SocietyA: Mathematical, Physical and Engineering Sciences,374(2083):20160118, 2016.

[19] Maithra Raghu, Justin Gilmer, Jason Yosinski, andJascha Sohl-Dickstein. Svcca: Singular vector canoni-cal correlation analysis for deep learning dynamics andinterpretability. In Advances in Neural InformationProcessing Systems, pages 6078–6087, 2017.

[20] Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. ”why should I trust you?”: Explaining thepredictions of any classifier. In Proceedings of the 22ndACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, San Francisco, CA,USA, August 13-17, 2016, pages 1135–1144, 2016.

[21] Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. Deep inside convolutional networks: Visualisingimage classification models and saliency maps. arXivpreprint arXiv:1312.6034, 2013.

[22] Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recogni-tion, 2014.

[23] Daniel Smilkov, Nikhil Thorat, Been Kim, FernandaViegas, and Martin Wattenberg. Smoothgrad: remov-ing noise by adding noise, 2017.

[24] Sarah Tan, Rich Caruana, Giles Hooker, and YinLou. Distill-and-compare. Proceedings of the 2018AAAI/ACM Conference on AI, Ethics, and Society,Feb 2018.

[25] Matthew D Zeiler and Rob Fergus. Visualizing andunderstanding convolutional networks, 2013.

[26] B. Zhou, D. Bau, A. Oliva, and A. Torralba. Interpret-ing deep visual representations via network dissection.IEEE Transactions on Pattern Analysis and MachineIntelligence, 41(9):2131–2145, Sep. 2019.

[27] Bolei Zhou, Yiyou Sun, David Bau, and AntonioTorralba. Interpretable basis decomposition for visualexplanation. In ECCV, 2018.