on attribution of recurrent neural network predictions via ...people.tamu.edu/~dumengnan/on...

Data Analytics at Texas A&M Lab

Mengnan Du, Ninghao Liu, Fan Yang, Shuiwang Ji, Xia Hu

Texas A&M University

On Attribution of Recurrent Neural Network

Predictions via Additive Decomposition


RNNs are regarded as black boxes

“Interpreting Machine Learning Models”. Medium.com, 2017.

RNNs make lots of progress

Accuracy

Inte

rpre

tab

ility

1

Text classification Machine translation

Four types of RNNs


Interpretation beneficial both to researchers and end-users

Researcher/developer End-user

Our Goal --- Provide post-hoc interpretation behind individual prediction

• Increase the interpretability for RNNs• Keep prediction performance unchanged

2

RNN

Explanation

UsersTrust

Explanation

ResearcherRefine


Key factors

--- A pre-trained RNN and an input text

--- The prediction of RNN

Post-hoc Interpretation

--- Contribution score for each feature in input

--- Deeper color in the heatmap means higher contribution

3

Interpretation heatmap

ℎ1 ℎ𝑡 ℎ𝑇

𝑥1 𝑥𝑡 𝑥𝑇

𝑧 𝑦

Teaching Machines to Read and Comprehend, NIPS


1

2

How to guarantee that the post-hoc interpretations are indeed

faithful to the original prediction

to the original model.

Local approximation based methods may not be faithful

It is challenging to develop an attribution method which could

generate phrase-level explanations.

4

positive

Negative sentiment

Phrase-level explanation“ Used to be my favorite”.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier”. KDD, 2016.


Faithful interpretation

Phrase-levelinterpretation

Should investigate Internal neurons

Interpretation method should be flexible

Can we utilize decomposition based methods to derive interpretations?

5


• Symbol α𝑡 ∶ partial evidence is brought to the time step 𝑡

• Symbol = 𝑔(𝑥𝑡): the evidence that RNN obtains at time step 𝑡

• Some follow this rule exactly, e.g., GRU. Some approximately, e.g., LSTM

6

ℎ1 ℎ𝑡−1 ℎ𝑡 ℎ𝑇

𝑥1 𝑥𝑡−1 𝑥𝑡 𝑥𝑇

𝑧 𝑦

• Abstracted RNN updating rule:ℎ𝑡−1 ℎ𝑡

𝑥𝑡

α𝑡

Data Analytics at Texas A&M Lab 7



𝑧 𝑦

Abstracted RNN updating rule:

• RNN prediction decomposition:

Two essential elements:

Hidden state vector ℎ𝑡updating vector α𝑡

RNN logit value:


• From decomposition to word-level explanation



𝑧 𝑦

EvidenceUpdating

EvidenceForgetting

Contribution score for 𝑥𝑡 :Updating from 𝑡 − 1 to 𝑡Forgetting from 𝑡 + 1 to 𝑇


• Phrase-level explanation• Contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞,… , 𝑟} :

ℎ1 ℎq ℎ𝑟 ℎ𝑇

𝑥1 𝑥𝑞 𝑥𝑟 𝑥𝑇

𝑧 𝑦

EvidenceUpdating

EvidenceForgetting

Two parts:Evidence updating from 𝑞 − 1 to 𝑟Evidence forgetting from 𝑟 + 1 to 𝑇


• Hidden state vector updating rule for GRU:

“Understanding LSTM Networks”. Colah’s blog, 2015.

• REAT updating rule:

• GRU contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞, … , 𝑟} :

Only need to replace REAT updating vector α𝑡 with GRU updating gate vector 𝑢𝑡


• Hidden state vector updating rule for LSTM:

“Understanding LSTM Networks”. Colah’s blog, 2015.

• Approximate REAT updating rule:

• LSTM contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞,… , 𝑟} :


• Concatenation of normal GRU and reverse GRU:

• Phrase-level attribution for BiGRU:

Concatenation of two terms:Normal GRU decompositionReverse GRU decomposition


2. Qualitative Evaluation via Case Studies

13

1. Attribution Faithfulness Evaluation

3. Applying REAT for Linguistic Patterns Analysis

4. Applying REAT for Model Misbehavior Analysis


• Once the most important sentence is deleted, it will cause the highest accuracy drop for the target class.

“ It is ridiculous , but of course it is also refreshing”.

REAT explanations are highly faithful to original RNN


REAT accurately reflect the prediction score of different architectures

15

• Visualizations Under Different RNN Architectures

GRU

LSTM

BiGRU

The fight scenes are fun but it grows tedious



• GRU positive prediction (51.6% confidence)

• LSTM positive prediction (96.2% confidence)

• BiGRU negative prediction (62.7% confidence)

Green: positive contribution, red: negative contribution


• Hierarchical Attribution: LSTM negative prediction with 99.46% confidence

Word

Phrase

Clause

The story may be new but the movie does n’t serve up lots of laughs,



Green: positive contribution, red: negative contribution

• In general, the first part of the text has negative contribution

• The second part of the text has positive contribution

• This hierarchical attention represents the contributions at different levels

of granularity


• Apply REAT to analyze linguistic patterns for LSTM over SST2 test.

POS category score distributions

RBS, e.g, “best”, “most”, highest score

JJ, adjectives, ranks relatively high

NN, nouns, near-zero median score

REAT unveils useful linguistic knowledge captured by LSTM


• LSTM wrongly gives a 99.97% negative sentiment prediction.

• Attribution score distribution for two words “terrible” and “terribly”

“Schweiger is talented and terribly charismatic, qualities essential to both movie stars and social anarchists”.

• REAT could tell us that LSTM captures the meaning relevant to “terrible”

• While ignore other meanings, such as “extremely”

• This LSTM fails to model polysemant of words



• Interpretable adversarial attack for a LSTM classifier



𝑧 𝑦

Negative prediction, 99.97% confidence






𝑧 𝑦

Extremely


“Schweiger is talented and extremely charismatic, qualities essential to both movie stars and social anarchists”.




𝑧 𝑦

Positive prediction, 81.29% confidence






𝑧 𝑦

Very


“Schweiger is talented and very charismatic, qualities essential to both movie stars and social anarchists”.




𝑧 𝑦

Positive prediction, 99.53% confidence


“Occasionally melodramatic, it ’s also extremely effective.”

• This adversarial attack generalizes to other instances

“Occasionally melodramatic, it ’s also terribly effective.”

Positive prediction, 99.53%

Negative prediction, 99.0%

“Extremely well acted by the four primary actors, this is a seriously intended movie that is not easily forgotten.”

“Terribly well acted by the four primary actors, this is a seriously intended movie that is not easily forgotten.”

Positive prediction, 99.98%

Negative prediction, 87.7%


REAT: A post-hoc interpretation method for predictions made by RNNs ---

• Highly faithful and interpretable explanations

• Useful debugging tool to examine RNNs

Future work ---• "Techniques for Interpretable Machine Learning",

Mengnan Du, Ninghao Liu, Xia Hu, Communications of the ACM, 2019.

25

New layer with interpretable constraints

Intrinsic Explanation（ global or local ）

Post-hocGlobal Explanation

Post-hocLocal Explanation

on attribution of recurrent neural network predictions via ...people.tamu.edu/~dumengnan/on...

Documents