on attribution of recurrent neural network predictions via ...people.tamu.edu/~dumengnan/on...
TRANSCRIPT
Data Analytics at Texas A&M Lab
Mengnan Du, Ninghao Liu, Fan Yang, Shuiwang Ji, Xia Hu
Texas A&M University
On Attribution of Recurrent Neural Network
Predictions via Additive Decomposition
Data Analytics at Texas A&M Lab
RNNs are regarded as black boxes
“Interpreting Machine Learning Models”. Medium.com, 2017.
RNNs make lots of progress
Accuracy
Inte
rpre
tab
ility
1
Text classification Machine translation
Four types of RNNs
Data Analytics at Texas A&M Lab
Interpretation beneficial both to researchers and end-users
Researcher/developer End-user
Our Goal --- Provide post-hoc interpretation behind individual prediction
• Increase the interpretability for RNNs• Keep prediction performance unchanged
2
RNN
Explanation
UsersTrust
Explanation
ResearcherRefine
Data Analytics at Texas A&M Lab
Key factors
--- A pre-trained RNN and an input text
--- The prediction of RNN
Post-hoc Interpretation
--- Contribution score for each feature in input
--- Deeper color in the heatmap means higher contribution
3
Interpretation heatmap
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Teaching Machines to Read and Comprehend, NIPS
Data Analytics at Texas A&M Lab
1
2
How to guarantee that the post-hoc interpretations are indeed
faithful to the original prediction
to the original model.
Local approximation based methods may not be faithful
It is challenging to develop an attribution method which could
generate phrase-level explanations.
4
positive
Negative sentiment
Phrase-level explanation“ Used to be my favorite”.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier”. KDD, 2016.
Data Analytics at Texas A&M Lab
Faithful interpretation
Phrase-levelinterpretation
Should investigate Internal neurons
Interpretation method should be flexible
Can we utilize decomposition based methods to derive interpretations?
5
Data Analytics at Texas A&M Lab
• Symbol α𝑡 ∶ partial evidence is brought to the time step 𝑡
• Symbol = 𝑔(𝑥𝑡): the evidence that RNN obtains at time step 𝑡
• Some follow this rule exactly, e.g., GRU. Some approximately, e.g., LSTM
6
ℎ1 ℎ𝑡−1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡−1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
• Abstracted RNN updating rule:ℎ𝑡−1 ℎ𝑡
𝑥𝑡
α𝑡
Data Analytics at Texas A&M Lab 7
ℎ1 ℎ𝑡−1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡−1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Abstracted RNN updating rule:
• RNN prediction decomposition:
Two essential elements:
Hidden state vector ℎ𝑡updating vector α𝑡
RNN logit value:
Data Analytics at Texas A&M Lab 8
• From decomposition to word-level explanation
ℎ1 ℎ𝑡−1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡−1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
EvidenceUpdating
EvidenceForgetting
Contribution score for 𝑥𝑡 :Updating from 𝑡 − 1 to 𝑡Forgetting from 𝑡 + 1 to 𝑇
Data Analytics at Texas A&M Lab 9
• Phrase-level explanation• Contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞,… , 𝑟} :
ℎ1 ℎq ℎ𝑟 ℎ𝑇
𝑥1 𝑥𝑞 𝑥𝑟 𝑥𝑇
𝑧 𝑦
EvidenceUpdating
EvidenceForgetting
Two parts:Evidence updating from 𝑞 − 1 to 𝑟Evidence forgetting from 𝑟 + 1 to 𝑇
Data Analytics at Texas A&M Lab 10
• Hidden state vector updating rule for GRU:
“Understanding LSTM Networks”. Colah’s blog, 2015.
• REAT updating rule:
• GRU contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞, … , 𝑟} :
Only need to replace REAT updating vector α𝑡 with GRU updating gate vector 𝑢𝑡
Data Analytics at Texas A&M Lab 11
• Hidden state vector updating rule for LSTM:
“Understanding LSTM Networks”. Colah’s blog, 2015.
• Approximate REAT updating rule:
• LSTM contribution score for a phrase 𝑥𝐴, 𝐴 = {𝑞,… , 𝑟} :
Data Analytics at Texas A&M Lab 12
• Concatenation of normal GRU and reverse GRU:
• Phrase-level attribution for BiGRU:
Concatenation of two terms:Normal GRU decompositionReverse GRU decomposition
Data Analytics at Texas A&M Lab
2. Qualitative Evaluation via Case Studies
13
1. Attribution Faithfulness Evaluation
3. Applying REAT for Linguistic Patterns Analysis
4. Applying REAT for Model Misbehavior Analysis
Data Analytics at Texas A&M Lab 14
• Once the most important sentence is deleted, it will cause the highest accuracy drop for the target class.
“ It is ridiculous , but of course it is also refreshing”.
REAT explanations are highly faithful to original RNN
Data Analytics at Texas A&M Lab
REAT accurately reflect the prediction score of different architectures
15
• Visualizations Under Different RNN Architectures
GRU
LSTM
BiGRU
The fight scenes are fun but it grows tedious
The fight scenes are fun but it grows tedious
The fight scenes are fun but it grows tedious
• GRU positive prediction (51.6% confidence)
• LSTM positive prediction (96.2% confidence)
• BiGRU negative prediction (62.7% confidence)
Green: positive contribution, red: negative contribution
Data Analytics at Texas A&M Lab 16
• Hierarchical Attribution: LSTM negative prediction with 99.46% confidence
Word
Phrase
Clause
The story may be new but the movie does n’t serve up lots of laughs,
The story may be new but the movie does n’t serve up lots of laughs,
The story may be new but the movie does n’t serve up lots of laughs,
Green: positive contribution, red: negative contribution
• In general, the first part of the text has negative contribution
• The second part of the text has positive contribution
• This hierarchical attention represents the contributions at different levels
of granularity
Data Analytics at Texas A&M Lab 17
• Apply REAT to analyze linguistic patterns for LSTM over SST2 test.
POS category score distributions
RBS, e.g, “best”, “most”, highest score
JJ, adjectives, ranks relatively high
NN, nouns, near-zero median score
REAT unveils useful linguistic knowledge captured by LSTM
Data Analytics at Texas A&M Lab 18
• LSTM wrongly gives a 99.97% negative sentiment prediction.
• Attribution score distribution for two words “terrible” and “terribly”
“Schweiger is talented and terribly charismatic, qualities essential to both movie stars and social anarchists”.
• REAT could tell us that LSTM captures the meaning relevant to “terrible”
• While ignore other meanings, such as “extremely”
• This LSTM fails to model polysemant of words
Data Analytics at Texas A&M Lab 19
“Schweiger is talented and terribly charismatic, qualities essential to both movie stars and social anarchists”.
• Interpretable adversarial attack for a LSTM classifier
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Negative prediction, 99.97% confidence
Data Analytics at Texas A&M Lab 20
“Schweiger is talented and terribly charismatic, qualities essential to both movie stars and social anarchists”.
• Interpretable adversarial attack for a LSTM classifier
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Extremely
Data Analytics at Texas A&M Lab 21
“Schweiger is talented and extremely charismatic, qualities essential to both movie stars and social anarchists”.
• Interpretable adversarial attack for a LSTM classifier
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Positive prediction, 81.29% confidence
Data Analytics at Texas A&M Lab 22
“Schweiger is talented and terribly charismatic, qualities essential to both movie stars and social anarchists”.
• Interpretable adversarial attack for a LSTM classifier
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Very
Data Analytics at Texas A&M Lab 23
“Schweiger is talented and very charismatic, qualities essential to both movie stars and social anarchists”.
• Interpretable adversarial attack for a LSTM classifier
ℎ1 ℎ𝑡 ℎ𝑇
𝑥1 𝑥𝑡 𝑥𝑇
𝑧 𝑦
Positive prediction, 99.53% confidence
Data Analytics at Texas A&M Lab 24
“Occasionally melodramatic, it ’s also extremely effective.”
• This adversarial attack generalizes to other instances
“Occasionally melodramatic, it ’s also terribly effective.”
Positive prediction, 99.53%
Negative prediction, 99.0%
“Extremely well acted by the four primary actors, this is a seriously intended movie that is not easily forgotten.”
“Terribly well acted by the four primary actors, this is a seriously intended movie that is not easily forgotten.”
Positive prediction, 99.98%
Negative prediction, 87.7%
Data Analytics at Texas A&M Lab
REAT: A post-hoc interpretation method for predictions made by RNNs ---
• Highly faithful and interpretable explanations
• Useful debugging tool to examine RNNs
Future work ---• "Techniques for Interpretable Machine Learning",
Mengnan Du, Ninghao Liu, Xia Hu, Communications of the ACM, 2019.
25
New layer with interpretable constraints
Intrinsic Explanation( global or local )
Post-hocGlobal Explanation
Post-hocLocal Explanation