^ e w } v Æ µ o ] Ì v ] } v r e Á } l ( } } v À ] } v o y µ ] } v v Á ] v p ·...
TRANSCRIPT
![Page 1: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/1.jpg)
SDNet: Contextualized Attention-based Deep Network for Conversational
Question AnsweringJan. 31, 2019
Chenguang ZhuMicrosoft Speech and Dialogue Research Group
![Page 2: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/2.jpg)
Motivation
• Traditional MRC models target for single-turn QA scenario
• No correlation between different questions/answers
• Humans interact by conversation, with rich context
• Models with conversational AI better match realistic settings, like customer support chatbot, social chatbot
![Page 3: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/3.jpg)
CoQA Dataset
• From Stanford NLP Lab
• First conversational question and answering datasets
• Context can be carried over, requiring coreference resolution
• Model should comprehend both the passage and questions/answers from previous rounds
![Page 4: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/4.jpg)
Example
![Page 5: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/5.jpg)
Our formulation
• Passage (context): • Previous rounds of questions and answers: • Current question:
• We prepend the latest rounds of QAs to the current question:•
![Page 6: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/6.jpg)
Encoding Layer
• For each word in passage and question• 300-dim GloVe word embedding• 1024-dim BERT contextual embedding
![Page 7: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/7.jpg)
Using BERT
• We utilize BERT as a contextual embedder, with some adaptation
• BERT tokenizes text into byte pair encoding (BPE)
• It may not align with canonical tokenizers like spaCy
![Page 8: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/8.jpg)
Using BERT--Alignment
• Tokenize text into word sequence
• Then use BERT to tokenize each word into BPEs:•
• After obtaining contextual embedding for each BPE, we average the BERT embeddings for all the BPEs corresponding to a word
![Page 9: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/9.jpg)
Using BERT—Linear combination
• BERT has many layers and the paper recommends using only the last layer
• We linearly combine all layers of output
• This can greatly improve the results
![Page 10: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/10.jpg)
Using BERT—Linear combination
Linear combination Averaged BPE embedding for a word
![Page 11: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/11.jpg)
Using BERT– Lock internal weights
• Locking weights of BERT (no backpropagation)
• In our case, finetuning does not work well
![Page 12: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/12.jpg)
Representation
• Each word in passage is represented by
• Each word in question is represented by
![Page 13: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/13.jpg)
Integration Layer
• Word-level Inter-Attention• Attend from question to passage, i.e. :
• Passage words have several more features• POS, NER embedding• 3-dim exact matching vector• A normalized term frequency
![Page 14: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/14.jpg)
RNN
• Passage (context) and question words both go through K layers of Bi-LSTMs
• Question words go through one more RNN layer
![Page 15: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/15.jpg)
Question self-attention
• Question contains conversation history• We need to establish relationship between history and current question• Self-attention:
, , ,
• is the final representation of question words
![Page 16: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/16.jpg)
Multi-level inter-attention
• Adapted from FusionNet (Huang et al. 2018)
• After multi-level inter-attention, we conduct RNN, self-attention and another RNN to obtain the final representation of context:
![Page 17: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/17.jpg)
Generate answer – positional probability
• Firstly, we condense the question into one vector
• Then we generate the probability that the answer starts and ends at each position:
Start probability:
GRU: End probability:
• We generate the probability of “yes”, “no”, “no answer” similarly
![Page 18: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/18.jpg)
Overall structure -- SDNet
![Page 19: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/19.jpg)
Training and prediction
• Loss is cross-entropy• Batch size: 1 dialogue• 30 epochs• ~2 hours per epoch on one Volta-100 GPU
• Prediction of span is done by choosing maximum (start x end) probability, with length <= 15
• Ensemble model consists of 12 best single models with different random seeds
![Page 20: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/20.jpg)
Experiments
• CoQA dataset• Dev set: 100 passages per in-domain topics• Test set: 100 passages per in-domain and out-of-domain topics
![Page 21: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/21.jpg)
Leaderboard (As of 12/11/2018)
![Page 22: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/22.jpg)
Ablation Studies
![Page 23: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/23.jpg)
Effect of history
![Page 24: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/24.jpg)
Training
![Page 25: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/25.jpg)
Summary
• SDNet leverages MRC, BERT and conversational history
• BERT is important, but adaptation is required to use it as a module
• Next step is open-domain conversational QA, with no passage given, and model may also ask questions
![Page 26: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/26.jpg)
Introduction to SDRG
• Microsoft Speech and Dialogue Research Group
• Conduct research on speech recognition and conversational AI
• We’re from Stanford, Princeton, Cambridge, IBM, …
• We’re now hiring researchers and machine learning engineers
• We have world-class research environment• DGX-1 machine and large clusters with tons of GPUs
• Welcome to learn more!
![Page 27: ^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á ] v P · 2019-02-05 · •^ E W } v Æ µ o ] Ì v ] } v r E Á } l ( } } v À ] } v o Y µ ] } v v Á](https://reader031.vdocuments.mx/reader031/viewer/2022022809/5e5205155c69707b1a455681/html5/thumbnails/27.jpg)
Q&A
• Thank you!
• My Email: [email protected]
• More details in our paper:• SDNet: Contextualized Attention-based Deep Network for Conversational
Question Answering – Chenguang Zhu, Michael Zeng, Xuedong Huang• https://arxiv.org/abs/1812.03593