recursive neural networks and its applications -...

Outline RNNs RNNs-FQA RNNs-NEM

Recursive Neural Networksand Its Applications

LU Yangyang

[email protected]

KERE SeminarOct. 29, 2014

Outline

Recursive Neural Networks

RNNs for Factoid Question AnsweringRNNs for Quiz BowlExperiments

RNNs for Anormal Event Detection in NewswireNeural Event Model (NEM)Experiments

Outline

RNNs for Factoid Question Answering

RNNs for Anormal Event Detection in Newswire

Introduction

Artificial Neural Networks:

∙ For a single neuron:Input 𝑥, Output 𝑦, Parameters 𝑊, 𝑏, Activation Function 𝑓

𝑧 = 𝑊𝑥 + 𝑏, 𝑦 = 𝑓(𝑧)

∙ For a simple ANN:

𝑧𝑙 = 𝑊𝑙𝑥𝑙 + 𝑏𝑙, 𝑦𝑙+1 = 𝑓(𝑧𝑙)

Introduction (cont.)Using neural networks: learning word vector representations

Word-level Representation → Sentence-level Representation?

A kind of solutions:

∙ Composition: Using syntactical information

Recursive AutoEncoder1

Given a sentence 𝑠, we can get its binary parsing tree.

∙ Children nodes: 𝑐1, 𝑐2

∙ Parent nodes: 𝑝 = 𝑓(𝑊𝑒[𝑐1; 𝑐2] + 𝑏)

∙ 𝑊𝑒: Encoding weight, 𝑓 : Activation function,𝑏: Bias weight

Training:Encouraging decoding results to be near the original representations

1R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic Pooling and Unfolding Recursive Autoencoders forParaphrase Detection. NIPS’11

Dependency Tree based RNNs2

Given a sentence 𝑠, we can get its dependency tree. Then we add hidden nodes foreach word node and get the reformed tree 𝑑.

For each node ℎ𝑖 in the tree 𝑡:

ℎ𝑖 = 𝑓(𝑧𝑖) (1)

𝑧𝑖 =1

𝑙(𝑖)(𝑊𝑣𝑥𝑖 +

∑︁𝑗∈𝐶(𝑖)

𝑙(𝑗)𝑊𝑝𝑜𝑠(𝑖,𝑗)ℎ𝑗)) (2)

where 𝑥𝑖, ℎ𝑖, 𝑧𝑖 ∈ R𝑛,𝑊𝑣,𝑊𝑝𝑜𝑠(𝑖,𝑗) ∈ R𝑛×𝑛

𝑙(𝑖) : the number of leaf nodes under ℎ𝑖

𝐶(𝑖) : the set of hidden nodes under ℎ𝑖

𝑝𝑜𝑠(𝑖, 𝑗) : the position of ℎ𝑗 respect to ℎ𝑖, such as 𝑙1, 𝑟1

𝑊𝑙 = (𝑊𝑙1,𝑊𝑙2, ...,𝑊𝑙𝑘𝑙) ∈ R𝑘𝑙×𝑛×𝑛

,𝑊𝑟 = (𝑊𝑟1,𝑊𝑟2, ...,𝑊𝑟𝑘𝑟 ) ∈ R𝑘𝑟×𝑛×𝑛

𝑘𝑙, 𝑘𝑟 : the max left, right width in the dataset

2R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing imageswith sentences. Transactions of the Association for Computational Linguistics’14

ℎ𝑖 = 𝑓(𝑧𝑖) (1)

𝑧𝑖 =1

∑︁𝑗∈𝐶(𝑖)

ℎ𝑖 = 𝑓(𝑧𝑖) (1)

𝑧𝑖 =1

∑︁𝑗∈𝐶(𝑖)

Tasks using RNNs I

∙ R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded com-positional semantics for finding and describing images with sentences. Transactions of theAssociation for Computational Linguistics.

∙ M. Luong, R. Socher, and C. D. Manning. 2013. Better word representations with recursiveneural networks for morphology. In CoNLL.

∙ R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. 2013a. Parsing With CompositionalVector Grammars. In ACL.

∙ R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts. 2013d.Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.

∙ E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving Word Representationsvia Global Context and Multiple Word Prototypes. In ACL.

∙ R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. 2012a. Convolutional- RecursiveDeep Learning for 3D Object Classification. In NIPS.

∙ R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. 2012b. Semantic CompositionalityThrough Recursive Matrix-Vector Spaces. In EMNLP.

Tasks using RNNs II

∙ R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. 2011a. DynamicPooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS.

∙ R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. 2011b. Parsing Natural Scenes and NaturalLanguage with Recursive Neural Networks. In ICML.

∙ R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. 2011c. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP.

∙ R. Socher, C. D. Manning, and A. Y. Ng. 2010. Learning continuous phrase representa-tions and syntactic parsing with recursive neural networks. In NIPS-2010 Deep Learning andUnsupervised Feature Learning Workshop.

∙ R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation andannotation of images using unaligned text corpora. In CVPR.

∙ L-J. Li, R. Socher, and L. Fei-Fei. 2009. Towards total scene understanding: classification,annotation and segmentation in an automatic framework. In CVPR.

Outline

RNNs for Anormal Event Detection in Newswire

RNN for Factoid Question Answering

∙ A Neural Network for Factoid Question Answering over Paragraphs

∙ EMNLP’14

∙ Mohit Iyyer1, Jordan Boyd-Graber2, Leonardo Claudino1, RichardSocher3, Hal Daum𝑒 III1

1 University of Maryland, Department of Computer Science and UMIACS2 University of Colorado, Department of Computer Science3 Stanford University, Department of Computer Science

IntroductionFactoid Question Answering:

∙ Given a description of an entity,identify the person, place, or thing discussed.

Quiz Bowl:

∙ A Task: Mapping natural language text to entities

∙ A challenging natural language problem with large amounts of diverseand compositional data

Answer: the Holy Roman Empire

→ QANTA: A question answering neural network with trans-sentential averaging

Quiz Bowl

∙ A Game: Mapping raw text to a large set of well-known entities

∙ Questions: 4 ∼ 6 sentences∙ Every sentence in a quiz bowl question is guaranteed to contain clues

that uniquely identify its answer, even without the context of previoussentences.

∙ A property “pyramidality”: sentences early in a question contain harder,more obscure clues, while later sentences are “giveaways”.

∙ Answering the question correctly requires an actual understanding ofthe sentence.

∙ Factoid answers:e.g., history questions ask players to identify specific battles, presi-dents, or events

Solutions: Bag-of-Words V.S. Recursive Neural Networks

Quiz Bowl

∙ A Game: Mapping raw text to a large set of well-known entities

∙ Questions: 4 ∼ 6 sentences∙ Every sentence in a quiz bowl question is guaranteed to contain clues

that uniquely identify its answer, even without the context of previoussentences.

∙ A property “pyramidality”: sentences early in a question contain harder,more obscure clues, while later sentences are “giveaways”.

∙ Answering the question correctly requires an actual understanding ofthe sentence.

∙ Factoid answers:e.g., history questions ask players to identify specific battles, presi-dents, or events

Solutions: Bag-of-Words V.S. Recursive Neural Networks

Outline

How to represent question sentences?

For a single sentence:

∙ A sentence – A dependency tree

∙ Each node 𝑛: associated with a word 𝑤,a word vector 𝑥𝑤 ∈ R𝑑, and a hidden vector ℎ𝑛 ∈ R𝑑

∙ Weights:𝑊𝑟 ∈ R𝑑×𝑑: with each dependency relation 𝑟𝑊𝑣 ∈ R𝑑×𝑑: to incorporate 𝑥𝑤 at a node into the node vector ℎ𝑛

(𝑑 = 100 in the experiments)

How to represent a single sentence?

For any node 𝑛 with children 𝐾(𝑛) and word vector 𝑥𝑤:

ℎ𝑛 = 𝑓(𝑊𝑣 · 𝑥𝑤 + 𝑏 +∑︁

𝑘∈𝐾(𝑛)

𝑊𝑅(𝑛,𝑘)·ℎ𝑘)

𝑅(𝑛, 𝑘) : the dependency relation between node 𝑛 and child node 𝑘

How to represent a single sentence?

For any node 𝑛 with children 𝐾(𝑛) and word vector 𝑥𝑤:

ℎ𝑛 = 𝑓(𝑊𝑣 · 𝑥𝑤 + 𝑏 +∑︁

𝑘∈𝐾(𝑛)

𝑊𝑅(𝑛,𝑘)·ℎ𝑘)

𝑅(𝑛, 𝑘) : the dependency relation between node 𝑛 and child node 𝑘

Training

Goal: Mapping questions to their corresponding answer entitiesA limited number of possible answers

→ A multi-class classification task

∙ A softmax layer over every node in the tree could predict answers

∙ Observation:Most answers are themselves words (features) 3 in other questions(e.g., a question on World War II might mention the Battle of theBulge and vice versa)

∙ Improving upon the exsiting DT-RNNs:Jointly learning answer and question representations in the same vec-tor space rather than learning them separately

3Different from multimodal text-to-image mapping problem

Training

Goal: Mapping questions to their corresponding answer entitiesA limited number of possible answers → A multi-class classification task

Training

Goal: Mapping questions to their corresponding answer entitiesA limited number of possible answers → A multi-class classification task

Training (cont.)

Intuition: Encourage the vectors of question sentences to be near their correct answersand far away from incorrect answers

The Error For A Single Sentence:

𝐶(𝑆, 𝜃) =∑︁𝑠∈𝑆

∑︁𝑧∈𝑍

𝐿(𝑟𝑎𝑛𝑘(𝑐, 𝑠, 𝑍))𝑚𝑎𝑥(0, 1− 𝑥𝑐 · ℎ𝑠 + 𝑥𝑧 · ℎ𝑠)

𝑆 : the set of all nodes in the sentence’s dependency tree

𝑠 : an individual node in 𝑆

𝑐 : the correct answer

𝑍 : the set of randomly selected incorrect answers (|𝑍| = 100 in experiments)

𝑧 : an individual incorrect answer in 𝑍

𝑟𝑎𝑛𝑘(𝑐, 𝑠, 𝑍) : the rank of correct answer 𝑐 with respect to the incorrect answers 𝑍

𝐿(𝑟) =𝑟∑︁

𝑖=1

1/𝑖

Training (cont.)

𝐶(𝑆, 𝜃) =∑︁𝑠∈𝑆

∑︁𝑧∈𝑍

𝐿(𝑟) =𝑟∑︁

𝑖=1

1/𝑖

Training (cont.)

𝐶(𝑆, 𝜃) =∑︁𝑠∈𝑆

∑︁𝑧∈𝑍

𝐿(𝑟) =𝑟∑︁

𝑖=1

1/𝑖

Training (cont.)

The objective function of whole model:

𝐽(𝜃) =1

∑︁𝑡∈𝑇

𝐶(𝑡, 𝜃)

𝑇 : all sentences in the training set

𝑁 : the number of nodes in the training set

𝜃 = (𝑊𝑟∈𝑅,𝑊𝑣,𝑊𝑒, 𝑏)

From Sentences to Questions

The model just described:considers each sentence in a quiz bowl question independently

Previously-heard sentences within the same question contain usefulinformation that we do not want our model to ignore.

Sentence-level representation → Larger paragraph-level representation

∙ The simplest and best aggregation method:Averaging the representations of each sentence seen so far in a par-ticular question

→ QANTA:A question answering neural network with trans-sentential averaging

Outline

Datasets

1. Expand previous4: 46, 842 questions in 14 different categories

2. NAQT5: 65, 212 questions

3. Selected: 21, 041 literature and 22, 956 history questions(> 40% of the coprus)

4. Only consider a limited set of the most popular quiz bowl answers

5. Wikipedia titles as training labels:Mappping all raw answer strings to a canonical set (By Woosh6)

6. Filtering out all answers that do not occur at least 6 times:451/4, 460 history and 595/5, 685 literature answers-questions (aver. 12 times)

7. Replacing all occurrences of answers in the question with single entities (NER)

Final Datasets: Training/Test Word Embedding Initialization: word2vec

Category Questions Sentences

History 3, 761/699 14, 217/2, 768

Literature 4, 777/908 17, 972/3, 577

4Jordan Boyd-Graber, et al. 2012. Besting the quiz master: Crowdsourcing incremental classification games. In EMNLP.

5Running quiz bowl tournaments and generously shared with us all of their questions from 1998-2013

6https://pypi.python.org/pypi/Whoosh/

Datasets

History 3, 761/699 14, 217/2, 768

Literature 4, 777/908 17, 972/3, 577

Datasets

History 3, 761/699 14, 217/2, 768

Literature 4, 777/908 17, 972/3, 577

Baselines

∙ BOW: a logistic regression classifier trained on binary unigram indicators

∙ BOW-DT: BOW + the feature set with dependency relation indicators

∙ IR-QB: using the state-of-the-art Whoosh IR engine + KB that contains “pages”associated with each answer

∙ IR-WIKI: IR-QB + Wikipedia KB

Human Comparision

∙ Human records: 1, 201 history guesses and 1, 715 literature guesses from 22 ofthe quiz bowl players who answered the most questions

∙ Standard scoring system for quiz bowl: 10 points - correct, −5 points - incorrect

Discussion

∙ Making prediction on early sentence positions

∙ Where the Attribute Space Helps Answer Questions

∙ Where all Models Struggle

∙ Visualizing the Attribute Space

Discussion(cont.)∙ Making prediction on early sentence positions

Discussion(cont.)

∙ Making prediction on early sentence positions

Summary

∙ Problem: Factoid Question Answering∙ Quiz Bowl Competition: question sentences → entities

∙ Approach: QANTA- A question answering neural network with trans-sentential averaging

∙ A limited number of possible answers→ A multi-class classification task

∙ Single Sentence Representation: DT-RNN∙ Jointly learning answer and question representations in the same vector

Summary

∙ Problem: Factoid Question Answering∙ Quiz Bowl Competition: question sentences → entities

∙ Approach: QANTA- A question answering neural network with trans-sentential averaging

∙ A limited number of possible answers→ A multi-class classification task

∙ Single Sentence Representation: DT-RNN∙ Jointly learning answer and question representations in the same vector

Outline

RNNs for Factoid Question Answering

RNN for Anormal Event Detection in Newswire

∙ Modeling Newswire Events using Neural Networks for Anomaly De-tection

∙ COLING’14

∙ Pradeep Dasigi, Eduard Hovy

Carnegie Mellon University, Language Technologies Institute

Introduction

Problem: Automatic anomalous event detection in Newswire (normal - anomalous)

∙ What are anomalous events (in Newswire)?

Defining anomalous events as those that are unusual compared to the generalstate of affairs and might invoke surprise when reported

∙ Understanding events: requiring our knowledge about the role fillers

Hypothesis:Anomaly is a result of unexpected or unusual combination of semantic role fillers.→ Encoding the goodness of semantic role filler coherence

∙ Event level anomaly is not the same as semantic incoherence

Defining anomalous events to be the sub class of those that are semanticallycoherent, but are unusual only based on real world knowledge

Introduction

What are Semantic Roles? 8

Semantic Roles (i.e. Thematic Roles):

∙ Used to indicate the role played by each entity in a sentence

∙ Ranging from very specific to very general.

∙ The entities that are labelled should have participated in an event.

7http://nlp.stanford.edu/projects/shallow-parsing.shtml

8http://language.worldofcomputing.net/semantics/semantic-roles.html

Semantic Roles (Semantic Arguments) 9

AGENT One who performs some actions

Joe played well and won the price.

CAUSE One that causes something or A reason for some happenings

Rain makes me happy.

EXPERIENCER One who experienced

Johan felt very painful when heard of the sudden demise of his friend.

BENEFICIARY One who gets benefits

I prayed early in the morning for Susan.

LOCATION The location

Steve was swimming in theriver.

MANNER The way in which one behave and talk when he or she is with other people

Tom behaved very gently even when he was insulted.

INSTR The instrument

Tom broke the wooden box withthe hammer.

FROM-LOC From location

John received the prize from the President.

TO-LOC To location

Susan threw a pen to John.

AT-LOC At location

The box contains a ball.

AT-TIME At time

I woke up at 5 clock to prepare for the examination.

∙ AGENT only: Joe walked.

∙ AGENT + INSTR: Joe flies with a parchute.

∙ AGENT + INSTR + BENEFICIARY: Joe flies with a parachute for charity.

Outline

Neural Event Model(NEM)

An Event: the pair (𝑉,𝐴)

∙ 𝑉 : the predicate or a semantic verb10

∙ 𝐴: the set of its semantic arguments11

→ Learning event embeddings explicitly guided by the semantic role structure

Neural Event Model: Recursive AutoEncoder (Binary Tree)

10“attacks” in “Terrorist attacks on the World Trade Center..”

11e.g. agent, patient, time, location

Training∙ Unsupervised: Argument Composition

∙ Contrastive estimation fashion:

𝑠: the original entire argument𝑠𝑐: randomly replacing one of the words in the argument at a time𝑉 : the set of representations of all the words in the vocabulary

∙ Supervised: Event Composition

∙ Labeled training set: whether the event is normal or anomalous

𝑘: the number of semantic arguments per event𝐿𝑒𝑣𝑒𝑛𝑡: the label operator

Outline

DatasetsEvent Extraction:

∙ Using the Semantic Role Labeling (SRL) tool in SENNA12 13

∙ Considering only the roles A0(AGENT), A1(AGENT),AM-TMP(TIME) and AM-LOC(LOCATION) as the arguments of the events

∙ NBC Weird Events (NWE):Crawling 3684 “weird news”headlines available publicly on the website of NBCnews14→ 4, 271 events extracted by SENNA → 3, 771 as negative training data

∙ Gigaword Events (GWE):Extracting events from headlines in the AFE section of GigawordSampling roughly 3, 771 GWE events as positive data

∙ Augment composition: 100𝑘 whole sentences from AFE headlines and the weirdnews headlines from which NWE are extracted

Word Embedding Initialization: SENNA

12Using tags in CoNLL 2005: http://www.lsi.upc.edu/?$“sim$rlconll

13http://ml.nec-labs.com/senna/

14http://www.nbcnews.com

Evaluation

Test Set Annotation:1003 events as HIT on AMT 15

∙ 3-way: highly unusual, strange, normal

∙ 4-way: highly unusual, strange, normal and cannot say

True Label Predictions: Generated by MACE16

∙ Merging the two anomaly classes → binary classification

15Human Intelligence Tasks on Amazon Mechanical Turk

16Dirk Hovy,et al. 2013. Learning whom to trust with mace. In NAACL-HLT.

Summary

∙ Problem:Automatic anomalous event detection in Newswire(normal - anomalous)

∙ Approach:∙ Semantic Role Labeling: Event Extraction and Semantic Structure∙ Neural Event Model: Recursive AutoEncoder (Binary Tree)∙ Unsupervised Argument Composition Training

+ Supervised Event Composition Training

∙ Data: NBC Weird News + Gigaword AFE Section

recursive neural networks and its applications -...

Documents