classifying sentences using induced structure menno van zaanen luiz augusto pizzato diego...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Classifying Sentences using Induced Structure
Menno Van Zaanen
Luiz Augusto Pizzato
Diego Mollá-Aliod
Centre for Language Technology
Macquarie University
Sydney, Australia
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(2/23)
Overview
• Sentence Classification Problem
• Induced Structure Approach– Alignment Based Learning– Trie Based Classifier
• Results
• Concluding Remarks
• Future Work
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(3/23)
Sentence Classification
• Assist several NLP task: document summarisation, information extraction, question answering, among others.
• Question Classification:• Definition: What is a golden parachute?• List: Name two brands of shaving cream.• Factoid questions:
– HUM:IND: Who discover the penicillin?– LOC:CITY: What is the capital of Australia?– FOOD, PLANT, ANIMAL: What do bats eat?
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(4/23)
Current approaches
• Handcrafted regular expressions:– Pros: Rules are understandable. Few rules satisfy a
large amount of the questions (Zip’s Law).
– Cons: Difficult to construct. Limited performance.
• Machine Learning:– Pros: Computer automatically finds “rules”.
– Cons: Rules and knowledge generated are not readable.
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(5/23)
Classifying by Induced Structure
• Process fits between ML and RE– Learn patterns from sentences;– Use these patterns in the classification phase;
TrainingData
Extract Structure
Structure
SentenceSentenceClassifier
Class
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(6/23)
Classifying by Induced Structure
• Propose two distinct approaches:– Alignment-Based Learning Classifier (ABL)
• ABL is a generic grammatical inference framework, that learns structure using plain text.
– Trie-Based Classifier• Classifies sentences based on partial matches in a Trie
structure.
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(7/23)
Alignment-Based Learning Classifier (ABL)
• Developed under the idea that constituents in sentences can be interchanged.– The book is on the table.– The car is on the driveway.
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(8/23)
Alignment-Based Learning Classifier (ABL)
• Developed under the idea that constituents in sentences can be interchanged.– The (book) is on the (table).– The (car) is on the (driveway).
the
book
on the
table
is
car driveway
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(9/23)
Alignment-Based Learning Classifier (ABL)
EAT Questions
DESC (What) (is (caffeine))
DESC (What) (is (Teflon))
LOC (Where) is (Milan)
LOC What (are the twin cities)
unhypo
What is .* DESC 2
What .* DESC 2
.* is caffeine DESC 1
.* is Teflon DESC 1
Where is .* LOC 1
.* is Milan LOC 1
What .* LOC 1
hypo
caffeine DESC 1
is caffeine DESC 1
What DESC 2
Teflon DESC 1
is Teflon DESC 1
Milan LOC 1
Where LOC 1
are the twin cities LOC 1
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(10/23)
Trie-Based Classifier
• T(S) = {T(S/a1), T(S/a2) ,…,T(S/ar)}
– Where S is the set of sentences and S/an are the sentences starting with an, but stripped of the initial element.
a|b|c|d|e|f|...|z
a|b|c|d|e|f|...|z
a|b|c|d|...|r|...|z
car
a|b|c|d|e|f|...|z
a|b|c|d|e|f|...|z
a|b|c|d|...|r|...|z
a|b|c|d|e|f|...|zzebra
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(11/23)
Trie-Based Classifier
1
2where
6who
19how
7is 13J.
9dean 10of 11ICS 12$ (eoq)
15$ (eoq)
8the
16of 17ICS 18$ (eoq)
3is 4Chile 5$ (eoq)
20far
21is 22Athens 23$ (eoq)
24tall 25is 26Sting 27$ (eoq)
^ (boq)
14Smith18
1HUM:DESC
FreqEAT
7
1HUM:IND
2HUM:DESC
FreqEAT
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(12/23)
Trie-Based Classifier
1 6who
7is
9dean
10of
11ICS
12$ (eoq)
^ (boq)
$^ who is prime minister of Australia
?
the
8the
?
• Look-ahead process:
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(13/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(14/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
unhypo
What is .* DESC 2
What .* DESC 2
.* is caffeine DESC 1
.* is Teflon DESC 1
Where is .* LOC 1
.* is Milan LOC 1
What .* LOC 1
hypo
caffeine DESC 1
is caffeine DESC 1
What DESC 2
Teflon DESC 1
is Teflon DESC 1
Milan LOC 1
Where LOC 1
are the twin cities
LOC 1
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(15/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
EAT Questions
DESC (What/WP) (is/VBZ (caffeine/NN))
DESC (What/WP) (is/VBZ (Teflon/NNP))
LOC (Where/WRB) is/VBZ (Milan/NNP)
LOC What/WP (are/VBP the/DT twin/JJ cities/NNS)
EAT Questions
DESC (What) (is (caffeine))
DESC (What) (is (Teflon))
LOC (Where) is (Milan)
LOC What (are the twin cities)
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(16/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
unhypo
What is .* DESC 2
What .* DESC 2
.* is caffeine DESC 1
.* is Teflon DESC 1
Where is .* LOC 1
.* is Milan LOC 1
What .* LOC 1
What is a mobile phone?
default:4: DESC1: LOC
prior:2: DESC1: LOC
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(17/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
1 6who 7is 9dean 10of 11ICS 12$ (eoq)^ (boq)
$^ who is prime minister of Australia
?
the
8the
?
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(18/23)
Implementations
• ABL– Hypo / Unhypo– Words / POS– default / prior
• Trie-based– Strict / Flex– Words / POS
1 6whoWP 7
isVBZ 9
deanNN 10
ofIN 11
ICSNNP 12
$ (eoq)^ (boq)
$eoq
^boq
whoWP
isVBZ
primeJJ
ministerNN
ofIN
AustraliaNNP
?
theDT
8theDT
?
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(19/23)
Results
coarse fine
words POS words POS
Baseline 0.188 0.188 0.110 0.110
ABL hypo default 0.516 0.682 0.336 0.628
prior 0.554 0.624 0.238 0.472
unhypo default 0.652 0.638 0.572 0.558
prior 0.580 0.594 0.520 0.432
Trie strict 0.844 0.812 0.738 0.710
flex 0.850 0.792 0.742 0.692
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(20/23)
Concluding Remarks
• Numeric results are not better than ML
• Showed that induced structure can obtain good results without using complex linguistic features
• These approaches can produce rules in the form of regular expressions than can be manually adjusted to better fit the problem.
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(21/23)
Future Work
• Regular Expressions can be improved:– Hand-tuning unique REs found by ABL– Augmenting the complexity of REs by
incorporating extra information
• Wildcard match:– Words tend to be semantically related;– Seem to be the focus words of the questions
Van Zaanen, Pizzato, Molla; SPIRE-2005. Buenos Aires, 2-4 November 2005.(22/23)
Review
• Sentence Classification Problem
• Induced Structure Approach– Alignment Based Learning– Trie Based Classifier
• Results
• Concluding Remarks
• Future Work
Classifying Sentences using Induced Structure
Menno Van Zaanen
Luiz Augusto Pizzato
Diego Mollá-Aliod
Centre for Language Technology
Macquarie University
Sydney, Australia