1 textual entailment dan roth, university of illinois, urbana-champaign usa acl -2007 ido dagan bar...

1

Textual Entailment

Dan Roth, University of Illinois, Urbana-ChampaignUSA

ACL -2007

Ido DaganBar Ilan UniversityIsrael

Fabio Massimo ZanzottoUniversity of RomeItaly

1. Motivation and Task Definition2. A Skeletal review of Textual Entailment

Systems3. Knowledge Acquisition Methods4. Applications of Textual Entailment5. A Textual Entailment view of Applied

Semantics

Outline

I. Motivation and Task Definition

Motivation

Text applications require semantic inference

A common framework for applied semantics is needed, but still missing

Textual entailment may provide such framework

Desiderata for Modeling Framework

A framework for a target level of language processing should provide:

1) Generic (feasible) module for applications

2) Unified (agreeable) paradigm for investigating language phenomena

Most semantics research is scattered

WSD, NER, SRL, lexical semantics relations… (e.g. vs. syntax)

Dominating approach - interpretation

Natural Language and Meaning

Meaning

Language

Ambiguity

Variability

Variability of Semantic Expression

Model variability as relations between text expressions:

Equivalence: text1 text2 (paraphrasing) Entailment: text1 text2 the general case

Dow ends up

Dow climbs 255

The Dow Jones Industrial Average closed up 255

Stock market hits a record high

Dow gains 255 points

Typical Application Inference: Entailment

Overture’s acquisition by Yahoo

Yahoo bought Overture

Question Expected answer formWho bought Overture? >> X bought Overture

text hypothesized answer

entails

Similar for IE: X acquire Y Similar for “semantic” IR: t: Overture was

bought for … Summarization (multi-document) – identify

redundant info MT evaluation (and recent ideas for MT) Educational applications

KRAQ'05 Workshop - KNOWLEDGE and REASONING for ANSWERING QUESTIONS (IJCAI-05)

CFP: Reasoning aspects:

* information fusion, * search criteria expansion models * summarization and intensional answers, * reasoning under uncertainty or with incomplete

knowledge, Knowledge representation and integration:

* levels of knowledge involved (e.g. ontologies, domain knowledge),

* knowledge extraction models and techniques to optimize response accuracy

… but similar needs for other applications – can entailment provide a common empirical framework?

Classical Entailment Definition

Chierchia & McConnell-Ginet (2001):A text t entails a hypothesis h if h is true in every circumstance (possible world) in which t is true

Strict entailment - doesn't account for some uncertainty allowed in applications

“Almost certain” Entailments

t: The technological triumph known as GPS … was incubated in the mind of Ivan Getting.

h: Ivan Getting invented the GPS.

Applied Textual Entailment

A directional relation between two text fragments: Text (t) and Hypothesis (h):

t entails h (th) if humans reading t will infer that h is most likely true

Operational (applied) definition: Human gold standard - as in NLP applications Assuming common background knowledge –

which is indeed expected from applications

Probabilistic Interpretation

Definition: t probabilistically entails h if:

P(h is true | t) > P(h is true) t increases the likelihood of h being true ≡ Positive PMI – t provides information on h’s truth

P(h is true | t ): entailment confidence The relevant entailment score for applications In practice: “most likely” entailment expected

The Role of Knowledge

For textual entailment to hold we require: text AND knowledge hbut knowledge should not entail h alone

Systems are not supposed to validate h’s truth regardless of t (e.g. by searching h on the web)

PASCAL Recognizing Textual Entailment (RTE) Challenges

EU FP-6 Funded PASCAL Network of Excellence 2004-7

Bar-Ilan University ITC-irst and CELCT, TrentoMITRE Microsoft Research

Generic Dataset by Application Use

7 application settings in RTE-1, 4 in RTE-2/3 QA IE “Semantic” IR Comparable documents / multi-doc summarization MT evaluation Reading comprehension Paraphrase acquisition

Most data created from actual applications output

RTE-2/3: 800 examples in development and test sets

50-50% YES/NO split

RTE Examples

TEXT HYPOTHESIS TASKENTAIL-MENT

1

Regan attended a ceremony in Washington to commemorate the landings in Normandy.

Washington is located inNormandy.

IE False

2Google files for its long awaited IPO.

Google goes public. IR True

3

…: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others.

Cardinal Juan Jesus Posadas Ocampo died in 1993.

QA True

4

The SPD got just 21.5% of the votein the European Parliament elections,while the conservative opposition partiespolled 44.5%.

The SPD is defeated bythe opposition parties.

IE True

Participation and Impact

Very successful challenges, world wide: RTE-1 – 17 groups RTE-2 – 23 groups

~150 downloads RTE-3 – 25 groups

Joint workshop at ACL-07 High interest in the research

community Papers, conference sessions and areas,

PhD’s, influence on funded projects Textual Entailment special issue at JNLE ACL-07 tutorial

Methods and Approaches (RTE-2)

Measure similarity match between t and h (coverage of h by t):

Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (WordNet, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e.g. negation, modality)

Cross-pair similarity Detect mismatch (for non-entailment) Interpretation to logic representation + logic

inference

Dominant approach: Supervised Learning

Features model similarity and mismatch Classifier determines relative weights of information

sources Train on development set and auxiliary t-h corpora

t,hSimilarity Features:

Lexical, n-gram,syntacticsemantic, global

Feature vector

Classifier

YES

NO

RTE-2 Results

First Author (Group) AccuracyAverage Precision

Hickl (LCC) 75.4% 80.8%

Tatu (LCC) 73.8% 71.3%

Zanzotto (Milan & Rome) 63.9% 64.4%

Adams (Dallas) 62.6% 62.8%

Bos (Rome & Leeds) 61.6% 66.9%

11 groups 58.1%-60.5%

7 groups 52.9%-55.6%

Average: 60%Median: 59%

Analysis

For the first time: methods that carry some deeper analysis seemed (?) to outperform shallow lexical methods

Cf. Kevin Knight’s invited talk at EACL-06, titled:

Isn’t linguistic Structure Important, Asked the Engineer

Still, most systems, which do utilize deep analysis, did not score significantly better than the lexical baseline

Why?

System reports point at: Lack of knowledge (syntactic transformation

rules, paraphrases, lexical relations, etc.) Lack of training data

It seems that systems that coped better with these issues performed best:

Hickl et al. - acquisition of large entailment corpora for training

Tatu et al. – large knowledge bases (linguistic and world knowledge)

Some suggested research directions

Knowledge acquisition Unsupervised acquisition of linguistic and world

knowledge from general corpora and web Acquiring larger entailment corpora Manual resources and knowledge engineering

Inference Principled framework for inference and fusion of

information levels Are we happy with bags of features?

Complementary Evaluation Modes

“Seek” mode: Input: h and corpus Output: all entailing t ’s in corpus Captures information seeking needs, but

requires post-run annotation (TREC-style) Entailment subtasks evaluations

Lexical, lexical-syntactic, logical, alignment… Contribution to various applications

QA – Harabagiu & Hickl, ACL-06; RE – Romano et al., EACL-06

II. A Skeletal review of Textual Entailment Systems

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year

Yahoo acquired Overture

Entails

Subsumed by

Overture is a search company Google is a search company ……….Google owns OverturePhrasal verb

paraphrasingEntity matching

Semantic Role Labeling

Alignment

Integration

How?

A general Strategy for Textual Entailment

Given a sentence T

Decision

Find the set of Transformations/Fe

atures

of the new representation

(or: use these to create a cost

function)

that allows embedding of H in T.

Given a sentence H

eRe-represent TLexical Syntactic

Semantic

Knowledge Base semantic; structural

& pragmatic Transformations/rules

Re-represent TRe-represent

T

Re-represent HLexical Syntactic

Semantic

Re-represent TRe-represent T Re-represent

TRe-represent T

Re-represent T

Representation

Details of The Entailment Strategy

Preprocessing Multiple levels of lexical

pre-processing Syntactic Parsing Shallow semantic

parsing Annotating semantic

phenomena Representation

Bag of words, n-grams through tree/graphs based representation

Logical representations

Knowledge Sources Syntactic mapping rules Lexical resources Semantic Phenomena

specific modules RTE specific knowledge

sources Additional Corpora/Web

resources Control Strategy &

Decision Making Single pass/iterative

processing Strict vs. Parameter

based Justification

What can be said about the decision?

The Case of Shallow Lexical Approaches

Preprocessing Identify Stop Words

Representation Bag of words

Knowledge Sources Shallow Lexical

resources – typically Wordnet

Control Strategy & Decision Making

Single pass Compute Similarity; use

threshold tuned on a development set (could be per task)

Justification It works

Shallow Lexical Approaches (Example)

Lexical/word-based semantic overlap: score based on matching each word in H with some word in T

Word similarity measure: may use WordNet May take account of subsequences, word order ‘Learn’ threshold on maximum word-based match

score

Text: The Cassini spacecraft has taken images that show rivers on Saturn’s moon Titan.

Hyp: The Cassini spacecraft has reached Titan.

Text: NASA’s Cassini-Huygens spacecraft traveled to Saturn in 2006.

Text: The Cassini spacecraft arrived at Titan in July, 2006.

Clearly, this may not appeal to what we think as understanding, and it is

easy to generate cases for which this does not work well.

However, it works (surprisingly) well with respect to current evaluation

metrics (data sets?)

An Algorithm: LocalLexcialMatching

For each word in Hypothesis, Text if word matches stopword – remove word if no words left in Hypothesis or Text return 0

numberMatched = 0; for each word W_H in Hypothesis for each word W_T in Text HYP_LEMMAS = Lemmatize(W_H); TEXT_LEMMAS = Lemmatize(W_T);

Use Wordnet’s if any term in HYP_LEMMAS matches any term in

TEXT_LEMMAS using LexicalCompare()

numberMatched++; Return: numberMatched/|HYP_Lemmas|

An Algorithm: LocalLexicalMatching (Cont.)

LexicalCompare() if(LEMMA_H == LEMMA_T)

return TRUE; if(HypernymDistanceFromTo(textWord, hypothesisWord) <= 3)

return TRUE; if(MeronymyDistanceFromTo(textWord, hypothesisWord) <= 3)

returnTRUE; if(MemberOfDistanceFromTo(textWord, hypothesisWord) <= 3)

return TRUE: if(SynonymOf(textWord, hypothesisWord)

return TRUE;

Notes: LexicalCompare is Asymmetric & makes use of single relation type Additional differences could be attributed to stop word list (e.g,

including aux verbs) Straightforward improvements such as bi-grams do not help. More sophisticated lexical knowledge (entities; time) should help.

LLM Performance:RTE2: Dev: 63.00 Test: 60.50RTE 3: Dev: 67.50 Test: 65.63

Details of The Entailment Strategy (Again)













based Justification


Preprocessing

Syntactic Processing: Syntactic Parsing (Collins; Charniak; CCG) Dependency Parsing (+types)

Lexical Processing Tokenization; lemmatization For each word in Hypothesis, Text Phrasal verbs Idiom processing Named Entities + Normalization Date/Time arguments + Normalization

Semantic Processing Semantic Role Labeling Nominalization Modality/Polarity/Factive Co-reference

}often used only during decision making

} often used only during decision making

Only a few systems














based Justification


Basic Representations

MeaningRepresentation

Raw Text

Inference

Representation

Textual Entailment

Local Lexical

Syntactic Parse

Semantic Representation

Logical Forms

Most approaches augment the basic structure defined by the processing level with additional annotation and make use of a tree/graph/frame-based system.

Basic Representations (Syntax)

Local Lexical

Syntactic Parse

Hyp: The Cassini spacecraft has reached Titan.

Basic Representations (Shallow Semantics: Pred-Arg )

T: The government purchase of the Roanoke building, a former prison, took place in 1902.

H: The Roanoke building, which was a former prison, was bought by the government in 1902.

The govt. purchase… prison

take

place in 1902ARG_0 ARG_1 ARG_2

PRED

The government

buy

The Roanoke … prison

ARG_0 ARG_1

PRED

The Roanoke building

be

a former prison

ARG_1 ARG_2

PRED

purchase

The Roanoke buildingARG_1

PRED

In 1902AM_TMP

Roth&Sammons’07

Basic Representations (Logical Representation)

[Bos & Markert] The semantic

representationlanguage is a first-order fragment a language

used in Discourse

Representation Theory (DRS), conveying argument structure with

a neo-Davidsonian analysis

and Including the recursive

DRS structure to cover

negation, disjunction, and

implication.

Representing Knowledge Sources

Rather straight forward in the Logical Framework:

Tree/Graph base representation may also use rule based transformations to encode different kinds of knowledge, sometimes represented as generic or knowledge based tree transformations.

Representing Knowledge Sources (cont.)

In general, there is a mix of procedural and rule based encodings of knowledge sources

Done by hanging more information on parse tree or predicate argument representation [Example from LCC’s system]

Or different frame-based annotation systems for encoding information, that are processed procedurally.














based Justification


Knowledge Sources

The knowledge sources available to the system are the most significant component of supporting TE.

Different systems draw differently the line between preprocessing capabilities and knowledge resources.

The way resources are handled is also different across different approaches.

Enriching Preprocessing

In addition to syntactic parsing several approaches enrich the representation with various linguistics resources

Pos tagging Stemming Predicate argument representation: verb predicates and

nominalization Entity Annotation: Stand alone NERs with a variable

number of classes Acronym handling and Entity Normalization: mapping

mentions of the same entity mentioned in different ways to a single ID.

Co-reference resolution Dates, times and numeric values; identification and

normalization. Identification of semantic relations: complex nominals,

genitives, adjectival phrases, and adjectival clauses. Event identification and frame construction.

Lexical Resources

Recognizing that a word or a phrase in S entails a word or a phrase in H is essential in determining Textual Entailment.

Wordnet is the most commonly used resoruce In most cases, a Wordnet based similarity measure

between words is used. This is typically a symmetric relation.

Lexical chains over Wordnet are used; in some cases, care is taken to disallow some chains of specific relations.

Extended Wordnet is being used to make use of Entities

Derivation relation which links verbs with their corresponding nominalized nouns.

Lexical Resources (Cont.)

Lexical Paraphrasing Rules A number of efforts to acquire relational

paraphrase rules are under way, and several systems are making use of resources such as DIRT and TEASE.

Some systems seems to have acquired paraphrase rules that are in the RTE corpus

person killed --> claimed one life hand reins over to --> give starting job to same-sex marriage --> gay nuptials cast ballots in the election -> vote dominant firm --> monopoly power death toll --> kill try to kill --> attack lost their lives --> were killed left people dead --> people were killed

Semantic Phenomena

A large number of semantic phenomena have been identified as significant to Textual Entailment.

A large number of them are being handled (in a restricted way) by some of the systems. Very little quantification per-phenomena has been done, if at all.

Semantic implications of interpreting syntactic structures [Braz et. al’05; Bar-Haim et. al. ’07]

Conjunctions Jake and Jill ran up the hill Jake ran up the hill Jake and Jill met on the hill *Jake met on the hill

Clausal modifiers But celebrations were muted as many Iranians observed a

Shi'ite mourning month. Many Iranians observed a Shi'ite mourning month. Semantic Role Labeling handles this phenomena automatically

Semantic Phenomena (Cont.)

Relative clauses The assailants fired six bullets at the car, which carried Vladimir

Skobtsov. The car carried Vladimir Skobtsov. Semantic Role Labeling handles this phenomena automatically

Appositives Frank Robinson, a one-time manager of the Indians, has the

distinction for the NL. Frank Robinson is a one-time manager of the Indians.

Passive We have been approached by the investment banker. The investment banker approached us. Semantic Role Labeling handles this phenomena automatically

Genitive modifier Malaysia's crude palm oil output is estimated to have risen.. The crude palm oil output of Malasia is estimated to have

risen .

Logical Structure

Factivity : Uncovering the context in which a verb phrase is embedded

The terrorists tried to enter the building. The terrorists entered the building.

Polarity negative markers or a negation-denoting verb (e.g. deny, refuse, fail)

The terrorists failed to enter the building. The terrorists entered the building.

Modality/Negation Dealing with modal auxiliary verbs (can, must, should), that modify verbs’ meanings and with the identification of the scope of negation.

Superlatives/Comperatives/Monotonicity: inflecting adjectives or adverbs.

Quantifiers, determiners and articles

Some Examples [Braz et. al. IJCAI workshop’05;PARC

Corpus]

T: Legally, John could drive. H: John drove.. S: Bush said that Khan sold centrifuges to North Korea. H: Centrifuges were sold to North Korea.. S: No US congressman visited Iraq until the war. H: Some US congressmen visited Iraq before the war.

S: The room was full of women. H: The room was full of intelligent women.

S: The New York Times reported that Hanssen sold FBI secrets to the Russians and could face the death penalty.

H: Hanssen sold FBI secrets to the Russians.

S: All soldiers were killed in the ambush. H: Many soldiers were killed in the ambush.














based Justification


Control Strategy and Decision Making

Single Iteration Strict Logical approaches are, in principle, a single stage

computation. The pair is processed and transform into the logic form. Existing Theorem Provers act on the pair along with the

KB. Multiple iterations

Graph based algorithms are typically iterative. Following [Punyakanok et. al ’04] transformations are

applied and entailment test is done after each transformation is applied.

Transformation can be chained, but sometimes the order makes a difference. The algorithm can be a greedy algorithm or can be more exhaustive, and search for the best path found [Braz et. al’05;Bar-Haim et.al 07]

Transformation Walkthrough [Braz et. al’05]



Does ‘H’ follow from ‘T’?

Transformation Walkthrough (1)




take

place in 1902ARG_0 ARG_1 ARG_2

PRED

The government

buy


ARG_0 ARG_1

PRED


be

a former prison

ARG_1 ARG_2

PRED

purchase

The Roanoke buildingARG_1

PRED

In 1902AM_TMP



The government purchase of the Roanoke building, a former prison, occurred in 1902.

H: The Roanoke building, which was a former prison, was bought by the government.


occur

in 1902ARG_

0ARG_

2

PRED

Phrasal Verb Rewriter


T: The government purchase of the Roanoke building, a former prison, occurred in 1902.

The government purchase the Roanoke building in 1902.


The government

purchase

ARG_0 ARG_

1

PRED

Nominalization Promoter

the Roanoke building, a former prison AM_TMP

In 1902

NOTE: depends on earlier

transformation: order is

important!


T: The government purchase of the Roanoke building, a former prison, occurred in 1902.

The Roanoke building be a former prison.



be

ARG_1

ARG_2

PRED

Apposition Rewriter

a former prison

Transformation Walkthrough (5)T: The government purchase of the Roanoke building, a

former prison, took place in 1902.

H: The Roanoke building, which was a former prison, was bought by the government in 1902.The

government

buy


ARG_0 ARG_1

PRED


be

a former prison

ARG_1 ARG_2

PRED

In 1902AM_TMP

The government

purchase


ARG_0 ARG_1

PRED


be

a former prison

ARG_1 ARG_2

PRED

In 1902AM_TMP

WordNet

Characteristics

Multiple paths => optimization problem Shortest or highest-confidence path through

transformations Order is important; may need to explore different

orderings Module dependencies are ‘local’; module B does

not need access to module A’s KB/inference, only its output

If outcome is “true”, the (optimal) set of transformations and local comparisons form a proof

Summary: Control Strategy and Decision Making

Despite the appeal of the Strict Logical approaches as of today, they do not work well enough.

Bos & Markert: Strict logical approach is failing significantly behind good

LLMs and multiple levels of lexical pre-processing Only incorporating rather shallow features and using it in

the evaluation saves this approach. Braz et. al.:

Strict graph based representation is not doing as well as LLM.

Tatu et. al Results show that strict logical approach is inferior to

LLMs, but when put together, it produces some gain. Using Machine Learning methods as a way to combine

systems and multiple features has been found very useful.

Hybrid/Ensemble Approaches

Bos et al.: use theorem prover and model builder Expand models of T, H using model builder, check sizes of

models Test consistency with background knowledge with T, H Try to prove entailment with and without background

knowledge Tatu et al. (2006) use ensemble approach:

Create two logical systems, one lexical alignment system Combine system scores using coefficients found via search

(train on annotated data) Modify coefficients for different tasks

Zanzotto et al. (2006) try to learn from comparison of structures of T, H for ‘true’ vs. ‘false’ entailment pairs

Use lexical, syntactic annotation to characterize match between T, H for successful, unsuccessful entailment pairs

Train Kernel/SVM to distinguish between match graphs

Justification

For most approaches justification is given only by the data Preprocessed

Empirical Evaluation Logical Approaches

There is a proof theoretic justification Modulo the power of the resources and the ability

to map a sentence to a logical form.

Graph/tree based approaches There is a model theoretic justification The approach is sound, but not complete, modulo

the availably of resources.

R - a knowledge representation language, with a well defined

syntax and semantics or a domain D.

For text snippets s, t: rs, rt - their representations in R. M(rs), M(rt) their model theoretic representations

There is a well defined notion of subsumption in R, defined model theoretically

u, v 2 R: u is subsumed by v when M(u) µ M(v)

Not an algorithm; need a proof theory.

Justifying Graph Based Approaches [Braz et. al 05]

The proof theory is weak; will show rs µ rt only when they are relatively similar syntactically.

r 2 R is faithful to s if M(rs) = M(r)

Definition: Let s, t, be text snippets with representations rs, rt 2 R.

We say that s semantically entails t if there is a representation r 2 R that is faithful to s, for which we can prove that r µ rt

Given rs need to generate many equivalent representations r’s and test r’s µ rt

Defining Semantic Entailment (2)

Cannot be done exhaustively How to generate alternative representations?

A rewrite rule (l,r) is a pair of expressions in R such that l µ r

Given a representation rs of s and a rule (r,l) for which rs µ l the augmentation of rs via (l,r) is r’s = rs Æ r.

Claim: r’s is faithful to s.

Proof: In general, since r’s = rs Æ r then M(r’s)= M(rs) Å M(r) However, since rs µ l µ r then M(rs) µ M(r).

Consequently: M(r’s)= M(rs)

And the augmented representation is faithful to s.

Defining Semantic Entailment (3)

rs l µ r, rs µ lµ

r’s = rs Æ r

The claim suggests an algorithm for generating alternative (equivalent) representations and for semantic entailment.

The resulting algorithm is a sound algorithm, but is not complete.

Completeness depends on the quality of the KB of rules.

The power of this algorithm is in the rules KB. l and r might be very different syntactically, but by

satisfying model theoretic subsumption they provide expressivity to the re-representation in a way that facilitates the overall subsumption.

Comments

The problem of determining non-entailment is harder, mostly due to it’s structure.

Most approaches determine non-entailment heuristically. Set a threshold for a cost function. If not met by the pair,

say ‘now’ Several approach has identified specific features the hind

on non-entialment.

A model Theoretic approach for non-entailment has also been developed, although it’s effectiveness isn’t clear yet.

Non-Entailment

What are we missing?

It is completely clear that the key resource missing is knowledge.

Better resources translate immediately to better results. At this point existing resources seem to be lacking in

coverage and accuracy. Not enough high quality public resources; no

quantification. Some Examples

Lexical Knowledge: Some cases are difficult to acquire systematically.

A bought Y A has/owns Y Many of the current lexical resources are very noisy.

Numbers, quantitative reasoning Time and Date; Temporal Reasoning. Robust event based reasoning and information integration

Textual Entailment as a Classification Task

RTE as classification task

RTE is a classification task: Given a pair we need to decide if T implies H or T

does not implies H

We can learn a classifier from annotated examples

What do we need: A learning algorithm A suitable feature space

Defining the feature space

How do we define the feature space?

Possible features “Distance Features” - Features of “some” distance

between T and H “Entailment trigger Features” “Pair Feature” – The content of the T-H pair is represented

Possible representations of the sentences Bag-of-words (possibly with n-grams) Syntactic representation Semantic representation

T1

H1

“At the end of the year, all solid companies pay dividends.”

“At the end of the year, all solid insurance companies pay dividends.”

T1 H1

Distance Features

Possible features Number of words in common Longest common subsequence Longest common syntactic subtree …

T

H



T H

Entailment Triggers

Possible featuresfrom (de Marneffe et al., 2006)

Polarity features presence/absence of neative polarity contexts (not,no or few,

without)

“Oil price surged”“Oil prices didn’t grow” Antonymy features

presence/absence of antonymous words in T and H

“Oil price is surging”“Oil prices is falling down” Adjunct features

dropping/adding of syntactic adjunct when moving from T to H

“all solid companies pay dividends” “all solid companies pay cash dividends”

…

Pair Features

Possible features Bag-of-word spaces of T and H

Syntactic spaces of T and H

T

H



T Hen

d_T

year_

T

solid

_T

com

pan

ies_

T

pay_T

div

iden

ds_

T

… … en

d_H

year_

H

solid

_H

com

pan

ies_

H

pay_H

div

iden

ds_

H

… …insu

ran

ce_H

T H

Pair Features: what can we learn?

Bag-of-word spaces of T and H

We can learn: T implies H as when T contains “end”… T does not imply H when H contains “end”…

en

d_T

year_

T

solid

_T

com

pan

ies_

T

pay_T

div

iden

ds_

T

… … en

d_H

year_

H

solid

_H

com

pan

ies_

H

pay_H

div

iden

ds_

H

… …insu

ran

ce_H

T H

It seems to be totally irrelevant!!!

(…)(…)

(…)

ML Methods in the possible feature spaces

Poss

ible

Featu

res

Sentence representation

Bag-of-words Semantic

Dis

tance

Pair

(Hickl et al., 2006)

Syntactic

Enta

ilment

Tri

gger

(Zanzotto&Moschitti, 2006)

(Bos&Markert, 2006)

(Ipken et al., 2006)

(Kozareva&Montoyo, 2006)

(de Marneffe et al., 2006)

(Herrera et al., 2006)

(Rodney et al., 2006)

Effectively using the Pair Feature Space

Roadmap

Motivation: Reason why it is important even if it seems not.

Understanding the model with an example Challenges A simple example

Defining the cross-pair similarity

(Zanzotto, Moschitti, 2006)

Observing the Distance Feature Space…

T1

H1



T1 H1

T1

H2


“At the end of the year, all solid companies pay cash dividends.”

T1 H2


% common syntactic dependencies

% common words

T1 H1In a distance feature space…

… the two pairs are very likely the same point

T1 H2

What can happen in the pair feature space?

T1

H1



T1 H1

T1

H2


“At the end of the year, all solid companies pay cash dividends.”

T1 H2

T3

H3

“All wild animals eat plants that have scientifically proven medicinal properties.”

“All wild mountain animals eat plants that have scientifically proven medicinal properties.”

T3 H3

S2 S1<


Observations

Some examples are difficult to be exploited in the distance feature space…

We need a space that considers the content and the structure of textual entailment examples

Let us explore: the pair space! … using the Kernel Trick: define the space

defining the distance K(P1 , P2) instead of defining the feautures

T1 H1

T1 H2

K(T1 H1,T1 H2)

Target

How do we build it: Using a syntactic interpretation of sentences Using a similarity among trees KT(T’,T’’): this

similarity counts the number of subtrees in common between T’ and T’’

This is a syntactic pair feature space

Question: do we need something more?

Page 82


Cross-pair similarityKS((T’,H’),(T’’,H’’)) KT(T’,T’’)+

KT(H’,H’’)

Observing the syntactic pair feature space

Can we use syntactic tree similarity?



Can we use syntactic tree similarity? Not only!



Can we use syntactic tree similarity? Not only!We want to use/exploit also the implied

rewrite rule


a b c d

a b c d

a b c d

a b c d

Exploiting Rewrite Rules

To capture the textual entailment recognition rule (rewrite rule or inference rule), the cross-pair similarity measure should consider:

the structural/syntactical similarity between, respectively, texts and hypotheses

the similarity among the intra-pair relations between constituents

How to reduce the problem to a tree similarity computation?


Exploiting Rewrite Rules(Zanzotto, Moschitti, 2006)

Exploiting Rewrite RulesIntra-pair operations (Zanzotto, Moschitti, 2006)

Exploiting Rewrite RulesIntra-pair operations Finding anchors


Exploiting Rewrite RulesIntra-pair operationsFinding anchors

Naming anchors with placeholders


Exploiting Rewrite RulesIntra-pair operationsFinding anchorsNaming anchors with placeholders

Propagating placeholders


Exploiting Rewrite RulesIntra-pair operationsFinding anchorsNaming anchors with placeholdersPropagating placeholders

Cross-pair operations (Zanzotto, Moschitti, 2006)

Cross-pair operationsMatching placeholders across pairs

Exploiting Rewrite RulesIntra-pair operationsFinding anchorsNaming anchors with placeholdersPropagating placeholders


Exploiting Rewrite RulesCross-pair operationsMatching placeholders across pairs

Renaming placeholders

Intra-pair operationsFinding anchorsNaming anchors with placeholdersPropagating placeholders


Exploiting Rewrite RulesCross-pair operationsMatching placeholders across pairsRenaming placeholders

Calculating the similarity between syntactic trees with co-indexed leaves


Exploiting Rewrite RulesCross-pair operationsMatching placeholders across pairsRenaming placeholdersCalculating the similarity between syntactic trees with co-indexed leaves


Exploiting Rewrite Rules

The initial example: sim(H1,H3) > sim(H2,H3)?


Defining the Cross-pair similarity

The cross pair similarity is based on the distance between syntatic trees with co-indexed leaves:

where C is the set of all the correspondences between anchors of

(T’,H’) and (T’’,H’’) t(S, c) returns the parse tree of the hypothesis (text)

S where placeholders of these latter are replaced by means of the substitution c

i is the identity substitution KT(t1, t2) is a function that measures the similarity

between the two trees t1 and t2.


Defining the Cross-pair similarity

Refining Cross-pair Similarity

Controlling complexity We reduced the size of the set of anchors using the

notion of chunk

Reducing the computational cost Many subtree computations are repeated during the

computation of KT(t1, t2). This can be exploited for a better dynamic progamming algorithm (Moschitti&Zanzotto, 2007)

Focussing on information within a pair relevant for the entailment:

Text trees are pruned according to where anchors attach


BREAK (30 min)

III. Knowledge Acquisition Methods

Knowledge Acquisition for TE

What kind of knowledge we need? Explicit Knowledge (Structured Knowledge

Bases) Relations among words (or concepts)

Symmetric: Synonymy, cohypohymy Directional: hyponymy, part of, …

Relations among sentence prototypes Symmetric: Paraphrasing Directional : Inference Rules/Rewrite Rules

Implicit Knowledge Relations among sentences

Symmetric: paraphrasing examples Directional: entailment examples

Acquisition of Explicit Knowledge

Acquisition of Explicit Knowledge

The questions we need to answer What?

What we want to learn? Which resources do we need?

Using what? Which are the principles we have?

How? How do we organize the “knowledge acquisition”

algorithm

Acquisition of Explicit Knowledge: what?

Types of knowledge Symmetric

Co-hyponymyBetween words: cat dog

SynonymyBetween words: buy acquireSentence prototypes (paraphrasing) : X bought Y X acquired Z% of the

Y’s shares

Directional semantic relationsWords: cat animal , buy own , wheel partof carSentence prototypes : X acquired Z% of the Y’s shares

X owns Y

Acquisition of Explicit Knowledge : Using what?

Underlying hypothesis

Harris’ Distributional Hypothesis (DH) (Harris, 1964)“Words that tend to occur in the same contexts tend

to have similar meanings.”

Robison’s Point-wise Assertion Patterns (PAP) (Robison, 1970)“It is possible to extract relevant semantic relations

with some pattern.”

sim(w1,w2)sim(C(w1), C(w2))

w1 is in a relation r with w2 if the context pattern(w1, w2 )

Words or Forms Context (Feature) Space

simw(W1,W2)simctx(C(W1), C(W2))

w1= constitute

w2= compose

C(w1)

C(w2)

Distributional Hypothesis (DH)

Corpus: source of contexts

… sun is constituted of hydrogen …

…The Sun is composed of hydrogen …

Point-wise Assertion Patterns (PAP)

w1 is in a relation r with w2 if the contexts patternsr(w1, w2 )

relation w1 part_of w2

patterns “w1 is constituted of

w2”

“w1 is composed of

w2”




part_of(sun,hydrogen)

selects correct vs incorrect relations among words

Statistical IndicatorScorpus(w1,w2)


w1= constitute

w2= compose

C(w1)

C(w2)

DH and PAP cooperate




Distributional Hypothesis Point-wise assertion Patterns

Knowledge Acquisition: Where methods differ?

On the “word” side Target equivalence classes: Concepts or

Relations Target forms: words or expressionsOn the “context” side Feature Space Similarity function


w1= cat

w2= dog

C(w1)

C(w2)

KA4TE: a first classification of some methods

Types

of

know

ledge


Distributional Hypothesis

Point-wise assertion Patterns

Sym

metr

icD

irect

ional

ISA patterns(Hearst, 1992)

Verb Entailment(Zanzotto et al., 2006)

Concept Learning(Lin&Pantel, 2001a)

Inference Rules (DIRT) (Lin&Pantel, 2001b)

Relation Pattern Learning (ESPRESSO)(Pantel&Pennacchiotti, 2006)

HearstESPRESSO

(Pantel&Pennacchiotti, 2006)

Noun Entailment(Geffet&Dagan, 2005)

TEASE(Szepktor et al.,2004)

Noun Entailment Relation

Type of knowledge: directional relations Underlying hypothesis: distributional

hypothesis Main Idea: distributional inclusion hypothesis

(Geffet&Dagan, 2006)

w1 w2

if

All the prominent features

of w1 occur with w2 in a

sufficiently large corpus


++++

++ +

++

w1

w2

C(w1)

C(w2)

w1 w2

I(C(w2))

I(C(w1))

Verb Entailment Relations

Type of knowledge: oriented relations Underlying hypothesis: point-wise assertion

patterns Main Idea:

win play? player wins!

(Zanzotto, Pennacchiotti, Pazienza, 2006)

relation v1 v2

patterns “agentive_nominalization(v2)

v1”

Point-wise Mutual information

Statistical IndicatorS(v1,v2)


Understanding the idea Selectional restriction

fly(x) has_wings(x)

in general

v(x) c(x) (if x is the subject of v then x has the property c)

Agentive nominalization“agentive noun is the doer or the performer of an action v’”

“X is player” may be read as play(x)

c(x) is clearly v’(x) if the property c is derived by v’ with an agentive nominalization

(Zanzotto, Pennacchiotti, Pazienza, 2006)

Skipped


Understanding the ideaGiven the expression

player wins Seen as a selctional restriction

win(x) play(x) Seen as a selectional preference

P(play(x)|win(x)) > P(play(x))

Skipped

Knowledge Acquisition for TE: How?

The algorithmic nature of a DH+PAP method Direct

Starting point: target words Indirect

Starting point: context feature space Iterative

Interplay between the context feature space and the target words



w1= cat

w2= dog

C(w1)

C(w2)

Direct Algorithm

sim(w1, w2)

I(C(w1))

I(C(w2))

sim(I(C(w1)), I(C(w2)))

sim(w1,w2)sim(I(C(w1)), I(C(w2)))

1. Select target words wi from the corpus or from a dictionary

2. Retrieve contexts of each wi and represent them in the feature space C(wi )

3. For each pair (wi, wj)1. Compute the

similarity sim(C(wi), C(wj )) in the context space

2. If sim(wi, wj )= sim(C(wi), C(wj ))>wi and wj belong to the same equivalence class W

sim(C(w1), C(w2))

1. Given an equivalence class W, select relevant contexts and represent them in the feature space

2. Retrieve target words (w1, …, wn) that appear in these contexts. These are likely to be words in the equivalence class W

3. Eventually, for each wi, retrieve C(wiI) from the corpus

4. Compute the centroid I(C(W))

5. For each for each wi, if sim(I(C(W), wi)<t, eliminate wi from W.



w1= cat

w2= dog

C(w1)

Indirect Algorithm

C(w2)

sim(w1, w2)


sim(C(w1), C(w2))

1. For each word wi in the equivalence class W, retrieve the C(wi) contexts and represent them in the feature space

2. Extract words wj that have contexts similar to C(wi)

3. Extract contexts C(wj) of these new words

4. For each for each new word wj, if sim(C(W), wj)>, put wj in W.



w1= cat

w2= dog

C(w1)

Iterative Algorithm

C(w2)

sim(C(w1), C(w2))

sim(w1, w2)


Knowledge Acquisition using DH and PAH

Direct Algorithms Concepts from text via clustering (Lin&Pantel, 2001) Inference rules – aka DIRT (Lin&Pantel, 2001) …

Indirect Algorithms Hearst’s ISA patterns (Hearst, 1992) Question Answering patterns (Ravichandran&Hovy,

2002) …

Iterative Algorithms Entailment rules from Web – aka TEASE (Szepktor et

al., 2004) Espresso (Pantel&Pennacchiotti, 2006) …

TEASE

Type: Iterative algorithmOn the “word” side Target equivalence classes: fine-grained relations

Target forms: verb with arguments

On the “context” side Feature Space

Innovations with respect to reasearches < 2004 First direct algorithm for extracting rules

prevent(X,Y)

X_{filler}:mi?,Y_{filler}:mi?

call

indictable

subjobj

mod

XYfinally

mod

(Szepktor et al., 2004)

TEASE

WEB

LexiconInput template:

Xsubj-accuse-objY

Sample corpus for input template:Paula Jones accused Clinton…BBC accused Blair…Sanhedrin accused St.Paul……

Anchor sets:{Paula Jonessubj; Clintonobj}{Sanhedrinsubj; St.Paulobj}…

Sample corpus for anchor sets:Paula Jones called Clinton indictable…St.Paul defended before the Sanhedrin …

Templates:X call Y indictableY defend before X…

TEASE

Anchor Set Extraction

(ASE)

Template Extraction

(TE)

iterate


Skipped

TEASE

Innovations with respect to reasearches < 2004

First direct algorithm for extracting rules A feature selection is done to assess the

most informative features Extracted forms are clustered to obtain the

most general sentence prototype of a given set of equivalent forms


call{1}

indictable{1}

subj {1}

obj {1}mod {1}

X{1}

Y{1}

harassment{1}

for {1}

S1: call{2}

indictable{2}

subj {2}

obj {2}mod {2}

X{2}

Y{2}

S2:

finally {2}

mod {2}

call{1,2}

indictable{1,2}

subj {1,2}

obj {1,2}mod {1,2}

X{1,2}

Y{1,2}

harassment{1}

for {1}

finally {2}

mod {2}

Skipped

Espresso

Type: Iterative algorithmOn the “word” side Target equivalence classes: relations

Target forms: expressions, sequences of tokens

Innovations with respect to reasearches < 2006 A measure to determine specific vs. general

patterns (ranking in the equivalent forms)

Y is composed by X, Y is made of X

compose(X,Y)


Espresso

Pattern Induction

Sentence retrieval

Sentence generalization

SEEDS

Frequency count

Pattern Ranking / Selection

Pattern Reliability ranking

Pattern selection

Instance Extraction

GENERIC PATTERN FILTERING

Pattern instantiation

Low Redundancy Test

yes

no

yes

Syntactic Expansion

Web Expansion

Generic Test Google

Web Instance Filter

Instance Ranking / Selection

Instance Reliability ranking

Instance selection

(leader , panel)(city , region)

(oxygen , water)

Y is composed by XX,Y

Y is part of Y

1.0 Y is composed by X0.8 Y is part of X0.2 X,Y

(tree , land)(oxygen , hydrogen)

(atom, molecule)(leader , panel)

(range of information, FBI report)(artifact , exhibit)

…

1.0 (tree , land)0.9 (atom, molecule)0.7 (leader , panel)0.6 (range of information, FBI report)0.6 (artifact , exhibit)0.2 (oxygen , hydrogen)


Skipped

Espresso

Innovations with respect to reasearches < 2006

A measure to determine specific vs. general patterns (ranking in the equivalent forms)

Both pattern and instance selections are performed

Different Use of General and specific patterns in the iterative algorithm


1.0 Y is composed by X0.8 Y is part of X0.2 X,Y Skipped

Acquisition of Implicit Knowledge

Acquisition of Implicit Knowledge

The questions we need to answer What?

What we want to learn? Which resources do we need?

Using what? Which are the principles we have?

Acquisition of Implicit Knowledge: what?

Types of knowledge Symmetric

Nearly Synonymy between sentences Acme Inc. bought Goofy ltd. Acme Inc. acquired 11% of the Goofy ltd.’s

shares

Directional semantic relations Entailment between sentences

Acme Inc. acquired 11% of the Goofy ltd.’s shares Acme Inc. owns Goofy ltd.

Note: ALSO TRICKY NOT-ENTAILMENT ARE RELEVANT

Acquisition of Implicit Knowledge : Using what?


Structural and content similarity“Sentences are similar if they share enough

content”

A revised Point-wise Assertion Patterns“Some patterns of sentences reveal relations among

sentences”

sim(s1,s2) according to relations from s1 and s2

A first classification of some methodsTypes

of

know

ledge


Structural and content similarity

Revised Point-wise assertion

Patterns

Sym

metr

ic

Dir

ect

ional

Relations among sentences(Hickl et al., 2006)

Paraphrase Corpus(Dolan&Quirk, 2004)

enta

ils

not

enta

ils

Relations among sentences(Burger&Ferro, 2005)

Entailment relations among sentences

Type of knowledge: directional relations (entailment)

Underlying hypothesis: revised point-wise assertion patterns

Main Idea: in headline news items, the first sentence/paragraph generally entails the title

(Burger&Ferro, 2005)

relation s2 s1

patterns “News Item

Title(s1)

First_Sentence(s2)”

This pattern works on the structure of the text

Entailment relations among sentencesexamples from the web

New York Plan for DNA Data in Most Crimes

Eliot Spitzer is proposing a major expansion of New York’s database of DNA samples to include people convicted of most crimes, while making it easier for prisoners to use DNA to try to establish their innocence. …

Title

Body

Chrysler Group to Be Sold for $7.4 Billion

DaimlerChrysler confirmed today that it would sell a controlling interest in its struggling Chrysler Group to Cerberus Capital Management of New York, a private equity firm that specializes in restructuring troubled companies. …

Title

Body

Tricky Not-Entailment relations among sentences

Type of knowledge: directional relations (tricky not-entailment)

Underlying hypothesis: revised point-wise assertion patterns

Main Idea: in a text, sentences with a same name entity generally

do not entails each other Sentences connected by “on the contrary”, “but”, … do

not entail each other

(Hickl et al., 2006)

relation s1 s2

patterns

s1 and s2 are in the same text and share at least a named entity“s1. On the contrary, s2”

Tricky Not-Entailment relations among sentences

examples from (Hickl et al., 2006)

One player losing a close friend is Japanese pitcherHideki Irabu, who was befriended by Wells during spring training last year.Irabu said he would take Wells out to dinnerwhen the Yankees visit Toronto.

T

H

According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient.

T

H In contrast, he stressed, Clean Mag has a 100percent pollution retrieval rate, is low cost and can be recycled.

He used a Phillips head to tighten the screw.

The bank owner tightened security after a spat of local crimes.

The Federal Reserve will aggressively tighten monetary policy.

Context Sensitive Paraphrasing

……….

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce


Can speak replace command?

The general commanded his troops. The general spoke to his troops.

The soloist commanded attention. The soloist spoke to attention.


Need to know when one word can paraphrase another, not just if.

Given a word v and its context in sentence S, and another word u:

Can u replace v in S and have S keep the same or entailed meaning.

Is the new sentence S’ where u has replaced v entailed by previous sentence S

The general commanded [V] his troops. [Speak = U]

The general spoke to his troops. YES

The soloist commanded [V ] attention. [Speak = U]

The soloist spoke to attention. NO

Related Work

Paraphrase generation: Given a sentence or phrase, generate paraphrases of

that phrase which have the same or entailed meaning in some context. [DIRT;TEASE]

A sense disambiguation task – w/o naming the sense

Dagan et. al’06 Kauchak & Barzilay (in the context of improving MT

evaluation) SemEval word Substitution Task; Pantel et. al ‘06

In these cases, this was done by learning (in a supervised way) a single classifier per word u

Context Sensitive Paraphrasing [Connor&Roth ’07]

Use a single global binary classifierf(S,v,u) ! {0,1}

Unsupervised, bootstrapped, learning approach

Key: the use of a very large amount of unlabeled data to derive a reliable supervision signal that is then used to train a supervised learning algorithm.

Features are amount of overlap between contexts u and v have both been seen with

Include context sensitivity by restricting to contexts similar to S

Are both u and v seen in contexts similar to local context S Allows running the classifier on previously unseen pairs

(u,v)

IV. Applications of Textual Entailment

Relation Extraction (Romano et al. EACL-06)

Identify different ways of expressing a target relation Examples: Management Succession, Birth - Death,

Mergers and Acquisitions, Protein Interaction

Traditionally performed in a supervised manner Requires dozens-hundreds examples per relation Examples should cover broad semantic variability

Costly - Feasible???

Little work on unsupervised approaches

Proposed Approach

Input TemplateX prevent Y

Entailment Rule Acquisition

TemplatesX prevention for Y, X treat Y, X reduce Y

Syntactic Matcher

Relation Instances<sunscreen, sunburns>

TEASE

TransformationRules

Dataset

Bunescu 2005 Recognizing interactions between

annotated proteins pairs 200 Medline abstracts

Input template : X interact with Y

Manual Analysis - Results

93% of interacting protein pairs can be identified with lexical syntactic templates

Phenomenon % Phenomenon %

transparent head 34 relative clause 8

apposition 24 co-reference 7

conjunction 24 coordination 7

set 13 passive form 2

R(%) # templates R(%) # templates

10 2 60 39

20 4 70 73

30 6 80 107

40 11 90 141

50 21 100 175

Frequency of syntactic phenomena:

Number of templates vs. recall (within 93%):

TEASE Output for X interact with Y

A sample of correct templates learned:

X bind to Y X binding to Y

X activate Y X Y interaction

X stimulate Y X attach to Y

X couple to Y X interaction with Y

interaction between X and Y

X trap Y

X become trapped in Y X recruit Y

X Y complex X associate with Y

X recognize Y X be linked to Y

X block Y X target Y

Iterative - taking the top 5 ranked templates as input

Morph - recognizing morphological derivations(cf. semantic role labeling vs. matching)

Experiment Recall

input 39%

input + iterative 49%

input + iterative + morph

63%

TEASE Potential Recall on Training Set

Performance vs. Supervised Approaches

Supervised: 180 training abstracts

Textual Entailment for Question Answering

Sanda Harabagiu and Andrew Hickl (ACL-06) : Methods for Using Textual Entailment in Open-Domain Question Answering

Typical QA architecture – 3 stages:1) Question processing2) Passage retrieval3) Answer processing

Incorporated their RTE-2 entailment system at stages 2&3, for filtering and re-ranking

Integrated three methods

1) Test entailment between question and final answer – filter and re-rank by entailment score

2) Test entailment between question and candidate retrieved passage – combine entailment score in passage ranking

3) Test entailment between question and Automatically Generated Questions (AGQ) created from candidate paragraph Utilizes earlier method for generating Q-A pairs from

paragraph Correct answer should match that of an entailed AGQ

TE is relatively easy to integrate at different stages Results: 20% accuracy increase

Answer Validation Exercise @ CLEF 2006-7

Peñas et al., Journal of Logic and Computation (to appear)

Allow textual entailment systems to validate (and prioritize) the answers of QA systems participating at CLEF

AVE participants receive:1) question and answer – need to generate full hypothesis2) supporting passage – should entail the answer

hypothesis Methodologically: Enables to measure TE

systems contribution to QA performance, across many QA systems TE developers do not need to have full-blown QA

system

V. A Textual Entailment view of Applied Semantics

Classical Approach = Interpretation

Stipulated Meaning

Representation(by scholar)

Language(by nature)

Variability

Logical forms, word senses, semantic roles, named entity types, … - scattered interpretation tasks

Feasible/suitable framework for applied semantics?

Textual Entailment = Text Mapping

Assumed Meaning (by humans)

Language(by nature)

Variability

General Case – Inference

MeaningRepresentation

Language

Inference

Interpretation

Textual Entailment

Entailment mapping is the actual applied goal - but also a touchstone for understanding!

Interpretation becomes possible means Varying representation levels may be investigated

Some perspectives

Issues with semantic interpretation Hard to agree on a representation language Costly to annotate semantic representations for

training Difficult to obtain - is it more difficult than

needed? Textual entailment refers to texts

Texts are theory neutral Amenable for unsupervised learning “Proof is in the pudding” test

Entailment as an Applied Semantics Framework

The new view: formulate (all?) semantic problems as entailment tasks

Some semantic problems are traditionally investigated as entailment tasks

But also… Revised definitions of old problems Exposing many new ones

Some Classical Entailment Problems

Monotonicity – traditionally approached via entailment

Given that: dog animal Upward monotone: Some dogs are nice Some animals

are nice Downward monotone: No animals are nice No dogs are nice

Some formal approaches – via interpretation to logical form

Natural logic – avoids interpretation to FOL (cf. Stanford @ RTE-3)

Noun compound relation identification a novel by Tolstoy Tolstoy wrote a novel Practically an entailment task, when relations are

represented lexically (rather than as interpreted semantic notions)

Revised definition of an Old Problem: Sense Ambiguity

Classical task definition - interpretation: Word Sense Disambiguation

What is the RIGHT set of senses? Any concrete set is problematic/subjective … but WSD forces you to choose one

A lexical entailment perspective: Instead of identifying an explicitly stipulated sense

of a word occurrence ... identify whether a word occurrence (i.e. its implicit

sense) entails another word occurrence, in context Dagan et al. (ACL-2006)

Synonym Substitution

Source = record Target = disc

This is anyway a stunning disc, thanks to the playing of the Moscow Virtuosi with Spivakov.

He said computer networks would not be affected and copies of information should be made on floppy discs.

Before the dead soldier was placed in the ditch his personal possessions were removed, leaving one disc on the body for identification purposes.

positive

negative

negative

Unsupervised Direct: kNN-ranking

Test example score: Average Cosine similarity of target example with k most similar (unlabeled) instances of source word

Rational: positive examples of target will be similar to

some source occurrence (of corresponding sense) negative target examples won’t be similar to

source examples Rank test examples by score

A classification slant on language modeling

Results (for synonyms): Ranking

kNN improves 8-18% precision up to 25% recall

Other Modified and New Problems

Lexical entailment vs. classical lexical semantic relationships

synonym ⇔ synonym hyponym ⇒ hypernym (but much beyond WN – e.g. “medical

technology”) meronym ⇐ ? ⇒ holonym – depending on meronym type, and

context boil on elbow ⇒ boil on arm vs. government voted ⇒ minister voted

Named Entity Classification – by any textual type Which pickup trucks are produced by Mitsubishi?

Magnum pickup truck Argument mapping for nominalizations (derivations)

X’s acquisition of Y X acquired Y X’s acquisition by Y Y acquired X

Transparent head sell to an IBM division sell to IBM sell to an IBM competitor ⇏ sell to IBM

…

The importance of analyzing entailment examples

Few systematic manual data analysis works were reported

Vanderwende et al. at RTE-1 workshop Bar-Haim et al. at ACL-05 EMSEE Workshop Within Romano et al. at EACL-06 Xerox Parc Data set; Braz et. IJCAI workshop’05

Contribute a lot to understanding and defining entailment phenomena and sub-problems

Should be done (and reported) much more…

Unified Evaluation Framework

Defining semantic problems as entailment problems facilitates unified evaluation schemes (vs. current state)

Possible evaluation schemes:1) Evaluate on the general TE task, while creating corpora which

focus on target sub-tasks E.g. a TE dataset with many sense-matching instances Measure impact of sense-matching algorithms on TE performance

2) Define TE-oriented subtasks, and evaluate directly on sub-task E.g. a test collection manually annotated for sense-matching Advantages: isolate sub-problem; researchers can investigate

individual problems without needing a full-blown TE system (cf. QA research)

Such datasets may be derived from datasets of type (1)

Facilitates common inference goal across semantic problems

Summary: Textual Entailment as Goal

The essence of the textual entailment paradigm: Base applied semantic inference on entailment

“engines” and KBs Formulate various semantic problems as entailment

sub-tasks Interpretation and “mapping” methods may

compete/complement at various levels of representations

Open question: which inferences can be represented at “language” level? require logical or specialized representation and

inference? (temporal, spatial, mathematical, …)

Textual Entailment ≈ Human Reading Comprehension

From a children’s English learning book(Sela and Greenberg):

Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …”

Hypothesis (True/False?): The Bermuda Triangle is near the United States

???

Cautious Optimism: Approaching the Desiderata?

1) Generic (feasible) module for applications

2) Unified (agreeable) paradigm for investigating language phenomena

Thank you!

1 textual entailment dan roth, university of illinois, urbana-champaign usa acl -2007 ido dagan bar...

Documents

framework slide

text t

applied textual entailment

task definition slide

motivation text applications

t probabilistic

textual entailment dan

nlp applications