m. lafourcade (lirmm & ch. boitet (geta, clips)lrec-02, las palmas, 31/5/2002 1 lrec-2002, las...

22
M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS) LREC-02, Las Palmas, 31/5/2002 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet LIRMM, Montpellier GETA, CLIPS, IMAG, Grenoble [email protected] http ://www-clips.imag.fr/geta Mathieu.Lafourcade@lirmm. fr http://www.lirmm.fr/~lafourca UNL Lexical Selection with Conceptual Vectors

Upload: melanie-kristin-stevenson

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20021

LREC-2002, Las Palmas, May 2002

Mathieur Lafourcade & Christian BoitetLIRMM, Montpellier

GETA, CLIPS, IMAG, [email protected] http://www-clips.imag.fr/geta

[email protected] http://www.lirmm.fr/~lafourca

UNL Lexical Selection with Conceptual Vectors

Page 2: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20022

Outline

The problem: disambiguation in UNL-French deconversionFinding the known UW nearest to an unknown UWFinding the best French lemma for a given UW

Conceptual vectorsNature & example on French (873 dimensions)Building (Dec. 201: 64,000 terms, 210,000 CVs)

CVD (CV Disambiguation) running for FrenchRecooking the vectors attached to a document treePlacing each recooked vector in the word sense tree

Using CVD in UNL-French deconversion: ongoing

Page 3: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20023

The UNL-FR deconversion process

UNL-FRA Graph (UW)

UNL-L1Graph “UNL Tree”

GMA structure

UMA structure

UMC structure

French utterance

Validation & Localization

Graph to tree conversion

Structural transfer

Paraphrase choice

Morphological generation

Syntactic generation

Lexical Transfer

Conceptual vectorscomputations

UNL-FRA Graph

(French LU)

Page 4: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20024

The problem: disambiguation in UNL-French deconversion

Find the known UW nearest to an unknown UWknown UWs: obj(open(icl>occur),door)(in KB context) a door opens

obj(open(icl>do),door)one opens a door

input graph: obj(open(icl>occur,ins>concrete thing),door)ins(open(icl>occur,ins>concrete

thing),key…) a key opens a door / a door opens with a key==> choose nearest open(icl>occur) for correct result

Find best French lemma for a UW in a given contextmeeting(icl>event) ==> réunion [ACTION, DURATION…]

rencontre [EVENT, MOMENT…]

Page 5: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20025

How to solve them?

1. unknown UW best known UW1. Accessing KB in real time impractical (web server)2. KB not enough: still many possible candidates

2. known UW best LU1. Often no clear symbolic conditions for selection2. Possibility to transform UNLLUfr dictionary into a kind

of neural net (cf. MSR MindNet)

3. a possible unifying solution: Lexical selection through DCV,

Disambiguation using Conceptual Vectors which works quite well for French on large scale

experiments

Page 6: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20026

Conceptual vectors

CV = vector in concept space (4th level in Larousse)V(to tidy up) = CHANGE [0.84], VARIATION [0.83],

EVOLUTION [0.82], ORDER [0.77], SITUATION [0.76], STRUCTURE [0.76], RANK [0.76] …

V(to cut) = GAME [0.8], LIQUID [0.8], CROSS [0.79], PART [0.78] MIXTURE [0.78], FRACTION [0.75], TORTURE [0.75] WOUND [0.75], DRINK [0.74] …

Global vector of a term = normalized sum of the CVs of its meanings/senses V(head) = HEAD [0.83], . BEGINNING [0.75],

ANTERIORITY [0.74], PERSON [0.74] INTELLIGENCE [0.68], HIERARCHY [0.65], …

Page 7: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20027

Conceptual vectors and sense space

Conceptual vector modelReminiscent of Vector Models (Salton and all.) & Sowa

Applied on preselected concepts (not terms)

Concepts are not independent

Set of k basic conceptsThesaurus Larousse = 873 concepts (translation of Roget’s)

A vector = a 873 uple of reals in [0..1]

Encoding for each dimension C = 215 : [0..32767]

Sense space = vector space + vector set

Page 8: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20028

Thematic relatedness

Conceptual vector distanceAngular Distance DA(x, y) = angle (x, y)

0 <= DA(x, y) <= Interpretation

if DA(x, y) = 0 x // y (colinear): same idea

if DA(x, y) = /2 x y (orthogonal): nothing in common

if DA(x, y) = DA(x, y) = DA(x, -x): -x anti-idea of x

x’

xy

Page 9: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/20029

Collection processStart from a few handcrafted term/meanings/vectors<do forever> //running constantly on Lafourcade’s Mac

<choose a word at random (with or without a CV) find NL definitions of its senses (mainly on the Web) for each sense definition SD

analyze SD into linguistic tree TreeDefattach existing or null CVs to lexical nodes of TreeDefiterate propagation of CVs in TreeDef (ling. rules used

here)until CV(root) converges or limit of cycle numbers is reached

CV(sense) CV(root(TreeDef)) use vector distance to arrange the CVs of senses into a binary

« discrimination tree »

</choose>

</do>

Page 10: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200210

An example discrimination tree

Page 11: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200211

Status on French CVsBy Dec. 2001

64,000 terms 210,000 CVs Average of 3.3 senses/term

Method robot to access web lexicon servers large coverage French analyzer by J.Chauché in Sigmart

See more details on http://www.lirmm.fr/~lafourca

Page 12: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200212

Disambiguation in French

Recook the vectors attached to a document tree– Take a document– Analyze it with Sigmart analyzer into ONE possibly big

tree (30 pages OK as a unit)– Use the same process as for processing definitions– Final CV(root) usable as thematic classifier of document– Final CV (lexemes) used as « sense in context »

Place each recooked vector in the discrimination tree– Walk down the discrimination tree, using vector distance– Stop at nearest node:

If leave node, full disambiguation (relative to available sense set) If internal node, partial disambigation (subset of senses)

Page 13: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200213

Example with some ambiguities

•The white ants strike rapidly the trusses of the roof

Page 14: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200214

Initialize: attach CVs to lexemes

• The white ants strike rapidly the trusses of the roof

Page 15: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200215

Up / Down propagation of

the CVs

Page 16: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200216

Result: sense selection

•The white ants strike rapidly the trusses of the roof

Page 17: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200217

Disambiguation in UNL-French deconversion

Our set-upExample input UNL-graph

Outline of the process Two usages of DCV (disambiguation with CV)

Finding the known UW nearest to an unknown UW

Finding the best French lemma for a given UW

Page 18: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200218

A UNL input graph

agt ins plt

obj mod

Ronaldo head(pof>body)corner

leftgoal(icl>thing)

score(icl>event,agt>human,fld>sport).@entry.@past.@complete

objpos

•Ronaldo has headed the ball into the left corner of the goal”

Page 19: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200219

Corresponding UNL-treewith CVs attached: localization DCV

1- Ronaldo: agt

corner: plt

left: mod

1- goal(icl>thing): obj

score(icl>event,agt>human,fld>sport).@entry.@past.@complete

1- goal(icl>thing): objVthing(goal)

Vthing(goal)

V(human)

Vplace(corner)

V(left)

V = Vevent(score)+ Vhuman(score)+ Vsport(score)

2- Ronaldo: pos

V(human)

Vbody(head)

head(pof>body): ins

Page 20: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200220

Result of first step: the « best » UWs

The vector contextualization generalizes both kinds of localization (lexical and cultural).

On each node, the selected UW is the one in the UNL-French database which vector is the closest to the contextualized vector.

Formulas used for up and dow propagation:

↑ V ' ( N ) = V ( N ) ⊕

i = 0

n

∑ V ( ni

)

↓V'(ni)=V(ni)⊕V(N)⊗V(ni)

Page 21: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200221

Second step: select the « best » LUs

Depending on the strategy of the generator, a lexical unit (LU) may be a lemma

a whole derivational family

(pay, payment, payable…)

Dictionay: <UW, CVdict> {<LUi, CVi>}

Input: <UW,CVcontext>

Output: LU i with nearest CVi

Page 22: M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas, 31/5/2002 1 LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet

M. Lafourcade (LIRMM & Ch. Boitet (GETA, CLIPS)LREC-02, Las Palmas,

31/5/200222

Conclusion

Another case of fruitful integration of symbolic & numerical methods

Further work plannedintegration into running UNL-FR serverwork on feed-back (Pr SU’s line of thought)

if user corrects the choice of LU for chosen UW or worse, if user chooses a LU corresponding to another

UW!==> then recompute vectors by giving more weight to

chosen CVs