6th intex workshop sofia, bulgaria 28-30 may 2003 macro- or microstructure? improving the lexical...
TRANSCRIPT
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Macro- or Microstructure?Improving the lexical coverage of an electronic dictionary
while enriching microstructural information
X. Blanco, A. Catena, S. FuentesAutonomous University of Barcelona
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Outline
Electronic Dictionaries of Spanish
Macro- or microstructure?
Case studies:prefixed formssuffixed formscliticized forms
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Indexation
Morphology
HAMT
MT
IR
SearchEngines
Syntax
SemanticsSpelling
ElectronicElectronicDictionariesDictionaries
Introduction
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Macrostructure
Electronic Dictionary of simple forms:80 000 entries – 1 000 000 inflectional forms
Electronic Dictionary of compound forms:250 000 compound nouns4 500 compound adverbs
Other Databases:75 000 Proper Nouns260 000 FFFetc.
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Microstructure Samples
• G: N5;4
• M: -
• T: Abst
• C: <textos>
• D: cinéma
• P: R1P1
• R: estándar
• V: -
• N0: Hum
• N1: de <films>
• N2: según <textos: lit>
• Caus1Func0: escribir• Labor13: adaptar• S0Labor13: adaptación• A2Labor13: adaptado de, basado en• Bon: interesante, apasionate, inteligente• AntiBon: absurdo, incoherente, aburrido• Real: leer• S: script, argumento, intriga, trama• Fr: scénario• En: script• De: Drehbuch• (...)
guión (script)
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Macro- or Microstructure?Unknown words:
inabatibleinacostumbradoimborrablereacostumbrarreaceptar
complet(o)ísimoquerid(o)ísimoriquísimopastelitopuebl(o)ecito
Unknown words:
dame
dámelo
léelo
léeselo
...
donne-le-lui
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Macro- or Microstructure?
Lemma
Microstructural information
Derived
lemma
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Unknown simple words &Analyzed tokens (PFX & SFX)
323,072 unknown unigrams in Spanish Webpages
68,818 candidates to new simple-word lexical entries
PFX.grf (99) 6.380 analyzed tokensSFX.grf (54) 271 analyzed tokens
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Three different classes of Analyzed tokens
• First case:
– The constructed form is lexicalized– Then, we need to add a new, independent entry
e.g. prelavado (prewashing), macrofiesta...
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Three different classes of Analyzed tokens• Second case:
– The constructed form is lexically conditioned
– Then, it must be explicitly indicated in the “Lexical Functions” area of the microstructure of the lexical basis
e.g. superamigos, ??archiamigos, ??hiperamigos, *maxiamigos“somos superamigos” = “somos muy amigos”, “somos
amigos íntimos” (close friends)
Lexical Function = Magn (close friends, heavy smoker, confirmed bachelor...)
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Three different classes of Analyzed tokens
• Third case:
The constructed forms is actually a constructed form !
In other words, a lexical unit of the dictionary plus a prefix that expresses a value of the actualisation of this lexical unit: tense, aspect, diathesis, negation or quantification.
The bad news:In order to generate reasonable hypothesis, we need to represent some linguistic constraints.
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Some linguistic contraints
anteparto,
{ante,ante.PFX+tps+anterioridad},{parto,parto.N1+Abst:ms}
antepuerto,
{ante,ante.PFX+posición},{puerto,puerto.N1+Loc:ms}
Pattern: ex + Nhum<post>:
•exrector,{ex,ex.PFX+tps+anterioridad},{rector,rector.N51:ms}
•expresidente,{ex,ex.PFX+tps+anterioridad}/{presidente,presidente.N50:ms}
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Prefixes: Tense
ante,PFX+temporal+anterioridad
ex,PFX+temporal+anterioridad
neo,PFX+temporal+posterioridad
pre,PFX+temporal+anterioridad
pos,PFX+temporal+posterioridad
post,PFX+temporal+posterioridad
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Prefixes: Aspect & Aktionsart
re,PFX+modo de accción+iteración
sobre,PFX+modo de accción+nimifactivo
sub,PFX+modo de accción+refactivo
Prefixes: Diathesis
auto,PFX+diatesis+reflexividad
co,PFX+ diatesis +comitativo
entre,PFX+ diatesis +reciprocidad
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Prefixes: Negation
anti,PFX+negación+oposición
contra,PFX+negación+oposición
des,PFX+negación+privación
des,PFX+negación+reversión
in,PFX+negación
sin,PFX+negación+privación
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Prefixes: Quantification
bi,PFX+cuantificación
cuatri,PFX+cuantificación
deca,PFX+cuantificación
dodeca,PFX+cuantificación
endeca,PFX+cuantificación
enea,PFX+cuantificación
hecto,PFX+cuantificación
hepta,PFX+cuantificación
hexa,PFX+cuantificación
mono,PFX+cuantificación
multi,PFX+cuantificación
octa,PFX+cuantificación
octo,PFX+cuantificación
penta,PFX+cuantificación
pluri,PFX+cuantificación
poli,PFX+cuantificación
sex,PFX+cuantificación
tetra,PFX+cuantificación
tri,PFX+cuantificación
uni,PFX+cuantificación
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Prefixes: Lexical Functionsante,PFX+FL+Locanti,PFX+FL+Locarchi,PFX+FL+Magncasi,PFX+FL+AntiVercircun,PFX+FL+Loccuasi,PFX+FL+AntiVerendo,PFX+FL+Locentre,PFX+FL+Locepi,PFX+FL+Locequi,PFX+FLexo,PFX+FL+Locextra,PFX+FL+Locextra,PFX+FL+Magnhetero,PFX+FLhiper,PFX+FL+Magnhipo,PFX+FL+AnMagnhomo,PFX+FL
infra,PFX+FL+AnMagn
infra,PFX+FL+Loc
inter,PFX+FL+Loc
intra,PFX+FL+Loc
intro,PFX+FL+Loc
iso,PFX+FL+Magn
macro,PFX+FL+Magn
mal,PFX+FL+Magn
maxi,PFX+FL+Magn
medio,PFX+FL+Magn
mega,PFX+FL+Magn
meta,PFX+FL+Magn
micro,PFX+FL+Magn
mini,PFX+FL+Magn
para,PFX+FL+Magn
peri,PFX+FL+Magn
post,PFX+FL+Loc
pre,PFX+FL+Loc
pro,PFX+FL+Loc
pseudo,PFX+FL+AntiVer
re,PFX+FL+Loc
re,PFX+FL+Magn
requete,PFX+FL+Magn
retro,PFX+FL+Loc
seudo,PFX+FL+AntiVer
so,PFX+FL+Magn
sobre,PFX+FL+Loc
sobre,PFX+FL+Magn
soto,PFX+FL+Loc
sub,PFX+FL+AntiMagn
sub,PFX+FL+Loc
super,PFX+FL+Loc
super,PFX+FL+Magn
supra,PFX+FL+Loc
supra,PFX+FL+Loc
trans,PFX+FL+Loc
tras,PFX+FL+Loc
ultra,PFX+FL+Loc
ultra,PFX+FL+Magn
vice,PFX+FL+Loc
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Appreciative Suffixesacha,.SFX+peyorativoacho,.SFX+peyorativoaco,.SFX+peyorativoaja,.SFX+peyorativoajo,.SFX+peyorativoal,.SFX+aumentativoales,.SFX+peyorativoalla,.SFX+peyorativoanga,.SFX+peyorativoángana,.SFX+peyorativoángano,.SFX+peyorativoango,.SFX+peyorativoastra,.SFX+peyorativoastre,.SFX+peyorativoastro,.SFX+peyorativoaza,.SFX+aumentativoazo,.SFX+aumentativoecita,.SFX+diminutivo
ecito,.SFX+diminutivoeja,.SFX+diminutivoejo,.SFX+diminutivoengue,.SFX+peyorativoeta,.SFX+diminutivoete,.SFX+diminutivoica,.SFX+diminutivoico,.SFX+diminutivoilla,.SFX+diminutivoillo,.SFX+diminutivoín,.SFX+diminutivoina,.SFX+diminutivoingo,.SFX+peyorativoingue,.SFX+peyorativoita,.SFX+diminutivoito,.SFX+diminutivoón,.SFX+aumentativo
ona,.SFX+aumentativo
orio,.SFX+peyorativoorra,.SFX+peyorativoorrio,.SFX+peyorativoorro,.SFX+peyorativoota,.SFX+aumentativoote,.SFX+aumentativouca,.SFX+peyorativoucha,.SFX+peyorativoucho,.SFX+peyorativouco,.SFX+peyorativouda,.SFX+aumentativoudo,.SFX+aumentativouela,.SFX+diminutivouelo,.SFX+diminutivouja,.SFX+peyorativoujo,.SFX+peyorativoute,.SFX+peyorativouza,.SFX+peyorativo
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
• cantar,W• cantando,G
• canta,Y:1s• cante,Y:2s• cantemos,Y:1p• cantad,Y:2p• canten,Y:3p
Clitic Pronouns
cantarme
cantándome
cántame
...
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Clitic Pronounsla,.CLIT+N1(3fs)
las,.CLIT+N1(3fp)
le,.CLIT+N1/N2(3s)
les,.CLIT+N1/N2(3p)
lo,.CLIT+N1(3ms)
los,.CLIT+N1(3mp)
me,.CLIT+N1/N2(1s)
mela,.CLIT+N2(1s)_N1(fs)
melas,.CLIT+N2(1s)_N1(fp)
melo,.CLIT+N2(1s)_N1(ms)
melos,.CLIT+N2(1s)_N1(mp)
nos,.CLIT+N1/N2(1p)
nosla,.CLIT+N2(1p)_N1(fs)
noslas,.CLIT+N2(1p)_N1(fp)
noslo,.CLIT+N2(1p)_N1(ms)
noslos,.CLIT+N2(1p)_N1(mp)
os,.CLIT+N1/N2(2p)
osla,.CLIT+N2(2p)_N1(fs)
oslas,.CLIT+N2(2p)_N1(fp)oslo,.CLIT+N2(2p)_N1(ms)oslos,.CLIT+N2(2p)_N1(mp)se,.CLIT+Pron/N1/N2-0sela,.CLIT+N2(3)_N1(fs)selas,.CLIT+N2(3)_N1(fp)sele,.CLIT+Pron_N1/N2(3s)seles,.CLIT+Pron_N1/N2(3p)selo,.CLIT+N2(3)_N1(ms)selos,.CLIT+N2(3)_N1(mp)seme,.CLIT+Pron_N1/N2(1s)senos,.CLIT+Pron_N1/N2(1p)seos,.CLIT+Pron_N1/N2(2p)sete,.CLIT+Pron_N1/N2(2s)te,.CLIT+N1/N2(2s)tela,.CLIT+N2(2s)_N1(fs)telas,.CLIT+N2(2s)_N1(fp)telo,.CLIT+N2(2s)_N1(ms)telos,.CLIT+N2(2s)_N1(mp)
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Verbal Structures
discurrir#1/N0:Tps/Fr:s’écouler/En:to pass
discurrir#2/N0:Hum/N1:Prép loc/Fr:parcourir/En:to wander
discurrir#3/N0:Inc<líquidos>/N1:Prép Loc/Fr:couler/En:to flow
discurrir#4/N0:Loc<vías fluv.>/N1:Prép Loc/Fr:couler/En:to flow
discurrir#5/N0:Hum/N1:Abst/Fr:inventer/En:to think up
discurrir#6/N0:Hum/N1:sobre Abst/Fr:réfléchir/En:to think
discurrir#7/N0:Hum/N1:sobre Abst/N2:con Hum/Es:to discourse
e.g.:
Así que tendréis que discurrirlo vosotros => discurrir#5
6th INTEX Workshop Sofia, Bulgaria 28-30 May 2003
Conclusions and Perspectives
• An unigram can convey, in a cumulative form, information about actualization, lexical functions and argument structure.
• The mechanism is, in someway, recursive:auto-des-programar, auto-des-program-able...
mega-rollitos de primavera...
We need an integrated description of morphology, syntax and semantics !