lrec 2010 presentation

21
The Dictionary of Italian Collocations: Design and Integration in an Online Learning Environment Stefania Spina University for Foreigners Perugia, Italia

Upload: stefania-spina

Post on 12-Jul-2015

712 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LREC 2010 presentation

The Dictionary of Italian

Collocations: Design and

Integration in an Online

Learning Environment

Stefania Spina

University for Foreigners Perugia, Italia

Page 2: LREC 2010 presentation

The Dictionary of Italian Collocations

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations2

Part of APRIL project (“Personalised web

environment for language learning”)

NLP resources as a support for the lexical

competence of students of Italian within a Virtual

Learning Environment (VLE).

Page 3: LREC 2010 presentation

Presentation outline

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations3

background and motivation

reference corpus

methodology

dictionary compilation

integration within VLE

Page 4: LREC 2010 presentation

Background

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations4

different syntactic and semantic profiles, but

prototypical features:

1. semantic non-compositionality

2. non-substitutability of components by semantically

similar words

3. non-insertion of external items

continuum rather than definite categories

Page 5: LREC 2010 presentation

Continuum

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations5

Tagliare la corda “run away” aprire la porta “open the door”

Camera oscura “dark room”

* Stanza oscura

{fare|porre|rivolgere|formul

are} una domanda “ask a

question”

Sistema *molto operativo

“operating system”

fare una lunga calda

riposante doccia “take a

long, hot, restful shower”

semantic non-

compositionality

non-substitutability

insertion of external

items

Page 6: LREC 2010 presentation

Motivation: collocations in SLA

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations6

improving learners fluency

non-native speakers and L2 vocabulary: first single

words, then more extended chunks

trend to overuse the creative combination of isolated

words

Sinclair’s open choice principle

Examples from Italian leaner corpora

preoccupata per il corso che mi mette nelle difficoltà

(Russia)

mettere in difficoltà “cause problems”

e poi alla fine ho fatto questa decisione (Vietnam)

Prendere una decisione “make a decision”

Page 7: LREC 2010 presentation

DICI

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations7

collocations require specific pedagogical attention

Dictionary of Italian Collocations (DICI)

it is corpus-based;

it is a learner-oriented tool: list of the most common Italian

collocations, classified on a frequency basis;

it is also based on statistical methodologies (dispersion in

the different textual genres represented in the corpus).

Page 8: LREC 2010 presentation

Reference corpus

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations8

Perugia corpus: POS-tagged, lemmatized

Textual genre N. of words

fiction 3 million

non-fiction 2 million

web 5 million

academic prose 1 million

press 3 million

language of administration 1 million

television programs 1 million

spoken texts 2 million

TOTAL 18 million

Page 9: LREC 2010 presentation

POS filtering

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations9

Analysis of existing list of collocations:

150 different POS sequences

10 most productive POS sequences

ADJ ADV N nudo come un verme "as naked as a

worm"ADJ CONG ADJ bianco e nero "black and white"ADJ N terzo mondo "third world"N ADJ cassa comune "common fund"N CONG N andata e ritorno "back and forth"N N caso limite "borderline case"N PRE N abito da sera "evening dress"V ADJ stare zitto "keep quiet"V ART N fare la doccia "take a shower"V N avere paura "be afraid"

Page 10: LREC 2010 presentation

Experimental methodology: 4 steps

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations10

1. extraction of candidate collocations from corpus;

2. filtering of the candidate collocations: frequency and

dispersion;

3. compilation of the dictionary;

4. integration of the dictionary with the online learningADJ CONG ADJ

N CONG N

N N

N PRE N

V ART N

V N

6 POS

sequences

fiction

press

academic prose

web

12-million-word sample, 4

sections

Page 11: LREC 2010 presentation

Collocations extraction

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations11

via IMS Corpus Workbench

removing all the candidates with frequency = 1

41643 collocations

Two more filters:

Dispersion

Manual (non-collocations)

Page 12: LREC 2010 presentation

Dispersion

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations12

Examples:

Aggrottare la fronte “to frown” (fiction)

Vincere le elezioni “to win the elections” (press)

Dare una definizione “to give a definition” (academic

prose)

Juilland’s D value (Juilland - Chang-Rodriguez,

1964)

D value: combined with frequency = usage

Usage value ≥ 2 2047 candidate collocations

Manual selection. Final result:

list of 1553 word combinations = dictionary entries

Page 13: LREC 2010 presentation

Collocations list

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations13

Page 14: LREC 2010 presentation

Compilation of the Dictionary

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations14

Lexical database enriched with two kinds of data:

Visible to the learner (client output)

definition, examples, part-of-speech, syntactic context of

occurrence of collocations

to be processed by other applications (server)

internal syntactic configuration for automatic recognition

Collocation Syntactic configuration

Fare la doccia [V$fare][ADV]? la|una|NUM [ADJ]?

[N$doccia]

Abito da sera [N$abito] da_sera

Alti e bassi alti_e_bassi

Page 15: LREC 2010 presentation

DB integration in the VLE

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations15

Virtual Learning Environment:

web application specifically devoted to language learning

LELE (Linguistically-Enhanced Learning

Environment)

provide language learners with additional NLP resources,

in order to improve their linguistic competence

receptive and productive learning activities concerning the

recognition and the active use of collocations

Page 16: LREC 2010 presentation

LELE Features

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations16

to automatically recognize and highlight multi-word

units in written Italian texts;

to show additional linguistic information about the

selected collocations;

to generate collocation tests for collocational

competence assessment of second or foreign

language learners.

Page 17: LREC 2010 presentation

LELE scheme

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations17

DB + tagger

LELE

browser (client)

server

Page 18: LREC 2010 presentation
Page 19: LREC 2010 presentation
Page 20: LREC 2010 presentation

Conclusions

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations20

Next step:

same methodology to the whole corpus, for all the 10

selected POS sequences

Further research

refine statistical measures

assign collocations to different levels of competence

other tools (productive tasks)

Page 21: LREC 2010 presentation

LREC 2010 - Stefania Spina - The Dictionary of Italian Collocations21

Stefania Spina

[email protected]

http://april.unistrapg.it