introduction to computational linguisitics

29
Introduction to Computational Linguisitics The Lexicon

Upload: thadeus-atalo

Post on 03-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Computational Linguisitics. The Lexicon. Introduction. An inventory of words is an essential component of programs for a wide variety of language sensitive applications, such as: Spellchecking, stylechecking IR, IE, message understanding parsing, generation, MT TTS and STT - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Computational Linguisitics

Introduction toComputational Linguisitics

The Lexicon

Page 2: Introduction to Computational Linguisitics

Introduction

• An inventory of words is an essential component of programs for a wide variety of language sensitive applications, such as:– Spellchecking, stylechecking– IR, IE, message understanding– parsing, generation, MT– TTS and STT

• Such an inventory usually called a dictionary or lexicon.

Page 3: Introduction to Computational Linguisitics

Dictionaries

• The purpose of a dictionary is to provide a wide range of information about words

• Some of this is linguistic information, e.g. syntactic category, pronunciation, distribution.

• But dictionaries also contain definitions of word senses thus providing knowledge about not just language but about the world itself.

Page 4: Introduction to Computational Linguisitics

What is "dog"?

dog (ANIMAL)   Show phoneticsnoun [C]a common four-legged animal, especially kept by people as a pet or to hunt or guard things:my pet dogwild dogsdog foodWe could hear dogs barking in the distance.

(from Cambridge Advanced Learner's Dictionary)

Page 5: Introduction to Computational Linguisitics

"Dictionary" versus "Lexicon"

• A dictionary is a collection of words• A lexicon is a collection of lexemes.• A lexeme roughly corresponds to a set of

words that are different forms of "the same word".

• For example, English run, runs, ran and running are forms of the same lexeme.

• A lexeme can also be regarded as a single word sense of a word.

Page 6: Introduction to Computational Linguisitics

Senses of Dog

• dog was found in the Cambridge Advanced Learner's Dictionary at the entries listed below.– dog (ANIMAL) – dog (PERSON) – dog (FOLLOW) – dog (PROBLEM)

different sensesor lexemes for dog

Page 7: Introduction to Computational Linguisitics

Two Views of the Lexicongive rise to different issues

• Lexicon as word database– How to represent the word collection– Access: given an arbitrary word, how to access the

relevant entries– What information to provide and how to express it.

• Lexicon as database about word senses– What are the relations between word senses?– How do word senses hook up with concept

knowledge

Page 8: Introduction to Computational Linguisitics

Representing the Word Collection

• Some possible representations:– Text file, 1 entry per line– Finite state automaton.– Other specialised data structure which allows

for common prefixes, e.g. letter tree

• Full form vs. lexeme + morphological analysis

Page 9: Introduction to Computational Linguisitics

FSA for Sublexicon Fragment

t h e s

ei

s

a

t

o

Page 10: Introduction to Computational Linguisitics

Letter Tree

ltree([ [b, [a, [r, [k, bark]]]], [c, [a, [r, [r, [y, carry]]], [t, cat, [e, [g, [o, [r, [y, category]]]]]]]],

[d, [e, [l, [a, [y, delay]]]]], [h, [e, [l, [p, help]]], [o, [p, hop, [e, hope]]]], [q, [u, [a, [r, [r, [y, quarry]]]], [i, [z, quiz]], [o, [t, [e, quote]]]]] ]).

Page 11: Introduction to Computational Linguisitics

Informal Definition of a Letter Tree

• Tree is a list of branches• Each branch is a list

– whose first element is a letter– whose remaining elements are either

• another branch, or• a lexical entry for a word

– These elements are in a specific order. Lexical entry (if any) comes first, and branches are in alphabetical order by their first letters.

Page 12: Introduction to Computational Linguisitics

Branch representingcat, category and cook

[c,[a,[t,cat,

[e,[g,[o,[r,[y category]]]]]]]

[o,[o,[k,cook]]]]

Page 13: Introduction to Computational Linguisitics

Full Form Dictionary

• There is an entry for every possible word.• No need for morphological processing• Exceptions are handled automatically• OK when number of entries is not too

large.• Repeated information.• Because languages have different

morphological properties, full form is better for some languages than for others.

Page 14: Introduction to Computational Linguisitics

Morphological Analysis + Lexicon

MorphologicalAnalysis

Input Word

cats

LEXICON

cat N

s PL

s 3SG

Page 15: Introduction to Computational Linguisitics

Morphological Analysis

• Very roughly, morphological analysis of a word involves 2 subproblems:

• A segmentation problem: how to get from the written text to the sequence of morphemes that make it up.

• A morphotactic problem: how to combine the individual morphemes together in a legitimate way.

Page 16: Introduction to Computational Linguisitics

Segmentation/MorphotacticSubproblems

• Segmentation problem:– enlargement => en + large + ment

• Morphotactic problem: given what we know about en, large and ment, how can they be legitimately combined– enlargement => (en + large) + ment– enlargement =/> en + (large + ment)– en + ADJ => V– V + ment => N

Page 17: Introduction to Computational Linguisitics

2-Level Morphology

• In 1981 the four Ks (Kimmo Koskenniemi, Lauri Karttunen, Ronald M. Kaplan and Martin Kay) were working on morphological analysis (MA)

• Basic idea was that MA is about computing relation between sets of strings at two levels:– Surface Level (string of lexical words made from

surface alphabet) – Lexical Level (string of morphemes made of lexical

alphabet).• Relation can be computed using finite state

transducers.• Reversibility of finite-state model

Page 18: Introduction to Computational Linguisitics

What Information to Provide

• Specific Information – eg "kicks"• Syntactic Information

– POS = verb– Tense = pres– Number = singular– Person = 3– Type =Transitive

• Semantic Information– event-type = Physical Action– type-of subject = animate– type-of object = physical

Page 19: Introduction to Computational Linguisitics

What Information to Provide

• General Information• Class Attributes

– Agreement has (Number, Gender)

• Enumeration of possible values– Gender = [masc, fem]– Number = [sing, plur]

• Class Relationships– Transitive isa Verb– Common isa Noun

Page 20: Introduction to Computational Linguisitics

Two Views of the Lexicongive rise to different issues

• Lexicon as word database– How to represent the word collection– Access: given an arbitrary word, how to access the

relevant entries– What information to provide and how to express it.

• Lexicon as database about word senses– What are the relations between word senses?– How do word senses hook up with conceptual

knowledge

Page 21: Introduction to Computational Linguisitics

WordNet

• In 1985 a group of psychologists and linguists at Princeton had the idea of searching dictionaries conceptually rather than alphabetically.

• Attempt to organise a dictionary in terms of word meanings rather than word forms.

• What is the nature and organisation of the lexicalised concepts that words can express?

• Distinction between word forms, word meanings, and entries.

Page 22: Introduction to Computational Linguisitics

Lexical Matrix

Word Meanings

Word Forms

F1 F2 .. Fn

M1 E1,1 E1,2

M2 E2,1

..

Mm Em,n

polysemy

synonymy

entries

Page 23: Introduction to Computational Linguisitics

WordNet

• A key aspect of WordNet is that a given meaning or word sense is represented as the set of words that can be used to express it.

• These meanings are called synsets – sets of words with synonymous readings.

• Synsets are established empirically according to a principle of substitutability that is relativised to context.

Page 24: Introduction to Computational Linguisitics

The Principle of Substitutability

• Two expressions are synonymous if the substitution of one for another never alters the truth value of a sentence in which the substitution is made.

• Two expressions are synonymous in linguistic context C if the substitution of one for the other in C does not alter the truth value.

• e.g. plank/board in carpentry contexts

Page 25: Introduction to Computational Linguisitics

Lexical Matrix

Word Meanings

Word Forms

board committee plank.. Fn

board committee

E1,1 E1,2

board plank

E2,1 E2,3

..

Mm Em,n

entries

Page 26: Introduction to Computational Linguisitics

WordNet

• In Wordnet, the synonymy relation between words is fundamental.

• Synsets can be thought of as representing concepts which stand in various semantic relations to each other.– X Antonym Y: meaning (synset) X is opposite

to meaning (synset) Y (big, small)– X Hyponym Y: like isa (e.g. dog, mammal)– X Meronym Y: X is a part of Y (e.g. leg, man)

Page 27: Introduction to Computational Linguisitics

Lexicon as a Concept Graph

• We can thus imagine the WordNet Lexicon as a gigantic graph whose nodes are synsets and whose arcs are semantic relations between synsets.

• Such a structure can be regarded as a semantic map of the concepts used in a given language.

• Many applications can be created using the WordNet graph as a resource

Page 28: Introduction to Computational Linguisitics

Using WordNet to Measure Semantic Orientations of AdjectivesJaap Kamps, Maarten Marx, Robert J. Mokken, Maarten de Rijke

Page 29: Introduction to Computational Linguisitics

Conclusion

• Lexicon is a central building block of language-sensitive systems

• Schizophrenic status of lexical information: linguistic versus world knowledge.

• As a wordlist, lexicon has to solve problem of representation and access. Morphological analysis can help to keep number of entries to a manageable level.

• As a collection of definitions, lexicon has to deal with relationships between word meanings.