superficial & lexical level 1

15
superficial and lexic level 1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information

Upload: hammett-bentley

Post on 30-Dec-2015

17 views

Category:

Documents


0 download

DESCRIPTION

Superficial & Lexical level 1. Superficial level What is a word Lexical level Lexicons How to acquire lexical information. Superficial level 1. Textual pre-process Getting the document(s) Accessing a BD Accessing the Web (wrappers) Getting the textual fragments of a document - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Superficial & Lexical level  1

NLP superficial and lexic level 1

Superficial & Lexical level 1

• Superficial level• What is a word • Lexical level• Lexicons• How to acquire lexical information

Page 2: Superficial & Lexical level  1

NLP superficial and lexic level 2

Superficial level 1

• Textual pre-process• Getting the document(s)

• Accessing a BD

• Accessing the Web (wrappers)

• Getting the textual fragments of a document• Multimedia documents, Web pages, ...

• Filtering out meta-information• tags HTML, XML, ...

Page 3: Superficial & Lexical level  1

NLP superficial and lexic level 3

Superficial level 2

• Text segmentation into paragraphs or sentences

• Tokenization• orthographic vs grammatical word

• Multiword terms

• dates, formulas, acronyms, abbreviations, quantities (and units), idioms,

• Named entities• NER, NEC, NERC

• Unknown word

• Language identification

Beeferman et al, 1999Ratnaparkhi, 1998

Bikel et al, 1999Borthwick, 1999Mikheev et al, 1999

Elworthy, 1999Adams,Resnik, 1997

Page 4: Superficial & Lexical level  1

NLP superficial and lexic level 4

Superficial level 3

• Vocabulary size (V)• Heap's Law

• V = KN

• K depends on the text 10 K 100

• N total number of words depends on the language, for English 0.4 0.6

• Vocabulary grows sublinealy but does not saturate tends to stabilize for 1Mb of text (150.000w)

words

Dif

fere

nt w

ords

Page 5: Superficial & Lexical level  1

NLP superficial and lexic level 5

Superficial level 4

• word tokens vs word types• Statistical distribution of words in a document

• Obviously non uniform

• Most common words cover more than 50% of occurrences

• 50% of the words only occur once

• ~12% of the document is formed by word occurring less than 4 times.

Page 6: Superficial & Lexical level  1

NLP superficial and lexic level 6

Superficial level 5

Histogram

0

50

100

150

200

250

300

350

Bin

Frequency

Frequency

10/

/1

NC

rCf

Zipf law:We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant

Page 7: Superficial & Lexical level  1

NLP superficial and lexic level 7

Lexical level 1

• Part of Speech (POS)• Formal property of a word-type determining its acceptable

uses in syntax.• A POS can be seen as a class of words • A word-type can own several POS, a word-token only

one• Plain categories

• open, many elements, neologisms, independent and semantically rich classes

• N, Adj, Adv, V• Functional categories

• closed

Page 8: Superficial & Lexical level  1

NLP superficial and lexic level 8

Lexical level 2

• Repository of lexical information for human or computer use

• Two aspects to consider• Representation of lexical information

• Acquisition of lexical information

Lexicon

Page 9: Superficial & Lexical level  1

NLP superficial and lexic level 9

Lexical level 3

• Orthografic Transcription • Phonetic Transcription • Flexion model• diathesis alternations, subcategorization frames

• LOVE VTR (OBJLIST: SN).

• LOVE• CAT = VERB

• SUBCAT = <SN, SN>

Lexicon content

Page 10: Superficial & Lexical level  1

NLP superficial and lexic level 10

• POS• Argument structure• Semantic information

• dictionaries => definition

• lexicons => semantic types predefined in a hierarchy.

• Lexical Relations• derivation

• Equivalence with other languages

Lexical level 4

Page 11: Superficial & Lexical level  1

NLP superficial and lexic level 11

Lexical level 5

• Form• attribute/value pairs, binarr or n-ary relations, coded values,

open domain values…

• Multiple assignments• One to many and many to one relations

• Contextual dependencies …

• Facets of features• Mandatory or optional, cardinality, default values

• Grading• Exact values, preferences, probabilistic assigments.

Problems

Page 12: Superficial & Lexical level  1

NLP superficial and lexic level 12

Lexical level 6

• General purpose databases• Textual databases• Lexical databases• OO formalisms• OO databases• Frames• Unification-based formalisms

Representation

Page 13: Superficial & Lexical level  1

NLP superficial and lexic level 13

Lexical level 7

• Dictionaries• MRD

• Predefined internal structure

• Some degree of coding in some contents

• Internal relations (synonimy, hyponymy, ...)

• (sometimes) restricted vocabulary

• Some sistematics on building definitions

Lexical Information acquisition

Page 14: Superficial & Lexical level  1

NLP superficial and lexic level 14

Lexical level 8

• Colocations

• Argument structure.

• Frecuency information

• Context

• Grammatical Induction

• Probabilistic Analysis.

• Lexical relations

• Examples of use.

• Selectional Restrictions

• Nominal compounds

• Idioms, ...

Information present in corpora

Page 15: Superficial & Lexical level  1

NLP superficial and lexic level 15

Lexical level 9

• Raw corpus• Horizontal or vertical Corpus • Tagged corpora• Parenthized corpora• Treebanks

Corpus typology