superficial & lexical level 1
DESCRIPTION
Superficial & Lexical level 1. Superficial level What is a word Lexical level Lexicons How to acquire lexical information. Superficial level 1. Textual pre-process Getting the document(s) Accessing a BD Accessing the Web (wrappers) Getting the textual fragments of a document - PowerPoint PPT PresentationTRANSCRIPT
NLP superficial and lexic level 1
Superficial & Lexical level 1
• Superficial level• What is a word • Lexical level• Lexicons• How to acquire lexical information
NLP superficial and lexic level 2
Superficial level 1
• Textual pre-process• Getting the document(s)
• Accessing a BD
• Accessing the Web (wrappers)
• Getting the textual fragments of a document• Multimedia documents, Web pages, ...
• Filtering out meta-information• tags HTML, XML, ...
NLP superficial and lexic level 3
Superficial level 2
• Text segmentation into paragraphs or sentences
• Tokenization• orthographic vs grammatical word
• Multiword terms
• dates, formulas, acronyms, abbreviations, quantities (and units), idioms,
• Named entities• NER, NEC, NERC
• Unknown word
• Language identification
Beeferman et al, 1999Ratnaparkhi, 1998
Bikel et al, 1999Borthwick, 1999Mikheev et al, 1999
Elworthy, 1999Adams,Resnik, 1997
NLP superficial and lexic level 4
Superficial level 3
• Vocabulary size (V)• Heap's Law
• V = KN
• K depends on the text 10 K 100
• N total number of words depends on the language, for English 0.4 0.6
• Vocabulary grows sublinealy but does not saturate tends to stabilize for 1Mb of text (150.000w)
words
Dif
fere
nt w
ords
NLP superficial and lexic level 5
Superficial level 4
• word tokens vs word types• Statistical distribution of words in a document
• Obviously non uniform
• Most common words cover more than 50% of occurrences
• 50% of the words only occur once
• ~12% of the document is formed by word occurring less than 4 times.
NLP superficial and lexic level 6
Superficial level 5
Histogram
0
50
100
150
200
250
300
350
Bin
Frequency
Frequency
10/
/1
NC
rCf
Zipf law:We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant
NLP superficial and lexic level 7
Lexical level 1
• Part of Speech (POS)• Formal property of a word-type determining its acceptable
uses in syntax.• A POS can be seen as a class of words • A word-type can own several POS, a word-token only
one• Plain categories
• open, many elements, neologisms, independent and semantically rich classes
• N, Adj, Adv, V• Functional categories
• closed
NLP superficial and lexic level 8
Lexical level 2
• Repository of lexical information for human or computer use
• Two aspects to consider• Representation of lexical information
• Acquisition of lexical information
Lexicon
NLP superficial and lexic level 9
Lexical level 3
• Orthografic Transcription • Phonetic Transcription • Flexion model• diathesis alternations, subcategorization frames
• LOVE VTR (OBJLIST: SN).
• LOVE• CAT = VERB
• SUBCAT = <SN, SN>
Lexicon content
NLP superficial and lexic level 10
• POS• Argument structure• Semantic information
• dictionaries => definition
• lexicons => semantic types predefined in a hierarchy.
• Lexical Relations• derivation
• Equivalence with other languages
Lexical level 4
NLP superficial and lexic level 11
Lexical level 5
• Form• attribute/value pairs, binarr or n-ary relations, coded values,
open domain values…
• Multiple assignments• One to many and many to one relations
• Contextual dependencies …
• Facets of features• Mandatory or optional, cardinality, default values
• Grading• Exact values, preferences, probabilistic assigments.
Problems
NLP superficial and lexic level 12
Lexical level 6
• General purpose databases• Textual databases• Lexical databases• OO formalisms• OO databases• Frames• Unification-based formalisms
Representation
NLP superficial and lexic level 13
Lexical level 7
• Dictionaries• MRD
• Predefined internal structure
• Some degree of coding in some contents
• Internal relations (synonimy, hyponymy, ...)
• (sometimes) restricted vocabulary
• Some sistematics on building definitions
Lexical Information acquisition
NLP superficial and lexic level 14
Lexical level 8
• Colocations
• Argument structure.
• Frecuency information
• Context
• Grammatical Induction
• Probabilistic Analysis.
• Lexical relations
• Examples of use.
• Selectional Restrictions
• Nominal compounds
• Idioms, ...
Information present in corpora
NLP superficial and lexic level 15
Lexical level 9
• Raw corpus• Horizontal or vertical Corpus • Tagged corpora• Parenthized corpora• Treebanks
Corpus typology