natural language procesing
DESCRIPTION
Natural language procesingTRANSCRIPT
-
1
@tomzeal
CPE 510: Natural Language Processing
Friday HSLTB 8 10am
Human Language Processing: This sphere of study is of the
field of artificial intelligence.
Artificial intelligence: is defined by what it has (have) and does
(do) but not what it is (are).
What is Artificial Intelligence?
Alan Turing: The Turing test of intelligence says to define if a
machine is intelligent
Saerte: Chinese room experiment
Features of Human Language
1. Human Language has a sophisticated linguistic
system which allows the generation of infinite
expression from finite set of cues or words.
2. Human language is ambiguous: Ambiguity is of
different levels:
- Lexical Ambiguity: Written the same way and
pronounce the same way meaning different
things(Ade sa Aso & Ade sa ere).
- Orthographic Ambiguity: Written the same
way, pronounced differently (I read the book, I
read the book: It is a record. Please record it)
- Sentential ambiguity: An expression made is
ambiguous (I saw the boy with my glasses)
3. Language is domain boundedi.e. can be a domain of
belief (some words can be a taboo or abomination) or
domain of world view (how you see the world)
Wife Iyawo Husband Oko Family Ebi Lord Oluwa
4. Language evolves with time and technology.
5. Language expresses three entities:
- What is done (Verb)
- Who did it and (Subject)
- To whom it is done. (Object)
6. It uses default to express multi-level conception of
ideas that are complex (E.g. Birds fly but ostrich in
the real sense does not and its a bird)
What must be considered for Natural Language
Processing?
Language has a structure (Grammatical structure)
Language Context of semantic
Language has a constraint on the mapping of
structure to context
What is Language?
Why it is difficult of express Knowledge Definitely
1. Whole Part Dilemma: If a system comprises many
components, can we describe the behaviour of the
system as an aggregation of its component parts?
2. Signifier Signified Dilemma:Things mean
different things to different people. Most often the
signifier is spoken of and not the signified.
How does perception come to us? (Sight, sound, taste,
smell, touch) but some other things go into what we
perceive (Emotion & Intuition) and language as well goes
into this.
S/N
Formal Language
Human Language
Natural Language
1. Means of communicating with computing devices
Means of communication among humans
Means of communication among natural entities
2. Dry and sterile Creative and Fertile
Engrained in nature
3. Use discrete symbols
Use signifier to represent natural or abstract entities or phenomenon
Uses natural signal phenomenon primarily. E.g. waves (sound & Water), sunrays,
4. It is definitive and seeks for consistency
It is ambiguous (because of creativity)
It is definitive and consistent
Content of NLP
Carpente
r Botani
st
Saw Miller
Tree
Computer
Science
-
2
@tomzeal
(A - Formal Language, B- Psychology, C - A.I, D - NLP)
Features of human Language in the context of
NLP
1. The construction of signifiers for ideas and concepts
is arbitrary
2. Due to the arbitrariness the signifier or words that
constitute a language is difficult to count. There are
171, 476 words in the advanced learners dictionary
3. There is a systematic process for constructing
expressions from the basic signifiers to language
(This is the grammar)
4. No two individuals have the same language
behaviour,
5. It is possible to generate an infinite expression from a
finite set of signifiers or words using the same
system.
Useful Terms in NLP
Alphabets (Finite set of symbols defining a
language) The Symbols which are:
Each symbol in an alphabet must represent a
unique primitive and each primitive represents
an indivisible or atomic entity in the domain of
interest
Each primitive in itself alone cannot register or
reckon a sense or concept
A string is the sequence of symbols over an
alphabet. It is the concatenation of symbols of
an alphabet.
A word is a string. A word has a meaning and
that is what differentiates it from a string. i.e.
String + meaning assignment = Word
Vocabulary is the set of words in a language.
Syllable is a sound that can be produced in one
effort
A language is formally defined as a string over
an alphabet. But in this class it is a set of words
and the rules that govern it
Meta Language is the language used to
describe another language. Example is mark-up
over xml and comments in programs
Syntax rule for constructing
Semantics it what it means
Pragmatics is what people get from it.
Computational Level
0 Levels Register
1 Level Memory
2 Levels Arithmetic / Logic /Relation
3 Levels Counting and Ordering
4 Levels Selection / Decision
5 Levels Control and Parallel operation
Operation at level i requires level (i - 1) details
Noam Chomsky Hierarchy
Model
3 Regular Expression (ab)n
Finite State Automata
2 Context Free anbn Push Down Automata (Stack)
1 Context Sensitive (Includes options of selection) anbn Cn
Linearly bounded Automata
0 Unrestricted (Recursively Enumerable) e.g. Human Langan!
Turing Machine
NLP System Development Steps
Why do we have to develop?
Vocabulary Semantics Synthesis Grammar
Speech Recognition
Speech Synthesis Machine Translation Text Summarization
Automatic Dialogue System Automata Diacition
Language Recognition
Computer Science and Engineering
Cognitive Science
Linguistics and
Language
Language
Tech
Human
Language
Impacts
C
Develops
A
C
B
C
D
C
-
3
@tomzeal
1. Understand the Problem:
2. State reasonable assumptions that are appropriate:
3. Identify the language structure:
4. Identify the System States
Speech and Signal Analysis
Praat.exe
Modelling Human Language
Subject - Ade
Verb - Slapped
Object Olu
Phrase Structure Grammar (Context Free Grammar
- Level 2)
G:
VT = Finite set of terminal symbols i.e. alphabets
VN = Finite non-empty set of Non terminal symbols
S = Non terminal symbol called the start symbol
P = Production or re-write rule |- Non-terminal notation
1. ::=
2. ::
=/
3. :: =/
4. :: = {list of all nouns}
5.
-
4
@tomzeal
W: The domain of the robot arm
B = {A, B, C, D,E,F}
The environment is monotonic i.e. the environment
wont change as the problem of the domain evolves. The
state only changes by effect of the environment. The
domain is closed.
In order to abstract, a state space must be defined:
Whatever methods has been selected the following must
be carried out:
1. Identify and label each object in the domain
2. Identify and label relationship between the
objects
3. Represent data of the spatial location of each
object
4. Express the world by formally representing facts
about the world
5. Device a mechanism to determine whether or
not a formula, fact, expression is logically
plausible or not.
Methods
1. A Semantic Networkcan be used to represent
this
2. First Order Predicate calculus: The logic used
here is binary logic. The times need to be
defined and allow calculus to manipulate these.
Box = {A, B, C, D, E, F}
Table = {T}
Robot = {R}
31 01 - 2014
Machine Translation
Machine Translationis the application of computers to
the task of translating text or speeches from one human
language to another human language. The expression
being translated can be in different form such as text,
speech, image, sign etc. The goal of machine translation
is to communicate the content of the expression in one
human language referred to as source language (SL) to its
equivalent in another human language referred to as
target language (TL). Machine translation is a multi-
disciplinarian study which cuts across arts and sciences.
The translation can either be unidirectional, bidirectional
or multidirectional.
Representation and Processing
Human translators usually employ at least five distinct
kinds of knowledge:
a. Knowledge of the source language
b. Knowledge of the target language
This allows them to produce text that are acceptable in
the target language.
c. Knowledge of various correspondents to source
language and target language how individual
words can be translated
d. Knowledge of the subject matter including
ordinary general knowledge and common sense
This along with knowledge of the source language allows
them to understand what the text to be translated means.
e. Knowledge of culture, social conventions,
customs and expectations etc. of the speaker,
source and target language.
* Phonological Knowledge, Morphological knowledge
Phonological Knowledge:knowledge about the sound
system of a language. Knowledge which for example
allows one to work out likely pronunciation of words.
When dealing with written text, such knowledge is not
useful. However, there is related knowledge about
orthography which can be useful. This deals with writing
style.
Example Aiye | Aye
-
5
@tomzeal
Enia | Eniyan
Adie | Adiye
Morphological knowledge:This has to do with apply
knowledge from the study of form and internal structure
of morphemes (words and their semantic building
blocks).
Morphemes are the smallest linguistic unit in a word that
can carry a meaning.
Example: Print/er
(Verb/noun)
Un-, break, able in unbreakable
It deals with how words can be constructed.
Syntactic Knowledge: Knowledge about how sentences
and other sort of phrases can be made up out of words.
Semantic Knowledge: Knowledge about words and
phrases or sentence that provide meaning about a
sentence, about how the meaning of a phrase is related
to the meaning of component words, phrase or
sentences.
Tolu builds a House (SVO)
Pragmatic Knowledge: Talks about the practical use of
knowledge. The use of spoken language in a social
context.
Representing Linguistic Knowledge
In general, syntax is concerned with two slightly different
analysis of sentence:
The first is constituent or phrase structure
analysis. The division of sentences into their
constituent parts ant the categorization of
these parts as nouns, verb etc.
The second has to do with the grammatical
relations. The assignment of grammatical
relations such as object, subject, head and so
on to various parts of the sentence.
Ade ate the food
N V Det N
Subject Predicate Object
Grammar and Constituent Structure
Sentences are made up of words traditionally categorized
into parts of speech or categories including nouns, verbs,
adjective and prepositions. A grammar of a language is a
set of rules which states how these parts of speech can
be put together to make grammatical or well-formed
sentences.
E.g.
a) Put some papers in the printer Follows grammar
rules
b) Print some put in papers do not follow grammar
rules
Here are some simple rules for English grammar with
example. A sentence consists of a noun phrase such as:
The user should clean the printer
(Noun Phrase)(modal) (verb)
(VP.)
But within a verb we can have a noun phrase: VP
VNP
A noun phrase can consist of a determinant or article
such as the or a
English to Yoruba: Unidirectional Bilingual
English to Yoruba Yoruba to English: Bilingual
Bidirectional
English to (Hausa, Igbo, Yoruba).
11 03 - 2014
Difference between Clauses and Sentences
A verb phrase can consist of verbs and auxiliary verbs.
Verb phrase can consists of a noun phrase. It will only
consist of a verb alone when it is a command word e.g.
go!. It consist of a noun phrase such as The printer.
We can have a structure like:
S NP + VP
Parts of Speech
Subject Verb Object
N|NP Verb N|NP
-
6
@tomzeal
Pronoun Pronoun
Machine Translation Approach
1. Statistical Based Approach (Data Driven):
Translates lines of sentences using a set of
stored words in a database or repository.
2. Rule Based Approach: Understand the
grammar structure of the source and target
language and use a context free grammar for
translation and then apply to the translation.
3. Hybrid Approach:
Part of Speech tagging: is the process of assigning a
part of speech label to each lexicon.
Lexicon Collection of Words
Lexical Item ===Lexeme === Word
Part of speech give information about a word and its
neighbours. This is clearly true for major categories e.g.
Verb and Noun.
Book the Flight
V | N
Ade bought the book
S Ade bought the book
N Ade, Book
V Bought
Det the
Ade ra iwe naa
High /
Low \
Mid-tone
Types of Part of Speech Tagging
1. Rule Based: Rule based tagging often have a
large database of handwritten disambiguation
rule which specify for example that an
ambiguous word is a noun rather than a verb if
it follows a determinant.
2. Stochastic Tagging: Generally resolved
tagging ambiguities by using a training corpus to
compute the probability of a give word having a
given tag in a given context.
Hidden Markov Model
Parsing
Essence is to determine if rules used are correct before
final translation.
1. Top Down Parser
2. Bottom Up parser
Formal Grammar
This is generally described as a structure containing a
vocabulary and a set of rules for defining strings on the
basis of the vocabulary. The Chomsky hierarchy of
formal grammar
Unrestricted Grammar (Type 0): The languages
defined by type one are accepted by Turing machines:
Chomskian transformations are defined as type 0
grammars. Type zero grammars have rules of the form
beta
Where and beta are arbitrary strings over the
vocabulary V: /=
The and could be terminal and non-terminal of a
sentence. E.g.
could be: S, NP, VP and PP. All these are non-
terminal.
could be: Verb, noun, preposition etc.
The words in these category are terminals
Context sensitive Grammar (Type 1):The language
defined by type one grammar are accepted by linearly
bounded automata. The syntax of some of the natural
languages are generally held in computational linguistic.
They have rules of the form:
A B
A,B E N, B /= E E V*
S E where S E N
Is the initial symbol and E is the empty string
14 03 2014
Example: Type 1
-
7
@tomzeal
Rule 1: S aSBC
Rule 2: S aBc
Rule 3: CB Bc
Rule 4: aB ab
Rule 5: bB bb
Rule 6: bc bc
Rule 7: cC cc
This grammar generate all strings consisting of a non-
empty sequence of As followed by the same number of
Bs followed by the same number of Cs.
anbncn
Context Free Grammar (Type 2): The languages
defined by type2 grammar are accepted by PDA; the
syntax of the natural languages is definable almost
entirely in terms of context free grammar and the tree
structures generated by them. Type 2 grammars have
rules of the form:
A Where A N, V
There are special normal forms e.g. Chomsky Normal
form and Theibach normal form into which any context
free grammar can be equivalently converted
Regular Grammars (Type 3): Languages defined by
this are accepted by FSA; Morphological structures and
perhaps all the syntax of the formal spoken dialogue is
desirable by regular grammars.
Read Up:There are two types: Right linear and left linear.
Scenario 1: Subject Verb Object
a. John ate the rice
S NP VP
NP N
NP V NP
NP DET N
NP DET ADJ N
b.
c. The tall boy ate the white rice
[Subject & object Modifiers]
The rule is (5) above suffices
Note: If we have
Preposition; then it can be reduced to
Pre NP
Adjectival Phrase (Det) Adj N
Adverbial Phrase Verb (Adv)
Re-ordering | Swapping
(1) He ate the rice
je res n
PRN V N Det
So, you need to change the order in the re-write rules i.e
Det Noun Noun Det
(2) The red car
Ok pupa n
The adjectival phrase becomes Noun Adjective Det
Phrase Structure Grammar
This is the traditional method of representing the
structural constitutents of a sentence in a phrase
structure tree.
The tall man | beat the boy
NP VP
Subject Object
S
NP VP
S NP VP
NP Det AdjP
AdjP Adj N
VP V NP
NP Det N
-
8
@tomzeal
Inflectionary verbs. In translating from Enlish to
Yoruba, theres no need to take care of inflectionary
verbs.
E.g : beat | beats na
Go | goes | went | gone went
21 03 - 2014
TEXT PROCESSING
Spelling Correction: Scanning sentence for words that
do not appear in its dictionary.
Grammar Checker: This is the flagging of words put
together in a sentence when there is no correspondence
between them hence making the sentence out of context
Language Recognition:
Information Retrieval:Allowsone to locate relevant
documents that are related to a query but do not specify
where the answers are. In information retrieval, the
document of interest are fetched by matching query key
words to indexes of document collection. The main
purpose of information retrieval is to prevent information
overload.
Information Extraction: This extracts the information
of interest for a well-defined extraction domain and this
relies on filling out predefined templates. Such
information consists of entities and the relationship
between them thus information extraction generates
structured information
This is about getting specific information. The difference
with information retrieval will return documents by
extraction will retrieve information particular to a
domain and often requires templates for extraction.
Handwriting Recognitionwhich comes under
character recognition.
Essay Grading: Searching through for key terms in an
essay as well as being written in good grammar
Recommender System: Is a system that goes through a
given data and then based on defined variables supplied
to it, it suggests an idea to the user.
Text Categorization: Case study in which this has been
used is the Federalist Papers. This was accomplished
using writing style and words used. Basically text
categorization looks through documents from unknown
authors and makes efforts to categorize. Plagiarism is
also and application of text categorization.
Question Answering System: accepts questions in
natural language form, searches for answers over a
collection of documents and formulates a concise
answer. It involves question processing, document
retrieval, answer extraction and answer formulation.
Computational Semantics:
Tools applied for Text Processing
Pattern Matching: Text written in any language is
about pattern. Matching the defined pattern is what
makes it valid. Rules can be used to implement pattern
matching.
- Regular Expressions: Example is in
detecting if a word is a proper noun.
- Statistics or probability:
- Dictionaries:
- Machine Learning: Often relies on
probability theory and statistics alongside a
machine learning algorithm to be able to
recognize patterns
- Language Model:
- Spell Check: Process include scan, identify
- Edit Distance, insertion, transposition,
substitution, deletion
- Contextual Spell Checking:
Sentence
NP
Det
The
AdjP
Adj
tall
N
man
VP
V
beat
NP
Det
the
N
boy