natural language procesing

8
1 @tomzeal CPE 510: Natural Language Processing Friday HSLTB 8 10am Human Language Processing: This sphere of study is of the field of artificial intelligence. Artificial intelligence: is defined by what it has (have) and does (do) but not what it is (are). What is Artificial Intelligence? Alan Turing: The Turing test of intelligence says to define if a machine is intelligent Saerte: Chinese room experiment Features of Human Language 1. Human Language has a sophisticated linguistic system which allows the generation of infinite expression from finite set of cues or words. 2. Human language is ambiguous: Ambiguity is of different levels: - Lexical Ambiguity: Written the same way and pronounce the same way meaning different things(Ade sa Aso & Ade sa ere). - Orthographic Ambiguity: Written the same way, pronounced differently (I read the book, I read the book: It is a record. Please record it) - Sentential ambiguity: An expression made is ambiguous (I saw the boy with my glasses) 3. Language is domain bounded i.e. can be a domain of belief (some words can be a taboo or abomination) or domain of world view (how you see the world) Wife Iyawo Husband Oko Family Ebi Lord Oluwa 4. Language evolves with time and technology. 5. Language expresses three entities : - What is done (Verb) - Who did it and (Subject) - To whom it is done. (Object) 6. It uses default to express multi-level conception of ideas that are complex (E.g. Birds fly but ostrich in the real sense does not and it’s a bird) What must be considered for Natural Language Processing? Language has a structure (Grammatical structure) Language Context of semantic Language has a constraint on the mapping of structure to context What is Language? Why it is difficult of express Knowledge Definitely 1. Whole Part Dilemma: If a system comprises many components, can we describe the behaviour of the system as an aggregation of its component parts? 2. Signifier Signified Dilemma:Things mean different things to different people. Most often the signifier is spoken of and not the signified. How does perception come to us? (Sight, sound, taste, smell, touch) but some other things go into what we perceive (Emotion & Intuition) and language as well goes into this. S/ N Formal Language Human Language Natural Language 1. Means of communicatin g with computing devices Means of communicatio n among humans Means of communicatio n among natural entities 2. Dry and sterile Creative and Fertile Engrained in nature 3. Use discrete symbols Use signifier to represent natural or abstract entities or phenomenon Uses natural signal phenomenon primarily. E.g. waves (sound & Water), sunrays, 4. It is definitive and seeks for consistency It is ambiguous (because of creativity) It is definitive and consistent Content of NLP Carpente Botani Saw Miller Tree Computer

Upload: salauwale

Post on 22-Nov-2015

16 views

Category:

Documents


0 download

DESCRIPTION

Natural language procesing

TRANSCRIPT

  • 1

    @tomzeal

    CPE 510: Natural Language Processing

    Friday HSLTB 8 10am

    Human Language Processing: This sphere of study is of the

    field of artificial intelligence.

    Artificial intelligence: is defined by what it has (have) and does

    (do) but not what it is (are).

    What is Artificial Intelligence?

    Alan Turing: The Turing test of intelligence says to define if a

    machine is intelligent

    Saerte: Chinese room experiment

    Features of Human Language

    1. Human Language has a sophisticated linguistic

    system which allows the generation of infinite

    expression from finite set of cues or words.

    2. Human language is ambiguous: Ambiguity is of

    different levels:

    - Lexical Ambiguity: Written the same way and

    pronounce the same way meaning different

    things(Ade sa Aso & Ade sa ere).

    - Orthographic Ambiguity: Written the same

    way, pronounced differently (I read the book, I

    read the book: It is a record. Please record it)

    - Sentential ambiguity: An expression made is

    ambiguous (I saw the boy with my glasses)

    3. Language is domain boundedi.e. can be a domain of

    belief (some words can be a taboo or abomination) or

    domain of world view (how you see the world)

    Wife Iyawo Husband Oko Family Ebi Lord Oluwa

    4. Language evolves with time and technology.

    5. Language expresses three entities:

    - What is done (Verb)

    - Who did it and (Subject)

    - To whom it is done. (Object)

    6. It uses default to express multi-level conception of

    ideas that are complex (E.g. Birds fly but ostrich in

    the real sense does not and its a bird)

    What must be considered for Natural Language

    Processing?

    Language has a structure (Grammatical structure)

    Language Context of semantic

    Language has a constraint on the mapping of

    structure to context

    What is Language?

    Why it is difficult of express Knowledge Definitely

    1. Whole Part Dilemma: If a system comprises many

    components, can we describe the behaviour of the

    system as an aggregation of its component parts?

    2. Signifier Signified Dilemma:Things mean

    different things to different people. Most often the

    signifier is spoken of and not the signified.

    How does perception come to us? (Sight, sound, taste,

    smell, touch) but some other things go into what we

    perceive (Emotion & Intuition) and language as well goes

    into this.

    S/N

    Formal Language

    Human Language

    Natural Language

    1. Means of communicating with computing devices

    Means of communication among humans

    Means of communication among natural entities

    2. Dry and sterile Creative and Fertile

    Engrained in nature

    3. Use discrete symbols

    Use signifier to represent natural or abstract entities or phenomenon

    Uses natural signal phenomenon primarily. E.g. waves (sound & Water), sunrays,

    4. It is definitive and seeks for consistency

    It is ambiguous (because of creativity)

    It is definitive and consistent

    Content of NLP

    Carpente

    r Botani

    st

    Saw Miller

    Tree

    Computer

    Science

  • 2

    @tomzeal

    (A - Formal Language, B- Psychology, C - A.I, D - NLP)

    Features of human Language in the context of

    NLP

    1. The construction of signifiers for ideas and concepts

    is arbitrary

    2. Due to the arbitrariness the signifier or words that

    constitute a language is difficult to count. There are

    171, 476 words in the advanced learners dictionary

    3. There is a systematic process for constructing

    expressions from the basic signifiers to language

    (This is the grammar)

    4. No two individuals have the same language

    behaviour,

    5. It is possible to generate an infinite expression from a

    finite set of signifiers or words using the same

    system.

    Useful Terms in NLP

    Alphabets (Finite set of symbols defining a

    language) The Symbols which are:

    Each symbol in an alphabet must represent a

    unique primitive and each primitive represents

    an indivisible or atomic entity in the domain of

    interest

    Each primitive in itself alone cannot register or

    reckon a sense or concept

    A string is the sequence of symbols over an

    alphabet. It is the concatenation of symbols of

    an alphabet.

    A word is a string. A word has a meaning and

    that is what differentiates it from a string. i.e.

    String + meaning assignment = Word

    Vocabulary is the set of words in a language.

    Syllable is a sound that can be produced in one

    effort

    A language is formally defined as a string over

    an alphabet. But in this class it is a set of words

    and the rules that govern it

    Meta Language is the language used to

    describe another language. Example is mark-up

    over xml and comments in programs

    Syntax rule for constructing

    Semantics it what it means

    Pragmatics is what people get from it.

    Computational Level

    0 Levels Register

    1 Level Memory

    2 Levels Arithmetic / Logic /Relation

    3 Levels Counting and Ordering

    4 Levels Selection / Decision

    5 Levels Control and Parallel operation

    Operation at level i requires level (i - 1) details

    Noam Chomsky Hierarchy

    Model

    3 Regular Expression (ab)n

    Finite State Automata

    2 Context Free anbn Push Down Automata (Stack)

    1 Context Sensitive (Includes options of selection) anbn Cn

    Linearly bounded Automata

    0 Unrestricted (Recursively Enumerable) e.g. Human Langan!

    Turing Machine

    NLP System Development Steps

    Why do we have to develop?

    Vocabulary Semantics Synthesis Grammar

    Speech Recognition

    Speech Synthesis Machine Translation Text Summarization

    Automatic Dialogue System Automata Diacition

    Language Recognition

    Computer Science and Engineering

    Cognitive Science

    Linguistics and

    Language

    Language

    Tech

    Human

    Language

    Impacts

    C

    Develops

    A

    C

    B

    C

    D

    C

  • 3

    @tomzeal

    1. Understand the Problem:

    2. State reasonable assumptions that are appropriate:

    3. Identify the language structure:

    4. Identify the System States

    Speech and Signal Analysis

    Praat.exe

    Modelling Human Language

    Subject - Ade

    Verb - Slapped

    Object Olu

    Phrase Structure Grammar (Context Free Grammar

    - Level 2)

    G:

    VT = Finite set of terminal symbols i.e. alphabets

    VN = Finite non-empty set of Non terminal symbols

    S = Non terminal symbol called the start symbol

    P = Production or re-write rule |- Non-terminal notation

    1. ::=

    2. ::

    =/

    3. :: =/

    4. :: = {list of all nouns}

    5.

  • 4

    @tomzeal

    W: The domain of the robot arm

    B = {A, B, C, D,E,F}

    The environment is monotonic i.e. the environment

    wont change as the problem of the domain evolves. The

    state only changes by effect of the environment. The

    domain is closed.

    In order to abstract, a state space must be defined:

    Whatever methods has been selected the following must

    be carried out:

    1. Identify and label each object in the domain

    2. Identify and label relationship between the

    objects

    3. Represent data of the spatial location of each

    object

    4. Express the world by formally representing facts

    about the world

    5. Device a mechanism to determine whether or

    not a formula, fact, expression is logically

    plausible or not.

    Methods

    1. A Semantic Networkcan be used to represent

    this

    2. First Order Predicate calculus: The logic used

    here is binary logic. The times need to be

    defined and allow calculus to manipulate these.

    Box = {A, B, C, D, E, F}

    Table = {T}

    Robot = {R}

    31 01 - 2014

    Machine Translation

    Machine Translationis the application of computers to

    the task of translating text or speeches from one human

    language to another human language. The expression

    being translated can be in different form such as text,

    speech, image, sign etc. The goal of machine translation

    is to communicate the content of the expression in one

    human language referred to as source language (SL) to its

    equivalent in another human language referred to as

    target language (TL). Machine translation is a multi-

    disciplinarian study which cuts across arts and sciences.

    The translation can either be unidirectional, bidirectional

    or multidirectional.

    Representation and Processing

    Human translators usually employ at least five distinct

    kinds of knowledge:

    a. Knowledge of the source language

    b. Knowledge of the target language

    This allows them to produce text that are acceptable in

    the target language.

    c. Knowledge of various correspondents to source

    language and target language how individual

    words can be translated

    d. Knowledge of the subject matter including

    ordinary general knowledge and common sense

    This along with knowledge of the source language allows

    them to understand what the text to be translated means.

    e. Knowledge of culture, social conventions,

    customs and expectations etc. of the speaker,

    source and target language.

    * Phonological Knowledge, Morphological knowledge

    Phonological Knowledge:knowledge about the sound

    system of a language. Knowledge which for example

    allows one to work out likely pronunciation of words.

    When dealing with written text, such knowledge is not

    useful. However, there is related knowledge about

    orthography which can be useful. This deals with writing

    style.

    Example Aiye | Aye

  • 5

    @tomzeal

    Enia | Eniyan

    Adie | Adiye

    Morphological knowledge:This has to do with apply

    knowledge from the study of form and internal structure

    of morphemes (words and their semantic building

    blocks).

    Morphemes are the smallest linguistic unit in a word that

    can carry a meaning.

    Example: Print/er

    (Verb/noun)

    Un-, break, able in unbreakable

    It deals with how words can be constructed.

    Syntactic Knowledge: Knowledge about how sentences

    and other sort of phrases can be made up out of words.

    Semantic Knowledge: Knowledge about words and

    phrases or sentence that provide meaning about a

    sentence, about how the meaning of a phrase is related

    to the meaning of component words, phrase or

    sentences.

    Tolu builds a House (SVO)

    Pragmatic Knowledge: Talks about the practical use of

    knowledge. The use of spoken language in a social

    context.

    Representing Linguistic Knowledge

    In general, syntax is concerned with two slightly different

    analysis of sentence:

    The first is constituent or phrase structure

    analysis. The division of sentences into their

    constituent parts ant the categorization of

    these parts as nouns, verb etc.

    The second has to do with the grammatical

    relations. The assignment of grammatical

    relations such as object, subject, head and so

    on to various parts of the sentence.

    Ade ate the food

    N V Det N

    Subject Predicate Object

    Grammar and Constituent Structure

    Sentences are made up of words traditionally categorized

    into parts of speech or categories including nouns, verbs,

    adjective and prepositions. A grammar of a language is a

    set of rules which states how these parts of speech can

    be put together to make grammatical or well-formed

    sentences.

    E.g.

    a) Put some papers in the printer Follows grammar

    rules

    b) Print some put in papers do not follow grammar

    rules

    Here are some simple rules for English grammar with

    example. A sentence consists of a noun phrase such as:

    The user should clean the printer

    (Noun Phrase)(modal) (verb)

    (VP.)

    But within a verb we can have a noun phrase: VP

    VNP

    A noun phrase can consist of a determinant or article

    such as the or a

    English to Yoruba: Unidirectional Bilingual

    English to Yoruba Yoruba to English: Bilingual

    Bidirectional

    English to (Hausa, Igbo, Yoruba).

    11 03 - 2014

    Difference between Clauses and Sentences

    A verb phrase can consist of verbs and auxiliary verbs.

    Verb phrase can consists of a noun phrase. It will only

    consist of a verb alone when it is a command word e.g.

    go!. It consist of a noun phrase such as The printer.

    We can have a structure like:

    S NP + VP

    Parts of Speech

    Subject Verb Object

    N|NP Verb N|NP

  • 6

    @tomzeal

    Pronoun Pronoun

    Machine Translation Approach

    1. Statistical Based Approach (Data Driven):

    Translates lines of sentences using a set of

    stored words in a database or repository.

    2. Rule Based Approach: Understand the

    grammar structure of the source and target

    language and use a context free grammar for

    translation and then apply to the translation.

    3. Hybrid Approach:

    Part of Speech tagging: is the process of assigning a

    part of speech label to each lexicon.

    Lexicon Collection of Words

    Lexical Item ===Lexeme === Word

    Part of speech give information about a word and its

    neighbours. This is clearly true for major categories e.g.

    Verb and Noun.

    Book the Flight

    V | N

    Ade bought the book

    S Ade bought the book

    N Ade, Book

    V Bought

    Det the

    Ade ra iwe naa

    High /

    Low \

    Mid-tone

    Types of Part of Speech Tagging

    1. Rule Based: Rule based tagging often have a

    large database of handwritten disambiguation

    rule which specify for example that an

    ambiguous word is a noun rather than a verb if

    it follows a determinant.

    2. Stochastic Tagging: Generally resolved

    tagging ambiguities by using a training corpus to

    compute the probability of a give word having a

    given tag in a given context.

    Hidden Markov Model

    Parsing

    Essence is to determine if rules used are correct before

    final translation.

    1. Top Down Parser

    2. Bottom Up parser

    Formal Grammar

    This is generally described as a structure containing a

    vocabulary and a set of rules for defining strings on the

    basis of the vocabulary. The Chomsky hierarchy of

    formal grammar

    Unrestricted Grammar (Type 0): The languages

    defined by type one are accepted by Turing machines:

    Chomskian transformations are defined as type 0

    grammars. Type zero grammars have rules of the form

    beta

    Where and beta are arbitrary strings over the

    vocabulary V: /=

    The and could be terminal and non-terminal of a

    sentence. E.g.

    could be: S, NP, VP and PP. All these are non-

    terminal.

    could be: Verb, noun, preposition etc.

    The words in these category are terminals

    Context sensitive Grammar (Type 1):The language

    defined by type one grammar are accepted by linearly

    bounded automata. The syntax of some of the natural

    languages are generally held in computational linguistic.

    They have rules of the form:

    A B

    A,B E N, B /= E E V*

    S E where S E N

    Is the initial symbol and E is the empty string

    14 03 2014

    Example: Type 1

  • 7

    @tomzeal

    Rule 1: S aSBC

    Rule 2: S aBc

    Rule 3: CB Bc

    Rule 4: aB ab

    Rule 5: bB bb

    Rule 6: bc bc

    Rule 7: cC cc

    This grammar generate all strings consisting of a non-

    empty sequence of As followed by the same number of

    Bs followed by the same number of Cs.

    anbncn

    Context Free Grammar (Type 2): The languages

    defined by type2 grammar are accepted by PDA; the

    syntax of the natural languages is definable almost

    entirely in terms of context free grammar and the tree

    structures generated by them. Type 2 grammars have

    rules of the form:

    A Where A N, V

    There are special normal forms e.g. Chomsky Normal

    form and Theibach normal form into which any context

    free grammar can be equivalently converted

    Regular Grammars (Type 3): Languages defined by

    this are accepted by FSA; Morphological structures and

    perhaps all the syntax of the formal spoken dialogue is

    desirable by regular grammars.

    Read Up:There are two types: Right linear and left linear.

    Scenario 1: Subject Verb Object

    a. John ate the rice

    S NP VP

    NP N

    NP V NP

    NP DET N

    NP DET ADJ N

    b.

    c. The tall boy ate the white rice

    [Subject & object Modifiers]

    The rule is (5) above suffices

    Note: If we have

    Preposition; then it can be reduced to

    Pre NP

    Adjectival Phrase (Det) Adj N

    Adverbial Phrase Verb (Adv)

    Re-ordering | Swapping

    (1) He ate the rice

    je res n

    PRN V N Det

    So, you need to change the order in the re-write rules i.e

    Det Noun Noun Det

    (2) The red car

    Ok pupa n

    The adjectival phrase becomes Noun Adjective Det

    Phrase Structure Grammar

    This is the traditional method of representing the

    structural constitutents of a sentence in a phrase

    structure tree.

    The tall man | beat the boy

    NP VP

    Subject Object

    S

    NP VP

    S NP VP

    NP Det AdjP

    AdjP Adj N

    VP V NP

    NP Det N

  • 8

    @tomzeal

    Inflectionary verbs. In translating from Enlish to

    Yoruba, theres no need to take care of inflectionary

    verbs.

    E.g : beat | beats na

    Go | goes | went | gone went

    21 03 - 2014

    TEXT PROCESSING

    Spelling Correction: Scanning sentence for words that

    do not appear in its dictionary.

    Grammar Checker: This is the flagging of words put

    together in a sentence when there is no correspondence

    between them hence making the sentence out of context

    Language Recognition:

    Information Retrieval:Allowsone to locate relevant

    documents that are related to a query but do not specify

    where the answers are. In information retrieval, the

    document of interest are fetched by matching query key

    words to indexes of document collection. The main

    purpose of information retrieval is to prevent information

    overload.

    Information Extraction: This extracts the information

    of interest for a well-defined extraction domain and this

    relies on filling out predefined templates. Such

    information consists of entities and the relationship

    between them thus information extraction generates

    structured information

    This is about getting specific information. The difference

    with information retrieval will return documents by

    extraction will retrieve information particular to a

    domain and often requires templates for extraction.

    Handwriting Recognitionwhich comes under

    character recognition.

    Essay Grading: Searching through for key terms in an

    essay as well as being written in good grammar

    Recommender System: Is a system that goes through a

    given data and then based on defined variables supplied

    to it, it suggests an idea to the user.

    Text Categorization: Case study in which this has been

    used is the Federalist Papers. This was accomplished

    using writing style and words used. Basically text

    categorization looks through documents from unknown

    authors and makes efforts to categorize. Plagiarism is

    also and application of text categorization.

    Question Answering System: accepts questions in

    natural language form, searches for answers over a

    collection of documents and formulates a concise

    answer. It involves question processing, document

    retrieval, answer extraction and answer formulation.

    Computational Semantics:

    Tools applied for Text Processing

    Pattern Matching: Text written in any language is

    about pattern. Matching the defined pattern is what

    makes it valid. Rules can be used to implement pattern

    matching.

    - Regular Expressions: Example is in

    detecting if a word is a proper noun.

    - Statistics or probability:

    - Dictionaries:

    - Machine Learning: Often relies on

    probability theory and statistics alongside a

    machine learning algorithm to be able to

    recognize patterns

    - Language Model:

    - Spell Check: Process include scan, identify

    - Edit Distance, insertion, transposition,

    substitution, deletion

    - Contextual Spell Checking:

    Sentence

    NP

    Det

    The

    AdjP

    Adj

    tall

    N

    man

    VP

    V

    beat

    NP

    Det

    the

    N

    boy