problematic language processing in artificial intelligence

Upload: dsudharsan88

Post on 10-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    1/91

    Machine Learning Methods in Natural Language Processing

    Michael CollinsMIT CSAIL

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    2/91

    Some NLP Problems

    Information extraction

    Named entities

    Relationships between entities

    Finding linguistic structure

    Part-of-speech tagging

    Parsing

    Machine translation

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    3/91

    Common Themes

    Need to learn mapping from one discrete structure to another

    Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging

    Strings to stringsMachine translation

    Strings to underlying trees

    Parsing

    Strings to relational data structuresInformation extraction

    Speech recognition is similar (and shares many techniques)

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    4/91

    Two Fundamental Problems

    TAGGING: Strings to Tagged Sequences

    a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C

    PARSING: Strings to Trees

    d e f g (A (B (D d) (E e)) (C (F f) (G g)))

    d e f g A

    B

    D

    d

    E

    e

    C

    F

    f

    G

    g

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    5/91

    Information Extraction

    Named Entity RecognitionINPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as

    their CEO Alan Mulally announced first quarter results.

    OUTPUT: Profits soared at

    Company Boeing Co.

    , easily topping forecastson

    Location Wall Street

    , as their CEO

    Person Alan Mulally

    announced first

    quarter results.

    Relationships between Entities

    INPUT: Boeing is located in Seattle. Alan Mulally is the CEO.

    OUTPUT:

    Relationship = Company-Location

    Relationship = Employer-Employee

    Company = Boeing Employer = Boeing Co.

    Location = Seattle

    Employee = Alan Mulally

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    6/91

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    7/91

    Named Entity Extraction as Tagging

    INPUT:Profi ts soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced fi rst quarter results.

    OUTPUT:

    Profi ts/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NAtopping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NAtheir/NA CEO/NA Alan/SP Mulally/CP announced/NA fi rst/NA

    quarter/NA results/NA ./NA

    NA = No entity

    SC = Start Company

    CC = Continue CompanySL = Start Location

    CL = Continue Location

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    8/91

    Parsing (Syntactic Structure)

    INPUT: Boeing is located in Seattle.

    OUTPUT:S

    NP

    N

    Boeing

    VP

    V

    is

    VP

    V

    located

    PP

    P

    in

    NP

    N

    Seattle

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    9/91

    Machine Translation

    INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

    OUTPUT:

    Boeing ist in Seattle. Alan Mulally ist der CEO.

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    10/91

    Techniques Covered in this Tutorial

    Generative models for parsing

    Log-linear (maximum-entropy) taggers

    Learning theory for NLP

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    11/91

    Data for Parsing Experiments

    Penn WSJ Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences

    An example tree:

    Canadian

    NNP

    Utilities

    NNPS

    NP

    had

    VBD

    1988

    CD

    revenue

    NN

    NP

    of

    IN

    C$

    $

    1.16

    CD

    billion

    CD

    ,

    PUNC,

    QP

    NP

    PP

    NP

    mainly

    RB

    ADVP

    from

    IN

    its

    PRP$

    natural

    JJ

    gas

    NN

    and

    CC

    electric

    JJ

    utility

    NN

    businesses

    NNS

    NP

    in

    IN

    Alberta

    NNP

    ,

    PUNC,

    NP

    where

    WRB

    WHADVP

    the

    DT

    company

    NN

    NP

    serves

    VBZ

    about

    RB

    800,000

    CD

    QP

    customers

    NNS

    .

    PUNC.

    NP

    VP

    S

    SBAR

    NP

    PP

    NP

    PP

    VP

    S

    TOP

    Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its

    natural gas and electric utility businesses in Alberta , where the company

    serves about 800,000 customers .

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    12/91

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    13/91

    2) Phrases S

    NP

    DT

    the

    N

    burglar

    VP

    V

    robbed

    NP

    DT

    the

    N

    apartment

    Noun Phrases (NP): the burglar, the apartment

    Verb Phrases (VP): robbed the apartment

    Sentences (S): the burglar robbed the apartment

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    14/91

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    15/91

    An Example Application: Machine Translation

    English word order is subject verb object

    Japanese word order is subject object verb

    English: IBM bought LotusJapanese: IBM Lotus bought

    English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    16/91

    Context-Free Grammars

    [Hopcroft and Ullman 1979]A context free grammar

    where:

    is a set of non-terminal symbols

    is a set of terminal symbols

    is a set of rules of the form

    for

    !

    "

    ,

    $

    ,

    %

    $

    &

    $

    is a distinguished start symbol

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    17/91

    A Context-Free Grammar for English

    = S, NP, VP, PP, D, Vi, Vt, N, P

    = S

    =

    sleeps, saw, man, woman, telescope, the, with, in

    S

    NP VPVP

    ViVP

    Vt NPVP

    VP PP

    NP D NNP

    NP PP

    PP

    P NP

    Vi

    sleepsVt

    saw

    N

    manN

    woman

    N

    telescopeD

    the

    P

    withP

    in

    Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,

    D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    18/91

    Left-Most Derivations

    A left-most derivation is a sequence of strings

    , where

    , the start symbol

    $

    , i.e.

    is made up of terminal symbols only

    Each

    %

    for

    is derived from

    %

    by picking the left-most non-terminal

    in

    %

    and replacing it by some

    where

    is a rule in

    For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],

    [the man Vi], [the man sleeps]

    Representation of a derivation as a tree:

    S

    NP

    D

    the

    N

    man

    VP

    Vi

    sleeps

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    19/91

    The Problem with Parsing: Ambiguity

    INPUT:She announced a program to promote safety in trucks and vans

    POSSIBLE OUTPUTS:

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a progra m

    VP

    t o p ro mo te NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a progra m

    VP

    to promote NP

    NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    And there are more...

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    20/91

    An Example Tree

    Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .

    Canadian

    NNP

    Utilities

    NNPS

    NP

    had

    VBD

    1988

    CD

    revenue

    NN

    NP

    of

    IN

    C$

    $

    1.16

    CD

    billion

    CD

    ,

    PUNC,

    QP

    NP

    PP

    NP

    mainly

    RB

    ADVP

    from

    IN

    its

    PRP$

    natural

    JJ

    gas

    NN

    and

    CC

    electric

    JJ

    utility

    NN

    businesses

    NNS

    NP

    in

    IN

    Alberta

    NNP

    ,

    PUNC,

    NP

    where

    WRB

    WHADVP

    the

    DT

    company

    NN

    NP

    serves

    VBZ

    about

    RB

    800,000

    CD

    QP

    customers

    NNS

    .

    PUNC.

    NP

    VP

    S

    SBAR

    NP

    PP

    NP

    PP

    VP

    S

    TOP

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    21/91

    A Probabilistic Context-Free Grammar

    S

    NP VP 1.0

    VP

    Vi 0.4VP

    Vt NP 0.4

    VP VP PP 0.2NP

    D N 0.3NP

    NP PP 0.7

    PP

    P NP 1.0

    Vi

    sleeps 1.0Vt

    saw 1.0

    N

    man 0.7

    N

    woman 0.2N

    telescope 0.1

    D

    the 1.0

    P

    with 0.5

    P

    in 0.5

    Probability of a tree with rules

    %

    %

    is

    %

    %

    %

    %

    Maximum Likelihood estimation

    VP

    V NP

    VP

    VP

    V NP

    VP

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    22/91

    PCFGs

    [Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defi nes a distribution over the set ofderivations provided that:

    1. The rule probabilities defi ne conditional distributions over thedifferent ways of rewriting each non-terminal.

    2. A technical condition on the rule probabilities ensuring that

    the probability of the derivation terminating in a fi nite numberof steps is 1. (This condition is not really a practical concern.)

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    23/91

    TOP

    S

    NP

    N

    IBM

    VP

    V

    bought

    NP

    N

    Lotus

    PROB =

    TOP

    S

    S

    NP VP

    N

    VP

    V NP

    V

    NP

    N

    N

    NP

    N

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    24/91

    The SPATTER Parser: (Magerman 95;Jelinek et al 94)

    For each rule, identify the head child

    S

    NP VPVP

    V NP

    NP DT N

    Add word to each non-terminal

    S(questioned)

    NP(lawyer)

    DT

    the

    N

    lawyer

    VP(questioned)

    V

    questioned

    NP(witness)

    DT

    the

    N

    witness

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    25/91

    A Lexicalized PCFG

    S(questioned)

    NP(lawyer) VP(questioned) ??

    VP(questioned)

    V(questioned) NP(witness) ??

    NP(lawyer)

    D(the) N(lawyer) ??

    NP(witness)

    D(the) N(witness) ??

    The big question: how to estimate rule probabilities??

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    26/91

    CHARNIAK (1997)

    S(questioned)

    NP VP

    S(questioned)

    S(questioned)

    NP VP(questioned)

    lawyer

    S,VP,NP, questioned)

    S(questioned)

    NP(lawyer) VP(questioned)

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    27/91

    Smoothed Estimation

    NP VP S(questioned)

    S(questioned)

    NP VP

    S(questioned)

    S

    NP VP

    S

    Where

    !, and

    !

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    28/91

    Smoothed Estimation

    lawyer S,NP,VP,questioned

    lawyer

    S,NP,VP,questioned

    S,NP,VP,questioned

    lawyer

    S,NP,VP

    S,NP,VP

    lawyer

    NP

    NP

    Where

    !, and

    !

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    29/91

    NP(lawyer) VP(questioned)

    S(questioned)

    S(questioned)

    NP VP

    S(questioned)

    S

    NP VP

    S

    lawyer

    S,NP,VP,questioned

    S,NP,VP,questioned

    lawyer

    S,NP,VP

    S,NP,VP

    lawyer

    NP

    NP

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    30/91

    Lexicalized Probabilistic Context-Free Grammars

    Transformation to lexicalized rules

    S

    NP VPvs. S(questioned)

    NP(lawyer) VP(questioned)

    Smoothed estimation techniques blend different counts

    Search for most probable tree through dynamic programming

    Perform vastly better than PCFGs (88% vs. 73% accuracy)

    I d d A i

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    31/91

    Independence Assumptions PCFGs

    S

    NP

    DT

    the

    N

    lawyer

    VP

    V

    questioned

    NP

    DT

    the

    N

    witness

    Lexicalized PCFGs

    S(questioned)

    NP(lawyer)

    DT

    the

    N

    lawyer

    VP(questioned)

    V

    questioned

    NP(witness)

    DT

    the

    N

    witness

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    32/91

    Results

    Method AccuracyPCFGs (Charniak 97) 73.0%

    Conditional Models Decision Trees (Magerman 95) 84.2%

    Lexical Dependencies (Collins 96) 85.5%

    Conditional Models Logistic (Ratnaparkhi 97) 86.9%

    Generative Lexicalized Model (Charniak 97) 86.7%

    Generative Lexicalized Model (Collins 97) 88.2%

    Logistic-inspired Model (Charniak 99) 89.6%

    Boosting (Collins 2000) 89.8%

    Accuracy = average recall/precision

    P i f I f ti E t ti

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    33/91

    Parsing for Information Extraction:Relationships between Entities

    INPUT:Boeing is located in Seattle.

    OUTPUT:

    Relationship = Company-LocationCompany = BoeingLocation = Seattle

    A Generati e Model (Miller et al)

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    34/91

    A Generative Model (Miller et. al)

    [Miller et. al 2000] use non-terminals to carry lexical items andsemantic tags

    SisCL

    NPBoeingCOMPANY

    Boeing

    VPisCLLOC

    V

    is

    VPlocatedCLLOC

    V

    located

    PPinCLLOC

    P

    in

    NPSeattleLOCATION

    Seattle

    PPin lexical headCLLOC semantic tag

    A Generative Model [Miller et al 2000]

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    35/91

    A Generative Model [Miller et. al 2000]

    Were now left with an even more complicated estimation problem,

    P(SisCL NPBoeingCOMPANY VP

    isCLLOC)

    See [Miller et. al 2000] for the details

    Parsing algorithm recovers annotated trees

    Simultaneously recovers syntactic structure and namedentity relationships

    Accuracy (precision/recall) is greater than 80% in recovering

    relations

    Techniques Covered in this Tutorial

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    36/91

    Techniques Covered in this Tutorial

    Generative models for parsing

    Log-linear (maximum-entropy) taggers

    Learning theory for NLP

    Tagging Problems

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    37/91

    Tagging Problems

    TAGGING: Strings to Tagged Sequences

    a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C

    Example 1: Part-of-speech tagging

    Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V

    forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N

    Mulally/N announced/V first/ADJ quarter/N results/N ./.

    Example 2: Named Entity Recognition

    Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA

    topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA

    CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA

    results/NA ./NA

    Log-Linear Models

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    38/91

    Log-Linear Models

    Assume we have sets and

    Goal: defi ne

    for any

    ,

    .

    A feature vector representation is

    Parameters

    Defi ne

    where

    !

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    39/91

    Log-Linear Taggers: Independence Assumptions

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    40/91

    Log Linear Taggers: Independence Assumptions

    The basic idea

    Chain rule

    Independence

    assumptions

    Two questions:

    1. How to parameterize

    ?

    2. How to fi nd

    !"

    #

    ?

    The Parameterization Problem

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    41/91

    The Parameterization Problem

    Hispaniola/NNP quickly/RB became/VB an/DT

    important/JJ base/?? from which Spain expanded

    its empire into the rest of the Western Hemisphere .

    There are many possible tags in the position ??

    Need to learn a function from (context, tag) pairs to a probability

    Representation: Histories

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    42/91

    Representation: Histories

    A history is a 4-tuple

    are the previous two tags.

    are the words in the input sentence.

    is the index of the word being tagged

    Hispaniola/NNP quickly/RB became/VB an/DT important/JJbase/?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

    DT, JJ

    FeatureVector Representations

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    43/91

    a V p a

    Take a history/tag pair

    .

    for

    !

    are features representing

    tagging decision

    in context

    .

    Example: POS Tagging [Ratnaparkhi 96]

    Word/tag features

    if current word

    %

    is base and

    = VB

    "

    otherwise

    if current word

    %

    ends in ing and

    = VBG

    "

    otherwise

    Contextual Features

    if

    DT, JJ, VB

    "

    otherwise

    Part-of-Speech (POS) Tagging [Ratnaparkhi 96]

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    44/91

    p ( ) gg g [ p ]

    Word/tag features

    if current word

    %

    is base and

    = VB

    "

    otherwise

    Spelling features

    if current word

    %

    ends in ing and

    = VBG

    "

    otherwise

    if current word

    %

    starts with pre and

    = NN

    "

    otherwise

    Ratnaparkhis POS Tagger

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    45/91

    p gg

    Contextual Features

    if

    DT, JJ, VB

    "

    otherwise

    if

    JJ, VB

    "

    otherwise

    if

    VB "

    otherwise

    if previous word

    %

    = the and

    VB

    "

    otherwise

    if next word

    %

    = the and

    VB

    "

    otherwise

    Log-Linear (Maximum-Entropy) Models

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    46/91

    g py

    Take a history/tag pair .

    for

    !

    are features

    for

    !

    are parameters

    Parameters defi ne a conditional distribution

    where

    Log-Linear (Maximum Entropy) Models

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    47/91

    Word sequence

    Tag sequence

    Histories

    Linear Score

    Local NormalizationTerms

    Log-Linear Models

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    48/91

    Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling

    Search for

    !"

    #

    :Dynamic programming,

    complexity

    Experimental results: Almost 97% accuracy for POS tagging [Ratnaparkhi 96]

    Over 90% accuracy for named-entity extraction

    [Borthwick et. al 98] Better results than an HMM for FAQ segmentation

    [McCallum et al. 2000]

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    49/91

    Linear Models for Classification

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    50/91

    Goal: learn a function

    Training examples

    %

    %

    for

    ,

    A representation

    , parameter vector$

    .

    Classifi er is defi ned as

    Unifying framework for many results: Support vector machines, boosting,

    kernel methods, logistic regression, margin-based generalization bounds,

    online algorithms (perceptron, winnow), mistake bounds, etc.

    How can these methods be generalized beyond classification

    problems?

    Linear Models for Parsing and Tagging

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    51/91

    Goal: learn a function

    Training examples

    %

    %

    for

    ,

    Three components to the model:

    A function

    enumerating candidates for

    A representation

    . A parameter vector

    $

    .

    Function is defi ned as

    Component 1:

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    52/91

    enumerates a set ofcandidates for a sentence

    She announced a program to promote safety in trucks and vans

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a progra m

    VP

    t o p ro mo te NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a progra m

    VP

    to promote NP

    NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    Examples of

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    53/91

    A context-free grammar

    A fi nite-state machine

    Top

    most probable analyses from a probabilistic grammar

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    54/91

    Features

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    55/91

    A feature is a function on a structure, e.g.,

    Number of times A

    B C

    is seen in

    A

    B

    D

    d

    E

    e

    C

    F

    f

    G

    g

    A

    B

    D

    d

    E

    e

    C

    F

    h

    A

    B

    b

    C

    c

    !

    Feature Vectors

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    56/91

    A set of functions

    defi ne a feature vector

    A

    B

    D

    d

    E

    e

    C

    F

    f

    G

    g

    A

    B

    D

    d

    E

    e

    C

    F

    h

    A

    B

    b

    C

    c

    !

    !

    !

    Component 3:

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    57/91

    is aparameter vector

    $

    and together map a candidate to a real-valued score

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    to VP

    promote NP

    safety PP

    in NP

    NP

    trucks

    and NP

    vans

    "

    "

    "

    "

    "

    "

    "

    "

    "

    "

    Putting it all Together

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    58/91

    is set of sentences, is set of possible outputs (e.g. trees)

    Need to learn a function

    ,

    , defi ne

    Choose the highest scoring tree as the most plausible structure

    Given examples

    %

    %

    , how to set ?

    She announced a program to promote safety in trucks and vans

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    59/91

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a progra m

    VP

    t o p ro mo te NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a progra m

    VP

    to promote NP

    NP

    safety PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    t o p ro mo te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    trucks

    and NP

    vans

    S

    NP

    She

    VP

    announced NP

    NP

    NP

    a program

    VP

    t o p r om o te NP

    safety

    PP

    in NP

    t r uc k s a n d v a ns

    13.6 12.2 12.1 3.3 9.4 11.1

    S

    NP

    She

    VP

    announced NP

    NP

    a p ro gr am

    VP

    to VP

    promote NP

    safety PP

    in NP

    NP

    trucks

    and NP

    vans

    Markov Random Fields[Johnson et. al 1999, Lafferty et al. 2001]

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    60/91

    [Johnson et. al 1999, Lafferty et al. 2001]

    Parameters defi ne a conditional distribution over candidates:

    Gaussian prior:

    MAP parameter estimates maximise

    Note: This is a globally normalised model

    A Variant of the Perceptron Algorithm

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    61/91

    Inputs: Training set

    %

    %

    for

    Initialization: "

    Define:

    %

    Algorithm: For

    ,

    %

    %

    If

    %

    %

    %

    %

    %

    %

    Output: Parameters

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    62/91

    Results that Carry Over from Classification

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    63/91

    [Freund and Schapire 99] defi ne the Voted Perceptron, proveresults for the inseparable case.

    Compression bounds [Littlestone and Warmuth, 1986]

    Say the perceptron algorithm makes

    mistakes whenrun to convergence over a training set of size

    .

    Then for all distributions

    , with probabilityat least

    over the choice of training set of size

    drawn from

    , if

    is the hypothesis at convergence,

    NB.

    !

    #

    !

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    64/91

    Define:

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    65/91

    !

    "

    $

    '

    !

    "

    $

    &

    !

    $

    !

    !

    "

    $# #

    !

    "

    $

    %

    $ $

    %

    &

    !

    '

    $

    (

    )

    0

    # #

    '

    !

    "

    $

    &

    (

    1

    '

    !

    "

    $

    &

    '

    $ $

    Generalization Theorem:For all distributions

    !

    !

    "

    $

    generating examples, for all

    2

    3 with

    , for all'

    5

    1

    , with probability at least (7

    over the choice of training set of

    size

    drawn from! ,

    !

    $

    %

    &

    !

    '

    $

    8

    9

    !

    @

    B

    @

    B

    C

    $

    '

    @

    B

    7

    where9

    is a constant such that

    "

    D

    !

    "

    $

    E

    !

    "

    $

    '

    !

    "

    $

    (

    '

    !

    "

    E

    $

    9

    . The variableC

    is the smallest positive integer suchthat

    "

    D

    !

    "

    $

    (

    C .

    Proof: See [Collins 2002b]. Based on [Bartlett 1998, Zhang, 2002]; closely

    related to multiclass bounds in e.g., [Schapire et al., 1998].

    Perceptron Algorithm 1: Tagging

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    66/91

    Going back to tagging:

    2 Inputs

    are sentences

    2

    i.e. all tag sequences of length

    2 Global representations

    are composed from local feature

    vectors

    where

    "

    #

    %

    #

    &

    '

    Perceptron Algorithm 1: Tagging

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    67/91

    2 Typically, local features are indicator functions, e.g.,

    if current word

    ends in ing and = VBG

    otherwise

    2 and global features are then counts,

    Number of times a word ending in ing istagged as VBG in

    Perceptron Algorithm 1: Tagging

    Score for a pair is

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    68/91

    Score for a

    #

    $

    #

    $

    pair is

    Viterbi algorithm for

    !

    "

    0#$

    %

    &

    $

    #

    $

    #

    $

    Log-linear taggers (see earlier part of the tutorial) are locally normalised models:

    @

    B

    '

    !

    !

    "

    0#

    $

    %

    )

    "

    0#

    $

    %

    $

    $

    1

    )

    0

    2

    2

    3

    2

    !

    4

    1

    !

    1

    $

    5

    67

    8

    Linear Model

    (

    $

    1

    )

    0

    @

    B

    9

    !

    4

    1

    $

    5

    67

    8

    Local Normalization

    Training the Parameters

    Inputs: Training set

    for

    .

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    69/91

    Initialization:

    Algorithm: For

    is output on

    th sentence with current parameters

    If

    then

    5

    67

    8

    Correct tagsfeature value

    5

    67

    8

    Incorrect tagsfeature value

    Output: Parameter vector .

    An Example

    S h f h

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    70/91

    Say the correct tags for

    th sentence are

    the/DT man/NN bit/VBD the/DT dog/NN

    Under current parameters, output is

    the/DT man/NN bit/NN the/DT dog/NN

    Assume also that features track: (1) all bigrams; (2) word/tag pairs

    Parameters incremented:

    "

    NN, VBD'

    "

    VBD, DT'

    "

    VBD

    bit'

    Parameters decremented:

    "

    NN, NN'

    "

    NN, DT'

    "

    NN

    bit'

    Experiments

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    71/91

    Wall Street Journal part-of-speech tagging data

    Perceptron = 2.89%, Max-ent = 3.28%

    (11.9% relative error reduction)

    [Ramshaw and Marcus 95] NP chunking data

    Perceptron = 93.63%, Max-ent = 93.29%

    (5.1% relative error reduction)

    See [Collins 2002a]

    Perceptron Algorithm 2: Reranking Approaches

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    72/91

    2

    is the top

    most probable candidates from a base model

    Parsing: a lexicalized probabilistic context-free grammar

    Tagging: maximum entropy tagger Speech recognition: existing recogniser

    Parsing Experiments

    Beam search used to parse training and test sentences:

    around 27 parses for each sentence

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    73/91

    around 27 parses for each sentence

    "

    '

    , where

    log-likelihood fromfi rst-pass parser,

    are

    indicator functions

    if

    contains "

    '

    otherwise

    S

    NP

    She

    VP

    announced NP

    NP

    a program

    VP

    to VP

    promote NP

    safety PP

    in NP

    NP

    trucks

    and NP

    vans

    "

    '

    Named Entities

    Top 20 segmentations from a maximum-entropy tagger

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    74/91

    "

    '

    ,

    if

    contains a boundary = [The

    otherwise

    Whether youre an aging flower child or a clueless

    [Gen-Xer], [The Day They Shot John Lennon], playing at the

    [Dougherty Arts Center], entertains the imagination.

    "

    '

    Whether youre an aging flower child or a clueless

    [Gen-Xer], [The Day They Shot John Lennon], playing at the

    [Dougherty Arts Center] entertains the imagination

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    75/91

    [Dougherty Arts Center], entertains the imagination.

    "

    '

    Whether youre an aging flower child or a cluelessGen-Xer, The Day [They Shot John Lennon], playing at the

    [Dougherty Arts Center], entertains the imagination.

    "

    '

    Whether youre an aging flower child or a clueless

    [Gen-Xer], The Day [They Shot John Lennon], playing at the

    [Dougherty Arts Center], entertains the imagination.

    "

    '

    Experiments

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    76/91

    Parsing Wall Street Journal Treebank

    Training set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.2% F-measure

    Reranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)

    Recovering Named-Entities in Web Data

    Training data

    53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)

    Boosting: 87.6% F-measure (15.6% relative error reduction)

    Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    77/91

    2 Its simple to derive a dual form of the perceptron algorithm

    If we can compute

    efficientlywe can learn efficiently with the representation

    All SubtreesRepresentation [Bod 98]

    2 Given: Non-Terminal symbols

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    78/91

    G ve : No e a sy bo s

    Terminal symbols

    2 An infi nite set of subtrees

    A

    B C

    A

    B

    b

    E

    A

    B

    b

    C

    A B

    A

    B A

    B

    b

    C

    2 Step 1:

    Choose an (arbitrary) mapping from subtrees to integers

    Number of times subtree

    is seen in

    "

    %

    '

    All Subtrees Representation

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    79/91

    is now huge

    But inner product

    can be computed

    effi ciently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    80/91

    An Example

    A A

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    81/91

    0

    B

    D

    d

    E

    e

    C

    F

    f

    G

    g

    B

    D

    d

    E

    e

    C

    F

    h

    G

    i

    '

    !

    0

    $

    &

    '

    !

    $

    !

    $

    !

    $

    !

    $

    !

    $

    !

    $

    2 Most of these terms are (e.g. ).

    2 Some are non-zero, e.g.

    B

    D E

    B

    D

    d

    E

    B

    D E

    e

    B

    D

    d

    E

    e

    Recursive Definition of

    2 If the productions at

    and

    are different

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    82/91

    p %

    %

    2 Else if

    %

    are pre-terminals,

    %

    2 Else

    %

    &

    %

    &

    is number of children of node ;

    &

    is the&

    th child of

    .

    Illustration of the Recursion

    A A

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    83/91

    B

    D

    d

    E

    e

    C

    F

    f

    G

    g

    B

    D

    d

    E

    e

    C

    F

    h

    G

    i

    How many subtrees do nodes and have in common? i.e., What is !

    $

    ?

    B

    D E

    B

    D

    d

    E

    B

    D E

    e

    B

    D

    d

    E

    e

    C

    F G

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    84/91

    Similar Kernels Exist for Tagged Sequences

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    85/91

    Whether youre an aging flower child or a clueless[Gen-Xer], [The Day They Shot John Lennon], playing

    at the [Dougherty Arts Center], entertains the imagination.

    Whether [Gen-Xer], Day They John Lennon],playing

    Whether youre an aging flower child or a clueless [Gen

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    86/91

    Open Questions

    2

    h l i h h i b i d

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    87/91

    Can the large-margin hypothesis be trainedeffi ciently, even when

    is huge? (see[Altun, Tsochantaridis, and Hofmann, 2003])

    2

    Can the large-margin bound be improved, to remove the

    factor?

    2 Which representations lead to good performance on parsing,

    tagging etc.?

    2 Can methods with global kernels be implementedeffi ciently?

    Conclusions

    Some Other Topics in Statistical NLP:

    M hi l i

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    88/91

    2 Machine translation

    2 Unsupervised/partially supervised methods

    2 Finite state machines

    2 Generation

    2 Question answering

    2 Coreference

    2 Language modeling for speech recognition

    2 Lexical semantics

    2 Word sense disambiguation

    2 Summarization

    References

    [Alt n Tsochantaridis and Hofmann 2003] Alt n Y I Tsochantaridis and T Hofmann 2003

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    89/91

    [Altun, Tsochantaridis, and Hofmann, 2003] Altun, Y., I. Tsochantaridis, and T. Hofmann. 2003.Hidden Markov Support Vector Machines. In Proceedings of ICML 2003.

    [Bartlett 1998] P. L. Bartlett. 1998. The sample complexity of pattern classifi cation with neural

    networks: the size of the weights is more important than the size of the network, IEEE

    Transactions on Information Theory, 44(2): 525-536, 1998.

    [Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI

    Publications/Cambridge University Press.

    [Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to

    abstract languages. IEEE Transactions on Computers, C-22(5), pages 442450.

    [Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting

    Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.

    of the Sixth Workshop on Very Large Corpora.[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural

    Language. In Proceedings of NIPS 14.

    [Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing

    and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings

    of ACL 2002.[Collins 2002a] Collins, M. (2002a). Discriminative Training Methods for Hidden Markov Models:

    Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.

    [Collins 2002b] Collins, M. (2002b). Parameter Estimation for Statistical Parsing Models: Theory

    and Practice of Distribution-Free Methods. To appear as a book chapter.

    [Crammer and Singer 2001a] Crammer, K., and Singer, Y. 2001a. On the Algorithmic

    Implementation of Multiclass Kernel-based Vector Machines. In Journal of Machine

    Learning Research 2(Dec):265 292

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    90/91

    Learning Research, 2(Dec):265-292.[Crammer and Singer 2001b] Koby Crammer and Yoram Singer. 2001b. Ultraconservative Online

    Algorithms for Multiclass Problems In Proceedings of COLT 2001.

    [Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classifi cation using the

    Perceptron Algorithm. In Machine Learning, 37(3):277296.

    [Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of

    Computer and System Sciences, 50(3):551-573, June 1995.[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata

    theory, languages, and computation. Reading, Mass.: AddisonWesley.

    [Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators

    for stochastic unifi cation-based grammars. In Proceedings of the 37th Annual Meeting

    of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random

    fi elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of

    ICML-01, pages 282-289, 2001.

    [Littlestone and Warmuth, 1986] Littlestone, N., and Warmuth, M. 1986. Relating data compression

    and learnability. Technical report, University of California, Santa Cruz.

    [MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated

    corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.

    [McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov

    models for information extraction and segmentation. In Proceedings of ICML 2000.

    [Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of

    Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.

    [Ramshaw and Marcus 95] Ramshaw L and Marcus M P (1995) Text Chunking Using

  • 8/8/2019 Problematic Language Processing in Artificial Intelligence

    91/91

    [Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large

    Corpora, Association for Computational Linguistics, 1995.

    [Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical

    methods in natural language processing conference.

    [Schapire et al., 1998] Schapire R., Freund Y., Bartlett P. and Lee W. S. 1998. Boosting the margin:A new explanation for the effectiveness of voting methods. The Annals of Statistics,

    26(5):1651-1686.

    [Zhang, 2002] Zhang, T. 2002. Covering Number Bounds of Certain Regularized Linear Function

    Classes. In Journal of Machine Learning Research, 2(Mar):527-550, 2002.