english version parsing natural languages : from "combinatorial" to...

43
English version Parsing natural languages : from "combinatorial" to "deterministic" parsing Jacques Vergne GREYC - Université de Caen http://www.info.unicaen.fr/~jvergne

Upload: coral-horton

Post on 22-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

English version

Parsing natural languages : from "combinatorial"

to "deterministic" parsing

Jacques VergneGREYC - Université de Caen

http://www.info.unicaen.fr/~jvergne

3/7/2001 © Jacques Vergne TALN 2001 -2-

English version

-

Introduction :

• our 1998 parser :

- 1st place at the GRACE evaluation (1995-1998)

- Grammaires et Ressources pour les Analyseurs de Corpus et leur Évaluation

- 22 participants of France, Suisse, Deutschland, Québec, USA : labs, companies (AT&T, IBM, Xerox, France-Télécom, ...)

- decision = 100% (= tokens with a unique tag / total of tokens)

- precision = 94,5% (= tokens with the same tag than human / tokens with a unique tag )

• what are the features of this parser ?

it is a deterministic parser

3/7/2001 © Jacques Vergne TALN 2001 -3-

English version

-

Introduction : our aims

• stressing the evolution of concepts and methods in parsing

• understanding why compiling drove to combinatorial parsing

• understanding principles of deterministic parsing

3/7/2001 © Jacques Vergne TALN 2001 -4-

English version

parserdata to be

parsedparsed

data

-

resources : declarative —> procedural

resources : static —> dynamic

process : combinatorial —> deterministic

programming languages

—> natural languages

Introduction : our workspace

parsingprocess

resources of the process

3/7/2001 © Jacques Vergne TALN 2001 -5-

English version

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

• parsing models :

origin, historical evolution

• criteria :

parsing process ,

resources of the process

parsingprogramming

languages

-

Introduction : our way into this space

3/7/2001 © Jacques Vergne TALN 2001 -6-

English version

• 2. parsing natural languages : combinatorial —> deterministic

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

Plan of the lecture

• 3. Some features of our parsers

• 1. parsing programming languages —> parsing natural languages

3/7/2001 © Jacques Vergne TALN 2001 -7-

English version

• 1. Parsingprogramming

languages —> parsing

natural languages

-

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

3/7/2001 © Jacques Vergne TALN 2001 -8-

English version

• 1.1 Formal grammars : a modelling tool for natural language syntax

• model : simplified and formalised representation of an object or a process

• Noam Chomsky

- first training as a mathematician

- then as linguist, Harris' student (who was a Bloomfield's student)

but at odds / object : attested material —> speaker's "competence"

- 1957 : Syntactic Structures : a linguist's book

- for chomskian linguists : modelling the "competence" of the native speaker (while generating)

- for "NLists" : modelling natural languages syntax, as attested material (while parsing)

- divergence between both trends about 1971 (Extended Standard Theory)

-

3/7/2001 © Jacques Vergne TALN 2001 -9-

English version

• 1.2 Formal grammars : a modelling tool

for programming language syntax

• 1958-1960 : ALGOL 60, first programming language

whose syntax is defined by a (context free) formal grammar

• formal grammar —> a method to design compilers

• first "ALGorithmic" Oriented programming Language :

no more goto in control structures ,but : alternative : if ... then ... else

repetitive : for ... step ... until ... do

for ... while ... do

-

3/7/2001 © Jacques Vergne TALN 2001 -10-

English version

• 1.2 Formal grammars : a modelling tool

for programming language syntax

ALGOL 60 : 1st language with recursive block structures :

program complex_statement

block

simple_statement

**

program —> complex_statement *

complex_statement —> simple_statement | block

block —> complex_statement *

formal grammar

UML diagram

-

3/7/2001 © Jacques Vergne TALN 2001 -11-

English version

• 1.3 A member of the ALGOL group :Bernard Vauquois

• 7 countries : Deutschland, Denmark, USA, France, Great-Britain, Nederland, Suisse

• 14 delegates : Backus, Bauer, Green, Katz, Mc Carthy, Naur, Perlis, Rutishauser, Salmelson, Turanski, Vauquois, Wegstein, Van Wijngaarden, Woodger

• conferences : Zurich (1958), Mayenne (1958), Copenhague (1959), Paris (June 1959, January

1960)

• filiation of ALGOL 60 : Pascal —> C —> C++ —> Java —> Ada

-

3/7/2001 © Jacques Vergne TALN 2001 -12-

English version

• 1.4 Bernard Vauquois, director of the CETA in 1961

• astronomer-mathematician —> computer scientist - linguist

• computer science teacher (formal language theory) at the university of Grenoble

• his ideas to base the Machine Translation of 2d generation :- using the formal language theory

- basing Machine Translation on the compiling model

• Christian Boitet, "L'apport scientifique de Bernard Vauquois" (Analectes, 1989) :

Il revient sans doute au CETA, à l'initiative de B. Vauquois, d'avoir introduit

l'analogie entre TA et compilation. Ainsi un système de TA est-il vu comme

une sorte de "compilateur de langue naturelle".-

3/7/2001 © Jacques Vergne TALN 2001 -13-

English version

• 1.5 Compiling <—> Machine Translation

human translation :designing-programming

statements in a progr. language

automatic translation :compiling

human 

statements in a natural language

processor

statements in machine language

automatic translation of Natural Languages

human  

texts in a source NL

texts in a target NL

human  

-

parsed languages :different

3/7/2001 © Jacques Vergne TALN 2001 -14-

English version

• 1.6 Transpositions of formal grammars into NLP

Extended Standard Theory

(Chomsky)

linguistics

modelling the "competence" of the native speaker, in generation

-

combinatorialNLP

modelling the syntax of natural languages (attested material), in parsing

analogy MT - compiling (Vauquois)

linguistics —> NLP : abandonment of the object

modelled by Chomsky

computer science

modelling the syntax of programming languages (ALGOL 60)

3/7/2001 © Jacques Vergne TALN 2001 -15-

English version

programming language parser

constituenttree

• 1.7 Compiling —> Natural Language parsing

program

exhaustive lexical

resources :primitives

exhaustive syntactic

resources :formal

grammar

sentence

dictionary

compiling parsingnatural languages

-

natural language parser

resources :the same static model of the parsed language

3/7/2001 © Jacques Vergne TALN 2001 -16-

English version

• 1.7 Compiling —> Natural Language parsing

criteria compiling parsing natural languages

process repetitive / token combinatorial deterministic non deterministic

complexity theoretical : polynomial theoretical : exponential in time practical : linear practical : polynomial

language formal language natural language

the same model of the parsed language, but a different process :

-

but the process is not transposed

the model of the parsed language is transposed

3/7/2001 © Jacques Vergne TALN 2001 -17-

English version

• 1.7 What difference between programming languages

and natural languages ?

criteria programming languages natural languages

dictionary closed and frozen open and changing

how many 1 token 1 token tags <—> <—> per token ? 1 unique tag several tags

=> compiling = deterministic process —> parsing natural languages

= non deterministic process

-

3/7/2001 © Jacques Vergne TALN 2001 -18-

English version• 2. Parsing natural languages :

combinatorial —> deterministic

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

resources = dynamic models of the computation process

resources = static models of the expected structures :

formal grammars

3/7/2001 © Jacques Vergne TALN 2001 -19-

English version

• 2.0 An example of two solving ways

• A problem :a father is 4 times older than his son, and their age difference is 30 years

• Its combinatorial resolution :

- be s the age of the son, and f the age of the father

- with 0<s<100, 0<f<100 (supposing an integer solution) :

for each 10 000 couples (s, f), if both constraints are satisfied then output the couple (s, f)

- number of solutions : a priori unknown : 0, 1, n

• Its deterministic resolution :

- posing the system of 2 equations with 2 unknowns : f=4s f-s=30- solving the system => unique solution :

f=4s and f-s=30 => 4s-s=3s=30 => s=10 => f=30+s=40-

3/7/2001 © Jacques Vergne TALN 2001 -20-

English version

• 2.1 Parsing natural

languages : combinatorial

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

3/7/2001 © Jacques Vergne TALN 2001 -21-

English version

• 2.1 Posing a problem in a combinatorial way

• "combinatorial" :

characterise a way to pose and solve a problem

but not the problem itself

• posing a problem in a combinatorial way :

- the attributes of a set of units have several possible values

- there are constraints on attribute values

- we want to find the attribute values which satisfy constraints

= posing it as a Constraint Satisfaction Problem (CSP)

-

3/7/2001 © Jacques Vergne TALN 2001 -22-

English version

• 2.1 Solving a problem in a combinatorial way

• in a combinatorial resolution :

-

• theoretical complexity in time : exponential according to the number of units

verifyverified

combinations

(0, 1, n)constraints

-2- for each combination, the constraint satisfaction is verified

enumeratecombinations

of values

attributes possible values

units

to process

-1- all possible combinations are enumerated

3/7/2001 © Jacques Vergne TALN 2001 -23-

English version

• 2.1 Posing NL parsing in a combinatorial way

• the problem of NL parsing

is traditionally posed and solved in a combinatorial way

• the problem is posed in such a way :

- words of a sentence have several possible tags - all attributes possible values of all words

are "exhaustively" enumerated in the dictionary

- constraints on tags are possible syntactic structures of sentences and phrases, explicited in the formal grammar

- we want to find word tags which satisfy constraints(= "disambiguisation")

-

3/7/2001 © Jacques Vergne TALN 2001 -24-

English version

• 2.1 Solving NL parsing in a combinatorial way

verifyverified

combinations

(0, 1, n)constraintsformal grammar

enumeratecombinations

of tags

units to processwords of the sentence

possible values of attributespossible tags (dictionary)

-

3/7/2001 © Jacques Vergne TALN 2001 -25-

English version

• 2.2 combinatorialNL parsing

—>tagging

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

resources = static models of the expected structures :

formal grammars

resources = dynamic models of the computation process

3/7/2001 © Jacques Vergne TALN 2001 -26-

English version

• 2.2 combinatorial parsing —> tagging

-

declarative or static resources : formal grammar

combinatorial process : recognising expectations

sentence parsedsentence

parsingprocess

resources of the process : expected structures

combinatorialNL parsing

0,1 or nphrase trees

complexity : theoretical : exponential, practical : polynomial

procedural or dynamic resources

deterministic process : interpreting rules

text parsedtext

parsingprocess

resources of the process : contextual rules

taggingchunking

1 unique result

complexity : theoretical : linear, practical : linear

3/7/2001 © Jacques Vergne TALN 2001 -27-

English version

• 2.2 Tagging, a process from forms and their position

-

Le site allemand de Dasa à Hambourg

devra assembler ce nouvel avion .

• some linguistic properties of forms & their position :

word tag => constraint on the tag of the following word

—> tagging process (of words)

det. => noun or adjective prep. => det., noun or adjective

3/7/2001 © Jacques Vergne TALN 2001 -28-

English version

• 2.2 Chunking, a process from forms and their position

-

Le site allemand de Dasa à Hambourg

devra assembler ce nouvel avion .

Le .... ........ de .... à ........

devra .......er ce ...... ..... .

• some linguistic properties of forms & their position :

function word => beginning and type of a chunk (Abney 1991)

—> segmentation process (chunking), tagging (of chunks)

[ [N [[ [ [

pN pN

NV

Le site allemand de Dasa à Hambourg

devra assembler ce nouvel avion .

3/7/2001 © Jacques Vergne TALN 2001 -29-

English version

• 2.3 Tagging —>

deterministic parsing

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

resources = static models of the expected structures :

formal grammars

resources = dynamic models of the computation process

3/7/2001 © Jacques Vergne TALN 2001 -30-

English version

• 2.3 Tagging —> deterministic parsing

-

• quasi identical, but which differences ?

- lexical resources : function words + morphemes of word endingonly one tag by default

- rules : conditions => actions : + linking units

text parsedtext

parsingprocess

resources of the process : contextual rules

procedural or dynamic resources

deterministic process : interpreting rules

1 uniqueresult

complexity : theoretical : linear, practical : linear

tagging

rules : conditions => actions

deterministicparsing

3/7/2001 © Jacques Vergne TALN 2001 -31-

English version

• 2.4 Parsing natural

languages : deterministic

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

resources = static models of the expected structures :

formal grammars

resources = dynamic models of the computation process

3/7/2001 © Jacques Vergne TALN 2001 -32-

English version

• 2.4 Posing and solving a problem in a deterministic way • to pose and solve a problem in a deterministic way :

- posing the problem in terms of computing from data, and not in terms of choice among known possible values

- having a better knowledge about properties of units to process and better exploiting this knowledge

=> finding more properties and finding how to use them to directly and definitively compute the value of an attribute

computeunique solution

properties

units to process

computeunique solution

operations

operandsproperties allow

to build operationson units to process

-

• deterministic resolution :

3/7/2001 © Jacques Vergne TALN 2001 -33-

English version

• 2.4 Posing and solving NL parsing

in a deterministic way

• explicitly taking in account the openness of NL :

- it is impossible to exhaustively describe a natural language, contrary to a programming language

- minimal lexical resources : function words + morphemes of word ending

- no formal grammar : no (exhaustive) inventory of expected structures no grammaticality test

• rules conditions => actions explicit in the same formalism :

- minimal typographical, lexical and morphological resources,

- linguistic properties (explicit use of context),

- the linking process of units

-

static model of the expected structures —>

dynamic model of the computing process

3/7/2001 © Jacques Vergne TALN 2001 -34-

English version

• computing process :

- a rule engine triggers "conditions => actions" rules once on each unit :

characters, tokens, phases, clauses, sentences, (paragraphs, ... )

- conditions on unit attributes and on links between units (contiguities, constituencies, dependencies, co-ordinations,

...)

- actions : affecting values to attributes, setting links (dependencies et co-ordinations), and generating units of the level above

• process of complexity linear : flow processing at constant rate

• lexicon et syntactic structures of the parsed text computed and output

-

• 2.4 Solving NL parsing in a deterministic way

3/7/2001 © Jacques Vergne TALN 2001 -35-

English version

• 3. Some features

of our parsers

parsing natural languages

combinatorialparsing

taggingchunking

deterministicparsing

parsingprogramming

languages

-

resources = static models of the expected structures :

formal grammars

resources = dynamic models of the computation process

3/7/2001 © Jacques Vergne TALN 2001 -36-

English version

• 3.1 A process from forms and their position

-

Le site allemand de Dasa à Hambourg

devra assembler ce nouvel avion .

Le .... ........ de .... à ........

devra .......er ce ...... ..... .

N pN pN

NV

V ?

N?

pN ? pN ? pN ?

subject - verb object - verb

• some linguistic properties of forms & their position :

—> segmenting process (chunking), tagging and linking (chunks)

Le site allemand de Dasa à Hambourg

devra assembler ce nouvel avion .

3/7/2001 © Jacques Vergne TALN 2001 -37-

English version

The ............ of .... in .......

have .......ed a ............. .

• 3.2 Another language, same process

-

N pN pN

NV

V ? pN ? pN ? pN ?

N?object - verb

• first rules package : written forms —> units attributes

• following packages : computing on attributes, identical for several natural languages

subject - verb

3/7/2001 © Jacques Vergne TALN 2001 -38-

English version

main clause ?

Qu'il s'agisse d'un logement vide ou meublé ,

le propriétaire ne peut pas s'opposer  

à ce que le locataire héberge un animal familier .

Qu'il ........ .... ........ .... .. ...... ,

.. ............ .. ..... ... .........  

à ce que .. ......... ....... .. ...... ........ .

• 3.3 Another level, same process

-

subordinated clause

subordinated clause

main clause

• some linguistic properties of forms & and their position :

—> segmentation process (into clauses), tagging and linking (clauses)

subordination

sub. clause ?

subordination

3/7/2001 © Jacques Vergne TALN 2001 -39-

English version

condition

• 3.4 A process of complexity linear to link units

• a 2 steps process, while units are arriving :

unit i

virtual unit

unit j

typestep 1rule 1

intermediary in the process,always invokable in conditions

type

step 2rule 2

action

-

conditionactions• process of linear complexity, independent of units arriving between both linked units

• this process models a valence saturation

3/7/2001 © Jacques Vergne TALN 2001 -40-

English version

• Conclusion

combinatorialNLP

resources :static model

of all possible expected forms formal grammarexhaustive dictionary word grain

deterministicNLP

computing from forms

and their position

computing rules are based on somelinguistic properties

grains : document, paragraph, sentence,

clause, chunk, ...

partial resources :dynamic model

of the computing process

-

abandonment of the static model

of all expected forms

3/7/2001 © Jacques Vergne TALN 2001 -41-

English version

end of the lecture

• you can download this presentation on http://www.info.unicaen.fr/~jvergne/TALN2001_JVergne_en.ppt

• also see the tutorial of Coling 2000"Trends in Robust Parsing"

on http://www.info.unicaen.fr/~jvergne/tutorialColing2000.html

(presentation and references)

-

3/7/2001 © Jacques Vergne TALN 2001 -42-

English version

your questions ?

-

3/7/2001 © Jacques Vergne TALN 2001 -43-

English version

-

Results of GRACE

decision = 100% = tokens with

unique tag / total of tokens

deterministicparser

decision < 100%<=>

multiple tags

precision = 94,5%

= tokens with

the same tag

than human/ tokens with unique tag