treebanks and parsing - computational...
TRANSCRIPT
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanksTreebanks and Parsing
L645
Dept. of Linguistics, Indiana UniversityFall 2009
1 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Def. Treebank
A treebank is a syntactically annotated corpus.
Issues:
◮ complete analysis vs. partial analysis◮ theory-neutral vs. theory-dependent◮ spoken vs. written language◮ constituents vs. dependency◮ annotate grammatical functions?◮ manual vs. automatic annotation
2 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Some Remarks
◮ treebanking is extremely labor-intensive (i.e. costly)
◮ good planning is therefore necessary
◮ good tools are crucial◮ they speed up the process◮ they help with consistency
◮ a detailed stylebook is essential
3 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Penn WSJ Treebank – Example
( (S (NP-SBJ (NP Pierre Vinken),(ADJP (NP 61 years)
old),)
(VP will(VP join
(NP the board)(PP-CLR as
(NP a nonexecutive director))(NP-TMP Nov. 29)))
.))
5 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Treebanks for English
◮ Penn Treebank◮ BLLIP Treebank◮ The Penn-Helsinki Parsed Corpus of Middle English◮ Susanne Corpus and Christine Project◮ International Corpus of English ICE◮ Lancaster Treebank◮ The Redwoods HPSG Treebank
6 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Treebanks Projects
◮ Basque◮ Eus3LB project
◮ Bulgarian◮ HPSG-based Syntactic Treebank of Bulgarian
(BulTreeBank)◮ Catalan
◮ CAT3LB project◮ Chinese
◮ The Chinese Treebank Project◮ Czech
◮ Prague Dependency Treebank
7 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Treebanks Projects (2)
◮ Danish◮ Danish Dependency Treebank
◮ Dutch◮ The Alpino Treebank
◮ French◮ Project TALANA
◮ German◮ NeGra Project - NeGra Corpus◮ Project TIGER◮ Verbmobil Treebank of Spoken German (TuBa-D/S)◮ The Tubingen Treebank of Written German
(TuBa-D/Z)
8 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Treebanks Projects (3)
◮ Italian◮ Turin University Treebank TUT◮ Italian Syntactic-Semantic Treebank
◮ Portuguese◮ The Floresta Sinta(c)tica project
◮ Slovene◮ Slovene Dependency Treebank
◮ Swedish◮ Swedish Treebank
◮ Turkish◮ METU treebank
9 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
The Annotation Scheme
◮ Should the annotation scheme be dependent on aparticular theory?
◮ Theory-neutrality doesn’t really exist. Everyannotation scheme is at least implicitlytheory-dependent.
◮ Grounding an annotation scheme in a linguistictheory tends to improve consistency of annotations.
10 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Theory-dependent Treebanks
◮ Prague Dependeny Treebank◮ based on Dependency Grammar
◮ The Redwoods HPSG Treebank◮ based on Head-Driven Phrase Structure Grammar
◮ CCGbank◮ translation of the Penn Treebank into a corpus of
Combinatory Categorial Grammar derivations
11 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Theory-neutral Treebanks
Theory-neutral treebanks do not adhere to any particularlinguistic theory
◮ encode those grammatical properties that aredistinguished by many, if not all grammaticalframeworks
Advantage:◮ more widely usable◮ less dependent on whatever version of a particular
grammatical theory may have existed at the timewhen the treebank annotation scheme wasdetermined
◮ examples: Penn Treebank, Negra treebank,Tubingen treebanks
12 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Extracting the CFG Grammar
0 1 2 3 4 5 6 7
500 501 502 503 504 505
506 507 508 509
510
511
wie
PWAV
sieht
VVFIN
das
PDS
bei
APPR
Ihnen
PPER
am
APPRART
Donnerstag
NN
aus
PTKVZ
HD HD HD HD HD VPT
ADJX
PRED
VXFIN
HD −
NX
HD −
NX
HD
NX
ON
PX
FOPP
PX
V−MOD
VF
−
LK
−
MF
−
VC
−
SIMPX
How does Thursday look for you?
13 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Extracting the CFG Grammar
0 1 2 3 4 5 6 7
500 501 502 503 504 505
506 507 508 509
510
511
wie
PWAV
sieht
VVFIN
das
PDS
bei
APPR
Ihnen
PPER
am
APPRART
Donnerstag
NN
aus
PTKVZ
HD HD HD HD HD VPT
ADJX
PRED
VXFIN
HD −
NX
HD −
NX
HD
NX
ON
PX
FOPP
PX
V−MOD
VF
−
LK
−
MF
−
VC
−
SIMPX
rules:
SIMPX → VF LK MF VCVF → ADJXLK → VXFINMF → NX PX PXVC → PTKVZ
13 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Extracting the CFG Grammar
0 1 2 3 4 5 6 7
500 501 502 503 504 505
506 507 508 509
510
511
wie
PWAV
sieht
VVFIN
das
PDS
bei
APPR
Ihnen
PPER
am
APPRART
Donnerstag
NN
aus
PTKVZ
HD HD HD HD HD VPT
ADJX
PRED
VXFIN
HD −
NX
HD −
NX
HD
NX
ON
PX
FOPP
PX
V−MOD
VF
−
LK
−
MF
−
VC
−
SIMPX
or:
SIMPX → VF-NH LK-NH MF-NH VC-NHVF-NH → ADJX-PREDLK-NH → VXFIN-HDMF-NH → NX-ON PX-FOPP PX-VMODVC-NH → PTKVZ-VPT
13 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Ambiguity – A Serious Problem
example sentence:
Volker Tegeler, stellvertretenderGeschaftsf uhrer des Landesverbandes, sagt:
(Volker Tegeler, vice CEO of the national association,says ...)
How many different analyses?◮ training corpus: 15 000 sentences, grammar: ca.
6 000 rules
14 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Ambiguity – A Serious Problem
example sentence:
Volker Tegeler, stellvertretenderGeschaftsf uhrer des Landesverbandes, sagt:
(Volker Tegeler, vice CEO of the national association,says ...)
How many different analyses?◮ training corpus: 15 000 sentences, grammar: ca.
6 000 rules
Answer: don’t know◮ but 58(!) root categories (if complex information is
used in the nodes)
14 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Ambiguity – Sample Analyses
the most probable parse:
R-SIMPX-ON
NX-ON
ENX-ON
NE-n
Volker
NE-n
Tegeler
, NX-ON
NX-ON
ADJX-n
ADJA-n
stellv.
NN
Geschaftsf.
NX-g
ART-g
des
NN-g
Landesv.
, R-SIMPX
VC
VXFIN
VVFIN
sagt
15 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Ambiguity – Sample Analyses
the almost correct parse:
SIMPX-ON
NX-ON
ENX-ON
NE-n
Volker
NE-n
Tegeler
, NX-ON
NX-ON
ADJX-n
ADJA-n
stellv.
NN
Geschaftsf.
NX-g
ART-g
des
NN-g
Landesv.
, SIMPX
VC-FIN
VVFIN
sagt
15 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Ambiguity – Sample Analyses
ADVX
ADVX
ADVX
NX-a
NE
Volker
ADV
Tegeler
, ADVX
ADJD
stellv.
SIMPX-C-OA
NX-OA
NX-OA
NN-a
Geschaftsf.
NX-d
NE-d
des
SIMPX-C
SIMPX
NX-d
NN-d
Landesv.
, SIMPX-C
R-SIMPX
VC
VXFIN
VVFIN
sagt
an “interesting” parse15 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Parsing and Treebanks – How do theyRelate?
◮ Progress in parsing has been guided byimprovements on the Penn Treebank
◮ How much of this is dependent on the annotationscheme?
◮ German: has two treebanks→ ideal for comparinginfluence of annotation scheme on parsing
16 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Negra and TuBa-D/Z
◮ both are based on newspaper texts: FrankfurterRundschau, taz
◮ both use same POS tagset: STTS◮ both annotate constituent structure and
function-argument structure◮ Negra: 20 000 sentences; TuBa-D/Z: 15 000
sentences (version 1), 27 000 sentences (version 3)
◮ different annotation decisions
17 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
A Tree from Negra
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
500 501 502
503
504
505
506
507
Das
PDS
kann
VMFIN
die
ART
Maske
NN
sein
VAINF
,
$,
die
PRELS
ihm
PPER
sein
PPOSAT
Charakter
NN
wie
KOKOM
ein
ART
Stempel
NN
ins
APPRART
Gesicht
NN
gedrückt
VVPP
hat
VAFIN
;
$.
NK NK CM NK NK AC NK
OA DA
NP
MO
PP
MO HD
NP
SB
VP
OC HD
NK NK
S
RC
HD
NP
PD
SB HD
VP
OC
S
That could be the mask that his character
pressed like a stamp on his face.
18 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
A Tree from TuBa-D/Z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
500 501 502 503 504 505 506 507 508
509 510 511 512 513 514 515
516 517
518 519
520
521
Der
ART
Autokonvoi
NN
mit
APPR
den
ART
Probenbesuchern
NN
fährt
VVFIN
eine
ART
Straße
NN
entlang
PTKVZ
,
$,
die
PRELS
noch
ADV
heute
ADV
Lagerstraße
NN
heißt
VVFIN
.
$.
− HD − HD HD − HD VPT HD HD HD HD
−
NCX
HD
VXFIN
HD
NCX
OA
NCX
ON
ADVX
− HD
NCX
−
VXFIN
HD
NCX
HD
PX
−
ADVX
V−MOD
EN−ADD
PRED
NX
ON
C
−
MF
−
VC
−
R−SIMPX
OA−MOD
VF
−
LK
−
MF
−
VC
−
NF
−
SIMPX
The convoy of the rehearsal visitors’ cars goes
along a street that is still called Lagerstraße.
19 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Main Differences in Annotation
◮ phrase structure◮ Negra: extremely flat◮ TuBa-D/Z: premodification flat, postmodification high
◮ clause structure◮ Negra: VP⇒ crossing branches
no unary nodes◮ TuBa-D/Z: topological fields
◮ long-distance relationships◮ Negra: crossing branches◮ TuBa-D/Z: pure tree structure + special labels
20 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Crossing Branches
0 1 2 3 4 5 6 7 8 9 10 11
500 501 502
503
504
Diese
PDAT
Metapher
NN
kann
VMFIN
die
ART
Freizeitmalerin
NN
durchaus
ADV
auch
ADV
auf
APPR
ihr
PPOSAT
Leben
NN
anwenden
VVINF
.
$.
NK NK NK NK MO AC NK NK
NP
OA
PP
MO HD
HD
NP
SB MO
VP
OC
S
The amateur painter can by all means apply
this metaphor also to her life.
21 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Resolved Structure
0 1 2 3 4 5 6 7 8 9 10 11
500 501 502
503
504
Diese
PDAT
Metapher
NN
kann
VMFIN
die
ART
Freizeitmalerin
NN
durchaus
ADV
auch
ADV
auf
APPR
ihr
PPOSAT
Leben
NN
anwenden
VVINF
.
$.
NK NK NK NK MO AC NK NK
PP
MO HD
NP
OA HD
NP
SB MO
VP
OC
S
22 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Resolved Structure
0 1 2 3 4 5 6 7 8 9 10 11
500 501 502
503
504
Diese
PDAT
Metapher
NN
kann
VMFIN
die
ART
Freizeitmalerin
NN
durchaus
ADV
auch
ADV
auf
APPR
ihr
PPOSAT
Leben
NN
anwenden
VVINF
.
$.
NK NK NK NK MO AC NK NK
PP
MO HD
NP
OA HD
NP
SB MO
VP
OC
S
S→ NP VMFIN NP ADV VP
22 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Results
lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95
all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00
TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07
23 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Results
lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95
all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00
TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07
23 / 23
Treebanks andParsing
Treebanks
Extracting aGrammar
Ambiguity
Parsing andTreebanks
Results
lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95
all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00
TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07
23 / 23