the good, the bad, and the hazy: design decisions in web ... · design decisions in eb w rpus co...
TRANSCRIPT
The Good, the Bad, and the Hazy:
Design de isions in web orpus onstru tion
Roland S häfer, Adrien Barbaresi, and Felix Bildhauer
German Grammar and Linguisti s (FU Berlin)
WaC8, Lan aster, July 22, 2013
Design de isions in web orpus onstru tion
COW proje t:
http://hpsg.fu-berlin.de/ ow/
texrex ( urrent version: texrex-hyperhyper):
http://sour eforge.net/proje ts/texrex/
Our brand new book on web orpus onstru tion:
http://dx.doi.org/10.2200/S00508ED1V01Y201305HLT022
http://sites.morgan laypool. om/w /
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 1/32
Design de isions in web orpus onstru tion
Overview
Text quality
Rating experiment
Badness s ores
Design de isions and non-destru tive normalization
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 2/32
Design de isions in web orpus onstru tion
Text quality
We are here. . .
Text quality
Rating experiment
Badness s ores
Design de isions and non-destru tive normalization
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 3/32
Design de isions in web orpus onstru tion
Text quality
Text quality as understood here
�
senten es
�
ideally onne ted
�
not word/name lists
�
not tag louds
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 4/32
Design de isions in web orpus onstru tion
Text quality
The Good
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 5/32
Design de isions in web orpus onstru tion
Text quality
The Bad
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 6/32
Design de isions in web orpus onstru tion
Text quality
The Hazy
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 7/32
Design de isions in web orpus onstru tion
Text quality
The Hazier
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 8/32
Design de isions in web orpus onstru tion
Rating experiment
We are here. . .
Text quality
Rating experiment
Badness s ores
Design de isions and non-destru tive normalization
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 9/32
Design de isions in web orpus onstru tion
Rating experiment
Corpus
�
UKCOW2012 (beta version) [S häfer and Bildhauer, 2012℄
�
approx. 6 GT, rawled in 2012 from .uk
�
leaned with texrex-mrvain
�
HTML stripping, onversion to ISO-8859, other normalizations
�
boilerplate removal
�
aggressive w-shingling-based dedupli ation
�
evaluated as superior to ukWaC in ollo ation extra tion tasks
in Biemann et al. [2013℄ (or at least �equally good, but bigger�)
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 10/32
Design de isions in web orpus onstru tion
Rating experiment
Task for human raters
�
lassi� ation of text quality for 1,000 do uments
�
500 do uments from the early phase of the rawl: �early data�
�
500 do uments from the late phase: �late data�
�
boilerplate removed, presented as text-only
with paragraph breaks
�
s ale:
�
�2,�1: do ument should not be in the orpus
�
1, 2: do ument should be in the orpus
�
0: unde ided/do ument might or might not be in the orpus
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 11/32
Design de isions in web orpus onstru tion
Rating experiment
From the guidelines I
�
Do uments ontaining predominantly full senten es are good,
�predominantly� meaning onsiderably more than
50% of the text mass (as per eived by the oder).
�
Boilerplate material in senten e form is good
(You are not allowed to post omments in this forum.),
other boilerplate material is bad
(Copyright © 2046 UAC Ltd.).
�
Senten es trun ated or otherwise destroyed
by some post-pro essing method are good as long as
they are re ognizable as (the rest of) a senten e.
�
Repetitions of good senten es are good.
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 12/32
Design de isions in web orpus onstru tion
Rating experiment
From the guidelines II
�
De isions should not depend on the length of the do ument,
su h that a do ument ontaining only one good senten e
would still be maximally good.
�
Non-English material ontributes to badness.
�
Non-senten e material (lists, tables, tag louds)
ontributes to badness.
�
However, if a list et . is embedded in a oherent text
whi h dominates the do ument, the do ument is good
(prototypi ally re ipes with longer instru tions).
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 13/32
Design de isions in web orpus onstru tion
Rating experiment
Raters
�
raters A and R: orpus designers
with a shared understanding of desired orpus
�
rater S: student assistant,
experien e with three similar rating tasks
�
�training�: rating of 100 do uments together
with several hours of dis ussion of borderline ases
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 14/32
Design de isions in web orpus onstru tion
Rating experiment
Hazy results
statisti early 500 late 500 all 1,000
raw 0.566 0.300 0.433
κ (raw) 0.397 0.303 0.367
ICC pC , 1q 0.756 0.679 0.725
raw (r ¥ 0) 0.900 0.762 0.831
raw (r ¥ 1) 0.820 0.674 0.747
κ (r ¥ 0) 0.673 0.625 0.660
κ (r ¥ 1) 0.585 0.555 0.598
κ (r ¥ 2) 0.546 0.354 0.498
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 15/32
Design de isions in web orpus onstru tion
Rating experiment
A eptable results?
�
values below 0.68 [Krippendor�, 1980℄
�
even onsidering riti ism of Krippendorf's magi number
[Carletta, 1996, Bayerl and Paul, 2011℄:
un omfortably low for �gold standard�
�
more onfusion on late data (lower overall quality)
�
worse: disagreement between orpus designers
�
a eptan e at the threshold ¥ 0:
A: 78.4%, R: 73.8%, S: 84.9%
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 16/32
Design de isions in web orpus onstru tion
Badness s ores
We are here. . .
Text quality
Rating experiment
Badness s ores
Design de isions and non-destru tive normalization
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 17/32
Design de isions in web orpus onstru tion
Badness s ores
General method and idea
�
simple metri with known properties
�
language-independent, unsupervised. . .
does not involve an obviously di� ult design de ision
�
strategy for leansing: high re all for everyone,
a ept medio re pre ision
�
for retained do uments: use as annotation in �nal orpus,
allow orpus users to �set� pre ision
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 18/32
Design de isions in web orpus onstru tion
Badness s ores
Implementation
�
based on �frequent/short word� method
in language identi� ation [Grefenstette, 1995℄
�
similar to WaCky [Baroni et al., 2009℄
but without manually ompiled lists of fun tion words
�
totally unsupervised pro edure for rawled data
predominantly in a single language (TLD rawl):
�
training: get weighted mean and standard deviation
of relative frequen ies of the most frequent words (�pro�le�)
�
produ tion: al ulate for the top m of them
the �standardized� negative deviation for ea h do ument
�
lamped and added up: the Badness s ore
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 19/32
Design de isions in web orpus onstru tion
Badness s ores
Questions
Does pro�le generation
yield reliable results?
How should we sele t the threshold
for the a tual deletion of do uments?
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 20/32
Design de isions in web orpus onstru tion
Badness s ores
Pro�le development while training
−1.
8−
1.6
−1.
4−
1.2
DIE
5 305 705 −2.
0−
1.8
−1.
6−
1.4
DER
5 305 705
−1.
8−
1.6
−1.
4−
1.2
UND
5 305 705
−2.
2−
2.0
−1.
8−
1.6
IN
5 305 705
−2.
2−
2.0
−1.
8−
1.6
DAS
5 305 705
−2.
3−
2.1
−1.
9
DEN
5 305 705
−2.
3−
2.1
−1.
9
ZU
5 305 705
−2.
4−
2.2
−2.
0−
1.8
−1.
6
IST
5 305 705
−2.
4−
2.2
−2.
0−
1.8
VON
5 305 705
−3
−2
−1
01
ICH
5 305 705
German test pro�le; n � 1000; log10-transformed relative frequen ies
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 21/32
Design de isions in web orpus onstru tion
Badness s ores
Distribution of Badness depending on do ument length
(early pro�le on early data)
0 10 20 30 40 500.00
0.04
0.08
0.12
0 10 20 30 40 500.00
0.04
0.08
0.12
document length > 5,000 Bdocument length > 1,000 Bdocument length > 200 Bdocument length > 0 B
left: DECOW2012; right: UKCOW2012
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 22/32
Design de isions in web orpus onstru tion
Badness s ores
Pro�le omparison:
E�e t of thresholds (= umulative density of Badness)
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
0.6780.766
0.282
0.431
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
0.6940.785
0.2150.314
early profile, early dataearly profile, late datalate profile, early datalate profile, late data
left: DECOW2012; right: UKCOW2012; only do uments over 200 B;
values at Badness 15 and 20 marked
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 23/32
Design de isions in web orpus onstru tion
Badness s ores
Pro�le omparison:
Raw agreement between pro�les
0 10 20 30 40 50
0.80
0.85
0.90
0.95
1.00
0 10 20 30 40 50
0.80
0.85
0.90
0.95
1.00
early datalate data
left: DECOW2012; right: UKCOW2012
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 24/32
Design de isions in web orpus onstru tion
Badness s ores
DECOW2012 sample raw pro�le omparison
0 10 20 30 40 50
010
2030
4050
0 10 20 30 40 50
010
2030
4050
left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 25/32
Design de isions in web orpus onstru tion
Badness s ores
UKCOW2012 sample raw pro�le omparison
0 10 20 30 40 50
010
2030
4050
0 10 20 30 40 50
010
2030
4050
left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 26/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
We are here. . .
Text quality
Rating experiment
Badness s ores
Design de isions and non-destru tive normalization
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 27/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
Badness threshold for removal
If re all is more important than pre ision: 35
pre re F1 orre t baseline
S 0.914 0.959 0.936 0.888 0.849
A 0.856 0.973 0.911 0.851 0.781
R 0.808 0.976 0.884 0.811 0.738
Pre ision/re all et . for our three raters (do uments better than 0)
at a Badness threshold of 35
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 28/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
Reasons for non-destru tive normalization
�
our goal: arefully sampled and pro essed web orpora
for fundamental resear h � theoreti al linguisti s,
linguisti web hara terization
�
noise or distortion through pro essing intolerable
�
Leave major destru tive design de isions to the user!
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 29/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
Annotation with quality metri s instead of removal:
A preliminary version based on UKCOW2012
<do url="http://www.booktrust.org.uk/writing/writing-tips/33" bd ="d">
<p bp =" ">A essibility and text options</p>
<p bp ="e">Professionals</p>
<p bp =" ">How to write omedy</p>
<p bp ="a">You want to write omedy? Well it's easy. I've just done it and didn't
even use a spell he k. I'd definitely suggest starting with a ' ' rather than a 'k'
as that sounds either wa ky or German and either way there will be prejudi es about
what type of humour it is. Oh, sorry, omedy writing? I see. Well first try not
doing the most God-awful joke you an in your very first senten e. That's a definite
no-no. To be honest, I an't tell you how to write omedy as su h. I'm very mu h a
subs riber to the 'either you're funny or you're not' party. However even if you have
the invite to that very party, there's a high han e you might not know quite how to
utilise the orre t hat for the kit hen, or the right outfit to wear, in the same way
I've never worked out how to do a de ent analogy.</p>
<p bp ="a">What I'm saying is that if you're a funny hap or hapette, there are some
things that definitely hone those powers into a neat bit of witty prose, rather than a
bundle of nearly funny ons.</p>
Annotation for Badness, boilerplate
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 30/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
Bottom line
�
�Web orpus leansing� is a destru tive pro ess.
�
Even orpus designers a hieve only medio re agreement
w. r. t. the �quality orpus material�/�noise� de ision.
�
Better guidelines just for e orpus users to live
with more spe i� problemati destru tive de isions.
�
Strategy: Leave as mu h as possible in the orpus
and provide do umented and intuitive quality annotation.
�
Badness is su h a metri .
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 31/32
Design de isions in web orpus onstru tion
Design de isions and non-destru tive normalization
Referen es I
M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very
large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):
209�226, 2009.
P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a
meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.
C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,
and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013. In
prep. The list of authors is preliminary and might re�e t neither the order nor the a tual list in the
printed version.
J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,
22(2):249�254, 1996.
G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation
onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.
K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly
Hills, 1980.
R. S häfer and F. Bildhauer. Building large orpora from the web using a new e� ient tool hain. In
N. Calzolari, K. Choukri, T. De ler k, M. U. Do§an, B. Maegaard, J. Mariani, J. Odijk, and
S. Piperidis, editors, Pro eedings of the Eight International Conferen e on Language Resour es and
Evaluation (LREC'12), pages 486�493, Istanbul, 2012. ELRA.
R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 32/32