the good, the bad, and the hazy: design decisions in web ... · design decisions in eb w rpus co...

The Good, the Bad, and the Hazy:

Design de isions in web orpus onstru tion

Roland S häfer, Adrien Barbaresi, and Felix Bildhauer

German Grammar and Linguisti s (FU Berlin)

WaC8, Lan aster, July 22, 2013


COW proje t:

http://hpsg.fu-berlin.de/ ow/

texrex ( urrent version: texrex-hyperhyper):

http://sour eforge.net/proje ts/texrex/

Our brand new book on web orpus onstru tion:

http://dx.doi.org/10.2200/S00508ED1V01Y201305HLT022

http://sites.morgan laypool. om/w /

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 1/32

http://hpsg.fu-berlin.de/cow/

http://sourceforge.net/projects/texrex/

http://dx.doi.org/10.2200/S00508ED1V01Y201305HLT022

http://sites.morganclaypool.com/wcc/


Overview

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization



Text quality

We are here. . .

Text quality

Rating experiment

Badness s ores




Text quality

Text quality as understood here

�

senten es

�

ideally onne ted

�

not word/name lists

�

not tag louds



Text quality

The Good



Text quality

The Bad



Text quality

The Hazy



Text quality

The Hazier



Rating experiment

We are here. . .

Text quality

Rating experiment

Badness s ores




Rating experiment

Corpus

�

UKCOW2012 (beta version) [S häfer and Bildhauer, 2012℄

�

approx. 6 GT, rawled in 2012 from .uk

�

leaned with texrex-mrvain

�

HTML stripping, onversion to ISO-8859, other normalizations

�

boilerplate removal

�

aggressive w-shingling-based dedupli ation

�

evaluated as superior to ukWaC in ollo ation extra tion tasks

in Biemann et al. [2013℄ (or at least �equally good, but bigger�)



Rating experiment

Task for human raters

�

lassi� ation of text quality for 1,000 do uments

�

500 do uments from the early phase of the rawl: �early data�

�

500 do uments from the late phase: �late data�

�

boilerplate removed, presented as text-only

with paragraph breaks

�

s ale:

�

�2,�1: do ument should not be in the orpus

�

1, 2: do ument should be in the orpus

�

0: unde ided/do ument might or might not be in the orpus



Rating experiment

From the guidelines I

�

Do uments ontaining predominantly full senten es are good,

�predominantly� meaning onsiderably more than

50% of the text mass (as per eived by the oder).

�

Boilerplate material in senten e form is good

(You are not allowed to post omments in this forum.),

other boilerplate material is bad

(Copyright © 2046 UAC Ltd.).

�

Senten es trun ated or otherwise destroyed

by some post-pro essing method are good as long as

they are re ognizable as (the rest of) a senten e.

�

Repetitions of good senten es are good.



Rating experiment

From the guidelines II

�

De isions should not depend on the length of the do ument,

su h that a do ument ontaining only one good senten e

would still be maximally good.

�

Non-English material ontributes to badness.

�

Non-senten e material (lists, tables, tag louds)

ontributes to badness.

�

However, if a list et . is embedded in a oherent text

whi h dominates the do ument, the do ument is good

(prototypi ally re ipes with longer instru tions).



Rating experiment

Raters

�

raters A and R: orpus designers

with a shared understanding of desired orpus

�

rater S: student assistant,

experien e with three similar rating tasks

�

�training�: rating of 100 do uments together

with several hours of dis ussion of borderline ases



Rating experiment

Hazy results

statisti early 500 late 500 all 1,000

raw 0.566 0.300 0.433

κ (raw) 0.397 0.303 0.367

ICC pC , 1q 0.756 0.679 0.725

raw (r ¥ 0) 0.900 0.762 0.831

raw (r ¥ 1) 0.820 0.674 0.747

κ (r ¥ 0) 0.673 0.625 0.660

κ (r ¥ 1) 0.585 0.555 0.598

κ (r ¥ 2) 0.546 0.354 0.498



Rating experiment

A eptable results?

�

values below 0.68 [Krippendor�, 1980℄

�

even onsidering riti ism of Krippendorf's magi number

[Carletta, 1996, Bayerl and Paul, 2011℄:

un omfortably low for �gold standard�

�

more onfusion on late data (lower overall quality)

�

worse: disagreement between orpus designers

�

a eptan e at the threshold ¥ 0:

A: 78.4%, R: 73.8%, S: 84.9%



Badness s ores

We are here. . .

Text quality

Rating experiment

Badness s ores




Badness s ores

General method and idea

�

simple metri with known properties

�

language-independent, unsupervised. . .

does not involve an obviously di� ult design de ision

�

strategy for leansing: high re all for everyone,

a ept medio re pre ision

�

for retained do uments: use as annotation in �nal orpus,

allow orpus users to �set� pre ision



Badness s ores

Implementation

�

based on �frequent/short word� method

in language identi� ation [Grefenstette, 1995℄

�

similar to WaCky [Baroni et al., 2009℄

but without manually ompiled lists of fun tion words

�

totally unsupervised pro edure for rawled data

predominantly in a single language (TLD rawl):

�

training: get weighted mean and standard deviation

of relative frequen ies of the most frequent words (�pro�le�)

�

produ tion: al ulate for the top m of them

the �standardized� negative deviation for ea h do ument

�

lamped and added up: the Badness s ore



Badness s ores

Questions

Does pro�le generation

yield reliable results?

How should we sele t the threshold

for the a tual deletion of do uments?



Badness s ores

Pro�le development while training

−1.

8−

1.6

−1.

4−

1.2

DIE

5 305 705 −2.

0−

1.8

−1.

6−

1.4

DER

5 305 705

−1.

8−

1.6

−1.

4−

1.2

UND

5 305 705

−2.

2−

2.0

−1.

8−

1.6

IN

5 305 705

−2.

2−

2.0

−1.

8−

1.6

DAS

5 305 705

−2.

3−

2.1

−1.

9

DEN

5 305 705

−2.

3−

2.1

−1.

9

ZU

5 305 705

−2.

4−

2.2

−2.

0−

1.8

−1.

6

IST

5 305 705

−2.

4−

2.2

−2.

0−

1.8

VON

5 305 705

−3

−2

−1

01

ICH

5 305 705

German test pro�le; n � 1000; log10-transformed relative frequen ies



Badness s ores

Distribution of Badness depending on do ument length

(early pro�le on early data)

0 10 20 30 40 500.00

0.04

0.08

0.12

0 10 20 30 40 500.00

0.04

0.08

0.12

document length > 5,000 Bdocument length > 1,000 Bdocument length > 200 Bdocument length > 0 B

left: DECOW2012; right: UKCOW2012



Badness s ores

Pro�le omparison:

E�e t of thresholds (= umulative density of Badness)

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

0.6780.766

0.282

0.431

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

0.6940.785

0.2150.314

early profile, early dataearly profile, late datalate profile, early datalate profile, late data

left: DECOW2012; right: UKCOW2012; only do uments over 200 B;

values at Badness 15 and 20 marked



Badness s ores

Pro�le omparison:

Raw agreement between pro�les

0 10 20 30 40 50

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40 50

0.80

0.85

0.90

0.95

1.00

early datalate data

left: DECOW2012; right: UKCOW2012



Badness s ores

DECOW2012 sample raw pro�le omparison

0 10 20 30 40 50

010

2030

4050

0 10 20 30 40 50

010

2030

4050

left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le



Badness s ores

UKCOW2012 sample raw pro�le omparison

0 10 20 30 40 50

010

2030

4050

0 10 20 30 40 50

010

2030

4050

left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le




We are here. . .

Text quality

Rating experiment

Badness s ores





Badness threshold for removal

If re all is more important than pre ision: 35

pre re F1 orre t baseline

S 0.914 0.959 0.936 0.888 0.849

A 0.856 0.973 0.911 0.851 0.781

R 0.808 0.976 0.884 0.811 0.738

Pre ision/re all et . for our three raters (do uments better than 0)

at a Badness threshold of 35




Reasons for non-destru tive normalization

�

our goal: arefully sampled and pro essed web orpora

for fundamental resear h � theoreti al linguisti s,

linguisti web hara terization

�

noise or distortion through pro essing intolerable

�

Leave major destru tive design de isions to the user!


Annotation with quality metri s instead of removal:

A preliminary version based on UKCOW2012

<do url="http://www.booktrust.org.uk/writing/writing-tips/33" bd ="d">

A essibility and text options

Professionals

How to write omedy

You want to write omedy? Well it's easy. I've just done it and didn't

even use a spell he k. I'd definitely suggest starting with a ' ' rather than a 'k'

as that sounds either wa ky or German and either way there will be prejudi es about

what type of humour it is. Oh, sorry, omedy writing? I see. Well first try not

doing the most God-awful joke you an in your very first senten e. That's a definite

no-no. To be honest, I an't tell you how to write omedy as su h. I'm very mu h a

subs riber to the 'either you're funny or you're not' party. However even if you have

the invite to that very party, there's a high han e you might not know quite how to

utilise the orre t hat for the kit hen, or the right outfit to wear, in the same way

I've never worked out how to do a de ent analogy.

What I'm saying is that if you're a funny hap or hapette, there are some

things that definitely hone those powers into a neat bit of witty prose, rather than a

bundle of nearly funny ons.

Annotation for Badness, boilerplate



Bottom line

�

�Web orpus leansing� is a destru tive pro ess.

�

Even orpus designers a hieve only medio re agreement

w. r. t. the �quality orpus material�/�noise� de ision.

�

Better guidelines just for e orpus users to live

with more spe i� problemati destru tive de isions.

�

Strategy: Leave as mu h as possible in the orpus

and provide do umented and intuitive quality annotation.

�

Badness is su h a metri .




Referen es I

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very

large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):

209�226, 2009.

P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a

meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.

C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,

and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013. In

prep. The list of authors is preliminary and might re�e t neither the order nor the a tual list in the

printed version.

J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,

22(2):249�254, 1996.

G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation

onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.

K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly

Hills, 1980.

R. S häfer and F. Bildhauer. Building large orpora from the web using a new e� ient tool hain. In

N. Calzolari, K. Choukri, T. De ler k, M. U. Do§an, B. Maegaard, J. Mariani, J. Odijk, and

S. Piperidis, editors, Pro eedings of the Eight International Conferen e on Language Resour es and

Evaluation (LREC'12), pages 486�493, Istanbul, 2012. ELRA.


the good, the bad, and the hazy: design decisions in web ... · design decisions in eb w rpus co...

Documents