the good, the bad, and the hazy: design decisions in web ... · design decisions in eb w rpus co...

33

Upload: others

Post on 29-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

The Good, the Bad, and the Hazy:

Design de isions in web orpus onstru tion

Roland S häfer, Adrien Barbaresi, and Felix Bildhauer

German Grammar and Linguisti s (FU Berlin)

WaC8, Lan aster, July 22, 2013

Page 2: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

COW proje t:

http://hpsg.fu-berlin.de/ ow/

texrex ( urrent version: texrex-hyperhyper):

http://sour eforge.net/proje ts/texrex/

Our brand new book on web orpus onstru tion:

http://dx.doi.org/10.2200/S00508ED1V01Y201305HLT022

http://sites.morgan laypool. om/w /

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 1/32

Page 3: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Overview

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 2/32

Page 4: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

We are here. . .

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 3/32

Page 5: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

Text quality as understood here

senten es

ideally onne ted

not word/name lists

not tag louds

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 4/32

Page 6: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

The Good

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 5/32

Page 7: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

The Bad

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 6/32

Page 8: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

The Hazy

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 7/32

Page 9: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Text quality

The Hazier

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 8/32

Page 10: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

We are here. . .

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 9/32

Page 11: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

Corpus

UKCOW2012 (beta version) [S häfer and Bildhauer, 2012℄

approx. 6 GT, rawled in 2012 from .uk

leaned with texrex-mrvain

HTML stripping, onversion to ISO-8859, other normalizations

boilerplate removal

aggressive w-shingling-based dedupli ation

evaluated as superior to ukWaC in ollo ation extra tion tasks

in Biemann et al. [2013℄ (or at least �equally good, but bigger�)

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 10/32

Page 12: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

Task for human raters

lassi� ation of text quality for 1,000 do uments

500 do uments from the early phase of the rawl: �early data�

500 do uments from the late phase: �late data�

boilerplate removed, presented as text-only

with paragraph breaks

s ale:

�2,�1: do ument should not be in the orpus

1, 2: do ument should be in the orpus

0: unde ided/do ument might or might not be in the orpus

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 11/32

Page 13: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

From the guidelines I

Do uments ontaining predominantly full senten es are good,

�predominantly� meaning onsiderably more than

50% of the text mass (as per eived by the oder).

Boilerplate material in senten e form is good

(You are not allowed to post omments in this forum.),

other boilerplate material is bad

(Copyright © 2046 UAC Ltd.).

Senten es trun ated or otherwise destroyed

by some post-pro essing method are good as long as

they are re ognizable as (the rest of) a senten e.

Repetitions of good senten es are good.

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 12/32

Page 14: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

From the guidelines II

De isions should not depend on the length of the do ument,

su h that a do ument ontaining only one good senten e

would still be maximally good.

Non-English material ontributes to badness.

Non-senten e material (lists, tables, tag louds)

ontributes to badness.

However, if a list et . is embedded in a oherent text

whi h dominates the do ument, the do ument is good

(prototypi ally re ipes with longer instru tions).

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 13/32

Page 15: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

Raters

raters A and R: orpus designers

with a shared understanding of desired orpus

rater S: student assistant,

experien e with three similar rating tasks

�training�: rating of 100 do uments together

with several hours of dis ussion of borderline ases

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 14/32

Page 16: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

Hazy results

statisti early 500 late 500 all 1,000

raw 0.566 0.300 0.433

κ (raw) 0.397 0.303 0.367

ICC pC , 1q 0.756 0.679 0.725

raw (r ¥ 0) 0.900 0.762 0.831

raw (r ¥ 1) 0.820 0.674 0.747

κ (r ¥ 0) 0.673 0.625 0.660

κ (r ¥ 1) 0.585 0.555 0.598

κ (r ¥ 2) 0.546 0.354 0.498

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 15/32

Page 17: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Rating experiment

A eptable results?

values below 0.68 [Krippendor�, 1980℄

even onsidering riti ism of Krippendorf's magi number

[Carletta, 1996, Bayerl and Paul, 2011℄:

un omfortably low for �gold standard�

more onfusion on late data (lower overall quality)

worse: disagreement between orpus designers

a eptan e at the threshold ¥ 0:

A: 78.4%, R: 73.8%, S: 84.9%

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 16/32

Page 18: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

We are here. . .

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 17/32

Page 19: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

General method and idea

simple metri with known properties

language-independent, unsupervised. . .

does not involve an obviously di� ult design de ision

strategy for leansing: high re all for everyone,

a ept medio re pre ision

for retained do uments: use as annotation in �nal orpus,

allow orpus users to �set� pre ision

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 18/32

Page 20: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Implementation

based on �frequent/short word� method

in language identi� ation [Grefenstette, 1995℄

similar to WaCky [Baroni et al., 2009℄

but without manually ompiled lists of fun tion words

totally unsupervised pro edure for rawled data

predominantly in a single language (TLD rawl):

training: get weighted mean and standard deviation

of relative frequen ies of the most frequent words (�pro�le�)

produ tion: al ulate for the top m of them

the �standardized� negative deviation for ea h do ument

lamped and added up: the Badness s ore

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 19/32

Page 21: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Questions

Does pro�le generation

yield reliable results?

How should we sele t the threshold

for the a tual deletion of do uments?

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 20/32

Page 22: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Pro�le development while training

−1.

8−

1.6

−1.

4−

1.2

DIE

5 305 705 −2.

0−

1.8

−1.

6−

1.4

DER

5 305 705

−1.

8−

1.6

−1.

4−

1.2

UND

5 305 705

−2.

2−

2.0

−1.

8−

1.6

IN

5 305 705

−2.

2−

2.0

−1.

8−

1.6

DAS

5 305 705

−2.

3−

2.1

−1.

9

DEN

5 305 705

−2.

3−

2.1

−1.

9

ZU

5 305 705

−2.

4−

2.2

−2.

0−

1.8

−1.

6

IST

5 305 705

−2.

4−

2.2

−2.

0−

1.8

VON

5 305 705

−3

−2

−1

01

ICH

5 305 705

German test pro�le; n � 1000; log10-transformed relative frequen ies

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 21/32

Page 23: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Distribution of Badness depending on do ument length

(early pro�le on early data)

0 10 20 30 40 500.00

0.04

0.08

0.12

0 10 20 30 40 500.00

0.04

0.08

0.12

document length > 5,000 Bdocument length > 1,000 Bdocument length > 200 Bdocument length > 0 B

left: DECOW2012; right: UKCOW2012

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 22/32

Page 24: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Pro�le omparison:

E�e t of thresholds (= umulative density of Badness)

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

0.6780.766

0.282

0.431

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

0.6940.785

0.2150.314

early profile, early dataearly profile, late datalate profile, early datalate profile, late data

left: DECOW2012; right: UKCOW2012; only do uments over 200 B;

values at Badness 15 and 20 marked

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 23/32

Page 25: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

Pro�le omparison:

Raw agreement between pro�les

0 10 20 30 40 50

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40 50

0.80

0.85

0.90

0.95

1.00

early datalate data

left: DECOW2012; right: UKCOW2012

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 24/32

Page 26: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

DECOW2012 sample raw pro�le omparison

0 10 20 30 40 50

010

2030

4050

0 10 20 30 40 50

010

2030

4050

left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 25/32

Page 27: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Badness s ores

UKCOW2012 sample raw pro�le omparison

0 10 20 30 40 50

010

2030

4050

0 10 20 30 40 50

010

2030

4050

left: early data; right: late data; x-axis: late pro�le; y-axis: early pro�le

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 26/32

Page 28: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

We are here. . .

Text quality

Rating experiment

Badness s ores

Design de isions and non-destru tive normalization

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 27/32

Page 29: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

Badness threshold for removal

If re all is more important than pre ision: 35

pre re F1 orre t baseline

S 0.914 0.959 0.936 0.888 0.849

A 0.856 0.973 0.911 0.851 0.781

R 0.808 0.976 0.884 0.811 0.738

Pre ision/re all et . for our three raters (do uments better than 0)

at a Badness threshold of 35

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 28/32

Page 30: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

Reasons for non-destru tive normalization

our goal: arefully sampled and pro essed web orpora

for fundamental resear h � theoreti al linguisti s,

linguisti web hara terization

noise or distortion through pro essing intolerable

Leave major destru tive design de isions to the user!

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 29/32

Page 31: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

Annotation with quality metri s instead of removal:

A preliminary version based on UKCOW2012

<do url="http://www.booktrust.org.uk/writing/writing-tips/33" bd ="d">

<p bp =" ">A essibility and text options</p>

<p bp ="e">Professionals</p>

<p bp =" ">How to write omedy</p>

<p bp ="a">You want to write omedy? Well it's easy. I've just done it and didn't

even use a spell he k. I'd definitely suggest starting with a ' ' rather than a 'k'

as that sounds either wa ky or German and either way there will be prejudi es about

what type of humour it is. Oh, sorry, omedy writing? I see. Well first try not

doing the most God-awful joke you an in your very first senten e. That's a definite

no-no. To be honest, I an't tell you how to write omedy as su h. I'm very mu h a

subs riber to the 'either you're funny or you're not' party. However even if you have

the invite to that very party, there's a high han e you might not know quite how to

utilise the orre t hat for the kit hen, or the right outfit to wear, in the same way

I've never worked out how to do a de ent analogy.</p>

<p bp ="a">What I'm saying is that if you're a funny hap or hapette, there are some

things that definitely hone those powers into a neat bit of witty prose, rather than a

bundle of nearly funny ons.</p>

Annotation for Badness, boilerplate

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 30/32

Page 32: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

Bottom line

�Web orpus leansing� is a destru tive pro ess.

Even orpus designers a hieve only medio re agreement

w. r. t. the �quality orpus material�/�noise� de ision.

Better guidelines just for e orpus users to live

with more spe i� problemati destru tive de isions.

Strategy: Leave as mu h as possible in the orpus

and provide do umented and intuitive quality annotation.

Badness is su h a metri .

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 31/32

Page 33: The Good, the Bad, and the Hazy: Design decisions in web ... · Design decisions in eb w rpus co construction Rating exp eriment e W re a here.. ext T y qualit Rating exp eriment

Design de isions in web orpus onstru tion

Design de isions and non-destru tive normalization

Referen es I

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very

large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):

209�226, 2009.

P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a

meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.

C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,

and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013. In

prep. The list of authors is preliminary and might re�e t neither the order nor the a tual list in the

printed version.

J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,

22(2):249�254, 1996.

G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation

onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.

K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly

Hills, 1980.

R. S häfer and F. Bildhauer. Building large orpora from the web using a new e� ient tool hain. In

N. Calzolari, K. Choukri, T. De ler k, M. U. Do§an, B. Maegaard, J. Mariani, J. Odijk, and

S. Piperidis, editors, Pro eedings of the Eight International Conferen e on Language Resour es and

Evaluation (LREC'12), pages 486�493, Istanbul, 2012. ELRA.

R. S häfer, A. Barbaresi, F. Bildhauer 2013, German Grammar and Linguisti s (FU Berlin) 32/32