motivation ml vandalism - webis.de · conclusion & outlook •vandalism can reduce the quality...

1
Motivation Related work Concentrates on unstructured knowledge bases Solution Idea Context Problems Corpus Construction SIGIR '15 - Santiago de Chile [email protected] Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis Stefan Heindorf Martin Potthast Benno Stein Gregor Engels Corpus Analysis Conclusion & Outlook Vandalism can reduce the quality of knowledge bases First standardized corpus to study vandalism in structured knowledge bases Next step: detect vandalism automatically Download our corpus! Manual Validation 1,000 rollbacked revision Manually reviewed 1,000 undone/reverted revisions 1,000 inconspicuous revisions Option 1: Rollback (by administrators) 86 ± 3 % * revisions indeed vandalism Option 2: Undo/Restore (by all users) 62 ± 3 % * revisions indeed vandalism * 95 % confidence level Overview of Corpus 24 million revisions 103 thousand vandalism revisions Corpus Revisions over Time What is vandalized? Who vandalizes? For our corpus: Rollbacked revisions are considered vandalism Information systems structured knowledge bases Patrollers are busy Vandalism is not detected in time Some people vandalize those knowledge bases ML Machine learning to detect vandalism Vandalism corpus Vandalism? Revision 1: yes Revision 2: no use 40% of vandalism in structured data New features beneficial 86 % of vandalism by unregistered users 57% 3% 8% 32% Top vandalized items Top items by category Unregistered Registered Cases Item title 47 Cristiano Ronaldo 43 Lionel Messi 43 One Direction 41 Portal:Featured content 34 Justin Bieber 33 Barack Obama 29 English Wikipedia 29 Selena Gomez Contributions Wikidata Vandalism Corpus WDVC-2015 Corpus analysis Automatic Revision Labeling Wikidata revision history Only non-bot revisions considered Goal: Automatic labeling as vandalism/non-vandalism Option 1: Rollback (by administrators) 103,205 rollbacked revisions Option 2: Undo/Restore (by all users) 64,820 undone/reverted revisions 1 2 3 4 2 5 revert 20% 20% 16% 14% 13% 9% 8% 1% 12% 21% 9% 15% 8% 4% 31% 1% Culture People Society Nature Meta items Technology Places Other Top 1,000 vandalized items Top 1,000 of all items http:// www.uni-weimar.de/en/media/ chairs /webis/corpora/corpus-wdvc-15/

Upload: others

Post on 30-Oct-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motivation ML Vandalism - webis.de · Conclusion & Outlook •Vandalism can reduce the quality of knowledge bases •First standardized corpus to study vandalism in structured knowledge

Motivation

Related workConcentrates on unstructured knowledge bases

Solution IdeaContext Problems

Corpus Construction

SIGIR '15 - Santiago de Chile [email protected]

Towards Vandalism Detection in Knowledge Bases:Corpus Construction and Analysis

Stefan Heindorf Martin Potthast Benno Stein Gregor Engels

Corpus Analysis

Conclusion & Outlook• Vandalism can reduce the quality of knowledge bases• First standardized corpus to study vandalism in structured knowledge bases• Next step: detect vandalism automatically

Downloadour corpus!

Manual Validation1,000 rollbacked revision

Manually reviewed 1,000 undone/reverted revisions1,000 inconspicuous revisions

Option 1: Rollback (by administrators)86 ± 3 %* revisions indeed vandalism

Option 2: Undo/Restore (by all users)62 ± 3 %* revisions indeed vandalism

* 95 % confidence level

Overview of Corpus• 24 million revisions• 103 thousand vandalism revisions

Corpus Revisions over Time

What is vandalized? Who vandalizes?

For our corpus:Rollbacked revisions are considered vandalism

Information systems structuredknowledge bases

Patrollers are busy

Vandalism is notdetected in time

Some people vandalize those knowledge bases MLMachine learning

to detect vandalismVandalismcorpus

Vandalism?Revision 1: yesRevision 2: no

use

40% of vandalismin structured data

New features beneficial86 % of vandalism

by unregistered users

57%

3%8%

32%

Top vandalized items Top items by category

Unregistered

Registered

Cases Item title47 Cristiano Ronaldo43 Lionel Messi43 One Direction41 Portal:Featured content34 Justin Bieber33 Barack Obama29 English Wikipedia29 Selena Gomez

Contributions• Wikidata Vandalism Corpus WDVC-2015• Corpus analysis

Automatic Revision Labeling• Wikidata revision history• Only non-bot revisions considered• Goal: Automatic labeling as vandalism/non-vandalism

Option 1: Rollback (by administrators)103,205 rollbacked revisions

Option 2: Undo/Restore (by all users)64,820 undone/reverted revisions

1

2

3

4

2

5

revert

20%

20%

16%

14%

13%

9%

8%

1%

12%

21%

9%

15%

8%

4%

31%

1%

Culture

People

Society

Nature

Meta items

Technology

Places

Other

Top 1,000 vandalized itemsTop 1,000 of all items

http://www.uni-weimar.de/en/media/

chairs/webis/corpora/corpus-wdvc-15/