analysing the usage of wikipedia on twitter: understanding inter-language links

Post on 13-Apr-2017

736 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Analysing the Usage of Wikipedia on Twitter: Understanding Inter-Language Links

HICSS 49, January 8th, 2016

Eva Zangerle, Georg Schmidhammer, Günther SpechtUniversity of Innsbruck, Austria

2MotivationWhy this work does matter…

• Wikipedia central source of information

• 450 million users per month, 277 editions

• Research focused on intrinsic factors• community• content• quality

3MotivationWhy this work does matter…

• Wikipedia central source of information

• 450 million users per month, 277 editions

• Research focused on intrinsic factors• community• content• quality

• What about extrinsic factors?

4Our Vision: Extrinsic Quality-Measures

5

Inter-language Link Analysis

Our Vision: Extrinsic Quality-Measures

6Previous Research

Eva Zangerle, Georg Schmidhammer and Günther Specht. #Wikipedia on Twitter: Analyzing Tweets About Wikipedia. In Proceedings of the 11th International Symposium on Open Collaboration, OpenSym ’15, pages 14:1–14:8, New York, NY, USA, 2015. ACM.

• Extrinsic view on Wikipedia via Twitter

• 20% of all tweets lead to a Wikipedia other than the tweet‘s language (except for English and Japanese)

7Research Questions

How are inter-language links distributed among the different Wikipedias?

What are the causes for users to link to a Wikipedia other than the one of their langage?

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

9Crawling

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

• Twitter API• Search for keyword „wikipedia“

• 2014/10/20 – 2015/04/28

• 6,415,762 tweets in total

• Extraction of links from tweets

10Cleaning Data

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

• Filter tweets with no Wikipedia URL contained

• Bots contained in dataset • 99th percentile (>130 tweets)• BotOrNot Detection Service for 1,083 accounts• users and tweets deleted from dataset

11Cleaning Data

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

Feature Raw CleanedTweets 6,415,762 2,844,399 Retweets 2,040,816 855,959 Distinct Users 2,287,430 1,092,732Mentions 4,673,284 2,437,092Distinct

Hashtags213,574 127,958

Hashtag Usages

2,283,535 788,210

Distinct URLs 1,976,479 1,179,288URL Usages 4,825,230 3,130,420

12Crawling Wikipedia

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

• MediaWiki API• Resolution of revision ID for time tweet was sent• Crawling of

• article• headings• wikilinks• references• images

• Last 500 edits

13Quality Measures

Crawl Twitter

Crawl Wikiped

ia

Clean Data

Quality Analyse

s

Extract Links

1. Article length2. Number of references (absolute)3. Number of references (relative)4. Diversity5. Number of headings (absolute)6. Number of headings (relative) Warncke-Wang, M., Cosley, D., and Riedl, J. "Tell Me More: An Actionable Quality Model for Wikipedia", in the proceedings of WikiSym 2013

7. Informativeness 8. Number of images (relative) 9. Number of wikilinks (relative)10.Currency11.HasInfoBox12.Complexity (Flesch Kincaid)

Results

15RQ1: Distribution of (Inter-language) links

Top3 Interlanguage Targets: 62.68 % English 6.26% Japanese5.76% Spanish

16RQ2: Causes for Inter-language Links

85% do not have a counterpart

in the tweet‘s language (out of 691,424 inter-language links)

17RQ2: Causes for Inter-language Links

Remaining 15%: Could article quality be an issue?

https://en.wikipedia.org/wiki/Black_Monday_(1987)

https://es.wikipedia.org/wiki/Lunes_negro_(1987)

originally posted counterpart

18

19

20RQ2: Causes for Inter-language Links

• Remaining 99,776 articles: apply 12 quality measures to all originally posted articles and their counterparts

• Group articles into language pairs (original and counterpart language)

• For each article in language pair count number of measures original articles performance better than counterpart and vice versa (result: two vectors)

• Wilcoxon signed rank test for each language pair

21RQ2: Causes for Inter-language Links

for

58% of all language combinations

the tweeted language is of significantly better quality (p < 0.05)

22Dominating Languages

Target Better than (p < 0.05) CountEnglish Spanish, Japanese, French, Korean,

Italian, German, Arabic, Indonesian, Portuguese, Dutch, Turkish, Swedish, Thai, Polish, Romanian, Finnish, Danish, Norwegian, Farsi, Welsh, Hindi, Bulgarian, Latvian, Bosnian, Slovakish, Hung-arian, Slovenian, Lithuanian, Bosnian

28

French English, Japanese, Spanish 3Spanish English, Italian 2Catalan English, Portuguese 2German English 1Japanese German 1Portuguese Spanish 1Turkish English 1

23Dominating Languages• Most dominating target languages are English,

Spanish, Japanese• most extensive Wikipedias• most active Wikipedias

more elaborate, mature articles than in user‘s language

24Quality Measures

66% of all articles tweeted feature a significantly higher quality

for all twelve quality measures(p < 0.001)

25Quality Measures

97% of all articles tweeted feature a significantly higher quality

for more than six quality measures(p < 0.001)

26Conclusion

85% of all inter-language links: no counterpart available

Articles tweeted are of significantly higher quality (with English, Japanese and German dominating)

Users deliberately tweet article of higher quality

Questions?

any coffee break

@eva_zangerleeva.zangerle@uibk.ac.athttp://www.evazangerle.at

http://dbis-informatik.uibk.ac.athttps://www.facebook.com/dbisibk

Contact

Analysing the Usage of Wikipedia on Twitter: Understanding Inter-Language Links

Eva Zangerle, Georg Schmidhammer, Günther SpechtUniversity of Innsbruck, Austria

top related