analysing the usage of wikipedia on twitter: understanding inter-language links
TRANSCRIPT
Analysing the Usage of Wikipedia on Twitter: Understanding Inter-Language Links
HICSS 49, January 8th, 2016
Eva Zangerle, Georg Schmidhammer, Günther SpechtUniversity of Innsbruck, Austria
2MotivationWhy this work does matter…
• Wikipedia central source of information
• 450 million users per month, 277 editions
• Research focused on intrinsic factors• community• content• quality
3MotivationWhy this work does matter…
• Wikipedia central source of information
• 450 million users per month, 277 editions
• Research focused on intrinsic factors• community• content• quality
• What about extrinsic factors?
4Our Vision: Extrinsic Quality-Measures
5
Inter-language Link Analysis
Our Vision: Extrinsic Quality-Measures
6Previous Research
Eva Zangerle, Georg Schmidhammer and Günther Specht. #Wikipedia on Twitter: Analyzing Tweets About Wikipedia. In Proceedings of the 11th International Symposium on Open Collaboration, OpenSym ’15, pages 14:1–14:8, New York, NY, USA, 2015. ACM.
• Extrinsic view on Wikipedia via Twitter
• 20% of all tweets lead to a Wikipedia other than the tweet‘s language (except for English and Japanese)
7Research Questions
How are inter-language links distributed among the different Wikipedias?
What are the causes for users to link to a Wikipedia other than the one of their langage?
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
9Crawling
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
• Twitter API• Search for keyword „wikipedia“
• 2014/10/20 – 2015/04/28
• 6,415,762 tweets in total
• Extraction of links from tweets
10Cleaning Data
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
• Filter tweets with no Wikipedia URL contained
• Bots contained in dataset • 99th percentile (>130 tweets)• BotOrNot Detection Service for 1,083 accounts• users and tweets deleted from dataset
11Cleaning Data
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
Feature Raw CleanedTweets 6,415,762 2,844,399 Retweets 2,040,816 855,959 Distinct Users 2,287,430 1,092,732Mentions 4,673,284 2,437,092Distinct
Hashtags213,574 127,958
Hashtag Usages
2,283,535 788,210
Distinct URLs 1,976,479 1,179,288URL Usages 4,825,230 3,130,420
12Crawling Wikipedia
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
• MediaWiki API• Resolution of revision ID for time tweet was sent• Crawling of
• article• headings• wikilinks• references• images
• Last 500 edits
13Quality Measures
Crawl Twitter
Crawl Wikiped
ia
Clean Data
Quality Analyse
s
Extract Links
1. Article length2. Number of references (absolute)3. Number of references (relative)4. Diversity5. Number of headings (absolute)6. Number of headings (relative) Warncke-Wang, M., Cosley, D., and Riedl, J. "Tell Me More: An Actionable Quality Model for Wikipedia", in the proceedings of WikiSym 2013
7. Informativeness 8. Number of images (relative) 9. Number of wikilinks (relative)10.Currency11.HasInfoBox12.Complexity (Flesch Kincaid)
Results
15RQ1: Distribution of (Inter-language) links
Top3 Interlanguage Targets: 62.68 % English 6.26% Japanese5.76% Spanish
16RQ2: Causes for Inter-language Links
85% do not have a counterpart
in the tweet‘s language (out of 691,424 inter-language links)
17RQ2: Causes for Inter-language Links
Remaining 15%: Could article quality be an issue?
https://en.wikipedia.org/wiki/Black_Monday_(1987)
https://es.wikipedia.org/wiki/Lunes_negro_(1987)
originally posted counterpart
18
19
20RQ2: Causes for Inter-language Links
• Remaining 99,776 articles: apply 12 quality measures to all originally posted articles and their counterparts
• Group articles into language pairs (original and counterpart language)
• For each article in language pair count number of measures original articles performance better than counterpart and vice versa (result: two vectors)
• Wilcoxon signed rank test for each language pair
21RQ2: Causes for Inter-language Links
for
58% of all language combinations
the tweeted language is of significantly better quality (p < 0.05)
22Dominating Languages
Target Better than (p < 0.05) CountEnglish Spanish, Japanese, French, Korean,
Italian, German, Arabic, Indonesian, Portuguese, Dutch, Turkish, Swedish, Thai, Polish, Romanian, Finnish, Danish, Norwegian, Farsi, Welsh, Hindi, Bulgarian, Latvian, Bosnian, Slovakish, Hung-arian, Slovenian, Lithuanian, Bosnian
28
French English, Japanese, Spanish 3Spanish English, Italian 2Catalan English, Portuguese 2German English 1Japanese German 1Portuguese Spanish 1Turkish English 1
23Dominating Languages• Most dominating target languages are English,
Spanish, Japanese• most extensive Wikipedias• most active Wikipedias
more elaborate, mature articles than in user‘s language
24Quality Measures
66% of all articles tweeted feature a significantly higher quality
for all twelve quality measures(p < 0.001)
25Quality Measures
97% of all articles tweeted feature a significantly higher quality
for more than six quality measures(p < 0.001)
26Conclusion
85% of all inter-language links: no counterpart available
Articles tweeted are of significantly higher quality (with English, Japanese and German dominating)
Users deliberately tweet article of higher quality
Questions?
any coffee break
@[email protected]://www.evazangerle.at
http://dbis-informatik.uibk.ac.athttps://www.facebook.com/dbisibk
Contact
Analysing the Usage of Wikipedia on Twitter: Understanding Inter-Language Links
Eva Zangerle, Georg Schmidhammer, Günther SpechtUniversity of Innsbruck, Austria