arxiv:2005.06012v1 [cs.si] 2 may 2020mega-cov: a billion-scale dataset of 65 languages for covid-19...

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19 ∗

Muhammad Abdul-Mageed, AbdelRahim Elmadany, Dinesh Pabbi, Kunal Verma, Rannie Lin{muhammad.mageed,a.elmadany}@ubc.ca; [email protected]

[email protected], [email protected] Language Processing LabUniversity of British Columbia

Abstract

We describe Mega-COV, a billion-scaledataset from Twitter for studying COVID-19.The dataset is diverse (covers 268 countries),longitudinal (goes as back as 2007), multilin-gual (comes in 100+ languages), and has asignificant number of location-tagged tweets(∼ 169M tweets). We release tweet IDs fromthe dataset. We also develop and release a pow-erful model (acc=94%) for teasing apart mes-sages related to the pandemic from those thatare not. Our data and models can be usefulfor studying various phenomena related to thepandemic, and accelerating viable solutions toassociated problems. Our data and models arepublicly available for research. 1

1 Introduction

The seeds of the coronavirus disease 2019 (COVID-19) pandemic are reported to have started as a localoutbreak in Wuhan (Hubei, China) in December,2019, but soon spread around the world (WHO,2020). As of June 3, 2020, the number of confirmedcases around the world is estimated at 6,438,254. 2

In response to this ongoing public health emer-gency, researchers are mobilizing to track the pan-demic and study its impact not only on human life,but possibly on all sorts of life in our planet. Thedifferent ways the pandemic has its footprint onour lives is a question that will probably be studied

∗A previous version of the current manuscript appearedunder the title “Mega-COV: A Billion-Scale Dataset of 65Languages for COVID-19” on ArXiv.

1Data and models are available at: https://github.com/UBC-NLP/megacov.

2Source the Center for Systems Science andEngineering (CSSE) at Johns Hopkins University(JHU) dashboard at: https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6. TheCSSE source provides real time updates on the locationand number of confirmed COVID-19 cases. The numberof confirmed cases witnessed exponential growth in somecountries. Overall, the growth has also been fast. For example,in April 7, when we first documented that number in thecurrent manuscript, the number of cases was at 1,390,511.

for years to come. Enabling scholarship on the

Figure 1: Global coverage of Mega-COV.V0.2 basedon our geo-located data. Each dot is a city. Contiguouscities of the same color belong to the same country.

topic by providing relevant data is an importantendeavor. Toward this goal, we collect and releaseMega-Cov, a billion-scale multilingual Twitterdataset with geo-location information.

As several countries and regions around theworld went into lockdown, the public health emer-gency has restricted physical aspects of humancommunication considerably. As hundreds of mil-lions of people spend more time at home, communi-cation over social media becomes more importantthan what it has ever been. In particular, the con-tent of social media communication promises tocapture useful aspects of the lives of the millionsof people involved. Mega-Cov is intended as arepository of such a content.

While other early efforts to collect Twitter dataare ongoing, our goal is to complement these exist-ing resources in significant ways. More specifically,we designed our methods to harvest a dataset thatis unique in multiple ways, as follows:Massive Scale: Very large datasets lend them-selves to analyses that are not possible with smallerdata. Given the global nature of COVID-19, werealize that a large-scale dataset will be mostuseful as the scale allows for slicing and dicingthe data across different times, communities, lan-guages, and regions that are not possible other-

arX

iv:2

005.

0601

2v2

[cs

.SI]

7 A

ug 2

020

https://github.com/UBC-NLP/megacov


https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6



wise. For this reason, we dedicated significantresources to harvesting and preparing the dataset.Our initial release of Mega-COV, which we re-ferred to as Mega-COV V0.2, had more focuson North America. Mega-COV V0.2, our cur-rent version, has solid international coverage andbrings data from 1M users from 268 countries (seeSection 3.1). Overall, Mega-COV.V0.2 has ∼1.5B tweets (Section 2). This is ∼ 1.5 order ofmagnitude larger than #COVID-19 (Chen et al.,2020), the largest dataset we know of (∼ 144Mtweets as of May 30, 2020).Topic Diversity: We do not restrict our collectionto tweets carrying certain hashtags. This makes thedata general enough to involve content and topicsdirectly related to COVID-19, regardless of exis-tence of accompanying hashtags. This also allowsfor investigating themes that may not be directlylinked to the pandemic but where the pandemicmay have some bearings which should be takeninto account when investigating such themes. Sec-tion 3.4 and Section 3.5 provide a general overviewof issues discussed in the dataset.Longitudinal Coverage: We collect multiple datapoints (up to 3,200) from the same users, with agoal to allow for comparisons between the presentand the past across the same users, communities,and geographical regions (Section 3.2).Language Diversity: Since our method of collec-tion targets users, rather than hashtag-based con-tent, Mega-COV is linguistically diverse. In theory,the dataset should comprise any language posted toTwitter by a user whose data we have collected.Based on Twitter-assigned language codes, weidentify a total of 65 languages. We use a languageid tool to tag messages whose language is not iden-tified by Twitter and find that our collection hastweets in more than 104 languages. (Section 3.3).No Distribution Shift: Related to the two previ-ous points, but from a machine learning perspec-tive, collecting the data without conditioning onexistence of specific (or any) hashtags avoids intro-ducing distribution bias. In other words, the datacan be used to study various phenomena in-the-wild. This warrants more generalizable findingsand models.A dataset as large as Mega-COV can be hard tonavigate. In particular, an informative descriptionof the dataset is necessary for navigating it. Inthis paper, we provide an explanation of a numberof global aspects of the dataset, including its ge-

ographic, temporal, and linguistic coverage. Wealso provide a high-level content analysis of thedata, and explore user sharing of domains with afocus on news media. We also develop a powerfulmultilingual (65 languages) model for detectingtweets related to COVID-19. In the context of ourinvestigation of the Mega-COV V0.2, we makean array of important discoveries. For example,we strikingly discover that, perhaps for the firsttime in Twitter history, users address one anotherand retweet more than they post tweets. We alsofind a noticeable rise in ranks for news sites (basedon how frequent their domains are shared) during2020 as compared to 2019, with a shift towardglobal (rather than local) news media. A third find-ing is how use of the blogging platform surged inMarch, perhaps making it the busiest time in thehistory of the network.The rest of the paper is organized as follows: InSection 2, we describe our data collection meth-ods. Section 3 is where we investigate geographic,linguistic, temporal, and content aspects of ourdata. We provide a case study of human mobilityin Section 4. We describe our model for detectingCOVID-19 tweets in Section 5 and detail data re-lease format and ethical considerations in Section6.We provide a literature review in Section 7, and weconclude in Section 8. We now introduce our datacollection method.

2 Data Collection

Tweets Retweets Replies All2007-2020 612M 507M 369M 1.5B2020 122M 174M 129M 425MUsers 1M 976K 994K 1M

Table 1: Distribution of tweets, retweets, and replies inMega-COV V0.2. (Numbers are rounded).

To collect a sufficiently large dataset, we putcrawlers using the Twitter streaming API3 onAfrica, Asia, Australia, Europe, North America,and South America starting in early January, 2020.This allowed us to acquire a diverse set of tweetsfrom which we can extract a random set of userIDs whose timelines (up to 3,200 tweets) wethen iteratively crawled every two weeks. Thisgave us data from May 15th, 2020 backwards,depending on how prolific of a poster a useris (see Table 4 for a breakdown.). Mega-COV

3Streaming API is accessible at: https://github.com/tweepy/tweepy

https://github.com/tweepy/tweepy

https://github.com/tweepy/tweepy

V0.2, our current release, comprises a total of1, 023, 972 users who contribute 1, 487, 328, 805tweets. For each tweet, we collect the whole jsonobject. This gives us access to information such asuser location and the language tag Twitter assignsto each tweet. We then use the data streamingand processing engine, Spark, to merge all userfiles and run our analyses. To capture a widerange of behaviors, we keep tweets, re-tweets,and direct user-to-user interactions (responses)in the data. Table 1 offers a breakdown ofthe distribution of the different categories ofmessages in Mega-COV V0.2. Tweet IDs ofthe dataset are available at our GitHub 4 and canbe downloaded for research. For future releases,our plan is to update the dataset repository monthly.

3 Exploring Mega-COV

3.1 Geographic Diversity

A region from which a tweet is posted can be as-sociated with a specific ‘point’ location or a Twit-ter place with a ‘bounding box’ that describes alarger area such as city, town, or country. Werefer to tweets in this category as geo-locatedtweets. A smaller fraction of tweets are alsogeo-tagged with longitude and latitude. As Ta-ble 2 shows, Mega-COV V0.2 has ∼ 187M geo-located tweets from ∼ 740K users and ∼ 31Mgeo-tagged tweets from ∼ 267K users. Table 2also shows the distribution of tweets and users overthe top two countries represented in the dataset, theU.S. and Canada, and other locations (summed upas one category, but see also Table 3 for the coun-tries in the data per continent). For the year 2020,Mega-COV V0.2 has∼ 66M geo-located tweetsfrom ∼ 670K users and ∼ 3M geo-tagged tweetsfrom∼ 109K users. 5 We note that significant partsfrom the data could still be posted from the differ-ent countries but just not geo-located in the originaljson files retrieved from Twitter. Figure 2 showsactual point co-ordinates of locations from whichthe data were posted. Overall, Mega-COV V0.2has data posted from a total of 167, 202 cities from268 countries. Figure 3 shows the distribution ofdata over countries. The top 5 countries in the dataare the U.S., Canada, Brazil, the U.K., and Japan.

4Accessible at: https://github.com/UBC-NLP/megacov.

5The dataset has ∼ 134K “locations” which we could notresolve to a particular country using only the json information.

Other top countries in the data across the variouscontinents are presented in Table 3 as mentionedearlier.

Figure 2: World map coverage of Mega-COV V0.2.Each dot is a point co-ordinate (longitude and latitude)from which at least one tweet was posted. Clearly,users tweet while traveling, whether by air or sea.

Figure 3: Geographical diversity in Mega-COVV0.2. We show the distribution of our geo-locateddata over the top 20 countries with most tweets and re-sponses. Overall, there are 268 countries in the data.

3.2 Temporal Coverage

Our goal is to make it possible to exploitMega-COV V0.2 for comparing user social con-tent over time. Since we crawl user timelines, thedataset comprises content going back as early as2007. Figure 4 shows the distribution of data overthe period 2007-2020. Simple frequency of userposting shows a surge in Twitter use in the periodof Jan-April 2020 compared to the same monthsin 2020. 6 compared to the same period in 2019(see Figure A.1 in supplementary material). Indeed,we identify 40.53% more posting during the first4 months of 2020 compared to the same period in

6We do not have enough data from the month of May tocompare to.



Geolocated Tweeted From Geotagged Tweeted FromCanada U.S. Other Canada U.S. Other

All-Time 186,939,854 16,459,655 70,756,282 99,723,917 31,392,563 3,600,952 11,449,400 16,342,211All-Users 739,645 102,388 327,213 463,673 266,916 47,622 117,096 165,8602020 65,584,908 3,331,720 24,259,973 37,993,215 2,942,675 246,185 1,187,131 1,509,3592020-Users 670,314 61,205 254,067 392,627 109,348 14,525 43,486 62,837

Table 2: Mega-COV V0.2 geolocated and geotagged users and their tweets over North America vs. Otherlocations. See also Table 3 for statistics over top countries by continent.

Figure 4: Distribution of Mega-COV V0.2 data 2007-2020

Figure 5: Twitter user activity for Jan-May, 2020.

2019. This is expected, both due to physical dis-tancing and a wide range of human activity (e.g.,shopping, work) moving online perhaps causingusers to be on their machines for longer times andalso a need to express self and share informationmore. The clear spike in the month of March 2020is striking. It is particularly so given the patternof use where retweeting and replying to others isobservably more frequent than tweeting. This es-pecially takes place during the month of Marchand somewhat continues in April, as in Figure 5.Figure 4 and Figure 5 also show a breakdown oftweets, re-tweets, and replies. A striking discoveryis that, for 2020, users are engaged in conversa-tions with one another (short as these typically arein Twitter) more than tweeting directly to the plat-form. This is the first time this happens comparedto any previous years, based on our dataset. Inaddition, for 2020, we also see users re-tweeting

Continent Country All 2020Geo-Located Users Geo-Located Users

Africa

Nigeria 1,876,879 16,220 1,057,742 14,872South Africa 1,503,181 9,751 692,367 6,373Egypt 873,079 8,840 452,738 5,900Ghana 373,996 3,942 202,470 3,089Kenya 373,667 4,480 172,796 3,026

Asia

Japan 7,646,901 32,038 2,752,890 23,773Turkey 5,067,118 32,589 1,463,550 25,477Indonesia 4,540,286 22,893 1,871,154 18,056Philippines 4,078,410 15,477 1,636,265 11,011India 3,107,917 33,931 1,576,549 27,940

Australia Australia 1,179,205 12,090 352,215 5,454

Europe

United Kingdom 11,714,012 70,787 2,970,848 44,420Spain 4,327,475 43,236 1,431,567 20,902France 2,030,523 36,017 729,500 12,497Italy 1,829,369 27,071 527,648 8,308Germany 1,272,339 24,215 385,306 7,412

North AmericaUnited States 69,515,949 327,213 23,578,430 254,067Canada 16,066,337 102,388 3,200,804 61,205Mexico 3,665,791 36,190 1,106,352 17,406

South America

Brazil 15,879,664 48,339 8,060,537 41,277Argentina 3,142,778 14,576 1,298,381 10,901Colombia 1,612,765 10,319 629,426 6,884Chile 1,003,459 6,212 378,770 3,674Ecuador 447,250 3,435 170,098 2,221

Table 3: Distribution of data over top countries per con-tinent in Mega-COV V0.2 (all data vs. 2020).

more than tweeting. This is also happening for thefirst time.

3.3 Linguistic Diversity

We perform the language analysis based on tweets(n=∼ 1.5B), including re-tweets and replies. Twit-ter assigns 65 language ids to∼ 1.4B tweets, whilethe rest are tagged as “und” (for “undefined”).Mega-COV V0.2 has ∼ 104M (∼ 7%) tweetstagged as “und”. We ran langid (Lui and Baldwin,2012) on undefined tweets to tag the rest of lan-guages. After we merge all our language tags fromTwitter and langid, we acquire a total of 104 tags,making Mega-COV V0.2 very linguistically rich.Table 4 shows the top 20 languages identified byTwitter (left) and the top 20 languages tagged bylangid after removing the 65 Twitter languages(right).

2019 2020Hashtag Freq Hashtag Freq

NewProfilePic 64,922 COVID19 260,024love 41,964 coronavirus 219,615Repost 39,128 NewProfilePic 102,724art 35,825 BBB20 91,775photography 34,874 質問箱 86,220music 28,335 Covid 19 70,106travel 28,236 COVID-19 67,737fashion 23,814 匿名質問募集中 56,345GameofThrones 21,484 Coronavirus 53,251nature 18,563 covid19 47,940instagood 18,491 StayHome 44,165photooftheday 18,032 NintendoSwitch 42,812tbt 17,332 love 42,497realestate 17,255 bbb20 39,974shopmycloset 16,760 NowPlaying 39,069GameOfThrones 16,127 Repost 37,036peing 15,930 AnimalCrossing 36,369質問箱 15,803 art 35,570fitness 15,623 ACNH 35,528food 15,358 photography 35,209BellLetsTalk 14,853 COVID2019 33,512NowPlaying 14,849 shopmycloset 31,428family 14,060 music 30,537style 14,041 StayAtHome 30,313SoundCloud 13,904 QuedateEnCasa 30,194WeTheNorth 13,579 stayhome 27,540GOT 13,458 PS4share 27,487np 13,335 SocialDistancing 27,376MyTwitterAnniv. 12,965 lockdown 27,344Toronto 12,964 TikTok 27,287

Table 5: Top 30 hashtags in Mega-COV.V0.2 for2019 vs. 2020.

Lang Freq Lang FreqEnglish (en) 900M Hebrew (he) 1.2MSpanish (es) 122M Croatian (hr) 685KPortuguese (pt) 79.5M Maltese (mt) 325KJapanese (ja) 46.6M Slovak (sk) 246KArabic (ar) 45M BI (id) 208KIndonesion (in) 37M Latin (la) 183KFrench (fr) 29.5M Bosnian (bs) 143KTurkish (tr) 28.5M Dzongkha (dz) 137.8KTagalog (tl) 19M Swahili (sw) 92KItalian (it) 8.8M Azerbaijani (az) 68.9KThai (th) 7.7M Quechua (qu) 61KHindi (hi) 7M Albanian (sq) 61KDutch (nl) 6.9M Malay (ms) 59KRussuian (ru) 6.2M Kinyarwanda (rw) 56.8KGerman (de) 6M Esperanto (eo) 55KCatalan (ca) 3.5M Javanese (jv) 53KKorean (ko) 2.9M,824 Xhosa (xh) 47.7KHaitian Creole (ht) 2.8M Irish (ga) 44.6KPolish (pl) 2.4M Kurdish (ku) 43KEstonain (et) 2.1M Volapuk (vo) 41Ka

Table 4: Top 20 languages assigned by Twitter (left)and top 20 languages assigned by langid (right) inMega-COV V0.2. BI: Bahasa Indonesia.

3.4 Hashtag Content Analysis

Hashtags usually correlate with the topics userspost about. We provide the top 30 hashtags in the

Domain Rank Domain Ranktheguardian.com ↑ 3 thehill.com ↑ 51nytimes.com ↑ 10 globeandmail.com ↓ -38cnn.com ↑ 18 businessinsdr.com ↑ 31apple.news ↑ 4 theatlantic.com ↑ 27washingtonpost.com ↑ 16 newsbreakapp.com ↑ 472cbc.ca ↓ -13 eldiario.es ↑ 62bbc.co.uk ↓ -4 apnews.com ↑ 48bbc.com ↑ 3 abc.es ↑ 89nyti.ms ↓ -11 reuters.com ↑ 59foxnews.com ↑ 51 thestar.com ↓ -64forbes.com ↓ -14 francebleu.fr ↑ 424nbcnews.com ↑ 39 globalnews.ca ↓ -78wsj.com ↑ 11 independent.co.uk ↓ -10bloomberg.com ↑ 13 elmundo.es ↑ 21ctvnews.ca ↑ 2 indiatimes.com l 0nypost.com ↑ 100 radio-canada.ca ↓ -66cnbc.com ↑ 43 lavanguardia.com ↑ 96usatoday.com ↑ 6 dailymail.co.uk ↑ 23latimes.com ↑ 23 politico.com ↑ 403huffpost.com ↑ 66 sky.com ↑ 114

Table 6: Top 40 domains in 2020 data and their rankchange relative to their rank in 2019.

data in Table 5. As the table shows, users tweetheavily about the pandemic using hashtags such asCOVID19, coronavirus, Coronavirus, COVID19,Covid19, covid19 and StayAtHome. Simple wordclouds of hashtags from the various languages (Fig-ure 6 provides clouds from the top 10 languages)also show COVID-19 topics trending. We also ob-serve hashtags related to gaming (e.g., NowPlaydo,PSshare, and NintendoSwitch). This reflects howusers may be spending part of their newly-foundhome time. We also note frequent occurrence ofpolitical hashtags in languages such Arabic, Farsi,Indian, and Urdu. This is in contrast to discussionsin European languages where politics are not as vis-ible. For example, in Urdu, discussions involvingthe army and border issues show up. This may bepartly due to different political environments, butalso due to certain European countries such as Italy,Sweden, Spain, and the U.K. being hit harder (andearlier) than many countries in the Middle East andAsia. In Indian languages such as Tamil and Hindi,posts also focused on movies (e.g., Valimai), TVshows (e.g., Big Boss), doctors, and even fake newsalong with the pandemic-related hashtags.

An interesting observation from the Chineselanguage word cloud is the use of hashtags such asChinaPneumonia and WuhanPneumonia to refer tothe pandemic. We did not observe these same hash-tags in any of the other languages. Additionally, forsome reason, Apple seems to be trending during

(a) English (en) (b) Spanish (es) (c) Portugese (pt) (d) Japanese (ja) (e) Arabic (ar)

(f) Indonesian (in) (g) Turkish (tr) (h) French (fr) (i) Tagalog (tl) (j) Italian (it)

Figure 6: Word clouds for hashtags in tweets from the top 10 languages in the data. We note that tweets innon-English can still carry English hashtags or employ Latin script.

the first 4 months of 2020 in China owing to hash-tags such as appledaily and appledailytw. Somelanguages, such as Romanian and Vietnamese, in-volve discussions of bitcoin and crypto-currency.This was also seen in the Chinese language wordcloud, but not as prominently.

3.5 Domain Sharing Analysis

Domains in URLs shared by users also provide awindow on what is share-worthy. We perform ananalysis of the top 200 domains shared in each of2019 and 2020. The major observation we reach isthe surge in tweets involving news websites, andthe rise in ranks for the majority of these websitescompared to 2019. Table 6 shows the top 40 newsdomains in the 2020 data and their change inrank compared to 2019. Such a heavy sharing ofnews domains reflects users’ needs: Intuitively,at times of global disruption, people need morefrequent updates on ongoing events. Of particularimportance, especially relative to other ongoingpolitical polarization in the U.S., is the striking riseof the conservative news network Fox News, whichhas moved from a rank of 118 in 2019 to 67 in2020 with a swooping 51 positions jump. We alsonote the rank of some news sites (e.g., The Globeand Mail and The Star going down. This is perhapsdue to people resorting to international (and morediverse) sources of information to remain informedabout countries other than their own.

Other domains: Other noteworthy domain ac-tivities including those related to gaming, video andmusic, and social media tools. Ranks of these do-mains have not necessarily shifted higher than 2019but remain prominent. This shows these themesstill being relevant in 2020. In spite of the eco-nomic impact of the pandemic, shopping domains

such as etsy.me and poshmark.com have markedlyrisen in rank as people moved to shopping onlinein more significant ways. We now introduce a casestudy as to how our data can be used for mobilitytracking.

4 Case Study: Mapping Human Mobilitywith Mega-COV.V.02

Geolocation information in Mega-COV.V02.can be used to characterize and track human mobil-ity in various ways. We investigate some of thesenext.

Inter-Region Mobility. Mega-COV V0.2can be exploited to generate responsive maps whereend users can check mobility patterns between dif-ferent regions over time. In particular, Geolocationinformation can show mobility patterns betweenregions. As an illustration of this use case, albeitin a static form, we provide Figure 7a where weshow how users move between U.S. states. Wecan also exploit Mega-COV V0.2 to show inter-state mobility during a given window of time. 7

Figures 7b- 7e present user mobility between U.S.states. The figure shows a clear change from highermobility in Jan. and Feb. to much less activity inMarch, April, and May. Clear differences can becan be seen in key states where the pandemic hashit hard such as New York (NY), California (CA),and Washington State (WA). We provide mobilitypatterns visualizations for a number of countrieswhere the pandemic has hit (sometimes hard) in thesupplementary material, as follows: Brazil, Canada,Italy, Saudi Arabia, and the United Kingdom.

Intra-Region Mobility. We also use informa-tion in Mega-COV V0.2 to map each user to a

7Here, due to increased posting in 2020, we normalizethe number of visits between states by the total number of alltweets posted during a given month 2020.

single home region (i.e., city, state/province, andcountry). We follow Geolocation literature (Rolleret al., 2012; Graham et al., 2014; Han et al., 2016;Do et al., 2018) in setting a condition that a usermust have posted at least 10 tweets from a givenregion. However, we also condition that at least60% of all user tweets must have been posted fromthe same region. We use the resulting set of userswhose home location we can verify to map userweekly mobility within their own city, state, andcountry exclusively for both Canada and the U.S.as illustrating examples. We provide the related vi-sualization in supplementary material under “UserWeekly Intra-Region Mobility”.

5 Model

We develop a model for detecting tweets relatedto COVID-19. Such a model, we hypothesizeshould be useful when studying phenomena relatedto the pandemic using social data. In particular,we believe using tweets with hashtags related tothe pandemic alone is not sufficient, since usersalso post without use of these hashtags and so asignificant share of the content can go unnoticed.To develop our model, we use randomly sample800K tweets from 65 languages in the dataset pre-sented by (Chen et al., 2020), who collect tweetswith hashtags such as Coronavirus, covid-19, andpandemic. We then sample another 800K tweetsrandomly from the same 65 languages from our2019 portion of Mega-COV. We split the data intotrain, dev and test (80%,10%, 10%, respectively)and fine-tune the BERT-Base Multilingual Casedmodel (Devlin et al., 2018) on TRAIN after re-moving all hshtags used in (Chen et al., 2020).The model obtains 0.94% acc., which is significantgiven that the two classes are balanced. We willrelease this model and also train similar models onthe 104 languages along with Mega-COV.

6 Data Release and Ethics

Data Distribution: The size of the data makes itan attractive object of study. Collection and explo-ration of the data required significant computing in-frastructure and use of powerful data streaming andprocessing tools. To facilitate use of the dataset, weorganize the tweet ids we release by time (monthand year) and language. This should enable inter-ested researchers to work with the exact parts ofthe data related to their research questions even ifthey do not have large computing hardware.

Ethical Considerations: We collectMega-COV from the public domain (Twit-ter). In compliance with Twitter policy, we donot publish hydrated tweet content. Rather, weonly publish publicly available tweet IDs. AllTwitter policies, including respect and protectionof user privacy, apply. We decided not to assigngeographic region tags to the tweet ids wedistribute, but these already exist on the json objectthat can be retrieved from Twitter. We encourageall researchers who decide to use Mega-COVto review Twitter policy at https://developer.twitter.com/en/developer-terms/policy

before they start working with the data.

7 Related Works

Twitter Datasets for COVID-19: Several workshave focused on creating datasets for enablingCOVID-19 research. To the best of our knowledge,all these works depend on a list of hashtags relatedto COVID-19 and focus on a given period of time.For example, Chen et al. (2020) started collectingtweets on Jan. 22nd and continued updating by ac-tively tracking a list of 22 popular keywords such as#Coronavirus, #Corona, and #Wuhancoronavirus.As of May 30th, 2020 (Chen et al., 2020) report144M tweets. Singh et al. (2020) collect a datasetcovering Jan. 16th2020-March 15th2020 using alist of hashtags such as #2019nCoV, #ChinaPneu-monia and #ChinesePneumonia, for a total of 2.8Mtweets, ∼ 18M re-tweets, and ∼ 457K direct con-versations. Using location information on the data,authors report that tweets strongly correlated withnewly identified cases in these locations.Similarly, Alqurashi et al. (2020) use a list of key-words and hashtags related to Covid-19 with Twit-ter’s streaming API to collect a dataset of Arabictweets. The dataset covers the period of March1st2020-March 30th2020 and is at 4M tweets. Theauthors’ goal is to help researchers and policy mak-ers study the various societal issues prevailing dueto the pandemic. In the same vein, Lopez et al.(2020) collect a dataset of ∼ 6.5M in multiple lan-guages, with English accounting for ∼ 63.4% ofthe data. The dataset covers Jan. 22nd2020-March2020. Analyzing the data, authors observe the levelof re-tweets to rise abruptly as the crisis ramped upin Europe in late February and early March.Twitter in emergency and crisis: Social mediacan play a useful role in disaster and emergencysince they provide a mechanism for wide informa-

https://developer.twitter.com/en/developer-terms/policy

https://developer.twitter.com/en/developer-terms/policy

(a) Overall inter-state mobility (b) January (c) February

(d) March (e) April (f) May

Figure 7: Inter-state user mobility in the U.S. for Jan.-May, 2020

tion dissemination (Simon et al., 2015). Examplesinclude use of Twitter information for the TyphoonHaiyan in the Philippines (Takahashi et al., 2015),Tsunami in Padang Indonesia (Carley et al., 2016),the Nepal 2015 earthquakes (Verma et al., 2019),Harvey Hurricane (Marx et al., 2020). A numberof works have focused on developing systems foremergency response. An example is McCreadieet al. (2019).Misinformation About COVID-19: Misinforma-tion can spread fast during disaster. Social datahave been used to study rumors and various typesof fake information related to the Zika (Ghenai andMejova, 2017) and Ebola (Kalyanam et al., 2015)viruses. In the context of COVID-19, a number ofworks have focused on investigating the effect ofmisinformation on mental health (Rosenberg et al.,2020), the types, sources, claims, and responsesof a number of pieces of misinformation aboutCOVID-19 (Brennen et al., 2020), the propagationpattern of rumours about COVID-19 on Twitter andWeibo Do et al. (2019), check-worthiness Wrightand Augenstein (2020), modeling the spread ofmisinformation and related networks about the pan-demic Cinelli et al. (2020); Osho et al. (2020);Pierri et al. (2020); Koubaa (2020), estimating therate of misinformation in COVID-19 associatedtweets Kouzy et al. (2020), the use of bots (Fer-rara, 2020), predicting whether a user is COVID19positive or negative (Karisani and Karisani, 2020),and the quality of shared links Singh et al. (2020).Other works have focused on detecting racism andhate speech (Devakumar et al., 2020; Schild et al.,2020; Shimizu, 2020; Lyu et al., 2020) and emo-

tional response during the pandemic (Kleinberget al., 2020).

8 Conclusion

We presented Mega-COV, a billion-scale datasetof 104 languages for studying global responseto the ongoing COVID-19 pandemic. In addi-tion to being large and highly multilingual, ourdataset comprises data long pre-dating the pan-demic. This allows for comparative and longitudi-nal investigations. We provided a global descrip-tion of Mega-COV in terms of its geographic andtemporal coverage, over-viewed its linguistic diver-sity, and provided analysis of its content based onhashtags and top domains. We also provided a casestudy of how the data can be used to track globalhuman mobility. The scale of the dataset has alsoallowed us to make a number of striking discover-ies, including (1) the shift toward retweeting andreplying to other users rather than tweeting in 2020and (2) the role of international news sites as keysources of information during the pandemic. Ourdataset is publicly available for ethical research.

Acknowledgements

MAM acknowledges the support of the NaturalSciences and Engineering Research Council ofCanada (NSERC), the Social Sciences ResearchCouncil of Canada (SSHRC), and ComputeCanada (www.computecanada.ca).

www.computecanada.ca

ReferencesSarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi.

2020. Large arabic twitter dataset on covid-19.arXiv preprint arXiv:2004.04315.

J Scott Brennen, Felix M Simon, Philip N Howard, andRasmus Kleis Nielsen. 2020. Types, sources, andclaims of covid-19 misinformation. Reuters Insti-tute.

Kathleen M Carley, Momin Malik, Peter M Landwehr,Jurgen Pfeffer, and Michael Kowalchuck. 2016.Crowd sourcing disaster management: The complexnature of twitter usage in padang indonesia. Safetyscience, 90:48–61.

Emily Chen, Kristina Lerman, and Emilio Ferrara.2020. Covid-19: The first public coronavirus twit-ter dataset. arXiv preprint arXiv:2003.07372.

Matteo Cinelli, Walter Quattrociocchi, AlessandroGaleazzi, Carlo Michele Valensise, Emanuele Brug-noli, Ana Lucia Schmidt, Paola Zola, Fabiana Zollo,and Antonio Scala. 2020. The covid-19 social mediainfodemic. arXiv preprint arXiv:2003.05004.

Delan Devakumar, Geordan Shannon, Sunil S Bhopal,and Ibrahim Abubakar. 2020. Racism and discrimi-nation in covid-19 responses. Lancet (London, Eng-land), 395(10231):1194.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Tien Huu Do, Xiao Luo, Duc Minh Nguyen, and NikosDeligiannis. 2019. Rumour detection via news prop-agation dynamics and user representation learning.arXiv preprint arXiv:1905.03042.

Tien Huu Do, Duc Minh Nguyen, Evaggelia Tsili-gianni, Bruno Cornelis, and Nikos Deligiannis. 2018.Twitter user geolocation using deep multiview learn-ing. In 2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 6304–6308. IEEE.

Emilio Ferrara. 2020. # covid-19 on twitter: Bots, con-spiracies, and social media activism. arXiv preprintarXiv:2004.09531.

Amira Ghenai and Yelena Mejova. 2017. Catchingzika fever: Application of crowdsourcing and ma-chine learning for tracking health misinformation ontwitter. arXiv preprint arXiv:1707.03778.

Mark Graham, Scott A Hale, and Devin Gaffney. 2014.Where in the world are you? geolocation and lan-guage identification in twitter. The Professional Ge-ographer, 66(4):568–578.

Bo Han, Afshin Rahimi, Leon Derczynski, and Timo-thy Baldwin. 2016. Twitter geolocation predictionshared task of the 2016 workshop on noisy user-generated text. In Proceedings of the 2nd Workshop

on Noisy User-generated Text (WNUT), pages 213–217.

Janani Kalyanam, Sumithra Velupillai, Son Doan,Mike Conway, and Gert Lanckriet. 2015. Factsand fabrications about ebola: A twitter based study.arXiv preprint arXiv:1508.02079.

Negin Karisani and Payam Karisani. 2020. Miningcoronavirus (covid-19) posts in social media. arXivpreprint arXiv:2004.06778.

Bennett Kleinberg, Isabelle van der Vegt, and Maxi-milian Mozes. 2020. Measuring emotions in thecovid-19 real world worry dataset. arXiv preprintarXiv:2004.04225.

Anis Koubaa. 2020. Understanding the covid19 out-break: A comparative data analytics and study.arXiv preprint arXiv:2003.14150.

Ramez Kouzy, Joseph Abi Jaoude, Afif Kraitem,Molly B El Alam, Basil Karam, Elio Adib, JabraZarka, Cindy Traboulsi, Elie W Akl, and KhalilBaddour. 2020. Coronavirus goes viral: Quantify-ing the covid-19 misinformation epidemic on twitter.Cureus, 12(3).

Christian E Lopez, Malolan Vasu, and Caleb Galle-more. 2020. Understanding the perception ofcovid-19 policies by mining a multilanguage twitterdataset. arXiv preprint arXiv:2003.10359.

Marco Lui and Timothy Baldwin. 2012. langid. py:An off-the-shelf language identification tool. In Pro-ceedings of the ACL 2012 system demonstrations,pages 25–30. Association for Computational Lin-guistics.

Hanjia Lyu, Long Chen, Yu Wang, and Jiebo Luo. 2020.Sense and sensibility: Characterizing social mediausers regarding the use of controversial terms forcovid-19. arXiv preprint arXiv:2004.06307.

Julian Marx, Milad Mirbabaie, and Christian Ehnis.2020. Sense-giving strategies of media organ-isations in social media disaster communication:Findings from hurricane harvey. arXiv preprintarXiv:2004.08567.

Richard McCreadie, Cody Buntain, and Ian Soboroff.2019. Trec incident streams: Finding actionableinformation on social media. Proceedings of the16th International Conferenc e on Information Sys-tems for Crisis Response and Management, Valencia,Spain.

Abiola Osho, Caden Waters, and George Amariucai.2020. An information diffusion approach to ru-mor propagation and identification on twitter. arXivpreprint arXiv:2002.11104.

Francesco Pierri, Carlo Piccardi, and Stefano Ceri.2020. Topology comparison of twitter diffusionnetworks effectively reveals misleading information.Scientific reports, 10(1):1–9.

Stephen Roller, Michael Speriosu, Sarat Rallapalli,Benjamin Wing, and Jason Baldridge. 2012. Super-vised text-based geolocation using language modelson an adaptive grid. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning, pages 1500–1510. Association forComputational Linguistics.

Hans Rosenberg, Shahbaz Syed, and Salim Rezaie.2020. The twitter pandemic: The critical role oftwitter in the dissemination of medical informationand misinformation during the covid-19 pandemic.Canadian Journal of Emergency Medicine, pages 1–7.

Leonard Schild, Chen Ling, Jeremy Blackburn, Gian-luca Stringhini, Yang Zhang, and Savvas Zannettou.2020. ” go eat a bat, chang!”: An early look onthe emergence of sinophobic behavior on web com-munities in the face of covid-19. arXiv preprintarXiv:2004.04046.

Kazuki Shimizu. 2020. 2019-ncov, fake news, andracism. The Lancet, 395(10225):685–686.

Tomer Simon, Avishay Goldberg, and Bruria Adini.2015. Socializing in emergencies—a review ofthe use of social media in emergency situations.International Journal of Information Management,35(5):609–619.

Lisa Singh, Shweta Bansal, Leticia Bode, Ceren Budak,Guangqing Chi, Kornraphop Kawintiranon, ColtonPadden, Rebecca Vanarsdall, Emily Vraga, andYanchen Wang. 2020. A first look at covid-19 infor-mation and misinformation sharing on twitter. arXivpreprint arXiv:2003.13907.

Bruno Takahashi, Edson C Tandoc Jr, and ChristineCarmichael. 2015. Communicating on twitter dur-ing a disaster: An analysis of tweets during typhoonhaiyan in the philippines. Computers in Human Be-havior, 50:392–398.

Rakesh Verma, Samaneh Karimi, Daniel Lee, Om-prakash Gnawali, and Azadeh Shakery. 2019.Newswire versus social media for disaster responseand recovery. arXiv preprint arXiv:1906.10607.

WHO. 2020. Who statement regarding cluster of pneu-monia cases in wuhan, china. Beijing: WHO, 9.

Dustin Wright and Isabelle Augenstein. 2020. Factcheck-worthiness detection as positive unlabelledlearning. arXiv preprint arXiv:2003.02736.

AppendicesA Summary of Supplementary Material

We provide the following supplementary items:

1. Figure A.2 shows the geographical diversityin Mega-COV.V0.2 based on geo-locateddata. We show the distribution in terms ofthe number of cities over the 20 countriesfrom which we retrieved the highest num-ber of locations in the dataset, broken byall-time and the year 2020. In total,Mega-COV.V0.2 has ∼ 167K cities from268 countries.

2. Figure A.1 shows the frequency of tweetingduring Jan-May 10th 2020 vs. Jan-May 2019.

3. Table A.1 shows the top 20 languages de-tected by langid in Mega-COV V0.1 whichwere not detected by twitter, broken by tweets,retweets, and replies.

4. We provides more examples for the user mo-bility between the states or regions of severalcountries: Canada (Figure A.3), and Italy(Figure A.4), during Jan.-Ma, 2020.

5. Section A.1 describes the User Weekly Intra-Region Mobility

Figure A.1: Frequency of tweeting during Jan-May10th 2020 vs. Jan-May 2019.

Figure A.2: Geographical diversity inMega-COV.V0.2 based on geo-located data.

A.1 User Weekly Intra-Region MobilityWe can also visualize user mobility as a distancefrom an average mobility score on a weekly basis.Namely, we calculate an average weekly mobilityscore for the year 2019 using geo-tag information(longitude and latitude) and use it as a baselineagainst which we plot user mobility for each weekof 2019 and 2020 up until April. In general, weobserve a drop in user mobility in Canada start-ing from mid-March. For U.S. users, we noticea very high mobility surge starting around end ofFeb. and early March, only waning down the lastweek of March and continuing in April as shownin Figure A.5. For both the U.S. and Canada, wehypothesize the surge in early March (much morenoticeable in the U.S.) is a result of people movingback to their hometowns, returning from travels,moving for basic need stocking, etc.

(a) Inter-provinces (b) January (c) February


Figure A.3: User mobility between Canada Provinces during Jan.-May 2020

Replies RT TweetsLang Frequency Lang Frequency Lang FrequencyHebrew (he) 337,237 Hebrew (he) 322,351 Hebrew (he) 504,327Croatian (hr) 242,772 Croatian (hr) 198,764 Croatian (hr) 243,601Maltese (mt) 94,695 Maltese (mt) 85,054 Maltese (mt) 145,395Dzongkha (dz) 64,063 Slovak(sk) 77,846 Slovak(sk) 131,544Bahasa Indonesia (id) 46,463 Latin (la) 66,061 Latin (la) 104,488Bosnian (bs) 43,066 Bahasa Indonesia (id) 61,295 Bahasa Indonesia (id) 100,561Slovak(sk) 36,662 Bosnian (bs) 45,276 Bosnian (bs) 54,200Swahili (sw) 21,803 Swahili (sw) 28,122 Dzongkha (dz) 46,950Azerbaijani (az) 20,242 Dzongkha (dz) 26,853 Swahili (sw) 42,076Latin (la) 13,030 Quechua (qu) 22,559 Malay (ms) 32,967Albanian (sq) 12,878 Malay (ms) 19,511 Quechua (qu) 31,175Xhosa (xh) 11,936 Esperanto (eo) 19,397 Albanian (sq) 30,361Irish (ga) 8,607 Kinyarwanda (rw) 19,371 Kinyarwanda (rw) 30,080Malagasy (mg) 7,727 Azerbaijani (az) 19,182 Azerbaijani (az) 29,507Quechua (qu) 7,449 Javanese (jv) 18,180 Javanese (jv) 29,121Kinyarwanda (rw) 7,427 Albanian (sq) 17,904 Esperanto (eo) 29,019Esperanto (eo) 6,755 Xhosa (xh) 14,886 Kurdish (ku) 24,259Malay (ms) 6,683 Irish (ga) 14,807 Afrikaans (af) 22,871Assamese (as) 6,442 Kurdish (ku) 14,475 Volapuk (vo) 21,840Volapuk (vo) 6,245 Galician (gl) 13,337 Irish (ga) 21,151

Table A.1: Top 20 languages detected by langid in Mega-COV V0.1 which were not detected by twitter, brokenby tweets, retweets, and replies

(a) Inter-Regions (b) January (c) February


Figure A.4: User mobility between Italy regions (regioni) during Jan.-May 2020

(a) Canada users

(b) U.S. users

Figure A.5: Canadian and American user weekly mobility during 2019-2020. Each point (a week) is modeled as amobility distance from weekly average mobility in 2019.

arxiv:2005.06012v1 [cs.si] 2 may 2020mega-cov: a billion-scale dataset of 65 languages for covid-19...

Documents