what happens after you are pwnd: understanding the use of

15
What Happens After You Are Pwnd: Understanding The Use Of Leaked Account Credentials In The Wild Jeremiah Onaolapo, Enrico Mariconti, and Gianluca Stringhini University College London {j.onaolapo,e.mariconti,g.stringhini}@cs.ucl.ac.uk ABSTRACT Cybercriminals steal access credentials to online accounts and then misuse them for their own profit, release them pub- licly, or sell them on the underground market. Despite the importance of this problem, the research community still lacks a comprehensive understanding of what these stolen accounts are used for. In this paper, we aim to shed light on the modus operandi of miscreants accessing stolen Gmail accounts. We developed an infrastructure that is able to mon- itor the activity performed by users on Gmail accounts, and leaked credentials to 100 accounts under our control through various means, such as having information-stealing malware capture them, leaking them on public paste sites, and posting them on underground forums. We then monitored the activ- ity recorded on these accounts over a period of 7 months. Our observations allowed us to devise a taxonomy of mali- cious activity performed on stolen Gmail accounts, to iden- tify differences in the behavior of cybercriminals that get ac- cess to stolen accounts through different means, and to iden- tify systematic attempts to evade the protection systems in place at Gmail and blend in with the legitimate user activ- ity. This paper gives the research community a better under- standing of a so far understudied, yet critical aspect of the cybercrime economy. 1. INTRODUCTION The wealth of information that users store in accounts on online services such as Gmail, Dropbox, and Face- book, as well as the possibility of misusing them for illicit activities have attracted cybercriminals, who ac- tively engage in compromising such accounts. Miscre- ants obtain the credentials to victims’ online accounts by performing phishing scams [13], by infecting users with information-stealing malware [23] or by compro- mising the databases of websites that contain such in- formation [5]. Such credentials are then sold on the black market to other cybercriminals who wish to use the stolen accounts for profit. This ecosystem has be- come a very sophisticated market in which only vetted sellers are allowed to join [24]. Cybercriminals can use compromised accounts in mul- tiple ways. First, they can use them to send spam [14]. This practice is particularly effective because of the established reputation of such accounts: the already- established contacts of the account are likely to trust its owner, and are therefore more likely to open the messages that they receive from her [16]. Similarly, the stolen account is likely to have a history of good behav- ior with the online service, and the malicious messages sent by it are therefore less likely to be detected as spam, especially if the recipients are within the same service (e.g., a Gmail account used to send spam to other Gmail accounts) [27]. Because of these advantages, the devel- opers of large spamming botnets include the opportu- nity to instruct their bots to use stolen webmail service accounts to deliver spam [24]. Alternatively, cybercrim- inals can use the stolen accounts to collect sensitive in- formation about the victim. Such information can in- clude financial credentials (credit card numbers, bank account numbers), login information to other online ser- vices, and personal communications of the victim [11]. Despite the importance of stolen accounts for the underground economy, there is surprisingly little work on the topic. Bursztein et al. [11] studied the modus operandi of cybercriminals collecting Gmail account cre- dentials through phishing scams. Their paper shows that criminals access these accounts to steal financial information from their victims, or use these accounts to send fraudulent emails. Despite the interesting insights, the narrowness of their threat model keeps many ques- tions unanswered. Other researchers did not attempt studying the activity of criminals on compromised on- line accounts because it is usually difficult to monitor what happens to them without being a large online ser- vice. The rare exceptions are studies that look at infor- mation that is publicly observable, such as the messages shared on Twitter by compromised accounts [14, 15]. To close this gap, in this paper we present a system that is able to monitor the activity performed by at- tackers on Gmail accounts. To this end, we instrument the accounts by using Google Apps Script [1]; by doing so, we are able to monitor any time an email is read, fa- 1

Upload: dinhdung

Post on 06-Feb-2017

217 views

Category:

Documents


1 download

TRANSCRIPT

What Happens After You Are Pwnd:Understanding The Use Of Leaked Account Credentials In

The Wild

Jeremiah Onaolapo, Enrico Mariconti, and Gianluca StringhiniUniversity College London

{j.onaolapo,e.mariconti,g.stringhini}@cs.ucl.ac.uk

ABSTRACTCybercriminals steal access credentials to online accountsand then misuse them for their own profit, release them pub-licly, or sell them on the underground market. Despite theimportance of this problem, the research community stilllacks a comprehensive understanding of what these stolenaccounts are used for. In this paper, we aim to shed lighton the modus operandi of miscreants accessing stolen Gmailaccounts. We developed an infrastructure that is able to mon-itor the activity performed by users on Gmail accounts, andleaked credentials to 100 accounts under our control throughvarious means, such as having information-stealing malwarecapture them, leaking them on public paste sites, and postingthem on underground forums. We then monitored the activ-ity recorded on these accounts over a period of 7 months.Our observations allowed us to devise a taxonomy of mali-cious activity performed on stolen Gmail accounts, to iden-tify differences in the behavior of cybercriminals that get ac-cess to stolen accounts through different means, and to iden-tify systematic attempts to evade the protection systems inplace at Gmail and blend in with the legitimate user activ-ity. This paper gives the research community a better under-standing of a so far understudied, yet critical aspect of thecybercrime economy.

1. INTRODUCTIONThe wealth of information that users store in accounts

on online services such as Gmail, Dropbox, and Face-book, as well as the possibility of misusing them forillicit activities have attracted cybercriminals, who ac-tively engage in compromising such accounts. Miscre-ants obtain the credentials to victims’ online accountsby performing phishing scams [13], by infecting userswith information-stealing malware [23] or by compro-mising the databases of websites that contain such in-formation [5]. Such credentials are then sold on theblack market to other cybercriminals who wish to usethe stolen accounts for profit. This ecosystem has be-come a very sophisticated market in which only vettedsellers are allowed to join [24].

Cybercriminals can use compromised accounts in mul-tiple ways. First, they can use them to send spam [14].This practice is particularly effective because of theestablished reputation of such accounts: the already-established contacts of the account are likely to trustits owner, and are therefore more likely to open themessages that they receive from her [16]. Similarly, thestolen account is likely to have a history of good behav-ior with the online service, and the malicious messagessent by it are therefore less likely to be detected as spam,especially if the recipients are within the same service(e.g., a Gmail account used to send spam to other Gmailaccounts) [27]. Because of these advantages, the devel-opers of large spamming botnets include the opportu-nity to instruct their bots to use stolen webmail serviceaccounts to deliver spam [24]. Alternatively, cybercrim-inals can use the stolen accounts to collect sensitive in-formation about the victim. Such information can in-clude financial credentials (credit card numbers, bankaccount numbers), login information to other online ser-vices, and personal communications of the victim [11].

Despite the importance of stolen accounts for theunderground economy, there is surprisingly little workon the topic. Bursztein et al. [11] studied the modusoperandi of cybercriminals collecting Gmail account cre-dentials through phishing scams. Their paper showsthat criminals access these accounts to steal financialinformation from their victims, or use these accounts tosend fraudulent emails. Despite the interesting insights,the narrowness of their threat model keeps many ques-tions unanswered. Other researchers did not attemptstudying the activity of criminals on compromised on-line accounts because it is usually difficult to monitorwhat happens to them without being a large online ser-vice. The rare exceptions are studies that look at infor-mation that is publicly observable, such as the messagesshared on Twitter by compromised accounts [14,15].

To close this gap, in this paper we present a systemthat is able to monitor the activity performed by at-tackers on Gmail accounts. To this end, we instrumentthe accounts by using Google Apps Script [1]; by doingso, we are able to monitor any time an email is read, fa-

1

vorited, sent, or a new draft is created. We also monitorthe accesses that the accounts receive, with particularattention to their system configuration and their origin.We call such accounts honey accounts. To allow re-searchers to set up their own honeypot infrastructures,we will release the source code of our honey accountsystem publicly.

We set up 100 honey accounts, each resembling theGmail account of the employee of a fictitious company.To understand how criminals use these accounts af-ter they get compromised, we leak the credentials tosuch accounts on multiple outlets, modeling the differ-ent ways in which cybercriminals share and get accessto such credentials. First, we leak credentials on pastesites, such as pastebin [4]. Paste sites are commonlyused by cybercriminals to post account credentials afterdata breaches [2]. We also leak them to underground fo-rums, which have been shown to be the place where cy-bercriminals gather to trade stolen commodities such asaccount credentials [24]. Finally, we login to our honeyaccounts on virtual machines that have been previouslyinfected with information stealing malware. By doingthis, the credentials will be sent to the cybercriminal be-hind the command and control infrastructure, and willthen be used directly or placed on the black market forsale [23]. We know that there are other outlets thatattackers use, for instance, phishing and data breaches,but we decided to focus on the paste sites, undergroundforums, and malware in this paper. We worked in closecollaboration with the Google anti-abuse team, to makesure that any unwanted activity by the compromised ac-counts would be promptly blocked. The accounts areconfigured to send any email to a mail server underour control, to prevent them from successfully deliver-ing spam.

After leaking our credentials, we recorded any inter-action with our honey accounts for a period of 7 months.Our analysis allows us to draw a taxonomy of the dif-ferent actions performed by criminals on stolen Gmailaccounts, as well as provide us interesting insights onthe keywords that criminals typically search for whenlooking for valuable information on these accounts. Wealso show that criminals who obtain access to stolen ac-counts through certain outlets appear more skilled thanothers, and make additional efforts to avoid detectionfrom Gmail. For instance, criminals who steal accountcredentials via malware make more efforts to hide theiridentity, by connecting from open proxies and the Tornetwork and disguising their browser user agent. Crim-inals who obtain access to stolen credentials throughpaste sites, on the other hand, tend to connect to theseaccounts from locations that are close to the typicallocation used by the owner of the account, if this infor-mation is shared with them. At the lowest level of so-phistication are criminals who browse free underground

forums looking for free samples of stolen accounts: theseindividuals do not take significant measures to avoid de-tection, and are therefore easier to detect and block.

In summary, this paper makes the following contri-butions:

• We develop a system to monitor the activity ofGmail accounts. We will publicly release the sourcecode of our system, to allow other researchers todeploy their own Gmail honey accounts and fur-ther the understanding that the security commu-nity has of malicious activity on online services.

• We deployed 100 honey accounts on Gmail, andleaked credentials through three different outlets:underground forums, public paste sites, and vir-tual machines infected with information-stealingmalware.

• We provide detailed measurements of the activ-ity logged by our honey accounts over a period of7 months. We show that certain outlets on whichcredentials are leaked appear to be used by moreskilled criminals, who act stealthy and actively at-tempt to evade detection systems.

2. BACKGROUNDWebmail accounts. Webmail service providers suchas Gmail, Yahoo!, and Outlook.com provide their userswith a convenient place to store, sort, and manage theiremails and contacts. In this paper we focus on Gmailaccounts, with particular attention to the actions per-formed by cybercriminals once they obtain access tosomeone else’s account. We made this choice becauseGmail allows users to set up scripts that augment thefunctionality of their accounts, and it was therefore theideal platform for developing webmail–based honeypots.To ease the understanding of the rest of this paper, webriefly summarize the capabilities offered by webmailaccounts in general, and by Gmail in particular.

In Gmail, after logging in, users are presented with aview of their inbox. The inbox contains all the emailsthat the user received, and highlights the ones that havenot been read yet by displaying them in boldface font.Users have the possibility to mark emails that are im-portant to them and that need particular attention bystarring them. Users are also given a search functional-ity, which allows them to find emails of interest by typ-ing related keywords. They are also given the possibilityto organize their email by placing related messages infolders, or assigning them descriptive labels. Such op-erations can be automated by creating rules that auto-matically process received emails. When writing emails,content is saved in a Drafts folder until the user decidesto send it. Once this happens, sent emails can be foundin a dedicated folder, and they can be searched similarlyto what happens for received emails.

2

Threat model. Cybercriminals can get access to ac-count credentials in three ways. First, they can per-form social engineering-based scams, such as setting upphishing web pages that resemble the login page of pop-ular online services [13] or sending spearphishing emailspretending to be members of customer support teamsat such online services [26]. As a second way of obtain-ing user credentials, cybercriminals can install malwareon victim computers and configure it to report back anyaccount credentials typed by the user to the commandand control server of the botnet [23]. As a third way ofobtaining access to user credentials, cybercriminals canexploit vulnerabilities in the databases used by onlineservices to store them [5].

After stealing account credentials, a cybercriminalcan either use them privately for his own profit, re-lease them publicly, or sell them on the undergroundmarket. Previous work studied the modus operandi ofcybercriminals stealing user accounts through phishingand using them privately [11]. In this work, we studya broader threat model in which we mimic cybercrim-inals leaking credentials on paste sites [4] as well asmiscreants advertising them for sale on undergroundforums [24]. In particular, previous research showedthat cybercriminals often offer a small number of ac-count credentials for free to test their “quality.” Wefollowed a similar approach, pretending to have moreaccounts for sale, but never following up to any furtherinquiries. In addition, we simulate infected victim ma-chines in which the malware steals the user’s credentialsand sends them to the cybercriminal. We describe oursetup and how we leaked account credentials on eachoutlet in detail in Section 3.2.

Finally, there are a number of actions that a cyber-criminal who obtains access to a victim account canperform. Such actions include sending spam [14] or col-lecting sensitive information from the account [11]. InSection 3.1 we describe how we developed an infras-tructure to monitor the activity of cybercriminals onour honey accounts, while in Section 4.2 we analyze indetail the types of activity that we observed.

3. METHODOLOGYOur overall goal was to gain a better understanding

of malicious activity in compromised webmail accounts.To achieve this goal, we developed a system able tomonitor the activity on Gmail accounts. We instru-mented these accounts to log all accesses they receive,and actions taken on them by cybercriminals. We setup a monitor infrastructure to keep track of this activityinformation. We then deployed 100 honey accounts onGmail. We proceeded to leak the accounts on differentoutlets, namely popular paste websites, underground fo-rums, and information-stealing malware. The idea is tostudy differences in malicious activity across these out-

lets. In the following section, we describe our systemarchitecture and our experiment setup in detail.

3.1 System overviewOur system comprises a number of components, namely,

honey accounts and monitor infrastructure. These com-ponents are described in this section.Honey accounts. Our honey accounts are webmail ac-counts instrumented with Google Apps Script, to mon-itor activity in them. Google Apps Script is a cloud-based scripting language based on JavaScript, origi-nally designed to augment the functionality of Gmailaccounts and Google Drive documents, in addition tobuilding web apps [3]. Google Apps Script providesAPIs for performing time-triggered and event-triggeredtasks, for instance, sending an email reminder once ev-ery day, about emails marked “important.” We incor-porated scripts into each account to monitor the emailsin the account honeypots. The scripts send notifica-tions to a dedicated webmail account under our controlwhenever an email is read, sent or “starred.” In ad-dition to information on actions taken on emails, thescripts also send copies of all draft emails created in thehoney accounts to us.

We included functions to scan the emails in eachhoney account every 10 minutes, to report all discoveredchanges, namely, read, sent, starred and draft emails,back to us. We also added a “heartbeat message” func-tion, to send us a predefined message once a day fromeach honey account, to attest that the account was stillfunctional and had not been blocked by Google. TheGoogle Apps Script was well hidden in a Google DocsSpreadsheet stored in the honey accounts. We believethat this measure makes it unlikely for attackers to findand delete our scripts. To minimize abuse, we changedeach honeypot account’s default send-from address toan email address pointing to a modified mailserver un-der our control. All emails sent from the account hon-eypots are delivered to the mailserver, which simplydumps the emails to disk and does not forward them tothe intended destination.Monitoring infrastructure. Google Apps Scripts arequite powerful, but they do not provide enough infor-mation in some cases. For example, they do not pro-vide information about the location of IP addresses ofaccesses to webmail accounts. To keep track of suchaccesses, we set up scripts to drive a web browser andperiodically login to each honey account, and recordinformation about visitors (cookie identifier, geoloca-tion information and times of accesses, among others).The scripts navigate to the visitor activity pages in thehoney accounts, and dump the pages to disk, for offlineparsing. It is interesting to note that by taking advan-tage of the Gmail activity page we are able to leverageGoogle’s geolocation system, as well as their operating

3

system fingerprinting techniques. In addition, notifica-tions of actions on honey accounts are sent to a dedi-cated webmail account we set up as a notifications store.The notifications are sent by Google Apps Scripts andretrieved by another script managed by us.

We believe that our honey account and monitoringframework unleashes multiple possibilities for researcherswho want to further study the behavior of attackers inwebmail accounts. For this reason, we will release thesource code of our system upon publication of this pa-per and will release it publicly.

3.2 Experiment setupAs part of our experiments, we first set up a number

of honey accounts on Gmail, and then leaked them onmultiple outlets used by cybercriminals.Honey account setup. We created 100 Gmail ac-counts and assigned them random combinations of pop-ular first and last names, similar to what was donein [25]. Creating and setting up these accounts is amanual process, and this is why it was not possible forus to create more than 100 accounts. Besides, Googlerate-limits the opening of new accounts from the sameIP address by presenting a phone verification page af-ter a few accounts have been created. This limits thenumber of honey accounts we can set up in practice.

We populated the freshly-created accounts with emailsfrom the public Enron email dataset [19]. This datasetcontains the emails sent by the executives of the energycorporation Enron and was publicly released as evidencefor the bankruptcy trial of the company. This datasetis suitable for our purposes, since the emails that itcontains are the typical emails exchanged by corporateusers. To make the honey accounts believable and avoidraising suspicion from cybercriminals accessing them,we mapped distinct recipients in the Enron dataset toour fictional characters, and replaced the original firstnames and last names in the dataset with our honeyfirst names and last names. In addition, we changedall instances of “Enron” to a fictitious company namethat we came up with, and updated all dates containedin the emails to reflect the time in which the accountswere populated.Leaking account credentials. To achieve our objec-tives, we had to entice cybercriminals to interact withour account honeypots, while we logged their accesses.We selected paste sites and underground forums as ap-propriate venues for leaking account credentials, sincethey tend to be misused by cybercriminals for dissemi-nation of stolen credentials. We also decided to set upvirtual machines and perform logins to honey accountsafter infecting them with information-stealing malware,since this is a popular way in which professional cyber-criminals obtain access to stolen accounts [9]. Infor-

mation stealing malware collects users’ form data, anduploads it to the C&C server.

We divided the account honeypots in groups and leakedtheir credentials in different locations, as shown in Ta-ble 1. We leaked 50 accounts in total on paste sites. For20 of them we leaked basic credentials (username andpassword pairs) on the popular paste sites pastebin.comand pastie.org. We then leaked 10 account creden-tials on popular Russian paste websites (p.for-us.nland paste.org.ru). For the remaining 20 accounts weleaked username and password pairs along with locationand date of birth information of the fictional characterthat “owned” each account. We specified locations closeto London, UK for 10 of them, while we specified loca-tions in the Midwest of the US for the other 10 accounts.By averaging the longitude and latitude of these loca-tions, the middle point falls in Pontiac, Illinois.

group numberof honeyaccounts

outlet of leak

1 30 popular paste websites (no loca-tion information)

2 20 popular paste websites (includ-ing location information)

3 10 underground forums (no locationinformation)

4 20 underground forums (includinglocation information)

5 20 malware (no location informa-tion)

Table 1: List of account honeypot groupings and wherewe leaked them.

We leaked 30 account credentials on underground fo-rums. For 10 of them we only specified username andpassword pairs, without additional information. Simi-lar to what was done for paste sites, we leaked a groupof 10 accounts each with additional information thatclaimed that the “owners” of such accounts were basednear London, UK and in Midwestern US cities with amiddle point in Pontiac, respectively. To leak creden-tials, we used these forums: offensivecommunity.net,bestblackhatforums.eu, hackforums.net,and blackhatworld.com. We selected these forums be-cause they were open for anybody to register, and werehighly ranked in Google results. We acknowledge thatsome underground forums are not open, and have astrict vetting policy to let users in [24]. Unfortunately,however, we did not have access to any private forum.In addition, the same approach of studying open un-derground forums has been used by previous work [6].When leaking credentials on underground forums, wemimicked the modus operandi of cybercriminals thatwas outlined by Stone-Gross et al. in [24]. In this pa-per, the authors showed that cybercriminals often posta sample of their stolen datasets on the forum to showthat the accounts are real, and promise to provide ad-

4

ditional data in exchange for a fee. We logged the mes-sages that the accounts we created received on under-ground forums, mostly inquiring about obtaining thefull dataset, but we did not follow up to them.

Finally, to study activity of information-stealing mal-ware in honey accounts, we leaked basic credentials of20 accounts to information-stealing malware samples,using the malware honeypot infrastructure described inthe following section. To this end, we selected malwaresamples from the Zeus family, which is one of the mostpopular malware families performing information steal-ing [9], as well as from the Corebot family. We willprovide detailed information on our malware honeypotinfrastructure in the next section.

The reason for leaking different accounts on differentoutlets is to study differences in the behavior of cyber-criminals getting access to stolen credentials throughdifferent sources. The reason for providing different ad-ditional information (in some cases, no additional in-formation) across the different groups is to observe dif-ferences in malicious activity depending on the amountand type of information available to the cybercriminal.As we will show in Section 4, the accesses that wereobserved in our honey accounts were heavily influencedby the presence of additional location information inthe leaked content.Malware honeypot infrastructure. We set up asandbox and infected it with malware samples; thesesamples captured the credentials that a web browserwas using to login on Gmail. Our sandbox works as fol-lows: a web server entity manages the honey credentials(usernames and passwords) and malware samples; oncethe host creates the Virtual Machine (VM), it sendsa request to the web server for the malware executablefile and another one for the honey credential. The struc-ture is similar to the one explained in [17]. The malwaresample is then executed; after a timeout, a script car-ries out the login operation using the downloaded cre-dential, in order to expose the honey credential to themalware that is already running in the VM. After a cer-tain lifetime the VM is deleted and a new one is created;this new VM downloads another malware sample anddifferent credential, and repeats the login operation.

To maximize the efficiency of the configuration, be-fore the experiment we carried out a test without theGmail login component to select only samples whoseC&C servers were still up and running. To avoid con-tributing to spam campaigns or other malicious actionswe set up a sinkhole server and followed other prudentpractices described by [22], such as limiting the band-width of the network interfaces to avoid Denial of Ser-vice attacks, and limiting the lifetime of each VM.

3.3 Threats to validity

We acknowledge that seeding the honey accounts withemails from the Enron dataset may introduce bias intoour results, and may make the honey accounts less be-lievable to visitors. Another threat to validity is thatthe honey accounts were pre-populated with emails, andwe did not send more emails to them while leakingthem. Some visitors may notice that the honey ac-counts did not receive any new emails during the periodof observation, and this may affect our results. Finally,since we only leaked honey credentials to the outletslisted previously (namely paste sites, underground fo-rums, and malware), our results reflect the activity ofparticipants on present on those outlets only. Besides,in the case of the underground forums, it is possible thatthe sophisticated cybercriminals from the undergroundforums did not interact with our honey accounts. Thisis because we leaked credentials through freshly createdaccounts on those forums, and freshly created accountslack the level of high reputation it takes to interactwith sophisticated cybercriminals in the undergroundforums. We also acknowledge that to generalize ourfindings, we would require more than 100 Gmail ac-counts.

Despite these issues, we believe that our methodologywill help researchers to further understand the under-ground economy of stolen accounts, and that our resultsshed light into what happens to stolen accounts. It iscurrently hard to gather data on compromised accounts,and our system provides a novel way to do so, especiallyfor researchers. Our methodology can be applied toother online accounts, for instance, those hosting doc-uments, and our results can be used in the process ofbuilding tools to detect and mitigate illegitimate ac-cesses to online accounts. As it can be seen, our resultsalready show interesting trends and findings, which weintend to explore further in future work.

3.4 EthicsThe experiments performed in this paper require some

ethical considerations. First of all, by giving access toour honey accounts to cybercriminals, we incur the riskthat these accounts will be used to damage third par-ties. To minimize this risk, as we said, we configured ouraccounts in a way that all emails would be forwardedto a sinkhole mailserver under our control and neverdelivered to the outside world. We also established aclose collaboration with Google and made sure to re-port to them any malicious activity that needed atten-tion. Although the suspicious login filters that Googletypically uses to protect their accounts from unautho-rized accessed were disabled for our honey accounts, allother malicious activity detection algorithms were stillin place, and in fact Google suspended a number of ac-counts under our control that attempted to send spam.To mitigate the possibility of cybercriminals finding our

5

honey scripts, deleting them, and locking us out of thehoney accounts we provided Google with a full list ofour honey accounts, so that they could monitor them incase something went wrong. Another point of risk is en-suring that the malware ran in our virtual environmentwould not be put in condition to harm third parties.As we said, we followed common practices [22] such asrestricting the bandwidth available to our virtual ma-chines and sinkholing all email traffic sent by them. Fi-nally, our experiments are inherently deceiving cyber-criminals by providing them fake accounts with fakepersonal information in them. To make sure that ourexperiments were ran in an ethical fashion, we soughtand obtained IRB approval by our institution.

4. DATA ANALYSISWe monitored the activity on our honey accounts for

a period of 7 months, from 25th June, 2015 to 16thFebruary, 2016. In this section, we first provide a tax-onomy of the types of activity that we observed. Weprovide a detailed analysis of the type of activity moni-tored on our honey accounts, focusing on the differencesin modus operandi shown by cybercriminals who obtaincredentials to our honey accounts from different outlets.We then investigate whether cybercriminals attempt toevade location-based detection systems by connectingfrom locations that are closer to where the owner of ac-count typically connects from. We also develop a met-ric to infer which keywords attackers search for whenlooking for interesting information in an email account.Finally, we analyze how certain types of cybercriminalsappear to be stealthier and more advanced than others.

4.1 OverviewWe created and instrumented 100 Gmail accounts

for our experiments. We observed 327 unique accessesto the accounts during the experiment, during which147 emails were read, 845 emails were sent, and therewere 12 unique draft emails composed by cybercrimi-nals. To avoid biasing our results, we removed all ac-cesses made to honey accounts by IP addresses fromour monitoring infrastructure. We also removed all ac-cesses that originated from the city where our monitor-ing infrastructure is located. 42 accounts were blockedby Google during the course of the experiment, sincethe cybercriminals that accessed them violated Google’sTerms and Conditions while doing so.We developed ataxonomy of cybercriminals accessing the honey accounts,which we explain in detail in the next section.

4.2 A taxonomy of account activityFrom our dataset of activity observed in the honey

accounts, we devise a taxonomy of attackers based onunique accesses to such accounts. We identify four typesof attackers, described in detail in the following.

Curious. These accesses constitute the most basictype of access to stolen accounts. After getting holdof account credentials, people login on those accountsto check if such credentials work. Afterward, they donot perform any additional action. The majority of theobserved accesses belong to this category, accountingfor 224 accesses.Gold Diggers. When getting access to a stolen ac-count, attackers often want to understand its worth.For this reason, on logging into honey accounts, someattackers search for sensitive information, such as ac-count information and attachments that have financial-related names. They also seek information that maybe useful in spearphishing attacks. We call these ac-cesses “gold diggers.” Previous research showed thatthis practice is quite common for manual account hi-jackers [11]. In this paper, we confirm that finding,provide a methodology to assess the keywords that cy-bercriminals search for, and analyze differences in themodus operandi of gold digger accesses for credentialsleaked through different outlets. In total, we observed82 accesses of this type.Spammers. One of the main capabilities of webmailaccounts is sending emails. Previous research showedthat large spamming botnets have code in their botsand in their C&C infrastructure to take advantage ofthis capability, by having the bots directly connect tosuch accounts and send spam [24]. We consider accessesto belong to this category if they send any email. Weobserved 8 accounts of this type. This low number ofaccesses shows that sending spam appears not to beone of the main purposes that cybercriminals use stolenaccounts for.Hijackers. A stealthy cybercriminal is likely to keepa low profile when accessing a stolen account, to avoidraising suspicion from the account’s legitimate owner.Less concerned miscreants, however, might just act tolock the legitimate owner out of their account by chang-ing the account’s password. We call these accesses “hi-jackers.” In total, we observed 36 accesses of this type.It is interesting to note that a change of password pre-vents us from scraping the accesses page, and thereforewe are unable to collect further information about theaccesses performed to that account. However, since thepassword to the account now does not match the leakedone, it is reasonable to assume that the person perform-ing further logins to such honey accounts is now eitherthe hijacker himself, or someone that the hijacker gave(perhaps sold) the account credentials to. Interestingly,even after losing control of the accounts, our monitor-ing scripts embedded in the accounts keep running, andtherefore we keep receiving information on the emailsthat are read, sent, or starred in that account.

It is important to note that the taxonomy classes thatwe described are not exclusive. For example, an at-

6

Figure 1: CDF of the length of unique accesses for dif-ferent types of activity on our honey accounts.

tacker might change the password of an account, there-fore falling into the “hijacker” category and then useit to send spam emails, therefore falling in the “spam-mer” category too. Such overlaps happened often forthe accesses recorded in our honey accounts. It is in-teresting to note that there was no access that behavedexclusively as “spammer.” Miscreants who sent spamthrough our honey accounts also acted as “hijackers” oras “gold diggers,” searching for sensitive information inthe account.

Figure 1 shows plots of the Cumulative DistributionFunction (CDF) of the length of accesses of differenttypes of attackers. As for the previous analysis, theseaccesses identify cookies, and account for the time be-tween the first and the last time a cookie is observedon a certain honey account. As it can be seen, the vastmajority of accesses are very short, lasting only a fewminutes and never coming back. “Spammer” accounts,in particular, tend to send emails in burst for a cer-tain period and then disconnect. “Hijacker” and “golddigger” accesses, on the other hand, have a long tail ofabout 10% accesses that keep coming back for severaldays in a row. For “hijacker” accesses, the statisticsreported are only a lower bound of the actual time thata cookie was active on an account: as we said, after theaccount password is changed, we lose the capability oftracking accesses (but not the information on the in-teraction with the accounts, for example which emailswere read). The CDF shows that most “curious” ac-cesses are repeated over many days, indicating that thecybercriminals keep coming back to find out if there isnew information in the accounts. This conflicts withthe finding in [11], which states that most cybercrimi-nals connect to a compromised webmail account once,to assess its value within a few minutes. However, [11]focused only accounts compromised via phishing pages,

Figure 2: Distribution of types of accesses for differentcredential leak accesses.

while we have on a broader threat model, as earlierstated.

We wanted to understand the distribution of differenttypes of accesses in accounts that were leaked throughdifferent means. Figure 2 shows a breakdown of this dis-tribution. As it can be seen, cybercriminals who get ac-cess to stolen accounts through malware are the stealth-iest, and never lock the legitimate users out of theiraccount. Instead, they limit their activity to check-ing if such credentials are real or searching for sensi-tive information in the account inbox, perhaps in an at-tempt to estimate the value of the accounts. Accountsleaked through paste sites and underground forums seethe presence of “hijackers.” 20% of the accesses to ac-counts leaked through paste sites, in particular, belongto this category. Accounts leaked through undergroundforums, on the other hand, see the highest percentageof “gold digger” accesses, with about 30% of all accessesbelonging to this category.

4.3 Activity on honey accountsGoogle identifies each access to a Gmail account with

a cookie identifier. For each unique access observed ineach account honeypot, we recorded the cookie identi-fier, the time of first access as t0, and the time of lastknown access as tlast. In the case of repeated accesseswith the same cookie identifier (i.e., multiple visits bythe same attacker), we chose tlast as the time of the lastrecorded visit, while t0 remains the time of first access.From this information, we computed the duration of ac-tivity as tlast − t0 per cookie. The majority of accessesare short, especially in groups leaked on paste sites andunderground forums. We observed that 80% of visi-tors to accounts leaked on paste sites and undergroundforums never came back after their first accesses, while80% of visitors to accounts leaked to malware came backafter some days.

7

Figure 3: CDF of the time passed after account cre-dentials are leaked and the first access by a cookie isrecorded.

We also studied how long it takes for cybercriminalsto access accounts in the different groups, after we leakthem. We measured the duration between time of firstleak in each group, and the time of first access. Wegenerated a CDF to compare durations across malware,paste sites and underground forums, as shown in Fig-ure 3. Similarly, we generated a time-based plot of ac-tivity in the honey accounts, by computing durationbetween time of first leak in each group, and the timeof unique access. This is shown in Figure 4. The ideais to identify time patterns of unique accesses.

Figure 3 shows that within the first 25 days, the fol-lowing were recorded: 80% of all unique accesses to ac-counts leaked to paste sites, 60% of all unique accessesto accounts leaked to underground forums, and 40%of all unique accesses to accounts leaked to malware.A particularly interesting observation is the nature ofunique accesses to accounts leaked to malware. A closelook at Figure 3 reveals rapid increases in unique ac-cesses to honey accounts leaked to malware, about 30days after the leak, and also after 100 days, indicatedby 2 sharp inflection points.

Figure 4 sheds more light into what happened atthose points. An interesting aspect to note is thataccounts that are leaked on public outlets such as fo-rum and paste sites can be accessed by multiple cyber-criminals at the same time. Account credentials leakedthrough malware, on the other hand, are available onlyto the botmaster that stole them, until they decide tosell them or to give them to someone else. Seeing burstsin accesses to accounts leaked through malware monthsafter the actual leak happened could indicate that theaccounts were aggregated by the same criminal fromdifferent C&C servers or that the accounts were soldon the underground market and that another criminalis now using them. This hypothesis is somewhat con-

Figure 4: Plot of duration between time of leak andunique accesses in accounts leaked through differentoutlets.

firmed by the fact that these bursts in accesses wereof the “gold digger” type, while all previous accessesto the same accounts were of the “curious” type. An-other interesting fact shown by Figure 4 is that accountsleaked to malware are accessed multiple times by themalware controllers who stole them, perhaps to checkwhether the accounts are active. As discussed previ-ously, this is in contrast with what was discovered byprevious work [11].

In addition, Figure 4 shows that majority of accountsleaked to paste sites were accessed within a few days ofleak, while a particular subset was not accessed for morethan 2 months. That subset refers to the 10 creden-tials we leaked to Russian paste sites. The correspond-ing honey accounts were not accessed for more than 2months from the time of leak. This either indicates thatcybercriminals are not many on the Russian paste sites,or they maybe they did not believe that the accountswere real, thus not bothering to access them.

4.4 System configuration of accessesUsing Google’s system fingerprinting information avail-

able in the honey accounts, we collected informationabout devices that accessed the honey accounts. Ac-counts leaked through paste sites and underground fo-rums recorded a fraction of accesses from Android de-vices, while the accounts leaked through malware onlyrecorded accesses from traditional computers. On thebrowser side we notice that accesses on accounts leakedthrough malware always consisted of an empty useragent, making it impossible for Google to identify thebrowser. Accounts leaked through underground forumsand paste sites, on the other hand, present accesses fromall the most popular browsers.

4.5 Location of accesses

8

We recorded the location information that we foundin 173 unique accesses, and analyzed them to identifydifferences in origin location patterns across honey ac-count groups. This was possible because Google pro-vides location information (that is, the city from whicha user logged into a Gmail account) in the account ac-tivity page of each Gmail account. 154 unique accessesdid not have location information in them. Accordingto information we got from Google, they are mostly ac-cesses that originated from Tor exit nodes or anonymousproxies.

We observed accesses from a total of 29 countries. Tounderstand whether the IP addresses that connectedto our honey accounts had been recorded in previousmalicious activity, we ran checks on all IP addresseswe observed, against Spamhaus blacklist. We found20 IP addresses that accessed our honey accounts in theSpamhaus blacklist. Because of the nature of this black-list, we believe that the addresses belong to malware-infected machines that are used by cybercriminals toconnect to the stolen accounts.

One of our goals was to observe if cybercriminals at-tempt to evade location-based login risk analysis sys-tems by tweaking access origins. In particular, we wantedto assess whether telling criminals the location wherethe owner of an account is based influences the loca-tions that they will use to connect to this account. De-spite observing 57 accesses to our honey accounts leakedthrough malware, we discovered that all these connec-tions except one originated from Tor exit nodes. Thisshows that malware operators prefer to hide their lo-cation through the use of anonymizing systems ratherthan modifying their location based on where the stolenaccount is typically connecting from. Because of thisobservation, we focused the further experiments in thissection on the accounts that are stolen through pastesites and underground forums.

To observe the impact of availability of location infor-mation about the honey accounts on the locations thatcybercriminals connect from, we calculated the medianvalues of distances of the locations recorded in uniqueaccesses, from the midpoints of the advertised locationsin our account leaks. As we mentioned, the midpoint ofadvertised UK locations is London, while the midpointof the advertised US locations is in Pontiac, Illinois. Forexample, for all unique accesses A to honey accountsleaked on paste sites, advertised with UK information,we extracted location information, translated them tocoordinates LA, and computed dist paste UK vector asdistance(LA,midUK), where midUK is London’s coor-dinates. All distances are in kilometers. We extractedthe median values of all distance vectors obtained, andplotted circles on UK and US maps, specifying thosemedian distances as radii of the circles, as shown inFigures 5a and 5b.

Interestingly, we observe that connections to accountswith advertised location information originate from placescloser to the midpoints than accounts with leaked infor-mation containing usernames and passwords only. Fig-ure 5a shows that connections to accounts leaked onpaste sites result in the smaller median circles, that is,the connections originate from locations closer to Lon-don, the UK midpoint. The smallest circle is for theaccounts leaked on paste sites, with advertised UK loca-tion information (radius 1400 kilometers). In contrast,the circle of accounts leaked on paste sites without lo-cation information has a radius of 1784 kilometers. Themedian circle of the accounts leaked in underground fo-rums, with no advertised location information, is thelargest circle in Figure 5a, while the one of accountsleaked in underground forums, along with UK locationinformation, is smaller.

We obtained similar results in US plot, with someinteresting distinctions. As shown in Figure 5b, con-nections to honey accounts leaked on paste sites, withadvertised US locations are clustered around the USmidpoint, as indicated by the circle with a radius of939 kilometers, compared to the the median circle ofaccounts leaked on paste sites without location infor-mation, that has a radius of 7900 kilometers. How-ever, despite the fact that the median circle of accountsleaked in underground forums with advertised locationinformation is less than that of the one without adver-tised location information, the difference in their radiiis not much. This again supports the indication thatcybercriminals on paste sites exhibit more location mal-leability, that is, masquerading their origins of accessesto appear closer to the advertised location, when pro-vided. It also shows that cybercriminals on forums areless sophisticated, or care less than the ones on pastesites.Statistical significance. As we explained, Figures 5aand 5b show that accesses to leaked accounts happencloser to advertised locations if this information is in-cluded in the leak. To confirm this finding we performeda Cramer Von Mises test [12]. The Anderson version [7]of this test is used to understand if two vectors of valuesare likely have the same statistical distribution or not.The p value has to be under 0.01 to let us state that it ispossible to reject the null hypothesis (i.e. the two vec-tors of distances have the same distribution), otherwiseit is not possible to state with statistical significanceif the two distance vectors are derived from the samedistribution.

The p value from the test on paste sites vectors (val-ues 0.0017415 for UK location information versus nolocation and 0.0000007 for US location information ver-sus no location) allows us to reject the null hypothesisstating that the two vectors come from different dis-tributions while we cannot say the same observing the

9

(a) Distance of login locations from the UK midpoint (b) Distance of login locations from the US midpoint

Figure 5: Distance of login locations from the midpoints of locations advertised while leaking credentials. Redlines indicate credentials leaked on paste sites with no location information, green lines indicate credentials leakedon paste sites with location information, purple lines indicate credentials leaked on underground forums withoutlocation information, while blue lines indicate credentials leaked on underground forums with location information.

p values for the tests on forum vectors (0.272883 inthe UK case and 0.272011 in the US one). Therefore,we can conclusively state that the statistical test provesthat criminals using paste sites connect from closer loca-tions when location information is provided along withthe leaked credentials. We cannot reach that conclusionin the case of accounts leaked to underground forums,although Figures 5a and 5b indicate that there are somelocation differences in this case too.

4.6 What are “gold diggers” looking for?Cybercriminals compromise online accounts due to

the inherent value of those accounts. As a result, theyassess accounts to decide how valuable they are, to de-cide exactly what to do with such accounts. We decidedto study the words that they searched for in the honeyaccounts, in order to understand and potentially char-acterize anomalous searches in the accounts, to aid thedesign of detection systems in the future. A limitingfactor in this case was the fact that we did not haveaccess to search logs of the honey accounts, but only tothe content of the emails that were read. To overcomethis limitation, we employed Term Frequency–InverseDocument Frequency (TF-IDF). TF-IDF is a productof two metrics, namely Term Frequency (TF) and In-verse Document Frequency (IDF). The idea is that wecan infer the words that cybercriminals searched for, by

comparing the important words in the emails read bycybercriminals to the important words in all emails inthe decoy accounts. TF-IDF is used to rank words in acorpus, by importance. As a result we relied on TF-IDFto infer the words that cybercriminals searched for inthe honey accounts.

In its simplest form, TF is a measure of how fre-quently term t is found in document d, as shown inEquation 1. IDF is a logarithmic scaling of the fractionof the number of documents containing term t, as shownin Equation 2 where D is the set of all documents in thecorpus, N is the total number of documents in the cor-pus, |d ∈ D : t ∈ d| is the number of documents in D,that contain term t. Once TF and IDF are obtained,TF-IDF is computed by multiplying TF and IDF, asshown in Equation 3.

tf(t, d) = ft,d (1)

idf(t,D) = logN

|d ∈ D : t ∈ d|(2)

tfidf(t, d,D) = tf(t, d)× idf(t,D) (3)

The output of TF-IDF is a weighted metric that rangesbetween 0 and 1. The closer the weighted value is to 1,the more important the term is, in the corpus.

10

Searched words tfidfR tfidfA tfidfR − tfidfA Common words tfidfR tfidfA tfidfR − tfidfAresults 0.2250 0.0127 0.2122 transfer 0.2795 0.2949 -0.0154bitcoin 0.1904 0.0 0.1904 please 0.2116 0.2608 -0.0493family 0.1624 0.0200 0.1423 original 0.1387 0.1540 -0.0154seller 0.1333 0.0037 0.1296 company 0.042 0.1531 -0.1111localbitcoins 0.1009 0.0 0.1009 would 0.0864 0.1493 -0.063account 0.1114 0.0247 0.0866 energy 0.0618 0.1471 -0.0853payment 0.0982 0.0157 0.0824 information 0.0985 0.1308 -0.0323bitcoins 0.0768 0.0 0.07684 about 0.1342 0.1226 0.0116below 0.1236 0.0496 0.074 email 0.1402 0.1196 0.0207listed 0.0858 0.02068 0.0651 power 0.0462 0.1175 -0.0713

Table 2: List of top 10 words by tfidfR− tfidfA (on the left) and list of top 10 words by tfidfA (on the right). Thewords on the left are the ones that have the highest difference in importance between the emails read by attackersand the emails in the entire corpus. For this reason, they are the words that attackers most likely searched for whenlooking for sensitive information in the stolen accounts. The words on the right, on the other hand, are the onesthat have the highest importance in the entire corpus.

We evaluated TF-IDF on all terms in a corpus of textcomprising two documents, namely, all emails dA in thehoney accounts, and all emails dR read by the attackers.The idea is that the words that have a large importancein the emails that have been read by a criminal, buthave a lower importance in the overall dataset, are likelyto be keywords that the attackers searched for on theGmail site. We preprocessed the corpus by filtering outall words that have less than 5 characters, and removingall known header-related words, for instance “delivered”and “charset,” honey email handles, and also removingsignaling information that our monitoring infrastruc-ture introduced into the emails. After running TF-IDFon all remaining terms in the corpus, we obtained theirTF-IDF values as vectors TFIDFA and TFIDFR, theTF-IDF values of all terms in the corpus [dA, dR]. Weproceeded to compute the vector TFIDFR−TFIDFA.The top 10 words by TFIDFR − TFIDFA, comparedto the top 10 words by TFIDFA, is presented in Ta-ble 2. The idea is that words that have TFIDFR valuesthat are higher than TFIDFA will rank higher in thelist, and those are the words that the cybercriminalslikely searched for.

As seen in Table 2, the top 10 important words byTFIDFR−TFIDFA are sensitive words, such as “bit-coin,” “family,” and “payment.” These top 10 wordsconstitute words with the greatest difference importancebetween the emails read by attackers, and all emailsin the corpus. Comparing these words with the mostimportant words in the entire corpus reveals the indi-cation that attackers likely searched for sensitive infor-mation, especially financial information. In addition,words with the highest importance in the entire corpus(for example, “company” and “energy”), shown in theright side of Table 2, have much lower importance inthe emails read by cybercriminals, and most of themhave negative values in TFIDFR − TFIDFA. This is

a strong indicator that the honey accounts were notsearched for random terms, but were searched for sen-sitive information.

Originally, the Enron dataset had no “bitcoin” term.However, that term was introduced into the read emailsdocument dR, through the actions of one of the cyber-criminals that accessed some of the honey accounts.The cybercriminal attempted to send blackmail mes-sages from some of our honey accounts to Ashley Madi-son scandal victims, requesting ransoms in bitcoin, inexchange for silence. In the process, a lot of draftemails containing information about ‘bitcoin’ were cre-ated and abandoned by the cybercriminal, and othercybercriminals read them during later accesses. Thatway, our monitoring infrastructure picked up ‘bitcoin’related terms, and they rank high in Table 2, showingthat cybercriminals showed a lot of interest in thoseemails.

4.7 Interesting case studiesIn this section, we present some interesting case stud-

ies we encountered during our experiments. They helpto shed further light into actions that cybercriminalstake on compromised webmail accounts.

Three of the honey accounts were used by an attackerto send multiple blackmail messages to some victims ofthe Ashley Madison scandal. The blackmailer threat-ened to expose the victims, unless they made some pay-ments in bitcoin to a specified bitcoin wallet. Tuto-rials on how to make bitcoin payments were also in-cluded in the messages. The blackmailer also createdand abandoned many drafts emails targeted at moreAshley Madison victims, which some other visitors tothe accounts read, thus contributing to read contentthat our monitoring infrastructure recorded.

Two of the honey accounts received notification emailsabout the hidden Google Apps Script in both honey

11

accounts “using too much computer time.” The no-tifications were read by an attacker, and we receivednotifications about the reading actions.

Finally, an attacker registered on an carding forumusing one of the honey accounts as registration emailaddress. As a result, registration confirmation infor-mation was sent to the honey account This shows thatsome of the accounts were used as stepping stones bycybercriminals to perform other attacks.

4.8 Sophistication of attackersFrom the accesses we recorded in the honey accounts,

we identified 3 peculiar behaviors of cybercriminals thatindicate their level of sophistication, namely, configura-tion hiding - for instance by hiding user agent informa-tion, detection evading - by connecting from locationsclose to the advertised decoy location if provided, andstealth - avoiding actions like hijacking and spamming.Attackers accessing the different groups of honey ac-counts exhibit different types of sophistication. Thoseaccessing accounts leaked through malware are morestealthy than others - they don’t hijack the accounts,and they don’t send spam from the accounts. They alsoaccess the accounts through Tor, and they hide theirsystem configuration, for instance, there was no useragent string information in the accesses we recordedfrom them. Attackers accessing accounts leaked on pastesites tend to connect from locations closer to the onesspecified as decoy locations in the leaked account. Theydo this in a bid to evade detection. Attackers access-ing accounts leaked in underground forums don’t makemuch attempts to stay stealthy or to connect from closerlocations. These differences in sophistication could beused to characterize attacker behavior.

5. DISCUSSIONIn this section, we discuss the implications of the

findings we made in this paper. First, we talk aboutwhat our findings mean for current mitigation tech-niques against compromised online service accounts, andhow they could be used to devise better defenses. Then,we talk about some limitations of our method. Finally,we present some ideas for future work.Implications of our findings. In this paper, we mademultiple findings that provide the research communitywith a better understanding of what happens when on-line accounts get compromised. In particular, we dis-covered that if attackers are provided location informa-tion about the online accounts they then tend to con-nect from places that are closer to that advertised loca-tion. We believe that this is an attempt to evade currentsecurity mechanisms employed by online services to dis-cover suspicious logins. Such systems often rely on theorigin of logins, to assess how suspicious those login at-tempts are. Our findings cast some doubts on the effec-

tiveness of such suspicious login anomaly detection sys-tems, especially in the presence of skilled attackers. Wealso observed that many accesses were received throughproxies and Tor exit nodes, so it is hard to determinethe exact origins of logins. This problem renders thesesecurity mechanisms less effective.

Despite confirming existing evasion techniques in useby cybercriminals, our experiments also highlighted in-teresting behaviors that could be used to develop effec-tive systems to detect malicious activity. For example,our observations about the words searched for by the cy-bercriminals show that behavioral modeling could workin identifying anomalous behavior in online accounts.Anomaly detection systems could be trained adaptivelyon words being searched for over a period of time, by thelegitimate account owner. A deviation of searches fromthose words would then be flagged as anomalous, indi-cating that the account may have been compromised.Similarly, anomaly detection systems could be trainedon durations of connections during benign usage, anddeviations from those could be flagged as anomalous.Limitations. We encountered a number of limitationsin the course of the experiments. For example, we wereable to leak the honey accounts only on a few outlets,namely paste sites, underground forums and malware.In particular, we could only target underground forumsthat were open to the public and for which registrationwas free. Similarly, we could not study some of the mostrecent families of information-stealing malware such asDridex, because they would not execute in our virtualenvironment.

Attackers could find the scripts we hid in the honeyaccounts and remove them, making it impossible for usto monitor the activity of the account. This is an in-trinsic limitation of our monitoring architecture, but inprinciple studies similar to ours could be performed bythe online service providers themselves, such as Googleand Facebook. By having access to the full logs of theirsystems, such entities would have no need of settingup monitoring scripts, and it would be impossible forattackers to evade their scrutiny.Future work. In the future, we plan to continue ex-ploring the ecosystem of stolen accounts, and gaining abetter understanding of the underground economy sur-rounding them. We would explore ways to make the de-coy accounts more believable, in order to attract morecybercriminals and keep them engaged with the decoyaccounts. We intend to set up additional scenarios, suchas studying attackers who have a specific motivation, forexample compromising accounts that belong to politi-cal activists (rather then generic corporate accounts, aswe did in this paper). We want to study the modusoperandi of cybercriminals taking over other types ofaccounts, such as Online Social Networks (OSNs) andcloud storage accounts. We could modify and adapt

12

the monitoring infrastructure that we already have toother types of accounts. We also plan to devise a widertaxonomy of attackers in the future.

6. RELATED WORKIn this section, we briefly compare this paper with

previous work, noting that most previous work focusedon spam and social spam. Only a few focused on manualhijacking of accounts and their activity.

Bursztein et al. [11] investigated manual hijacking ofonline accounts, through phishing pages. The study fo-cuses on cybercriminals who steal user credentials anduse them privately, and shows that manual hijackingis not as common as automated hijacking by botnets.The study illustrates the usefulness of honey credentials(account honeypots), in the study of hijacked accounts.Compared to the work by Bursztein et al., in this pa-per we analyze a much broader threat model, lookingat account credentials automatically stolen by malware,as well as the behavior of cybercriminals who obtainaccount credentials through underground forums andpaste sites. By focusing on multiple types of miscre-ants, we were able to show differences in their modusoperandi, and provide multiple insights on the activitiesthat happen on hijacked Gmail accounts in the wild. Inaddition, we provide an open source framework thatcan be used by other researchers to set up experimentssimilar to ours and further explore the ecosystem ofstolen Google accounts. Despite the fact that the au-thors of [11] had more visibility on the account hijackingphenomenon than we did, since they were operating theGmail service, the dataset that we collected is of com-parable size to theirs: we logged 327 malicious accessesto 100 accounts, while they studied 575 high-confidencehijacked accounts.

Thomas et al. [28] studied Twitter accounts underthe control of spammers. Stringhini et al. [25] studiedsocial spam using 300 honeypot profiles, and presenteda tool for detection of spam on Facebook and Twit-ter. Similar work was also carried out in [8, 10, 20, 32].Egele et al. [14] presented a system that detects mali-cious activity in online social networks by building sta-tistical models of “normal” behavior patterns of users.Stringhini et al. [26] developed a tool for spearphishingdetection based on behavioral modeling of senders.

Stone-Gross et al. [24] studied a large-scale spam op-eration by analyzing 16 C&C servers of Pushdo/Cutwailbotnet. In the paper, the authors highlight that theCutwail botnet, one of the largest of its time, has thecapability of connecting to webmail accounts to sendspam. This capability seemed not to be used by cyber-criminals at the time of the study. During our measure-ment we did not observe any mass-sending email activ-ity that might make us think our honey accounts wereused by a large botnet to automatically send spam. In

their paper, Stone-Gross et al. also describe the activityof cybercriminals on spamdot, a large underground fo-rum. They show that cybercriminals were actively trad-ing account information such as the one provided in thispaper, providing free “teasers” of the overall datasetsfor sale. In this paper, we used a similar approach toleak account credentials on underground forums.

Thomas et al. [29] studied underground markets inwhich fake Twitter accounts are sold and then used tospread spam and other malicious content. Unlike thispaper, they focus on fake accounts and not on legitimateones that have been hijacked. Wang et al. [30] proposedthe use of patterns of click events to spot fake accountsin online services, by building clickstream models of realusers and fake accounts. In 2012, Liu et al. [21] studiedcontent privacy issues in Peer-to-Peer (P2P) networks,by deploying honeyfiles containing honey account cre-dentials in P2P shared spaces. They monitored down-loading events and concluded that attackers that down-loaded the honeyfiles had malicious intent, to make eco-nomic gain from the private data they obtained. Thestudy used a similar approach to ours, especially in theplacement of honey account credentials. However, theyplaced more emphasis on honeyfiles than honey creden-tials. Besides, they studied P2P networks while ourwork focuses on compromised accounts in the WorldWide Web (WWW). Kapravelos et al. [18] studied mali-cious browser extensions, and emphasized the huge risksthat malicious browser extensions pose to users. Theyconfigured dynamic web pages as honeypots, to studythe behavior of browser extensions. On the other hand,Wang et al. [31] investigated malicious web pages bydeploying VM-based honeypots that ran on vulnerableoperating systems, and patrolled the World Wide Webto find malicious web pages.

7. CONCLUSIONIn this paper, we presented a honey account system

able to monitor the activity of cybercriminals that gainaccess to Gmail credentials. We will open source oursystem, to encourage researchers to set up additionalexperiments and improve the knowledge of our commu-nity regarding what happens after an account is com-promised. We set up 100 honey accounts, and leakedthem on paste sites, underground forums, and virtualmachines infected with malware. We measured the ac-cesses received on our honey accounts for a period of7 months, and provided detailed statistics of the activ-ity of cybercriminals on these accounts, together with ataxonomy of such criminals. Our findings help the re-search community to get a better understanding of theecosystem of stolen online accounts, and will help re-searchers develop better detection systems against thismalicious activity.

13

8. REFERENCES

[1] Apps Script.https://developers.google.com/apps-script/?hl=en.

[2] Dropbox User Credentials Stolen: A Reminder ToIncrease Awareness In House.http://www.symantec.com/connect/blogs/

dropbox-user-credentials-stolen-reminder-\increase-awareness-house.

[3] Overview of Google Apps Script.https://developers.google.com/apps-script/overview.

[4] Pastebin. pastebin.com.[5] The Target Breach, By the Numbers.

http://krebsonsecurity.com/2014/05/

the-target-breach-by-the-numbers/.[6] S. Afroz, A. C. Islam, A. Stolerman,

R. Greenstadt, and D. McCoy. Doppelgangerfinder: Taking stylometry to the underground. InIEEE Symposium on Security and Privacy, 2014.

[7] T. Anderson and D. Darling. Asymptotic theoryof certain ’goodness of fit’ criteria based onstochastic processes. Annals of MathematicalStatistics, 1952.

[8] F. Benevenuto, G. Magno, T. Rodrigues, andV. Almeida. Detecting Spammers on Twitter. InConference on Email and Anti-Spam (CEAS),2010.

[9] H. Binsalleeh, T. Ormerod, A. Boukhtouta,P. Sinha, A. Youssef, M. Debbabi, and L. Wang.On the analysis of the zeus botnet crimewaretoolkit. In Privacy Security and Trust (PST),2010.

[10] Y. Boshmaf, I. Muslukhov, K. Beznosov, andM. Ripeanu. The socialbot network: when botssocialize for fame and money. In Proceedings ofthe 27th Annual Computer Security ApplicationsConference, 2011.

[11] E. Bursztein, B. Benko, D. Margolis,T. Pietraszek, A. Archer, A. Aquino,A. Pitsillidis, and S. Savage. Handcrafted fraudand extortion: Manual account hijacking in thewild. In ACM SIGCOMM Conference on InternetMeasurement, 2014.

[12] H. Cramer. On the composition of elementaryerrors. Skandinavisk Aktuarie- tidskrift, 1928.

[13] R. Dhamija, J. D. Tygar, and M. Hearst. Whyphishing works. In SIGCHI conference on HumanFactors in computing systems, 2006.

[14] M. Egele, G. Stringhini, C. Kruegel, andG. Vigna. Compa: Detecting compromisedaccounts on social networks. In Symposium onNetwork and Distributed System Security (NDSS),2013.

[15] M. Egele, G. Stringhini, C. Kruegel, andG. Vigna. Towards detecting compromised

accounts on social networks. In Transactions onDependable and Secure Systems (TDSC), 2015.

[16] T. Jagatic, N. Johnson, M. Jakobsson, andT. Jagatif. Social Phishing. Communications ofthe ACM, 2007.

[17] J. P. John, A. Moshchuk, S. D. Gribble, andA. Krishnamurthy. Studying Spamming BotnetsUsing Botlab. In USENIX Symposium onNetworked Systems Design and Implementation(NSDI), 2009.

[18] A. Kapravelos, C. Grier, N. Chachra, C. Kruegel,G. Vigna, and V. Paxson. Hulk: Elicitingmalicious behavior in browser extensions. InUSENIX Security Symposium, 2014.

[19] B. Klimt and Y. Yang. Introducing the EnronCorpus. In Conference on Email and Anti-Spam(CEAS), 2004.

[20] K. Lee, J. Caverlee, and S. Webb. The socialhoneypot project: protecting online communitiesfrom spammers. In Proceedings of the 19thinternational conference on World wide web, 2010.

[21] B. Liu, Z. Liu, J. Zhang, T. Wei, and W. Zou.How many eyes are spying on your shared folders?In ACM workshop on Privacy in the electronicsociety, 2012.

[22] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich,V. Paxson, N. Pohlmann, H. Bos, and M. vanSteen. Prudent practices for designing malwareexperiments: Status quo and outlook. In IEEESymposium on Security and Privacy, 2012.

[23] B. Stone-Gross, M. Cova, L. Cavallaro,B. Gilbert, M. Szydlowski, R. Kemmerer,C. Kruegel, and G. Vigna. Your Botnet is MyBotnet: Analysis of a Botnet Takeover. In ACMConference on Computer and CommunicationsSecurity (CCS), 2009.

[24] B. Stone-Gross, T. Holz, G. Stringhini, andG. Vigna. The underground economy of spam: Abotmasters perspective of coordinating large-scalespam campaigns. In USENIX Workshop onLarge-Scale Exploits and Emergent Threats(LEET), 2011.

[25] G. Stringhini, C. Kruegel, and G. Vigna.Detecting Spammers on Social Networks. InAnnual Computer Security ApplicationsConference (ACSAC), 2010.

[26] G. Stringhini and O. Thonnard. That aint you:Blocking spearphishing through behavioralmodelling. In Detection of Intrusions andMalware, and Vulnerability Assessment (DIMVA),2015.

[27] B. Taylor. Sender reputation in a large webmailservice. In Conference on Email and Anti-Spam(CEAS), 2006.

14

[28] K. Thomas, C. Grier, D. Song, and V. Paxson.Suspended accounts in retrospect: an analysis oftwitter spam. In Proceedings of the 2011 ACMSIGCOMM conference on Internet measurementconference, 2011.

[29] Thomas, K. and McCoy, D. and Grier, C. andKolcz, A. and Paxson, V. Trafficking FraudulentAccounts: The Role of the Underground Marketin Twitter Spam and Abuse. In USENIX SecuritySymposium, 2013.

[30] G. Wang, T. Konolige, C. Wilson, X. Wang,H. Zheng, and B. Y. Zhao. You are how you click:

Clickstream analysis for sybil detection. USENIXSecurity Symposium, 2013.

[31] Y.-M. Wang, D. Beck, X. Jiang, R. Roussev,C. Verbowski, S. Chen, and S. King. Automatedweb patrol with strider honeymonkeys. InSymposium on Network and Distributed SystemSecurity (NDSS), 2006.

[32] S. Webb, J. Caverlee, , and C.Pu. SocialHoneypots: Making Friends with a SpammerNear You. In Conference on Email and Anti-Spam

(CEAS), 2008.

15