anomaly detection in online social networks

12
Anomaly Detection in Online Social Networks Abstract Anomalies in online social networks can signify irregular, and often illegal behaviour. Detection of such anomalies has been used to identify malicious individuals, including spammers, sexual predators, and online fraudsters. In this paper we survey existing computational techniques for detecting anomalies in online social networks. We characterise anomalies as being either static or dynamic, and as being labelled or unlabelled, and survey methods for detecting these dierent types of anomalies. We suggest that the detection of anomalies in online social networks is composed of two sub-processes; the selection and calculation of network features, and the classification of observations from this feature space. In addition, this paper provides an overview of the types of problems that anomaly detection can address and identifies key areas for future research. Keywords: Anomaly Detection, Link Mining, Link Analysis, Social Network Analysis, Online Social Networks 1. Introduction 1 Anomalies arise in online social networks as a conse- 2 quence of particular individuals, or groups of individuals, 3 making sudden changes in their patterns of interaction or 4 interacting in a manner that markedly diers from their 5 peers. The impacts of this anomalous behaviour can be 6 observed in the resulting network structure. For example, 7 fraudulent individuals in an online auction system may 8 collaborate to boost their reputation. Because these indi- 9 viduals have a heightened level of interaction, they tend 10 to form highly interconnected subregions within the net- 11 work (Pandit et al., 2007). In order to detect this type 12 of behaviour, the structure of a network can be examined 13 and compared to an assumed or derived model of nor- 14 mal, non-collaborative interaction. Regions of the network 15 whose structure diers from that expected under the nor- 16 mal model can then be classified as anomalies (also known 17 as outliers, exceptions, abnormalities, etc.). 18 In recent times, the rise of online social networks and 19 the digitisation of many forms of communication has meant 20 that online social networks have become an important part 21 of social network analysis (SNA). This includes research 22 into the detection of anomalies in social networks, and 23 numerous methods have now been developed. This de- 24 velopment has occurred over a wide range of problem do- 25 mains, with anomaly detection being applied to the de- 26 tection of important and influential network participants 27 (e.g. Shetty and Adibi (2005); Malm and Bichler (2011); 28 Cheng and Dickinson (2013)), clandestine organisational 29 structures (e.g. Shetty and Adibi (2005); Krebs (2002); 30 Reid et al. (2005)), and fraudulent and predatory activity 31 (e.g. Phua et al. (2010); Fire et al. (2012); Chau et al. 32 (2006); Akoglu et al. (2013); Pandit et al. (2007)). 33 Since anomaly detection is coming to play an increas- 34 ingly important role in SNA, the purpose of this paper is 35 to survey existing techniques, and to outline the types of 36 challenges that can be addressed. To our knowledge this 37 survey represents the first attempt to examine anomaly 38 detection with a specific focus on social networks. The 39 contributions of this paper are as follows 40 provide an overview of existing challenges in a range 41 of problem domains associated with online social net- 42 works that can be addressed using anomaly detection 43 provide an overview of existing techniques for anomaly 44 detection, and the manner in which these have been 45 applied to social network analysis 46 explore future challenges for online social networks, 47 and the role that anomaly detection can play 48 outline key areas where future research can improve 49 the use of anomaly detection techniques in SNA 50 In drafting this review we did not set out to consider 51 particular problem domains. Rather, we aimed to iden- 52 tify tools specifically designed for detection of anomalies, 53 regardless of the particular social networks they were de- 54 signed to analyse. However, as we conducted our survey 55 we found that relevant work was predominantly published 56 in the area of computer science, and consequently, many 57 of the applications of anomaly detection that we encoun- 58 tered were focused on anomalies in online systems. There- 59 fore, unless specifically stated otherwise, the term social 60 network will be used throughout this paper to mean an 61 online social network. 62 Within the social sciences literature, we found a num- 63 ber of papers focusing on the concept of network change 64 (see for example McCulloh and Carley (2011); Arney and 65 Arney (2013); Tambayong (2014)), which attempts to char- 66 acterise the evolution of social networks. We see anomaly 67 Preprint submitted to Social Networks May 7, 2014

Upload: rmit

Post on 30-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Anomaly Detection in Online Social Networks

Abstract

Anomalies in online social networks can signify irregular, and often illegal behaviour. Detection of such anomalieshas been used to identify malicious individuals, including spammers, sexual predators, and online fraudsters. In thispaper we survey existing computational techniques for detecting anomalies in online social networks. We characteriseanomalies as being either static or dynamic, and as being labelled or unlabelled, and survey methods for detecting thesedi↵erent types of anomalies. We suggest that the detection of anomalies in online social networks is composed of twosub-processes; the selection and calculation of network features, and the classification of observations from this featurespace. In addition, this paper provides an overview of the types of problems that anomaly detection can address andidentifies key areas for future research.

Keywords: Anomaly Detection, Link Mining, Link Analysis, Social Network Analysis, Online Social Networks

1. Introduction1

Anomalies arise in online social networks as a conse-2

quence of particular individuals, or groups of individuals,3

making sudden changes in their patterns of interaction or4

interacting in a manner that markedly di↵ers from their5

peers. The impacts of this anomalous behaviour can be6

observed in the resulting network structure. For example,7

fraudulent individuals in an online auction system may8

collaborate to boost their reputation. Because these indi-9

viduals have a heightened level of interaction, they tend10

to form highly interconnected subregions within the net-11

work (Pandit et al., 2007). In order to detect this type12

of behaviour, the structure of a network can be examined13

and compared to an assumed or derived model of nor-14

mal, non-collaborative interaction. Regions of the network15

whose structure di↵ers from that expected under the nor-16

mal model can then be classified as anomalies (also known17

as outliers, exceptions, abnormalities, etc.).18

In recent times, the rise of online social networks and19

the digitisation of many forms of communication has meant20

that online social networks have become an important part21

of social network analysis (SNA). This includes research22

into the detection of anomalies in social networks, and23

numerous methods have now been developed. This de-24

velopment has occurred over a wide range of problem do-25

mains, with anomaly detection being applied to the de-26

tection of important and influential network participants27

(e.g. Shetty and Adibi (2005); Malm and Bichler (2011);28

Cheng and Dickinson (2013)), clandestine organisational29

structures (e.g. Shetty and Adibi (2005); Krebs (2002);30

Reid et al. (2005)), and fraudulent and predatory activity31

(e.g. Phua et al. (2010); Fire et al. (2012); Chau et al.32

(2006); Akoglu et al. (2013); Pandit et al. (2007)).33

Since anomaly detection is coming to play an increas-34

ingly important role in SNA, the purpose of this paper is35

to survey existing techniques, and to outline the types of 36

challenges that can be addressed. To our knowledge this 37

survey represents the first attempt to examine anomaly 38

detection with a specific focus on social networks. The 39

contributions of this paper are as follows 40

• provide an overview of existing challenges in a range 41

of problem domains associated with online social net- 42

works that can be addressed using anomaly detection 43

• provide an overview of existing techniques for anomaly 44

detection, and the manner in which these have been 45

applied to social network analysis 46

• explore future challenges for online social networks, 47

and the role that anomaly detection can play 48

• outline key areas where future research can improve 49

the use of anomaly detection techniques in SNA 50

In drafting this review we did not set out to consider 51

particular problem domains. Rather, we aimed to iden- 52

tify tools specifically designed for detection of anomalies, 53

regardless of the particular social networks they were de- 54

signed to analyse. However, as we conducted our survey 55

we found that relevant work was predominantly published 56

in the area of computer science, and consequently, many 57

of the applications of anomaly detection that we encoun- 58

tered were focused on anomalies in online systems. There- 59

fore, unless specifically stated otherwise, the term social 60

network will be used throughout this paper to mean an 61

online social network. 62

Within the social sciences literature, we found a num- 63

ber of papers focusing on the concept of network change 64

(see for example McCulloh and Carley (2011); Arney and 65

Arney (2013); Tambayong (2014)), which attempts to char- 66

acterise the evolution of social networks. We see anomaly 67

Preprint submitted to Social Networks May 7, 2014

e48154
This is the author's personal copy before proof for publication. Please cite the paper as: David Savage, Xiuzhen Zhang, Xinghuo Yu, Pauline Chou, Qingmai Wang, Anomaly detection in online social networks, Social Networks, Volume 39, October 2014, Pages 62-70, ISSN 0378-8733, http://dx.doi.org/10.1016/j.socnet.2014.05.002.(http://www.sciencedirect.com/science/article/pii/S0378873314000331)
e48154
e48154

detection as being a subset of of change detection, as anomaly68

detection could be used to identify change points where an69

evolving social network undergoes a rapid change, however70

a network that evolves in a consistent fashion over an ex-71

tended period of time is unlikely to be deemed anomalous.72

We have therefore elected to limit the scope of our review73

to those methods that deal specifically with anomaly de-74

tection.75

2. Related Work76

Previous reviews of anomaly detection have provided77

an overview of the general, non-network based problem,78

describing the use of various algorithms and the particu-79

lar types of problems to which these algorithms are most80

suited (Hodge and Austin, 2004; Chandola et al., 2009;81

Markou and Singh, 2003a,b). A workshop on the detection82

of network based anomalies was also held at ACM 201383

(Akoglu and Faloutsos, 2013). The most recent review84

of general anomaly detection (Chandola et al., 2009), ex-85

pands on previous works to define six categories of anomaly86

detection techniques; classification (supervised learning),87

clustering, nearest neighbour, statistical, information the-88

oretic, and spectral analysis.89

As well as categorising anomaly detection techniques,90

previous reviews describe a number of challenges for anomaly91

detection, mainly associated with the problem of defining92

normal behaviour, particularly in the face of evolving sys-93

tems, or systems where anomalies result from malicious94

activities (Chandola et al., 2009; Hodge and Austin, 2004).95

In particular, Chandola et al. (2009) note that the devel-96

opment of general solutions to anomaly detection remains97

a significant challenge and that novel methods are often98

developed to solve specific problems, accommodating the99

specific requirements of these problems and the specific100

representation of the underlying systems. As discussed in101

Section 7, this has also been the case for some methods102

focused on anomaly detection in social networks.103

104

In addition to the major reviews of anomaly detection105

described above, other works have considered anomaly de-106

tection as part of methodological surveys for particular107

problem domains. For example, methods for performing108

anomaly detection have been discussed as part of more109

general reviews in areas of fraud detection (Bolton and110

Hand, 2002; Phua et al., 2010), network intrusion (Jyoth-111

sna et al., 2011; Gogoi et al., 2011; Patcha and Park, 2007),112

and the deployment of wireless sensor networks (Zhang113

et al., 2010; Janakiram et al., 2006). While significant114

overlap exists between the analysis of computer and sen-115

sor networks and social networks, there are also a number116

of di↵erences that must be taken into account. In par-117

ticular, social networks are typically composed of many118

inter-connected communities, which has important conse-119

quences for the distribution of node degree, and the transi-120

tivity of the network (Newman and Park, 2003). Moreover,121

anomaly detection in both sensor and computer networks122

is typically required to occur online in (soft) real-time, and 123

while this constraint may also apply in some SNA sce- 124

narios, it is not typically required. In addition, anomaly 125

detection in sensor networks generally requires algorithms 126

that reduce network tra�c and have a low computational 127

complexity (Zhang et al., 2010). 128

3. Problem domains for the application of anomaly 129

detection in social networks 130

Anomalies in social networks are often representative 131

of illegal and unwanted behaviour. The recent explosion of 132

social media and online social systems, means that many 133

social networks have become key targets for malicious in- 134

dividuals attempting to illegally profit from, or otherwise 135

cause harm to, the users of these systems. 136

Many users of online social systems such as Facebook, 137

Google+, Twitter, etc. are regularly subjected to a bar- 138

rage of spam and otherwise o↵ensive material (Shrivastava 139

et al., 2008; Fire et al., 2012; Akoglu et al., 2010; Hassan- 140

zadeh et al., 2012). Moreover, the relative anonymity and 141

the unsupervised nature of interaction in many online sys- 142

tems provides a means for sexual predators to engage with 143

young, vulnerable individuals (Fire et al., 2012). Since the 144

perpetrators of these behaviours often display patterns of 145

interaction that are quite di↵erent from regular users, they 146

can be identified through the application of anomaly de- 147

tection techniques. For example, sexual predators often 148

interact with a set of individuals who are otherwise un- 149

connected, leading to the formation of star like structures 150

(Fire et al., 2012). These types of structures can be iden- 151

tified by examining a range of network features (Akoglu 152

et al., 2010; Shrivastava et al., 2008; Hassanzadeh et al., 153

2012), or through the use of trained classifiers (Fire et al., 154

2012). 155

Online retailers and online auctions have also become 156

a key target for malicious individuals. By subverting the 157

reputation systems of online auction systems, fraudsters 158

are able to masquerade as honest users, fooling buyers into 159

paying for expensive goods that are never delivered. This 160

process is facilitated by the use of Sybil attacks (the use 161

of multiple fake accounts) and through collaboration be- 162

tween fraudulent individuals to artificially boost reputa- 163

tion to a point where honest buyers are willing to partici- 164

pate in large transactions (Chau et al., 2006; Pandit et al., 165

2007). In many online stores, opinion spam, in the form of 166

fake product reviews, is used in an attempt to distort con- 167

sumers’ perceptions of product quality and to influence 168

buyer behaviour (Akoglu et al., 2013). Again, the mali- 169

cious individuals who engage in these types of behaviour 170

often form anomalous structures within the network, as 171

their patterns of interaction can be quite di↵erent from 172

regular users. 173

In addition to the social networks supported by ded- 174

icated online systems, mining of the social networks in- 175

duced by mobile phone communications, financial trans- 176

actions, etc. can also be used to identify illegal activities. 177

2

Detection of anomalies in these types of networks have pre-178

viously been used to identify organised criminal behaviour,179

including insurance fraud (Subelj et al., 2011), and ter-180

rorist activities (Reid et al., 2005; Krebs, 2002). Given181

the highly detrimental impact of these types of behaviour,182

anomaly detection in social networks can be seen as an183

extremely important component in the growing tool-box184

for performing social network analysis (SNA).185

Outside of criminal or malicious behaviour, anomaly186

detection has also been used to detect important and in-187

fluential individuals (Shetty and Adibi, 2005), individu-188

als fulfilling particular roles within a community (Welser189

et al., 2011), levels of community participation (Bird et al.,190

2008), and unusual patterns in email tra�c (Eberle and191

Holder, 2007).192

4. Definitions193

Anomalies are typically defined in terms of deviation194

from some expected behaviour. A recent review of general,195

non-network based anomaly detection defined anomalies196

as “patterns in data that do not conform to a well defined197

notion of normal behaviour” (Chandola et al., 2009). An-198

other recent review defines anomalies as “an observation199

(or subset of observations) which appears to be inconsis-200

tent with the remainder of that set of data” ((Barnett and201

Lewis, 1994), cited in (Hodge and Austin, 2004)). We note202

two important aspects of these definitions.203

First, the definitions given above highlight the impor-204

tance of defining expected behaviour, but are somewhat205

imprecise in terms of exactly how the anomaly deviates206

from this expectation. This imprecision stems from the207

fact that the magnitude of any deviation will depend on208

the specific problem domain (Chandola et al., 2009).209

Second, these definitions presuppose an understanding210

of what exactly should be observed in order to di↵erentiate211

normal and anomalous behaviour. In order to detect real-212

life behaviour of interest we must observe the underlying213

system through a suitable set of features, and determining214

which features will provide the greatest separation of nor-215

mal and anomalous behaviour is in itself a key challenge216

in anomaly detection (Chandola et al., 2009). This can be217

especially di�cult in adversarial problem domains as the218

behaviour of anomalous entities may change over time in219

direct response to the detection methods employed.220

221

For the detection of anomalies in social networks, we222

are concerned with the pattern of interactions between223

the individuals in the network. Thus, following (Chandola224

et al., 2009; Hodge and Austin, 2004) we define network225

anomalies simply as patterns of interaction that signifi-226

cantly di↵er from the norm. As is the case for general227

anomaly detection, this definition is highly imprecise, and228

for a given domain, the specific meaning of ‘pattern of229

interaction’ and ‘significantly di↵er’ will reflect the partic-230

ular behaviour of interest. For example, one analysis of231

emails between employees of the Enron corporation con- 232

sidered ‘patterns of interaction’ to simply be the number 233

of emails sent by an individual over a given period of time 234

(Priebe et al., 2005). For this particular study, this met- 235

ric was deemed to be a suitable feature for identifying the 236

real-life anomalous behaviour of interest. However, for a 237

di↵erent study, concerned with a di↵erent aspect of em- 238

ployee behaviour, a di↵erent metric may be considered in 239

order to capture the appropriate ‘patterns of interaction’. 240

In this paper, we focus on the detection of anoma- 241

lies resulting from patterns of interaction that di↵er from 242

normal behaviour within the same social network. How- 243

ever, for completeness, we note here that another form of 244

anomaly exists, termed horizontal anomalies (Gao et al., 245

2013). Horizontal anomalies occur when the character- 246

istics or behaviours of an entity vary depending on the 247

source of the data (Gao et al., 2013; Papadimitriou et al., 248

2010). For example, a user of social media may have a 249

similar set of friends or followers across a number of plat- 250

forms (e.g. Twitter, Facebook, etc.), but for one particular 251

platform (Google+ for example) has a markedly di↵erent 252

set of acquaintances. Methods for detecting this type of 253

anomaly are beyond the scope of this paper. However, 254

given the highly diverse nature of the systems from which 255

social network data is currently being extracted, we believe 256

that the detection of horizontal anomalies will become ex- 257

tremely important in the near future, and suggest that 258

further research in this area will be extremely fruitful. 259

5. Characterisation of anomalies 260

In analysing social networks it is the interactions be- 261

tween individuals that form the main object of study. Fo- 262

cus is given to the manner in which interactions between 263

pairs of individuals influence interactions with and be- 264

tween other individuals in the system, and the relation- 265

ship between these interactions and the attributes of the 266

individuals involved (see Brandes et al. (2013); Getoor and 267

Diehl (2005) for discussion of network analysis in general). 268

This di↵erentiates anomaly detection in social networks 269

from traditional, non-network based analysis, where indi- 270

vidual’s attributes are assumed to follow some population 271

level distribution that ignores interactions and the result- 272

ing relationships between individuals and their peers. 273

Typically, social networks are represented using a graph 274

with vertices (nodes) representing individuals (or groups, 275

companies, etc.) and edges (links, arcs, etc.) between 276

the vertices representing interactions. In some instances, 277

nodes may be partitioned into multiple types, and such 278

networks can be represented using bipartite (or tripartite, 279

etc.) graphs. 280

281

Depending on the type of analysis being performed, 282

networks can be characterised as being either static or dy- 283

namic, and as being labeled or unlabelled. Obviously, dy- 284

namic networks change over time, with changes occurring 285

in the pattern of interactions (who interacts with who), 286

3

or in the nature of these interactions (how X interacts287

with Y). Labeled networks include information regarding288

various attributes of both the individuals and their inter-289

actions.290

It could be argued that all social networks are dynamic,291

however it is often useful to analyse social networks as if292

they were static. As an example, consider the Enron email293

data set (analysed in Priebe et al. (2005); Akoglu et al.294

(2010); Eberle and Holder (2007, 2009)), which consists295

of a dynamic network, representing the emails sent be-296

tween Enron employees between 1998 - 2002. This network297

changes over time as di↵erent groups of individuals send298

and receive emails at di↵erent times. If employee A sends299

five emails to employee B, a dynamic representation of the300

network would track these five emails as links between A301

and B occurring only in the appropriate time-steps. If the302

properties of each email are considered (e.g. length, sub-303

ject, etc.) then the network is treated as being labelled.304

By considering a dynamic representation of the network,305

Priebe et al. (2005) were able to detect anomalous time-306

steps where a particular individual suddenly increased the307

number of emails they sent relative to previous time-steps.308

In contrast to a dynamic representation of the Enron309

network, if consideration is given to the total set of emails310

sent by each employee over the full time period, the time311

at which each email is sent is no longer deemed important312

and the network can be treated as static. In reducing the313

network from a dynamic to a static representation, multi-314

ple interactions occurring at di↵erent points in time may315

be combined in some way, and represented by a single link.316

This link may be labelled with the number of interactions317

it represents, and with some aggregation of the properties318

of these interactions (e.g. mean email length), or it may be319

that only the fact that an interaction took place is impor-320

tant, and the network can be treated as being unlabelled.321

For example, a static, unlabelled representation of the En-322

ron email network was used in the analysis by (Akoglu323

et al., 2010), where the formation of an anomalous star324

like structure was identified, indicative of a single indi-325

vidual (Ken Lay, a former CEO) sending emails to large326

numbers of individuals who were not otherwise connected327

(i.e. they did not send emails to one another). Thus, so328

long as the information is actually available, a network329

can be easily represented in a number of di↵erent ways,330

depending on the requirements of the analysis.331

Exactly how a social network is chosen to be repre-332

sented will depend of course on the type of anomalies to be333

detected. As with the networks themselves, the anomalies334

that occur in social networks can also be characterised as335

being dynamic or static, and as labelled or unlabelled. A336

dynamic anomaly occurs with respect to previous network337

behaviour, while a static anomaly occurs with respect to338

the remainder of the network. Labelled anomalies relate339

to both network structure and vertex or edge attributes,340

while unlabelled anomalies relate only to network struc-341

ture (see Akoglu and Faloutsos (2013)).342

In addition to the characterisation as dynamic or static343

and labelled or unlabelled, anomalies can be further char- 344

acterised as occurring with respect to a global or local 345

context and as occurring across a particular network unit, 346

which we refer to as the minimal anomalous unit. A global 347

or local context simply describes whether an anomaly oc- 348

curs relative to the entire network, or relative only to its 349

close neighbours. For example, an individual’s income 350

may be globally insignificant (many people have similar 351

income), however if the income of their friends (defined by 352

their links within the network) are all significantly higher, 353

they may be considered a local anomaly (Gao et al., 2010; 354

Ji et al., 2012). The minimal anomalous unit refers to the 355

network structure that we consider to be anomalous. If 356

we are interested in changes made by individuals, a sud- 357

den change in the medium of communication for example, 358

we might consider a single vertex to be the minimal anom- 359

nalous unit. However, if we are interested in larger groups 360

of individuals, perhaps a group of fraudulent individuals 361

collaborating to boost their reputation in an online auc- 362

tion system (Pandit et al., 2007), the minimal anomalous 363

unit of interest scales to that of a sub-network. In this 364

situation, the behaviour of any given vertex may not be 365

anomalous, but taken together as a group, the pattern of 366

communication may be anomalous with respect to other 367

groups of vertices in the network. Throughout this review, 368

we will consider how the di↵erent approaches to anomaly 369

detection relate to each of these four characteristics. 370

6. Methods for anomaly detection 371

In Section 5, we defined four characteristics that can be 372

used to categorise anomalies; static or dynamic, labelled 373

or unlabelled, local or global context, and the minimal 374

anomalous unit. In this section we outline various ap- 375

proaches for detecting these di↵erent types of anomalies. 376

In practice we found that characterisation as static or dy- 377

namic and labelled or unlabelled best di↵erentiates the 378

various approaches, thus the following section is organised 379

according to characterisation along these two axes. 380

Note that we have considered here only methods that 381

have previously been used to analyse social networks. Net- 382

work based anomaly detection has been investigated in a 383

number of additional problem domains including intrusion 384

detection (Garcıa-Teodoro et al., 2009; Gogoi et al., 2011; 385

Jyothsna et al., 2011), tra�c modelling (Shekhar et al., 386

2001) and gene regulation (Kim and Gelenbe, 2009; Kim 387

et al., 2012). However since we are primarily interested in 388

social networks we have not included these studies in our 389

review. 390

391

6.1. Static unlabelled anomalies 392

Static, unlabelled anomalies occur when the behaviour 393

of an individual or group of individuals leads to the forma- 394

tion of unusual network structures. Because the labels on 395

edges and vertices are not considered, any information re- 396

garding the type of interaction, it’s duration, the age of the 397

4

individuals involved, etc. is ignored. Only the fact that398

the interaction occurred is significant. Thus in order to399

detect anomalous behaviour, assumptions must be made400

regarding the probability that a given pair of individuals401

will interact.402

As an example, Shrivastava et al. (2008) noted that403

email spam and viral marketing material is typically sent404

from a single malicious individual to many targets. Since405

the targets are selected at random, they are unlikely to be406

connected independently of the malicious individual, and407

the resulting network structure will form a star. Based on408

this observation, the number of triangles in each ego-net409

1 is used to label the subject individual as being either410

malicious or innocent, with a low triangle count indicating411

a malicious individual. Extensions to this basic approach412

have also been used to detect groups of malicious individ-413

uals acting in collaboration (Shrivastava et al., 2008).414

While Shrivastava et al. (2008) considered star-like struc-415

tures to be indicative of malicious behaviour, Akoglu et al.416

(2010) demonstrated that both near-stars and near-cliques417

may be indicative of anomalous behaviour in social net-418

works. Cliques and stars form the extremal values of a419

power-law relationship between the number of nodes in420

an ego-net and the number of edges (Ne / N↵v ), thus in421

order to detect anomalies, a power-law curve can be fit-422

ted to the network, and the residuals analysed for signif-423

icant deviance from the expected relationship. This idea424

has been applied to numerous properties of the ego-net425

(Akoglu et al., 2010; Hassanzadeh et al., 2012), and in a426

comparison of various combinations of features, Hassan-427

zadeh et al. (2012) showed that a power-law relationship428

between edge count, and the average betweenness central-429

ity (see Hassanzadeh et al. (2012) for definitions) of an430

individual’s ego-net was most useful in di↵erentiating be-431

tween normal and anomalous (near-cliques or near-stars)432

ego-nets.433

Another approach for detecting static, unlabelled anoma-434

lies is the use of signal processing techniques. The appli-435

cation of signal processing stems from work on graph par-436

titioning, and community detection (Miller et al., 2010b;437

Newman, 2006), and treats anomalous subgraphs as a sig-438

nal embedded in a larger graph, considered to be back-439

ground noise. The problem of detecting this anomalous440

subgraph is formulated in terms of a hypothesis test (Miller441

et al., 2010b,a, 2011a) with442

H0: the graph is noise (no anomalous subgraph443

exists)444

H1: the graph is signal + noise (an anomalous445

subgraph is embedded in the graph)446

The null hypothesis assumes that the network is generated447

by some stochastic process describing the probability that448

1An ego-net is defined as the subgraph induced by considerationof a subject and their immediate neighbours

any pair of nodes will be connected by an edge. Anoma- 449

lous subgraphs are generated by di↵erent underlying pro- 450

cesses, resulting in a density of edges that di↵ers from 451

that expected under the null model (Miller et al., 2010b). 452

Clearly, the degree to which a signal processing approach 453

can be successfully applied depends on the identification 454

of suitable null models describing the background graph. 455

The most basic model of random graphs, the Erdos-Renyi 456

model, treats the possibility of an edge between two ver- 457

tices as an independent Bernoulli trial, so that any given 458

edge exists with probability p. However, graphs generated 459

using this process exhibit a Poisson distribution of vertex 460

degrees, which has been shown to be a poor approxima- 461

tion of many real world networks. Typically, real social 462

networks exhibit a much fatter tail than the Poisson dis- 463

tribution (Newman et al., 2001; Menges et al., 2008). More 464

sophisticated models, such as the recursive-matrix model 465

(R-MAT) are able to generate more realistic graphs with 466

some degree of clustering and non-Poisson degree distri- 467

butions (Chakrabarti et al., 2004; Newman et al., 2001; 468

Menges et al., 2008). However, the selection of a suit- 469

able model remains a challenge for signal processing ap- 470

proaches, and further research is required in this area. 471

6.2. Static labelled anomalies 472

By considering vertex and edge labels in addition to the 473

network structure, a context can be defined in which typ- 474

ical structures may be considered anomalous. Conversely, 475

the particular combination of edge and vertex labels within 476

the context defined by a given network structure (say an 477

ego-net or community) may also be considered anomalous. 478

In addition to the detection of cliques and stars us- 479

ing unlabelled properties, Akoglu et al. (2010) also con- 480

sidered edge labels to detect ‘heavy’ ego-nets, where the 481

sum over a particular label is disproportionately high rel- 482

ative to the number of edges. For example, in analysing 483

donations to US presidential candidates, the Democratic 484

National Committee were found to have donated a sub- 485

stantial amount of money, but this money was spread over 486

only a few candidates, thus the ego-net formed is consid- 487

ered ‘heavy’. In contrast the Kerry Committee received 488

a large number of donations, but each donation was for 489

a small amount. In this example, a bipartite network is 490

considered (with candidates and donors as the two vertex 491

types), and the ego-net provides a context in which the re- 492

ceived or donated amounts may be considered anomalous. 493

Outside this structure, a series of small or large donations 494

may be perfectly normal. 495

Taking a similar view of context, the detection of com- 496

munity dependent anomalies has been investigated by di- 497

viding the network into communities based on individu- 498

als’ patterns of interactions, and then considering the at- 499

tributes of the members of each community (Gao et al., 500

2010; Ji et al., 2012; Gupta et al., 2012; Muller et al., 501

2013). Attribute values which would be deemed normal 502

across the entire network may appear as anomalies within 503

the more restricted context imposed by the community 504

5

setting. For example, Gao et al. (2010) describe a situa-505

tion where the patterns of interaction between a particular506

group of individuals marks them as a community within507

a larger network, however one member of this community508

has a far lower income than their peers. Globally, this509

individual’s income is completely normal, however within510

the context provided by the community their income is511

seen as an anomaly.512

513

In the realm of spam detection, static labelled anoma-514

lies have been used to identify opinion spam in online prod-515

uct reviews (Akoglu et al., 2013; Chau et al., 2006; Pandit516

et al., 2007). These researchers have applied belief prop-517

agation to anomaly detection, whereby ‘hidden’ labels as-518

signed to vertices and edges are updated in an iterative519

manner conditionally based on the labels observed in the520

system of interest.521

As an example of belief propagation, the FraudEagle522

algorithm (Akoglu et al., 2013) uses a bipartite graph to523

represent reviews of products in an online retail store, with524

users forming one set of vertices, and products forming525

the other. Edges between users and products represent526

product reviews and are signed according to the overall527

rating given in the review. The aim of FraudEagle is to528

assign a hidden label to each node, with users labelled as529

either honest or fraudulent, and products as good or bad.530

These labels are assigned based on the observed pattern531

of reviews. It is assumed that normal, honest reviewers532

will typically give good products positive reviews and bad533

products negative reviews, while fraudulent reviewers are534

assumed to regularly do the opposite. This assumption is535

encoded as a matrix describing the probability of a user536

with a hidden label of either honest or fraudulent giving537

a positive or negative review of a product whose hidden538

label designates it as either good or bad. By iteratively539

propagating labels through the network according to these540

assumed probabilities (i.e. by applying loopy belief prop-541

agation), the system is able to determine those users who542

can be considered as opinion spammers.543

While not strictly applied to social networks, TrustRank544

(Gyongyi et al., 2004) could also be considered as an ex-545

ample of a belief propagation approach to anomaly detec-546

tion. In TrustRank, trustworthy pages are assumed to be547

unlikely to link to spam pages, and given an initially la-548

beled set of trustworthy pages, trust is propagated through549

page out-links, so that those pages within k-steps of the550

the initially trusted page are also labeled as trustworthy.551

By ranking pages according to the propagated trust score,552

spam pages, representing anomalies, can be detected.553

554

Another approach to the detection of static labelled555

anomalies is the application of information theory, which556

provides a set of measures for quantifying the amount of557

information described by some event or observation. The558

most commonly used measure, entropy, describes the num-559

ber of bits required to encode the information associated560

with a given event. Thus, given a set of observations O, the561

global entropy H(O) can be thought of as a measure of the 562

degree of randomness or heterogeneity. A set containing 563

a large number of di↵erent observations will require more 564

bits to store the associated information and will therefore 565

have a higher global entropy. In contrast a totally ho- 566

mogenous set will require relatively few bits to store the 567

associated information. 568

In terms of anomaly detection, entropy can be used to 569

identify those subgraphs that if removed from the network 570

would lead to a significant change in entropy. These sub- 571

graphs may represent important communities or individ- 572

uals (Shetty and Adibi, 2005; Tsugawa et al., 2010; Serin 573

and Balcisoy, 2012), or may represent a normal pattern of 574

behaviour that can then be used as a point of comparison, 575

with those subgraphs that deviate from this norm consid- 576

ered to be anomalous (Noble and Cook, 2003; Eberle and 577

Holder, 2007, 2009). In comparing subgraphs, an informa- 578

tion theoretic approach considers a global context, using a 579

very di↵erent representation of the network than that used 580

in the other approaches described in this paper. In ap- 581

plying an information theoretic approach, interactions are 582

represented as subgraphs, and values typically represented 583

as labels on edges or vertices are themselves represented 584

as a vertex in this subgraph. 585

Information theoretic approaches have been shown to 586

be capable of detecting anomalous interactions based on 587

complex structures and flows of information. For exam- 588

ple, a version of this method was applied to the Enron 589

email data set (Eberle and Holder, 2009), resulting in de- 590

tection of a single instance of an email being sent from a 591

director to a non-management employee, and then being 592

forwarded to another non-management employee. Within 593

the Enron data set many emails are sent from directors 594

to non-management employees and between pairs of non- 595

management employees, but this was the only instance 596

involving a forwarded email from a director to a non- 597

management employee. This degree of sensitivity is use- 598

ful in problem domains consisting of adversarial systems 599

where individuals actively seek to hide illegal or otherwise 600

unwanted behaviour, and deviations from the norm are 601

likely to be small (Eberle and Holder, 2009). 602

6.3. Dynamic unlabelled anomalies 603

Dynamic unlabelled anomalies arise when patterns of 604

interaction change over time, such that the structure of 605

a network in one time-step di↵ers markedly from that in 606

previous time-steps. As with static unlabelled anomalies, 607

there are obviously numerous graph properties that we 608

may consider, each of which can be used to generate time- 609

series that can be analysed using traditional tools. Indeed, 610

we could go so far as to use a static analysis (e.g. the pa- 611

rameter of a fitted power-law relationship) as the subject 612

of a time-series and consider how this value changes over 613

time. The di�culty lies, of course, in selecting an appro- 614

priate feature, or set of features, that adequately captures 615

the real-world behaviour of interest. 616

6

Within the literature we found examples of dynamic617

unlabelled network analysis using a variety of methods,618

including scan statistics (Priebe et al., 2005; Neil et al.,619

2011; Cheng and Dickinson, 2013; Marchette, 2012; Park620

et al., 2009; McCulloh and Carley, 2011), Bayesian infer-621

ence (Heard et al., 2010), auto-regressive moving average622

(ARMA) models (Pincombe, 2005), and link prediction623

(Huang and Zeng, 2006). The use of these methods has624

mostly focused on individual nodes, or ego-nets, consider-625

ing the node degree or the size of the ego-net.626

Link prediction attempts to predict future interactions627

between individuals based on previous interactions. In628

order to detect anomalies, a link prediction algorithm is629

run, and for each pair of individuals in the network, the630

likelihood of interaction occurring in the next time-step is631

calculated. This predicted likelihood is then compared to632

the observed interactions, and those observed interactions633

having a low predicted likelihood are deemed anomalous634

(Huang and Zeng, 2006).635

Using scan statistics, analysis is performed by averag-636

ing a time-series over a moving window [t� ⌧, t] and com-637

paring this average to the current time-step. If the di↵er-638

ence is larger than some predefined threshold, the time-639

step is considered to be anomalous (Priebe et al., 2005;640

Neil et al., 2011; Cheng and Dickinson, 2013; Marchette,641

2012; Park et al., 2009). The use of ARMA models in-642

volves fitting the model to a time series and considering643

the residuals for this fit (see Pincombe (2005)). Those644

residuals above a specified cuto↵ are deemed to represent645

anomalous time-steps.646

In Bayesian analysis, the generated time series is as-647

sumed to follow some distribution (i.e. each time-step rep-648

resents a draw from the assumed distribution). Given an649

initial estimation of the distribution parameters, a series of650

updates can be performed in an iterative manner. For each651

time step, the prior distribution is updated by considering652

the value at that time step and calculating a posterior dis-653

tribution that incorporates the information from all of the654

previous time steps. Using this posterior distribution, a655

p-value is calculated, giving the probability that the value656

at the next time step was drawn from this distribution.657

If this p-value is below a significance threshold, the value658

can be considered anomalous. The posterior distribution659

is then treated as the prior for the next update (Heard660

et al., 2010).661

662

The use of scan statistics is the most prominent form663

of analysis we found in the literature, and extensions to664

the basic form have been suggested that allow correlated665

changes in graph properties to be detected (Cheng and666

Dickinson, 2013). This enables inter-related vertices or667

subgraphs to be identified, and may be of benefit in de-668

tecting clandestine organisations. Another extension is the669

application to hypergraphs (Park et al., 2009), which are670

a generalisation of graphs allowing hyperedges that can671

connect more than two vertices. Such graphs can be used672

to represent interactions that involve more than two indi-673

viduals such as meetings, group emails and text messages, 674

and co-authorship on research papers (Park et al., 2009; 675

Silva and Willett, 2008). 676

Other researchers have considered larger structures such 677

as maximal cliques (Chen et al., 2012), and coronets 2 (Ji 678

et al., 2013), with a view to detecting anomalous evolu- 679

tion of a system at a neighbourhood scale. If we con- 680

sider only the pattern of interactions between individuals, 681

a maximal clique can evolve in only six di↵erent ways; by 682

growing, shrinking, merging, splitting, appearing or van- 683

ishing (Chen et al., 2012). Similarly, a coronet can evolve 684

only through the insertion or deletion of a node, or the 685

insertion or deletion of a number of edges. If normal be- 686

haviour is assumed to result in stable, non-evolving neigh- 687

bourhoods, then any neighbourhood that undergoes one 688

of these transformations can be considered to be anoma- 689

lous. For less stable systems, this idea could obviously be 690

extended to include consideration of the number of neigh- 691

bourhoods that change over a time-step, or the magnitude 692

of this change. In such scenarios, scan-statistics, Bayesian 693

inference or any other form of time-series analysis may be 694

employed in order to detect anomalous time-steps. 695

696

In addition to the approaches discussed above, there is 697

a large body of work that focuses on anomaly detection in 698

time-series in general (see Cheboli (2010); Chandola (2009) 699

for recent reviews). We believe that these existing methods 700

could also be used to detect dynamic anomalies in social 701

networks, assuming that suitable features can be identified 702

for generating the required time-series. Moreover, sub- 703

stantial work has been undertaken in the field of change 704

detection in social networks (e.g. McCulloh and Carley 705

(2011); Arney and Arney (2013); Tambayong (2014), and 706

we believe that there is considerable room for cross-over 707

between these two fields. 708

6.4. Dynamic labelled anomalies 709

In conducting our survey, we found dynamic labelled 710

anomalies to be the least represented in the literature. A 711

recent paper describes the application of a signal process- 712

ing approach to anomaly detection in dynamic, labelled 713

networks (Miller et al., 2013), however this was the only 714

example we found. 715

Extending the basic signal processing approach used 716

for static, unlabelled anomalies (Miller et al., 2010b,a), 717

detection of dynamic labelled anomalies assumes that the 718

probability of an edge occurring between any two nodes 719

is a function of the the linear combination of node at- 720

tributes (Miller et al., 2013). Coe�cients for this linear 721

2A coronet is a concept of neighbourhood, where the ‘closeness’to a subject is defined in terms of the number of interactions thatoccur between individuals, rather than simple connectedness. See(Ji et al., 2013) for a formal definition. Note that the number ofinteractions between two individuals is often expressed as a weight,which could be considered a label on the edge. However, since mul-tiple interactions can also be represented by multiple edges, we donot consider this to be a true label.

7

combination are fitted to the network of interest. The722

dynamic nature of the network is handled by considering723

the network structure in discrete time-steps, and treating724

each time-step as for a static network. Comparing the ex-725

pected graph to the observed graph in each time-step gives726

a residual matrix that can be integrated over a specified727

time period using a filter (see Miller et al. (2013, 2011b)728

for details). Note that this method assumes the number729

of vertices remains constant over the specified time-period,730

and that only the edges change.731

732

In addition to the signal processing approach described733

above, we suggest that those methods used for detecting734

dynamic unlabelled anomalies could be adapted for the735

detection of dynamic labelled anomalies. We argue that736

this type of development will be extremely beneficial as in-737

cluding additional information provided by node and edge738

labels is likely to significantly improve anomaly detection,739

providing additional dimensions for separating normal and740

anomalous behaviour. For example, a recent study demon-741

strated that link prediction can be significantly improved742

by including topic information from users tweets (Rowe743

et al., 2012). Obviously if the algorithm used to predict744

links is improved, any anomaly detection using this algo-745

rithm will also be improved.746

As with dynamic unlabelled anomalies, we also sug-747

gest that existing methods for time-series analysis can be748

applied to the detection of dynamic labelled anomalies. A749

number of methods for detecting anomalies within discrete750

sequences are discussed in (Chandola et al., 2012), and we751

believe that many of these methods could be applied to752

the detection of dynamic labelled anomalies in social net-753

works.754

7. Discussion755

In this paper we have surveyed a small but growing756

number of solutions to detecting anomalies in online social757

networks. While these approaches di↵er in their treatment758

of the network, considering dynamic or static representa-759

tions and including or ignoring node attributes, they are760

all based primarily on the analysis of interactions between761

individuals. It is this basis that di↵erentiates the detec-762

tion of anomalies in social networks form other forms of763

anomaly detection.764

Note that while anomalies and networks can be charac-765

terised as static or dynamic, labelled or unlabelled etc., the766

various approaches to anomaly detection can, of course,767

also be grouped by the types of algorithms employed and768

also the problem domain they were originally developed769

to address. In conducting our survey, we found it quite770

interesting that the partitioning of detection methods by771

algorithm type very nearly matches the partition induced772

by the authorship of related literature. During our re-773

view, we found very little crossover between the di↵erent774

research groups exploring this area, with each group tend-775

ing to focus on a single or limited number of approaches.776

Certainly, we did not find clear lines of development, where 777

a particular approach has evolved through incremental im- 778

provements over a large number of iterations. Clearly, this 779

reflects the young nature of this field, and we anticipate 780

that this state of a↵airs will change significantly over the 781

next decade. 782

783

In considering how the various approaches presented in 784

the literature apply to di↵erent types of network anoma- 785

lies, we found that the methods used to actually detect 786

anomalies, as opposed to calculating network features, is 787

largely independent of the type of anomaly being detected. 788

Once a suitable feature space has been identified and cal- 789

culated, it is often the case that any number of traditional 790

anomaly detection techniques could be applied. For exam- 791

ple, in Section 6.1 (static unlabelled anomalies), we dis- 792

cuss the use of regression to detect anomalous ego-nets 793

displaying an unusual edge count relative to node count 794

(Akoglu et al., 2010). However, rather than applying re- 795

gression, once the edge count - node count feature space is 796

extracted, an alternative anomaly detection method, such 797

as a nearest neighbour method, could just as easily have 798

been applied. Therefore, we argue that the process of de- 799

tecting anomalies in social networks is composed of two 800

quite separable sub-processes; namely, the calculation of 801

a suitable feature space, and the detection of anomalies 802

within this space. Moreover, we argue that the selection 803

of a suitable feature space can be further broken down so 804

that, taking a high level view of the problem, the vast ma- 805

jority of the approaches we reviewed can be generalised 806

into the following five steps. 807

1. Determine the smallest unit a↵ected by the behaviour 808

of interest (node, neighbourhood, community, etc). 809

2. Identify the particular properties of this unit that 810

are expected to deviate from the norm. 811

3. Identify the context in which these deviations are 812

expected to occur. 813

4. Calculate the properties of interest, extracting a fea- 814

ture space in which the distances between observa- 815

tions can be measured and compared in a meaningful 816

way. 817

5. Within this space, calculate distances between obser- 818

vations, possibly applying traditional (non-network) 819

tools for anomaly detection (e.g. clustering, nearest 820

neighbour analysis etc.). 821

In each of the papers we surveyed we found something 822

akin to the five steps given above. Obviously there is a 823

major focus on step 4, as the naive approach to this step 824

is often computationally expensive. Given the size of many 825

online social networks, and the demand for tools to handle 826

Big Data we see the development of scalable anomaly de- 827

tection solutions as being an extremely important area of 828

research. However, we also see the selection of a suitable 829

feature space (steps 1 - 3) as being equally important. 830

We believe that our steps 1 - 3, which encompass the 831

mapping of real-world behaviours of interest to suitable 832

8

feature spaces, form a key challenge for detecting anoma-833

lies in social networks. In conducting our survey, few of the834

papers we reviewed explicitly described how the mapping835

between the behaviour of interest and the network prop-836

erties considered was performed. Those few papers that837

did (e.g. Akoglu et al. (2013); Shrivastava et al. (2008);838

Friedland (2009); Gao et al. (2010); Ji et al. (2013)), clearly839

explained the combination of domain knowledge and math-840

ematical reasoning that underpin the use of a particular841

approach and the examination particular network features.842

For the majority of papers though, the reasons for con-843

sidering a particular set of features were unclear, and in844

many examples the real-world behaviour giving rise to any845

anomalies detected was discussed only in response to the846

anomaly actually being found, rather than as a motiva-847

tion for using a particular approach. We believe that this848

is problematic as many of the approaches we reviewed were849

tested only on a single, or a limited number of data sets,850

and may therefore be biased towards the particular anoma-851

lies inherent in those data sets. Without explicit reasoning852

as to why a particular behaviour should be identifiable by a853

particular anomalous network feature there is no reason to854

suspect that this feature will behave in a particular man-855

ner across di↵erent data sets, representing di↵erent social856

networks.857

The lack of papers clearly describing the reasons for858

examining a particular set of features suggests to us that859

selection of a suitable feature space may be extremely dif-860

ficult in practice. Many properties of social networks are861

correlated in some fashion, and it is not clear which com-862

binations of features can be used to capture orthogonal863

concepts. Some comparison of feature spaces has been un-864

dertaken (e.g. Hassanzadeh et al. (2012); McCulloh and865

Carley (2011)), however such comparisons have been re-866

stricted in the number of features considered, and the867

types of anomalies being detected. We predict that the868

requirements for anomaly detection in social networks will869

rapidly advance in the near future, with ever larger vol-870

umes of data and increasingly complex behaviours being871

considered. This may in turn lead to the consideration872

of increasingly complex feature spaces. Thus the develop-873

ment of any guidelines or heuristics for mapping real-world874

behaviour to appropriate feature spaces will be of benefit.875

One possibility that we see for identifying suitable fea-876

ture spaces is the application of a shotgun approach. By877

applying multiple detection techniques across numerous878

graph properties, it may be possible to identify mecha-879

nistic links between the actual behaviours of interest and880

resulting changes in network properties, and to determine881

those features that best di↵erentiate anomalous and nor-882

mal behaviours.883

Another major challenge for detecting anomalies in on-884

line social networks is the evaluation and comparison of885

di↵erent methods. Clearly, not all algorithms are univer-886

sally applicable, and many of the existing approaches have887

been developed with specific problem domains and data888

formats in mind. Moreover, there is a distinct lack of pub-889

licly available data sets with known ground truths. In or- 890

der to compare existing and novel techniques, it is impor- 891

tant to consider how accuracy (recall) and precision are in- 892

fluenced by the scale of the analysis, the density of anoma- 893

lies, and the magnitude of di↵erence between anomalous 894

and normal observations. However, the range of di↵er- 895

ent data sets required to perform such comparisons is not 896

currently available. Thus many approaches are tested on 897

a single data set, with verification of results performed 898

for the top ranked anomalies using manual investigation. 899

For example, Pandit et al. (2007) analyse transactions in 900

an online auction system, and in order to test their ap- 901

proach results were ranked according to an anomaly score. 902

The top-ranked user accounts were then manually investi- 903

gated by checking the auction website for complaints made 904

against these accounts. This type of manual investiga- 905

tion is highly time-consuming and highly dependent on 906

the level of reporting by the owners of the target system. 907

In many cases, manual investigation of user accounts in 908

this way is impossible. 909

Some data sets do exist that are relatively well charac- 910

terised, including the Enron email data set and to a lesser 911

extent the Paraiso communication (synthetic, VAST 2008, 912

www.cs.umd.edu/hcil/VASTchallenge08/). Anomalies de- 913

tected within these data sets can be easily verified against 914

the known series of events. However these data sets are 915

relatively small, and may only be suitable for a small sub- 916

set of problem domains. Therefore, in order to address 917

this challenge, we believe that the generation of synthetic 918

data sets is a logical course of action. 919

A number of methods for generating random social net- 920

works have been proposed, ranging from simple models as- 921

suming some global probability that a given pair of nodes 922

will be connected to more complex models considering the 923

dynamics of each individual node Menges et al. (2008). Be- 924

tween these two extremes lies a broad spectrum of models 925

that have been developed over time and thus reflect our 926

growing understanding of the general properties of social 927

networks (e.g. high transitivity, positive, assortivity, pres- 928

ence of community structures). Consequently, early mod- 929

els (reviewed by Chakrabarti and Faloutsos (2006), see 930

also Sallaberry et al. (2013)) capture only a small num- 931

ber of what we now consider to be general properties of 932

social networks Akoglu and Faloutsos (2009), while more 933

recent models (e.g. Akoglu and Faloutsos (2009); Sal- 934

laberry et al. (2013)) are able to reproduce far more realis- 935

tic representations of a wide range of social networks. As 936

an example, Akoglu and Faloutsos (2009) describe eleven 937

properties of social networks which they suggest should 938

be reproduced by random network generators and pro- 939

vide a suitable method for generation that does indeed 940

reporduce these properties. This method is reported to be 941

highly flexible, and can be used to generate weighted or un- 942

weighted, directed or undirected and unipartite or bipar- 943

tite networks. The method has since been applied in the 944

testing of a novel anomaly detection method, being used to 945

generate a bipartite network representing user reviews of 946

9

online products in an e-commerce setting (Akoglu et al.,947

2013) (a method for generating these types of networks948

is also described as part of the Berlin SPARQL Bench-949

mark http://wifo5-03.informatik.uni-mannheim.de/950

bizer/berlinsparqlbenchmark, last accessed May, 2014).951

The vast majority of network generation methods pro-952

posed to date can be described as phenomenological in953

nature. While these methods are able to reproduce many954

of the known properties of social networks, they do so955

without describing the underlying mechanisms that cause956

these properties to arise. In contrast, agent-based mod-957

els (alternatively, individual-based models) such as that958

proposed by Menges et al. (2008) (see also Glasser and959

Lindauer (2013); Hummel et al. (2012)) describe the spe-960

cific actions taken by individuals in the underlying system961

that lead to the formation of a social network. In this962

type of simulation, the macro-level properties of the re-963

sulting social network can be seen as emergent properties964

stemming from the micro-level behaviour of the underly-965

ing agent-based system. The model does not expicitely966

set out to reproduce these properties, but rather to rep-967

resent the behaviour of the individuals involved. In the968

course of representing this behaviour, the model repro-969

duces a realistic social network. We believe that this ap-970

proach has great potential for generating realistic data sets971

and represents an extremely interesting area of research,972

particularly in adversarial problem domains where anoma-973

lous agents could attempt to disguise their behaviour by974

mimicing ’normal’ agents.975

976

In this paper we have surveyed a number of approaches977

for detecting anomalies in online social networks. We978

found that these di↵erent approaches can be usefully cat-979

egorised based on characterisation of anomalies as being980

static or dynamic and labelled or unlabelled. Depending981

on this characterisation, di↵erent features of the network982

may be examined, and we have suggested that the selec-983

tion of appropriate features is a highly non-trivial task. We984

believe that as the requirements of social network analysis985

continue to grow, the challenges inherent in Big Data anal-986

ysis will necessitate sophisticated heuristics and algorithms987

for mapping complex behaviours to network features, and988

scalable solutions for calculating rich feature spaces which989

can be subjected to anomaly detection analysis.990

8. References991

Akoglu, L., Chandy, R., Faloutsos, C., 2013. Opinion fraud detec-992

tion in online reviews by network e↵ects. In: Proceedings of the993

Seventh International AAAI Conference on Weblogs and Social994

Media.995

Akoglu, L., Faloutsos, C., 2009. Rtg: a recursive realistic graph gen-996

erator using random typing. Data Mining and Knowledge Discov-997

ery 19 (2), 194–209.998

Akoglu, L., Faloutsos, C., 2013. Anomaly, event, and fraud detection999

in large network datasets. In: Proceedings of the sixth ACM in-1000

ternational conference on Web search and data mining. ACM, pp.1001

773–774.1002

Akoglu, L., McGlohon, M., Faloutsos, C., 2010. Oddball: Spotting 1003

anomalies in weighted graphs. In: Advances in Knowledge Dis- 1004

covery and Data Mining. Springer, pp. 410–421. 1005

Arney, D. C., Arney, K., 2013. Modeling insurgency, counter- 1006

insurgency, and coalition strategies and operations. The Journal 1007

of Defense Modeling and Simulation: Applications, Methodology, 1008

Technology 10 (1), 57–73. 1009

Barnett, V., Lewis, T., 1994. Outliers in Statistical Data, 3rd Edi- 1010

tion. John Wiley & Sons. 1011

Bird, C., Pattison, D., D’Souza, R., Filkov, V., Devanbu, P., 2008. 1012

Latent social structure in open source projects. In: Proceedings 1013

of the 16th ACM SIGSOFT International Symposium on Foun- 1014

dations of software engineering. ACM, pp. 24–35. 1015

Bolton, R. J., Hand, D. J., 2002. Statistical fraud detection: A re- 1016

view. Statistical Science, 235–249. 1017

Brandes, U., Robins, G., McCrannie, A., Wasserman, S., 4 2013. 1018

What is network science? Network Science 1 (01), 1–15. 1019

Chakrabarti, D., Faloutsos, C., 2006. Graph mining: Laws, genera- 1020

tors, and algorithms. ACM Computing Surveys (CSUR) 38 (1), 1021

2. 1022

Chakrabarti, D., Zhan, Y., Faloutsos, C., 2004. R-mat: A recursive 1023

model for graph mining. Computer Science Department, 541. 1024

Chandola, V., 2009. Anomaly detection for symbolic sequences and 1025

time series data. Ph.D. thesis, University of Minnesota. 1026

Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: 1027

A survey. ACM Computing Surveys (CSUR) 41 (3), 15. 1028

Chandola, V., Banerjee, A., Kumar, V., 2012. Anomaly detection for 1029

discrete sequences: A survey. Knowledge and Data Engineering, 1030

IEEE Transactions on 24 (5), 823–839. 1031

Chau, D. H., Pandit, S., Faloutsos, C., 2006. Detecting fraudulent 1032

personalities in networks of online auctioneers. In: Knowledge Dis- 1033

covery in Databases: PKDD 2006. Springer, pp. 103–114. 1034

Cheboli, D., 2010. Anomaly detection of time series. Ph.D. thesis, 1035

University of Minnesota. 1036

Chen, Z., Hendrix, W., Samatova, N. F., 8 2012. Community-based 1037

anomaly detection in evolutionary networks. Journal of Intelligent 1038

Information Systems 39 (1), 59–85. 1039

Cheng, A., Dickinson, P., 2013. Using scan-statistical correlations for 1040

network change analysis. In: Trends and Applications in Knowl- 1041

edge Discovery and Data Mining. Springer, pp. 1–13. 1042

Eberle, W., Holder, L., 2009. Applying graph-based anomaly detec- 1043

tion approaches to the discovery of insider threats. In: Intelligence 1044

and Security Informatics, 2009. ISI’09. IEEE International Con- 1045

ference on. IEEE, pp. 206–208. 1046

Eberle, W., Holder, L. B., 2007. Mining for structural anomalies in 1047

graph-based data. In: DMIN. pp. 376–389. 1048

Fire, M., Katz, G., Elovici, Y., 2012. Strangers intrusion detection - 1049

detecting spammers and fake profiles in social networks based on 1050

topology anomalies. Human Journal 1 (1), 26–39. 1051

Friedland, L., 2009. Anomaly detection for inferring social structure. 1052

Gao, J., Du, N., Fan, W., Turaga, D., Parthasarathy, S., Han, J., 1053

2013. A multi-graph spectral framework for mining multi-source 1054

anomalies. In: Graph Embedding for Pattern Analysis. Springer, 1055

pp. 205–227. 1056

Gao, J., Liang, F., Fan, W., Wang, C., Sun, Y., Han, J., 2010. On 1057

community outliers and their e�cient detection in information 1058

networks. In: Proceedings of the 16th ACM SIGKDD interna- 1059

tional conference on Knowledge discovery and data mining. ACM, 1060

pp. 813–822. 1061

Garcıa-Teodoro, P., Dıaz-Verdejo, J., Macia-Fernandez, G., Vazquez, 1062

E., 2 2009. Anomaly-based network intrusion detection: Tech- 1063

niques, systems and challenges. Computers & Security 28 (1-2), 1064

18–28. 1065

Getoor, L., Diehl, C. P., 2005. Link mining: a survey. ACM SIGKDD 1066

Explorations Newsletter 7 (2), 3–12. 1067

Glasser, J., Lindauer, B., 2013. Bridging the gap: A pragmatic ap- 1068

proach to generating insider threat data. In: Security and Privacy 1069

Workshops (SPW), 2013 IEEE. IEEE, pp. 98–104. 1070

Gogoi, P., Bhattacharyya, D. K. . K., Borah, B., Kalita, J. K. . K., 4 1071

2011. A survey of outlier detection methods in network anomaly 1072

identification. The Computer Journal 54 (4), 570–588. 1073

10

Gupta, M., Gao, J., Sun, Y., Han, J., 2012. Community trend outlier1074

detection using soft temporal pattern mining. Springer, pp. 692–1075

708.1076

Gyongyi, Z., Garcia-Molina, H., Pedersen, J., 2004. Combating web1077

spam with trustrank. In: Proceedings of the Thirtieth interna-1078

tional conference on Very large data bases-Volume 30. VLDB En-1079

dowment, pp. 576–587.1080

Hassanzadeh, R., Nayak, R., Stebila, D., 2012. Analyzing the e↵ec-1081

tiveness of graph metrics for anomaly detection in online social1082

networks. Lecture Notes in Computer Science : Web Information1083

Systems Engineering 7651, 624–630.1084

Heard, N. A., Weston, D. J., Platanioti, K., Hand, D. J., 6 2010.1085

Bayesian anomaly detection methods for social networks. The An-1086

nals of Applied Statistics 4 (2), 645–662.1087

Hodge, V. J., Austin, J., 2004. A survey of outlier detection method-1088

ologies. Artificial Intelligence Review 22 (2), 85–126.1089

Huang, Z., Zeng, D. D., 2006. A link prediction approach to anoma-1090

lous email detection. In: Systems, Man and Cybernetics, 2006.1091

SMC’06. IEEE International Conference on. Vol. 2. IEEE, pp.1092

1131–1136.1093

Hummel, A., Kern, H., Kuhne, S., Dohlernd, A., 2012. An agent-1094

based simulation of viral marketing e↵ects in social networks. In:1095

26th European Simulation and Modelling Conference - ESM’2012.1096

p. 212219.1097

Janakiram, D., Adi Mallikarjuna Reddy, V., Phani Kumar, A. V. U.,1098

2006. Outlier detection in wireless sensor networks using bayesian1099

belief networks. In: Communication System Software and Mid-1100

dleware, 2006. Comsware 2006. First International Conference on.1101

IEEE, pp. 1–6.1102

Ji, T., Gao, J., Yang, D., 2012. A Scalable Algorithm for Detecting1103

Community Outliers in Social Networks. Springer, pp. 434–445.1104

Ji, T., Yang, D., Gao, J., 2013. Incremental Local Evolutionary Out-1105

lier Detection for Dynamic Social Networks. Springer, pp. 1–15.1106

Jyothsna, V., Prasad, V. V. R., Prasad, K. M., 2011. A review of1107

anomaly based intrusion detection systems. International Journal1108

of Computer Applications (0975–8887) Volume.1109

Kim, H., Atalay, R., Gelenbe, E., 2012. G-network modelling based1110

abnormal pathway detection in gene regulatory networks. In:1111

Computer and Information Sciences II. Springer, pp. 257–263.1112

Kim, H., Gelenbe, E., 2009. Anomaly detection in gene expression1113

via stochastic models of gene regulatory networks. BMC Genomics1114

10 (Suppl 3), S26.1115

Krebs, V. E., 2002. Mapping networks of terrorist cells. Connections1116

24 (3), 43–52.1117

Malm, A., Bichler, G., 5 2011. Networks of collaborating criminals:1118

Assessing the structural vulnerability of drug markets. Journal of1119

Research in Crime and Delinquency 48 (2), 271–297.1120

Marchette, D., 9 2012. Scan statistics on graphs. Wiley Interdisci-1121

plinary Reviews: Computational Statistics 4 (5), 466–473.1122

Markou, M., Singh, S., 12 2003a. Novelty detection: a review - part1123

1: statistical approaches. Signal Processing 83 (12), 2481–2497.1124

Markou, M., Singh, S., 12 2003b. Novelty detection: a review - part1125

2: neural network based approaches. Signal Processing 83 (12),1126

2499–2521.1127

McCulloh, I., Carley, K. M., 2011. Detecting change in longitudinal1128

social networks. Tech. rep., DTIC Document.1129

Menges, F., Mishra, B., Narzisi, G., 2008. Modeling and simulation1130

of e-mail social networks: a new stochastic agent-based approach.1131

In: Proceedings of the 40th Conference on Winter Simulation.1132

Winter Simulation Conference, pp. 2792–2800.1133

Miller, B., Bliss, N., Wolfe, P. J., 2010a. Subgraph detection using1134

eigenvector l1 norms. In: Advances in Neural Information Pro-1135

cessing Systems. pp. 1633–1641.1136

Miller, B. A., Arcolano, N., Bliss, N. T., 2013. E�cient anomaly1137

detection in dynamic, attributed graphs. In: Intelligence and Se-1138

curity Informatics. pp. 179–184.1139

Miller, B. A., Beard, M. S., Bliss, N. T., 2011a. Eigenspace analysis1140

for threat detection in social networks. In: Information Fusion1141

(FUSION), 2011 Proceedings of the 14th International Conference1142

on. IEEE, pp. 1–7.1143

Miller, B. A., Beard, M. S., Bliss, N. T., 2011b. Matched filtering1144

for subgraph detection in dynamic networks. In: Statistical Signal 1145

Processing Workshop (SSP), 2011 IEEE. IEEE, pp. 509–512. 1146

Miller, B. A., Bliss, N. T., Wolfe, P. J., 2010b. Toward signal pro- 1147

cessing theory for graphs and non-euclidean data. In: Acoustics 1148

Speech and Signal Processing (ICASSP), 2010 IEEE International 1149

Conference on. IEEE, pp. 5414–5417. 1150

Muller, E., Snchez, P. I., Mulle, Y., Bohm, K., 2013. Ranking outlier 1151

nodes in subspaces of attributed graphs. In: Data Engineering 1152

Workshops (ICDEW), 2013 IEEE 29th International Conference 1153

on. IEEE, pp. 216–222. 1154

Neil, J., Storlie, C., Hash, C., Brugh, A., Fisk, M., 2011. Scan statis- 1155

tics for the online detection of locally anomalous subgraphs. Ph.D. 1156

thesis, PhD thesis, U. of New Mexico. 1157

Newman, M. E., 2006. Finding community structure in networks us- 1158

ing the eigenvectors of matrices. Physical review E 74 (3), 036104. 1159

Newman, M. E., Park, J., 2003. Why social networks are di↵erent 1160

from other types of networks. Physical Review E 68 (3), 036122. 1161

Newman, M. E. J., Strogatz, S. H. . H., Watts, D. J. . J., 7 2001. 1162

Random graphs with arbitrary degree distributions and their ap- 1163

plications. Physical Review E 64 (2). 1164

Noble, C. C., Cook, D. J., 2003. Graph-based anomaly detection. In: 1165

Proceedings of the ninth ACM SIGKDD international conference 1166

on Knowledge discovery and data mining. ACM, pp. 631–636. 1167

Pandit, S., Chau, D. H., Wang, S., Faloutsos, C., 2007. Netprobe: 1168

a fast and scalable system for fraud detection in online auction 1169

networks. In: Proceedings of the 16th international conference on 1170

World Wide Web. ACM, pp. 201–210. 1171

Papadimitriou, P., Dasdan, A., Garcia-Molina, H., 2010. Web graph 1172

similarity for anomaly detection. Journal of Internet Services and 1173

Applications 1 (1), 19–30. 1174

Park, Y., Priebe, C., Marchette, D., Youssef, A., 2009. Anomaly de- 1175

tection using scan statistics on time series hypergraphs. In: SDM. 1176

SDM. 1177

Patcha, A., Park, J.-M. . M., 8 2007. An overview of anomaly detec- 1178

tion techniques: Existing solutions and latest technological trends. 1179

Computer Networks 51 (12), 3448–3470. 1180

Phua, C., Lee, V., Smith, K., Gayler, R., 2010. A comprehensive sur- 1181

vey of data mining-based fraud detection research. arXiv preprint 1182

arXiv:1009.6119. 1183

Pincombe, B., 2005. Anomaly detection in time series of graphs using 1184

arma processes. ASOR BULLETIN 24 (4), 2. 1185

Priebe, C. E., Conroy, J. M., Marchette, D. J., Park, Y., 2005. Scan 1186

statistics on enron graphs. Computational & Mathematical Orga- 1187

nization Theory 11 (3), 229–247. 1188

Reid, E., Qin, J., Zhou, Y., Lai, G., Sageman, M., Weimann, G., 1189

Chen, H., 2005. Collecting and analyzing the presence of terrorists 1190

on the web: A case study of jihad websites. In: Intelligence and 1191

Security Informatics. Springer, pp. 402–411. 1192

Rowe, M., Stankovic, M., Alani, H., 2012. Who will follow whom? 1193

exploiting semantics for link prediction in attention-information 1194

networks. In: The Semantic Web–ISWC 2012. Springer, pp. 476– 1195

491. 1196

Sallaberry, A., Zaidi, F., Melancon, G., 2013. Model for generating 1197

artificial social networks having community structures with small- 1198

world and scale-free properties. Social Network Analysis and Min- 1199

ing 3 (3), 597–609. 1200

Serin, E., Balcisoy, S., 2012. Entropy based sensitivity analysis and 1201

visualization of social networks. In: Advances in Social Networks 1202

Analysis and Mining (ASONAM), 2012 IEEE/ACM International 1203

Conference on. IEEE, pp. 1099–1104. 1204

Shekhar, S., Lu, C.-t., Zhang, P., 2001. Detecting graph-based spatial 1205

outliers: algorithms and applications (a summary of results). In: 1206

Proceedings of the seventh ACM SIGKDD international confer- 1207

ence on Knowledge discovery and data mining. ACM, pp. 371–376. 1208

Shetty, J., Adibi, J., 2005. Discovering important nodes through 1209

graph entropy the case of enron email database. In: Proceedings 1210

of the 3rd international workshop on Link discovery. ACM, pp. 1211

74–81. 1212

Shrivastava, N., Majumder, A., Rastogi, R., 2008. Mining (social) 1213

network graphs to detect random link attacks. In: Data Engineer- 1214

ing, 2008. ICDE 2008. IEEE 24th International Conference on. 1215

11

IEEE, pp. 486–495.1216

Silva, J., Willett, R., 2008. Detection of anomalous meetings in a1217

social network. In: Information Sciences and Systems, 2008. CISS1218

2008. 42nd Annual Conference on. IEEE, pp. 636–641.1219

Tambayong, L., 2014. Change detection in dynamic political net-1220

works: The case of sudan. In: Theories and Simulations of Com-1221

plex Social Systems. Springer, pp. 43–59.1222

Tsugawa, S., Ohsaki, H., Imase, M., 2010. Inferring success of online1223

development communities: Application of graph entropy for quan-1224

tifying leaders’ involvement. In: Information and Telecommunica-1225

tion Technologies (APSITT), 2010 8th Asia-Pacific Symposium1226

on. IEEE, pp. 1–6.1227

Subelj, L., Furlan, u., Bajec, M., 2011. An expert system for de-1228

tecting automobile insurance fraud using social network analysis.1229

Expert Systems with Applications 38 (1), 1039–1052.1230

Welser, H. T., Cosley, D., Kossinets, G., Lin, A., Dokshin, F., Gay,1231

G., Smith, M., 2011. Finding social roles in wikipedia. In: Pro-1232

ceedings of the 2011 iConference. ACM, pp. 122–129.1233

Zhang, Y., Meratnia, N., Havinga, P., 2010. Outlier detection tech-1234

niques for wireless sensor networks: A survey. Communications1235

Surveys & Tutorials, IEEE 12 (2), 159–170.1236

12