detecting cyber security threats in weblogs
TRANSCRIPT
-
8/7/2019 Detecting Cyber Security Threats in Weblogs
1/12
Detecting Cyber Security Threats in Weblogs
Using Probabilistic Models
Flora S. Tsai and Kap Luk Chan
School of Electrical & Electronic Engineering,Nanyang Technological University, Singapore, 639798
Abstract. Organizations and governments are becoming vulnerable toa wide variety of security breaches against their information infrastruc-ture. The magnitude of this threat is evident from the increasing rate ofcyber attacks against computers and critical infrastructure. Weblogs, orblogs, have also rapidly gained in numbers over the past decade. Weblogsmay provide up-to-date information on the prevalence and distribution ofvarious cyber security threats as well as terrorism events. In this paper,we analyze weblog posts for various categories of cyber security threatsrelated to the detection of cyber attacks, cyber crime, and terrorism. Ex-isting studies on intelligence analysis have focused on analyzing news orforums for cyber security incidents, but few have looked at weblogs. We
use probabilistic latent semantic analysis to detect keywords from cybersecurity weblogs with respect to certain topics. We then demonstrate howthis method can present the blogosphere in terms of topics with measur-able keywords, hence tracking popular conversations and topics in theblogosphere. By applying a probabilistic approach, we can improve infor-mation retrieval in weblog search and keywords detection, and providean analytical foundation for the future of security intelligence analysis ofweblogs.
Keywords: cyber security, weblog, blog, probabilistic latent semantic
analysis, cyber crime, cyber terrorism, data mining.
1 Introduction
Cyber security is defined as the intersection of computer, network, and informa-tion security issues which directly affect the national security infrastructure [15].Cyber security problems are frequent, serious, and global in nature. The numberof cyber attacks by persons and malicious software are increasing rapidly. Manycyber criminals or hackers may post their ongoing achievements in weblogs, orblogs, which are websites where entries are made in a reverse chronological order.In addition, weblogs may provide up-to-date information on the prevalence anddistribution of various cyber security incidents and threats.
Weblogs range in scope from individual diaries to arms of political campaigns,media programs, and corporations. Weblogs explosive growth is generating largevolumes of raw data and is considered by many industry watchers one of the top
C.C. Yang et al. (Eds.): PAISI 2007, LNCS 4430, pp. 4657, 2007.c Springer-Verlag Berlin Heidelberg 2007
http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
2/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 47
ten industry trends [3]. Blogosphere is the collective term encompassing all blogsas a community or social network. Because of the huge volume of existing weblogposts and their free format nature, information in the blogosphere is ratherrandom and chaotic, but immensely valuable in the right context. Weblogs canthus potentially contain usable and measurable information related to cybersecurity threats, such as malware, viruses, cyber blackmail, and other cybercrime.
With the amazing growth of blogs on the web, the blogosphere affects muchin the media. Studies on the blogosphere include measuring the influence of theblogosphere [6], analyzing the blog threads for discovering the important bloggers[11], determining the spatiotemporal theme pattern on blogs [10], focusing thetopic-centric view of the blogosphere [1], detecting the blogs growing trends [7],tracking the propagation of discussion topics in the blogosphere [8], and searchingand detecting topics in corporate blogs [16].
Existing studies have focused on analyzing forums and news articles for cy-ber threats [12,18,19], but few have looked at weblogs. In this paper, we focuson analyzing cyber security weblogs, which are blogs providing commentary oranalysis of cyber security threats and incidents.
In our work, we analyzed various weblog posts to detect the keywords ofvarious topics of the blog entries, hence tracking the trends and topics of conver-sations in the blogosphere. Probabilistic Latent Semantic Analysis (PLSA) wasused to detect the keywords from various cyber security blog entries with respectto certain topics. By using PLSA, we can present the blogosphere in terms oftopics with measurable keywords.
The paper is organized as follows. Section 2 reviews the related work onintelligence analysis and extraction of useful information from weblogs. Section3 describes an overview of the Latent Semantic models such as Latent SemanticAnalysis and Probabilistic Latent Semantic Analysis model for mining of weblog-related topics. Section 4 presents experimental results, and Section 5 concludesthe paper.
2 Review of Related Work
This section reviews related work in intelligence analysis and extraction of usefulinformation from weblogs.
2.1 Intelligence Analysis
Intelligence analysis is the process of producing formal descriptions of situationsand entities of strategic importance [17]. Although its practice is found in itspurest form inside intelligence agencies, such as the CIA in the United Statesor MI6 in the UK, its methods are also applicable in fields such as businessintelligence or competitive intelligence.
Recent works related to security intelligence analysis include using entity rec-ognizers to extract names of people, organizations, and locations from news
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
3/12
48 F.S. Tsai and K.L. Chan
articles, and applying probabilistic topic models to learn the latent structurebehind the named entities and other words [12]. Another study analyzed theevolution of terror attack incidents from online news articles using techniques re-lated to temporal and event relationship mining [18]. In addition, Support VectorMachines were used for improving document classification for the insider threatproblem within the intelligence community by analyzing a collection of docu-ments from the Center for Nonproliferation Studies (CNS) related to weaponsof mass destruction [19]. These studies illustrate the growing need for securityintelligence analysis, and the usage of machine learning and information retrievaltechniques to provide such analysis. However, much work has yet to be done inobtaining intelligence information from the vast collection of weblogs that existthroughout the world.
2.2 Information Extraction from Weblogs
Current weblog text analysis focuses on extracting useful information from we-blog entry collections, and determining certain trends in the blogophere. NLP(Natural Language Processing) algorithms have been used to determine the mostimportant keywords and proper names within a certain time period from thou-sands of active weblogs, which can automatically discover trends across blogs, aswell as detect key persons, phrases and paragraphs [7]. A study on the propaga-
tion of discussion topics through the social network in the blogophere developedalgorithms to detect the long-term and short-term topics and keywords, whichwere then validated with real weblog entry collections [8]. On evaluating thesuitable methods of ranking term significance in an evolving RSS feed corpus,three statistical feature selection methods were implemented: 2, Mutual Infor-mation (MI) and Information Gain (I), and the conclusion was that 2 methodseems to be the best among all, but full human classification exercise would berequired to further evaluate such method [14]. A probabilistic approach basedon PLSA was proposed in [10] to extract common themes from blogs, and also
generate the theme life cycle for each given location and the theme snapshots foreach given time period. PLSA has also been previously used for weblog searchand mining of corporate blogs [16].
Our work differs from existing studies in two respects: (1) We focus on cybersecurity weblog entries which has not been studied before in the context ofintelligence analysis (2) We have used probabilistic models to extract popularkeywords for each topic in order to detect themes and trends in cyber threatsand terrorism events.
3 Latent Semantic Models
This section reviews the latent semantic models used for this work, which involvelatent sematic analysis and extending probabilistic latent semantic analysis fortopic detection in weblogs.
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
4/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 49
3.1 Latent Semantic Analysis
Latent Semantic Analysis (LSA) [4] is a well-known technique for informationretrieval and document classification. LSA solves two fundamental problems in
natural language processing: synonymy and polysemy:
In synonymy, different words may have the same meaning. Thus, a personissuing a query in a search engine may use a different word from what appearsin a document, and may not retrieve the document.
In polysemy, the same word can have multiple meanings, so a searcher canget unwanted documents with the alternate meanings.
LSA solves the problem of lexical matching methods by using statistically
derived conceptual indices instead of individual words for retrieval [2]. LSA usesa term-document matrix (TDM) which describes patterns of term (word) distri-bution across a set of documents.
LSA then finds a low-rank approximation which is smaller and less noisy thanthe original term-document matrix. The downsizing of the matrix is achievedthrough the use of singular value decomposition (SVD), where the set of all theterms is then represented by a vector space of lower dimensionality than thetotal number of terms in the vocabulary. The consequence of the rank loweringis that some dimensions get merged.
In LSA, each element of then
m
term-document matrix reflects the occur-rence of a particular word in a particular document, i.e.,
A = [aij ], (1)
where aij is the number of times or frequency in which term i appears in docu-ment j. As each word will not usually appear in every document, the matrix Ais typically sparse with rarely any noticeable nonzero structure [2].
The matrix A is then factored into the product of three matrices using SVD.Given a matrix A, where rank(A) = r, the SVD of A is defined as:
A = USVT. (2)
The columns of U and V are referred to as the left and right singular vectors,respectively, and the singular values of A are the diagonal elements of S, or thenonnegative square roots of the n eigenvalues of AAT.
As defined by Equation (2), the SVD is used to represent the original rela-tionships among terms and documents as sets of linearly-independent vectors.Performing truncated SVD by using the k-largest singular values and correspond-
ing singular vectors, the original TDM can be reduced to a smaller collection ofvectors in k-space for conceptual query processing [2].
3.2 Probabilistic Latent Semantic Analysis for Weblog Mining
Probabilistic Latent Semantic Analysis (PLSA) [9] is based on a generative prob-abilistic model that stems from a statistical approach to LSA [4]. PLSA is able to
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
5/12
50 F.S. Tsai and K.L. Chan
capture the polysemy and synonymy in text for applications in the informationretrieval domain. Similar to LSA, PLSA uses a term-document matrix whichdescribes patterns of term (word) distribution across a set of documents (blogentries). By implementing PLSA, topics are generated from the blog entries,where each topic produces a list of word usage, using the maximum likelihoodestimation method, the expectation maximization (EM) algorithm.
The starting point for PLSA is the aspect model [9]. The aspect model isa latent variable model for co-occurrence data associating an unobserved classvariable zk {z1, . . . , zk} with each observation, an observation being the oc-currence of a keyword in a particular blog entry. There are three probabilitiesused in PLSA:
1. P(bi) denotes the probability that a keyword occurrence will be observed in
a particular blog entry bi,2. P(wj |zk) denotes the class-conditional probability of a specific keyword con-
ditioned on the unobserved class variable zk,3. P(zk|di) denotes a blog-specific probability distribution over the latent vari-
able space.
In the collection, the probability of each blog and the probability of eachkeyword are known, while the probability of an aspect given a blog and theprobability of a keyword given an aspect are unknown. By using the above three
probabilities and conditions, three fundamental schemes are implemented:
1. select a blog entry bi with probability P(bi),2. pick a latent class zk with probability P(zk|bi),3. generate a keyword wj with probability P(wj |zk).
As a result, a joint probability model is obtained in asymmetric parameteri-zation:
P(bi, wj) = P(bi)P(wj |bi), (3)
P(wj |bi) =K
k=1
P(wj |zk)P(zk|bi) (4)
After the aspect model is generated, the model is fitted using the EM algo-rithm. The EM algorithm involves two steps, namely the expectation (E) stepand the maximization (M) step. The E-step computes the posterior probabilityfor the latent variable, by implying Bayes formula, so the parameterization ofjoint probability model is obtained as:
P(zk|bi, wj) =P(wj |zk)P(zk|bi)Kl=1 P(wj |zl)P(zl|bi)
(5)
http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
6/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 51
The M-step updates the parameters based on the expected complete datalog-likelihood depending on the posterior probability resulted from the E-step.Hence the M-step re-estimates the following two probabilities:
P(wj |zk) =
Ni=1 n(bi, wj)P(zk|bi, wj)M
m=1
Ni=1 n(bi, wm)P(zk|bi, wm)
(6)
P(zk|bi) =
Mj=1 n(bi, wj)P(zk|bi, wj)
n(bi)(7)
The EM iteration is continued to increase the likelihood function until thespecific conditions are met and the program is terminated. These conditions canbe a convergence condition, or a cut-off point, which is specified for reaching a
local maximum, rather than a global maximum.In short, the PLSA model selects the model parameter values that maximizethe probability of the observed data, and returns the relevant probability dis-tributions by implying the EM algorithm. Word usage analysis with the aspectmodel is a common application of the aspect model. Based on the pre-processedterm-document matrix, the blogs are then classified onto different aspects ortopics. For each aspect, the keyword usage, such as the probable words in theclass-conditional distribution P(wj |zk), is determined. Empirical results indi-cate the advantages of PLSA in reducing perplexity, and high performance of
precision and recall in information retrieval [9].
4 Experiments and Results
We have used latent semantic models to analyze weblogs related to cyber securitythreats and incidents, and applied probabilistic models for weblog analysis onour dataset. Dimensionality reduction was performed with latent semantic anal-ysis to show the similarity plot of weblog terms. We extract the most relevantcategories and show the topics extracted for each category. Experiments show
that the probabilistic model can reveal interesting patterns in the underlyingtopics for our dataset of security-related weblogs.
4.1 Data Corpus
For our experiments, we extracted a subset of the Nielson BuzzMetrics weblogdata corpus1 that focuses on blogs related to cyber security threats and incidentsrelated to cyber crime and terrorism. The original dataset consists of 14 millionweblog posts collected by Nielsen BuzzMetrics for May 2006. Although the blog
entries span only a short period of time, they are indicative of the amount andvariety of blog posts that exists in different languages throughout the world.
Blog entries in the English language related to cyber security threats such asmalware, cyber crime, and terrorism were extracted and stored for use in ouranalysis. Figure 1 shows an excerpt of a weblog post related to cyber blackmail.
1 http://www.icwsm.org/data.html
http://-/?-http://-/?-http://-/?-http://www.icwsm.org/data.htmlhttp://www.icwsm.org/data.htmlhttp://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
7/12
52 F.S. Tsai and K.L. Chan
Cyber blackmail is on the increase ... Criminal gangs have moved away from the stealthuse of infected computers ... to direct blackmailing of victims. ... Cyber blackmailingis done ... by encrypting data or by corrupting system information. The criminal then
demands a ransom for its return to the victim. ...
Fig. 1. Excerpt of weblog post related to cyber blackmail and ransom
There are a total of 5493 entries in our dataset, and each weblog entry is savedas a text file for further text preprocessing. For the preprocessing of the blogdata, HTML tags were removed and lexical analysis was performed by removingstopwords, stemming, and pruning using the Text to Matrix Generator (TMG)
[20]. The total number of terms after pruning and stopword removal is 797. Theterm-document matrix was then input to the LSA and PLSA algorithms.
4.2 Semantic Detection of Terms
We used the LSA model [4] for analyzing semantic detection of terms, as LSA isable to consider weblog entries with similar words which are semantically close.The results of applying LSA on this term-document matrix (with k=2) is shownin Figure 2.
The plot shows the similarity in two-dimensional space of the terms in theweblog entries. Although many terms are not visible because of the large numberof words, there are a few groupings evident from the graph. Some of the visibleterms include the grouping of spyware, malware, and software at the top centerof the plot. Another group visible at the right include Iraq, war, Bush, and
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
abilabsolutabuabus
accept
accessaccount
accus actaction
activ
actual adaddadditaddress
administr
admit afghanistanag
agenc
agentago
agreaidairallegalli allow
america
american
amountanalysi
announcanswer
anti
apparappearapproachapprov
aprilarab
areaarenarguargumentarmarmiarrest
articlask
associ
attack
attemptattentauthor
avoidawar backbad
baghdadbasebasicbattlbeganbeginbeliev bigbillbillion
bin
bitblackblameblog
blowbodiboi bombbookborder
breakbringbritish
brother
broughtbui build
burn
bush
busi
call
campcampaign
capit capturcar
cardcarecarri
case
caughtcaus
cellcenter
centralchalleng
chanc
chang
charg
checkcheneichiefchildrenchoicchristian
cia
citicitizencivil
civilian claimclass
clearclintonclose
code
collect
combatcomecommand comment
commit
committe
common
commun
compani
complet
comput
concernconductconfirm
conflict
congress
connect
conservconsid
constitutcontactcontent continu
controlconvers
convict
corporcost countricoupl
court
covercreat
crimecrimin
criticcrossculturcurrent
custom
cut
daidailidamag
danger
data
databas
datedaviddead
deal
death
debat
deciddecisdeclardefeatdefenddefensdemand
democraci
democrat
deni
departdesign
destroidestructdetail
determindevelop
dididn
die
differdirect
directli
directordiscov discuss
documentdoesndollar
domest
dondontdoubt
drivedropdueearliearliereasi easteconomeduc
effecteffortelectemail
emergendenemi
enforcengagenter entirestablisheuropeventevid
evil
execut
existexpectexperiexpertexplainexpressextremey face factfailfailurfair
fallfals
famili
fatherfavor fbifear
federfeel
fieldfight
figur
file
final
findfine firefly
focus
followforcforeign
formforward foundfree
freedomfridai friendfrontfullfundfuturgain gamegathergave
gengener
georggiveglobalgoal
god
good
govern
greatground
groupgrowguess guigunhalf hand
happenhappihard
hatehaven head
hearheardheartheldhell
helphide high
hijack
historihitholdhome
hopehot
hour
houshttp
huge
humanhundrhusseinideaidentifi
ignor
illeg
imagimaginimmigrimport
includincreasindependindividu
inform
initiinnoc
insid
instal
institutinsurg
intellig
interestinterninternet
interviewinvadinvas
investig
involviran
iraqiraqi
islam
isn israelissu
januari jobjohnjoinjournal
judg
justic
justifikei
kid
kill
kind
knewknowledglack
laden
landlarglargestlatelatestlaunch
law
lawyerlead
leaderlearnleavled left
legallet
letterlevelli liberliberti
life
lightlimit line
linklist listen
livelocal
longlongerlose
lost
lotlovelow
machinmade
mailmain
majormake
malwar
man
managmarchmark
marketmassmattermeanmeasur
mediameet
membermenmentionmessagmet
michaelmiddl militari
million
mind
mine
minutmiss
missionmistakmoment
moneimonitor
monthmoral
mornmovemovementmovi
murder muslim
name
nation
naturneed
network
newnewspapnice nightnorth
notenotic
nsa
nuclear
number
objectoffer offic officioilonlin open
operopinionopportunopposopposit
orderorgan
origin
osama
pagepaipakistan
paper
part
parti
particippasspastpeacpentagon peoplperiod person
phone
pickpicturpiec placeplaiplan
plane
plot
pointpolic
polici politpoll
poorpopulpopularposit
possiblpost
potenti powerpracticprepar present
presid
pressprettiprevent
price
prison
privaci
privat problemprocessproducproduct
program
progressprojectpromis
protect
protestprove
provid
publicpublish
pullpurpospush
putqaeda
questionquotradic
raisrate
reach readreadi realrealitirealiz reasonreceivrecent
record
redreferrefusregimregionrelat
releasreligireligionremainrememb
removreport
repres
republicanrequir
research
respectrespond respons
rest
resultreturnreveal
review rightriserisk
role
rollroom
rulerun
saddamsafe
save
school
search secretsecretari
secur
seeksell
senat
sendseniorsens
sentenc
sept
septemb
seri
serv
servic
setshareshortshot showsidesignsignificsimilarsimpl simplisingl
sit
site
situatsmallsocialsocieti
softwar
soldiersonsortsound
sourc
south
speakspecial
specifspeech
spend
spentspread
spy
spywar
stai standstandard start state
statementstep stop
storistrategi
streetstrikestrongstudentstudistuffsubjectsuccess
suffer
suggestsuicid
support
suppossurpris
surveil
suspectsystemtake taliban
talktarget
taxteam
technologtelephon
tell
ten
termterror
terrorist
test thingthink thought
thousand
threat
threaten
thursdai
ti timetodai
told
tool top
tortur
totaltown
track
tradetrain
trial
trooptroubltruetrust
truthtuesdai
turntypeunderstand unit
universupdat
usa
user
version
victim
videoview
violatviolenc
viru
visitvoic vote
waiwaitwalkwall
want
warwarn
warrantwashington
wasnwatchwater
weaponweb
websit weekwestwestern
whitewide win
window
womanwomen
won
wordwork
worldworri
worsworstworthwritewritten
wrongwrote
www
yeah yearyesterdaiyorkyoung
zacaria
zarqawi
Fig. 2. Two-dimensional plot of terms for weblog entries using LSA
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
8/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 53
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
abil
absolut abuabus
accept
access
account
accusactaction
activ
actualad
addaddit
address
admitafghanistan
ag
agenc
agentago
agre
aid airallegalli allowamount
analysi
announc answer
anti
apparappearapproach
approv
april
arab
areaaren
arguargumentarmarmi
arrest
articl
askassoci attemptattent
author
avoidawar
backbad
baghdadbasebasicbattl
beganbegin
believbig
billbillion
bitblackblame
blog
blow bodi boi bomb
book
border
breakbring
british
brother
brought
bui build
burn
busi
campcampaign
capit capturcar
card
carecarri
case
caughtcaus
cell
center
centralchalleng
chanc
chang
charg
checkchenei
chiefchildrenchoic
christian
cia
citi
citizencivil
civilianclaim
class
clear
clintonclose
code
collect
combatcome
command comment
commit
committe
common
commun
compani
completconcern
conductconfirm
conflict
congress
connect
conservconsid
constitut
contactcontent
continucontrolconvers
convict
corporcostcoupl
court
covercreat
crimecrimin
critic
crosscultur
current
custom
cutdaili
damag
danger
data
databas
datedavid
dead
dealdebat
deciddecis
declar
defeat defenddefensdemand
democraci
democrat
deni
depart
design
destroidestruct
detail
determin
develop
dididn
die
differdirect
directli
director
discov discuss
document
doesndollar
domest
dont
doubt
drivedropdue
earliearlier
easieast
economeduceffect
effortelect
email
emerg
endenemi
enforcengagenter
entirestablisheurop
eventevidevil
execut
existexpectexperiexpert
explainexpressextremey face fact
failfailur
fair
fallfals
famili
fatherfavor fbi
fear
federfeel
fieldfight
figur
file
final
findfinefire
fly
focus
follow
forcforeignformforward found
free
freedomfridai friendfront fullfundfutur
gain gamegather
gave
gengener
georg
giveglobalgoal
god
goodgreat
groundgroupgrow
guessgui
gunhalf
handhappenhappi
hard
hate
haven head
hearheardheartheld
hell
helphide high
hijack
historihithold
home
hope
hot
hour
houshttp
huge
human
hundr husseinidea
identifiignor
illeg
imagimagin
immigrimportinclud
increasindepend
individu
inform
initi
innoc
insid
instal
institutinsurg
interestintern
internet
interviewinvadinvas
investig
involv
iraniraqi
isn israel
issujanuari
jobjohnjoin
journal
justic
justifi
keikid
kind
knew
knowledglack landlarg
largest
latelatestlaunch
law
lawyer
lead
leaderlearn
leavled
left
legal
let
letterlevel
li liber
liberti
lightlimit line
linklist
listen
live
local
longlonger
lose
lost
lotlove
low
machin
made
mail
mainmajor
malwar
man
managmarch
mark
market
massmattermeanmeasur
media
meet
membermen
mention messagmetmichael
middl
million
mind
mine
minut
miss
mission
mistakmoment
moneimonitor
month
moral
morn movemovement movi
murder muslim
name
natur
need
network
newspap
nicenightnorth notenotic
nuclear
number
object offeroffic officioil
onlinopen
operopinionopportun
opposoppositorder
organorigin
osama
pagepai
pakistan
paper
part
parti
particippasspast peacpentagon
periodperson
pickpicturpiec
placeplaiplan
plane
plot
pointpolic
policipolit
poll
poorpopulpopular posit
possibl
potentipower
practicpreparpresent presspretti
preventprice
privaci
privatproblem
processproduc
product
progressproject
promis
protect
protestprove
provid
publicpublish
pull
purpospush
put
question
quotradic
rais
ratereach readreadi
realrealitirealiz reasonreceiv
recent
record
redreferrefusregimregion relat
releasreligireligion
remainrememb
remov
repres
republican
requirresearch
respectrespond
respons
rest
resultreturn
revealreview
rightrise
risk
role
roll
room
rule
run
saddam
safe
save
school
searchsecret
secretariseeksell
senat
sendsenior
sens
septemb
seri
serv
servic
setshareshort
shot showsidesignsignificsimilar
simpl simplisingl
sit
site
situatsmall
social
societi
softwar
soldierson
sort
sound
sourc
south
speak
specialspecif
speech
spend
spent
spread
spy
spywar
staistand
standardstart
statement
step stopstori
strategistreetstrike
strong
studentstudistuff
subjectsuccess
suffer
suggest
suicid
support
suppossurpris
surveil
suspect system
taketaliban
talk
target
taxteam
technologtelephon
tell
ten
term
test
think thoughtthousand
threat
threaten
thursdai
ti
todai
told
tooltop
tortur
totaltown
track
tradetrain
trooptroubl
true
trust
truth
tuesdai
turntypeunderstand
univers
updat
usa
user
version
victim
videoview
violat
violenc
viru
visit
voicvote
wai
waitwalkwall
want
warn
warrant
washington
wasnwatchwater
weapon
web
websit weekwestwestern
whitewide
win
window
woman
women
won
word
workworri
worsworstworth
writewritten
wrong
wrote
www
yeah yesterdai
yorkyoung zarqawi
Fig. 3. Zoomed-in graph of Figure 2
American. Yet another grouping include the terms death, prison, and life at the
bottom of the graph. Zooming into the large cluster from Figure 2, Figure 3 showsa subset of the big cluster of keywords. A larger group of keywords (spyware,malware, software, data, user, window, privacy, spy, domestic) can be identified,thus showing the ability descend through a hierarchical grouping of keywords.The implications of the graphs demonstrate the possibility to visualize closely-related terms in two-dimensional space. Although the two-dimensional graphsmay be an over-simplification of the dimensionality reduction that takes place,the plot can help to visualize the terms and relate to the topics produced for theweblogs.
4.3 Results for Weblog Topic Analysis
We conducted some experiments using PLSA for the weblog entries. Tables 1-4summarizes the keywords found for each of the four topics (Computer Security,Osama bin Laden, Iraq War, and US National Security).
By looking at the various topics listed, we are able to see that the probabilis-tic approach is able to list important keywords of each topic in a quantitativefashion. The keywords listed can relate back to the original topics. For example,the keywords detected in the Computer Security topic features items such ascomputers, spyware, software, and internet.
Figure 4 shows the graph of the topic-document distribution of the weblogentries by date. Some of the topics have a higher density of documents distributedaround certain dates. This can be used to match certain events in each topic tothe weblog entries. For example, the heavy clustering of documents for Topic4 (US National Security) indicate that there was an increase on the weblog
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
9/12
54 F.S. Tsai and K.L. Chan
Table 1. List of keywords for Topic 1:Computer Security
Keyword Probability
comput 0.023716malwar 0.020509spywar 0.018047softwar 0.014650secur 0.014257
window 0.013527internet 0.013436http 0.013266web 0.012022user 0.011651
Table 2. List of keywords for Topic 2:Osama bin Laden
Keyword Probability
moussaoui 0.0170900don 0.0083916life 0.0083244bin 0.0079485
laden 0.0078515osama 0.0074730prison 0.0074594peopl 0.0064921death 0.0063618god 0.0061734
Table 3. List of keywords for Topic 3:Iraq War
Keyword Probability
iraq 0.0134810war 0.0089393islam 0.0087796
zarqawi 0.0086076militari 0.0073831muslim 0.0073576
afghanistan 0.0072725iran 0.0070231qaeda 0.0069711iraqi 0.0065779
Table 4. List of keywords for Topic 4: USNational Security
Keyword Probability
nsa 0.0142720bush 0.0127100phone 0.0121350
program 0.0099155presid 0.0098480cia 0.0093545
american 0.0088027record 0.0086708call 0.0086591
administr 0.0084607
conversations in this topic around the middle of May 2006. This can be due toUS President Bushs comment on May 11, 2006 about a USA Today report ona massive NSA database that collects information about all phone calls madewithin the United States [5]. This is one example of an event that can triggermuch conversation in the blogosphere.
For the topic of Computer Security, we further decompose into separatesubtopics, two of which are shown in Tables 5-6. Malware, which includes com-puter viruses, worms, trojan horses, spyware, adware, and other malicious soft-ware, is the topic derived from examining the keywords in Subtopic 1. Subtopic2 is classified as Macintosh, and reflects the increasing reports of cyber attacksaffecting Macintosh computers. Therefore, we can classify and decompose thetopics into a hierarchy of subtopics, which may be useful for examining largerdata sets.
The power of PLSA in cyber security applications include the ability to au-tomatically detect terms and keywords related to cyber security threats andterror events. By presenting blogs with measurable keywords, we can improve
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
8/7/2019 Detecting Cyber Security Threats in Weblogs
10/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 55
s
cipoT
Documents
1000 2000 3000 4000 5000
1
2
3
4
May 1 5 10 15 20 25 31 (2006)
Documents By Date
Fig. 4. Topic by document distribution by date
Table 5. List of keywords for Subtopic 1:Malware
Keyword Probabilityspywar 0.0174610trojan 0.0117580adwar 0.0092120scan 0.0088160
anti 0.0087841free 0.0074914
spybot 0.0074895remov 0.0072834
download 0.0069242viru 0.0068496
Table 6. List of keywords for Subtopic 2:Macintosh
Keyword Probabilitymac 0.0078348secur 0.0056874appl 0.0054443
microsoft 0.0041262
attack 0.0041169system 0.0039921report 0.0037533cyber 0.0037227crime 0.0036151comput 0.0035352
our understanding of cyber security issues in terms of distribution and trends ofcurrent threats and events. This has implications for security agencies wishingto monitor real-time threats present in weblogs or other related documents.
5 Conclusions
In this paper, we analyzed weblog posts for various categories of cyber securitythreats related to the detection of cyber security threats, cyber crime, and cyberterrorism. To our knowledge, is the first such study focusing on cyber securityweblogs. We use latent semantic analysis to illustrate similarities in terms dis-
tributed across all the terms in the weblog dataset. Our experiments on ourdataset of weblogs demonstrate how our probabilistic weblog model can presentthe blogosphere in terms of topics with measurable keywords, hence trackingpopular conversations and topics in the blogosphere. By applying a probabilisticapproach, we can improve information retrieval in weblog search and keywords
-
8/7/2019 Detecting Cyber Security Threats in Weblogs
11/12
56 F.S. Tsai and K.L. Chan
detection, and provide an analytical foundation for the future of security intel-ligence analysis of weblogs.
Potential applications of this stream of research may include automaticallymonitoring and identifying trends in cyber terror and security threats in weblogs.This can have some significance for government and intelligence agencies wishingto monitor real-time potential international terror threats present in weblogconversations and the blogosphere.
References
1. Avesani, P., Cova, M., Hayes, C., Massa, P.: Learning Contextualised Weblog Top-ics. WWW 05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics (2005)2. Berry, M., Dumais, S. and OBrien, G.: Using linear algebra for intelligent infor-
mation retrieval. SIAM Review, 37(4):573595 (1995).3. Columbus, L.: Blog Mining Gets Real. CRM Buyer (2005).4. Deerwester, S., Dumais, S., Landauer, T., Furnas,G., Harshman, R.: Indexing by
latent semantic analysis. In Journal of the American Society of Information Science,41(6) (1990) 391407
5. Diamond, J.: NSA has massive database of Americans phone calls. In USA Today(May 10, 2006)
6. Gill, K.E.: How Can We Measure the Influence of the Blogosphere? WWW 04Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics(2004)
7. Glance, N.S. Hurst, M. Tomokiyo, T: BlogPulse: Automated Trend Discovery forWeblogs. WWW 04 Workshop on the Weblogging Ecosystem: Aggregation, Anal-ysis and Dynamics (2004)
8. Gruhl, D. Guha, R.,Liben-Nowell, D., Tomkins, A.: Information Diffusion ThroughBlogspace. WWW 04 (2004)
9. Hofmann, T.: Probabilistic Latent Semantic Indexing. SIGIR99 (1999)10. Mei, Q., Liu, C., Su, H., Zhai, C.: A Probabilistic Approach to Spatiotemporal
Theme Pattern Mining on Weblogs. WWW 06 (2006)11. Nakajima, S., Tatemura, J., Hino,Y., Hara,Y., Tanaka, K.: Discovering ImportantBloggers based on Analyzing Blog Threads. WWW 05 Workshop on the Weblog-ging Ecosystem: Aggregation, Analysis and Dynamics (2005)
12. Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M.: Analyzing Entities andTopics in News Articles Using Statistical Topic Models. ISI 06 (2006)
13. Pikas, C.K.: Blog Searching for Competitive Intelligence, Brand Image, and Rep-utation Management. Online. 29(4) (2005) 1621
14. Prabowo, R., Thelwall, M.: A Comparison of Feature Selection Methods for anEvolving RSS Feed Corpus, Information Processing and Management, 42 (2006)
1491151215. Tsai, F.S., Chan, C.K. (eds): Cyber Security, Pearson Education, Singapore (2006)
16. Tsai, F.S., Chen, Y., Chan, K.L.: Probabilistic Latent Semantic Analysis for Searchand Mining of Corporate Blogs (2007)
17. Wikipedia contributors: Intelligence Analysis. In: Wikipedia, The Free Encyclope-dia, http://en.wikipedia.org/wiki/Intelligence analysis. (accessed Nov 7,2006).
http://en.wikipedia.org/wiki/Intelligence_analysishttp://en.wikipedia.org/wiki/Intelligence_analysis -
8/7/2019 Detecting Cyber Security Threats in Weblogs
12/12
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 57
18. Yang, C.C., Shi, X., Wei, C.-P.: Tracing the Event Evolution of Terror Attacksfrom On-Line News. ISI 06 (2006)
19. Yilmazel, O., Symonenko, S., Balasubramanian, N., Liddy, E.D.: Leveraging One-Class SVM and Semantic Analysis to Detect Anomalous Content. ISI 05 (2005)
20. Zeimpekis, D., Gallopoulos, E.: TMG: A MATLAB Toolbox for generating term-document matrices from text collections. Grouping Multidimensional Data: RecentAdvances in Clustering. Springer (2005) 187210