detecting cyber security threats in weblogs

Upload: ayoubit

Post on 08-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    1/12

    Detecting Cyber Security Threats in Weblogs

    Using Probabilistic Models

    Flora S. Tsai and Kap Luk Chan

    School of Electrical & Electronic Engineering,Nanyang Technological University, Singapore, 639798

    [email protected]

    Abstract. Organizations and governments are becoming vulnerable toa wide variety of security breaches against their information infrastruc-ture. The magnitude of this threat is evident from the increasing rate ofcyber attacks against computers and critical infrastructure. Weblogs, orblogs, have also rapidly gained in numbers over the past decade. Weblogsmay provide up-to-date information on the prevalence and distribution ofvarious cyber security threats as well as terrorism events. In this paper,we analyze weblog posts for various categories of cyber security threatsrelated to the detection of cyber attacks, cyber crime, and terrorism. Ex-isting studies on intelligence analysis have focused on analyzing news orforums for cyber security incidents, but few have looked at weblogs. We

    use probabilistic latent semantic analysis to detect keywords from cybersecurity weblogs with respect to certain topics. We then demonstrate howthis method can present the blogosphere in terms of topics with measur-able keywords, hence tracking popular conversations and topics in theblogosphere. By applying a probabilistic approach, we can improve infor-mation retrieval in weblog search and keywords detection, and providean analytical foundation for the future of security intelligence analysis ofweblogs.

    Keywords: cyber security, weblog, blog, probabilistic latent semantic

    analysis, cyber crime, cyber terrorism, data mining.

    1 Introduction

    Cyber security is defined as the intersection of computer, network, and informa-tion security issues which directly affect the national security infrastructure [15].Cyber security problems are frequent, serious, and global in nature. The numberof cyber attacks by persons and malicious software are increasing rapidly. Manycyber criminals or hackers may post their ongoing achievements in weblogs, orblogs, which are websites where entries are made in a reverse chronological order.In addition, weblogs may provide up-to-date information on the prevalence anddistribution of various cyber security incidents and threats.

    Weblogs range in scope from individual diaries to arms of political campaigns,media programs, and corporations. Weblogs explosive growth is generating largevolumes of raw data and is considered by many industry watchers one of the top

    C.C. Yang et al. (Eds.): PAISI 2007, LNCS 4430, pp. 4657, 2007.c Springer-Verlag Berlin Heidelberg 2007

    http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    2/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 47

    ten industry trends [3]. Blogosphere is the collective term encompassing all blogsas a community or social network. Because of the huge volume of existing weblogposts and their free format nature, information in the blogosphere is ratherrandom and chaotic, but immensely valuable in the right context. Weblogs canthus potentially contain usable and measurable information related to cybersecurity threats, such as malware, viruses, cyber blackmail, and other cybercrime.

    With the amazing growth of blogs on the web, the blogosphere affects muchin the media. Studies on the blogosphere include measuring the influence of theblogosphere [6], analyzing the blog threads for discovering the important bloggers[11], determining the spatiotemporal theme pattern on blogs [10], focusing thetopic-centric view of the blogosphere [1], detecting the blogs growing trends [7],tracking the propagation of discussion topics in the blogosphere [8], and searchingand detecting topics in corporate blogs [16].

    Existing studies have focused on analyzing forums and news articles for cy-ber threats [12,18,19], but few have looked at weblogs. In this paper, we focuson analyzing cyber security weblogs, which are blogs providing commentary oranalysis of cyber security threats and incidents.

    In our work, we analyzed various weblog posts to detect the keywords ofvarious topics of the blog entries, hence tracking the trends and topics of conver-sations in the blogosphere. Probabilistic Latent Semantic Analysis (PLSA) wasused to detect the keywords from various cyber security blog entries with respectto certain topics. By using PLSA, we can present the blogosphere in terms oftopics with measurable keywords.

    The paper is organized as follows. Section 2 reviews the related work onintelligence analysis and extraction of useful information from weblogs. Section3 describes an overview of the Latent Semantic models such as Latent SemanticAnalysis and Probabilistic Latent Semantic Analysis model for mining of weblog-related topics. Section 4 presents experimental results, and Section 5 concludesthe paper.

    2 Review of Related Work

    This section reviews related work in intelligence analysis and extraction of usefulinformation from weblogs.

    2.1 Intelligence Analysis

    Intelligence analysis is the process of producing formal descriptions of situationsand entities of strategic importance [17]. Although its practice is found in itspurest form inside intelligence agencies, such as the CIA in the United Statesor MI6 in the UK, its methods are also applicable in fields such as businessintelligence or competitive intelligence.

    Recent works related to security intelligence analysis include using entity rec-ognizers to extract names of people, organizations, and locations from news

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    3/12

    48 F.S. Tsai and K.L. Chan

    articles, and applying probabilistic topic models to learn the latent structurebehind the named entities and other words [12]. Another study analyzed theevolution of terror attack incidents from online news articles using techniques re-lated to temporal and event relationship mining [18]. In addition, Support VectorMachines were used for improving document classification for the insider threatproblem within the intelligence community by analyzing a collection of docu-ments from the Center for Nonproliferation Studies (CNS) related to weaponsof mass destruction [19]. These studies illustrate the growing need for securityintelligence analysis, and the usage of machine learning and information retrievaltechniques to provide such analysis. However, much work has yet to be done inobtaining intelligence information from the vast collection of weblogs that existthroughout the world.

    2.2 Information Extraction from Weblogs

    Current weblog text analysis focuses on extracting useful information from we-blog entry collections, and determining certain trends in the blogophere. NLP(Natural Language Processing) algorithms have been used to determine the mostimportant keywords and proper names within a certain time period from thou-sands of active weblogs, which can automatically discover trends across blogs, aswell as detect key persons, phrases and paragraphs [7]. A study on the propaga-

    tion of discussion topics through the social network in the blogophere developedalgorithms to detect the long-term and short-term topics and keywords, whichwere then validated with real weblog entry collections [8]. On evaluating thesuitable methods of ranking term significance in an evolving RSS feed corpus,three statistical feature selection methods were implemented: 2, Mutual Infor-mation (MI) and Information Gain (I), and the conclusion was that 2 methodseems to be the best among all, but full human classification exercise would berequired to further evaluate such method [14]. A probabilistic approach basedon PLSA was proposed in [10] to extract common themes from blogs, and also

    generate the theme life cycle for each given location and the theme snapshots foreach given time period. PLSA has also been previously used for weblog searchand mining of corporate blogs [16].

    Our work differs from existing studies in two respects: (1) We focus on cybersecurity weblog entries which has not been studied before in the context ofintelligence analysis (2) We have used probabilistic models to extract popularkeywords for each topic in order to detect themes and trends in cyber threatsand terrorism events.

    3 Latent Semantic Models

    This section reviews the latent semantic models used for this work, which involvelatent sematic analysis and extending probabilistic latent semantic analysis fortopic detection in weblogs.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    4/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 49

    3.1 Latent Semantic Analysis

    Latent Semantic Analysis (LSA) [4] is a well-known technique for informationretrieval and document classification. LSA solves two fundamental problems in

    natural language processing: synonymy and polysemy:

    In synonymy, different words may have the same meaning. Thus, a personissuing a query in a search engine may use a different word from what appearsin a document, and may not retrieve the document.

    In polysemy, the same word can have multiple meanings, so a searcher canget unwanted documents with the alternate meanings.

    LSA solves the problem of lexical matching methods by using statistically

    derived conceptual indices instead of individual words for retrieval [2]. LSA usesa term-document matrix (TDM) which describes patterns of term (word) distri-bution across a set of documents.

    LSA then finds a low-rank approximation which is smaller and less noisy thanthe original term-document matrix. The downsizing of the matrix is achievedthrough the use of singular value decomposition (SVD), where the set of all theterms is then represented by a vector space of lower dimensionality than thetotal number of terms in the vocabulary. The consequence of the rank loweringis that some dimensions get merged.

    In LSA, each element of then

    m

    term-document matrix reflects the occur-rence of a particular word in a particular document, i.e.,

    A = [aij ], (1)

    where aij is the number of times or frequency in which term i appears in docu-ment j. As each word will not usually appear in every document, the matrix Ais typically sparse with rarely any noticeable nonzero structure [2].

    The matrix A is then factored into the product of three matrices using SVD.Given a matrix A, where rank(A) = r, the SVD of A is defined as:

    A = USVT. (2)

    The columns of U and V are referred to as the left and right singular vectors,respectively, and the singular values of A are the diagonal elements of S, or thenonnegative square roots of the n eigenvalues of AAT.

    As defined by Equation (2), the SVD is used to represent the original rela-tionships among terms and documents as sets of linearly-independent vectors.Performing truncated SVD by using the k-largest singular values and correspond-

    ing singular vectors, the original TDM can be reduced to a smaller collection ofvectors in k-space for conceptual query processing [2].

    3.2 Probabilistic Latent Semantic Analysis for Weblog Mining

    Probabilistic Latent Semantic Analysis (PLSA) [9] is based on a generative prob-abilistic model that stems from a statistical approach to LSA [4]. PLSA is able to

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    5/12

    50 F.S. Tsai and K.L. Chan

    capture the polysemy and synonymy in text for applications in the informationretrieval domain. Similar to LSA, PLSA uses a term-document matrix whichdescribes patterns of term (word) distribution across a set of documents (blogentries). By implementing PLSA, topics are generated from the blog entries,where each topic produces a list of word usage, using the maximum likelihoodestimation method, the expectation maximization (EM) algorithm.

    The starting point for PLSA is the aspect model [9]. The aspect model isa latent variable model for co-occurrence data associating an unobserved classvariable zk {z1, . . . , zk} with each observation, an observation being the oc-currence of a keyword in a particular blog entry. There are three probabilitiesused in PLSA:

    1. P(bi) denotes the probability that a keyword occurrence will be observed in

    a particular blog entry bi,2. P(wj |zk) denotes the class-conditional probability of a specific keyword con-

    ditioned on the unobserved class variable zk,3. P(zk|di) denotes a blog-specific probability distribution over the latent vari-

    able space.

    In the collection, the probability of each blog and the probability of eachkeyword are known, while the probability of an aspect given a blog and theprobability of a keyword given an aspect are unknown. By using the above three

    probabilities and conditions, three fundamental schemes are implemented:

    1. select a blog entry bi with probability P(bi),2. pick a latent class zk with probability P(zk|bi),3. generate a keyword wj with probability P(wj |zk).

    As a result, a joint probability model is obtained in asymmetric parameteri-zation:

    P(bi, wj) = P(bi)P(wj |bi), (3)

    P(wj |bi) =K

    k=1

    P(wj |zk)P(zk|bi) (4)

    After the aspect model is generated, the model is fitted using the EM algo-rithm. The EM algorithm involves two steps, namely the expectation (E) stepand the maximization (M) step. The E-step computes the posterior probabilityfor the latent variable, by implying Bayes formula, so the parameterization ofjoint probability model is obtained as:

    P(zk|bi, wj) =P(wj |zk)P(zk|bi)Kl=1 P(wj |zl)P(zl|bi)

    (5)

    http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    6/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 51

    The M-step updates the parameters based on the expected complete datalog-likelihood depending on the posterior probability resulted from the E-step.Hence the M-step re-estimates the following two probabilities:

    P(wj |zk) =

    Ni=1 n(bi, wj)P(zk|bi, wj)M

    m=1

    Ni=1 n(bi, wm)P(zk|bi, wm)

    (6)

    P(zk|bi) =

    Mj=1 n(bi, wj)P(zk|bi, wj)

    n(bi)(7)

    The EM iteration is continued to increase the likelihood function until thespecific conditions are met and the program is terminated. These conditions canbe a convergence condition, or a cut-off point, which is specified for reaching a

    local maximum, rather than a global maximum.In short, the PLSA model selects the model parameter values that maximizethe probability of the observed data, and returns the relevant probability dis-tributions by implying the EM algorithm. Word usage analysis with the aspectmodel is a common application of the aspect model. Based on the pre-processedterm-document matrix, the blogs are then classified onto different aspects ortopics. For each aspect, the keyword usage, such as the probable words in theclass-conditional distribution P(wj |zk), is determined. Empirical results indi-cate the advantages of PLSA in reducing perplexity, and high performance of

    precision and recall in information retrieval [9].

    4 Experiments and Results

    We have used latent semantic models to analyze weblogs related to cyber securitythreats and incidents, and applied probabilistic models for weblog analysis onour dataset. Dimensionality reduction was performed with latent semantic anal-ysis to show the similarity plot of weblog terms. We extract the most relevantcategories and show the topics extracted for each category. Experiments show

    that the probabilistic model can reveal interesting patterns in the underlyingtopics for our dataset of security-related weblogs.

    4.1 Data Corpus

    For our experiments, we extracted a subset of the Nielson BuzzMetrics weblogdata corpus1 that focuses on blogs related to cyber security threats and incidentsrelated to cyber crime and terrorism. The original dataset consists of 14 millionweblog posts collected by Nielsen BuzzMetrics for May 2006. Although the blog

    entries span only a short period of time, they are indicative of the amount andvariety of blog posts that exists in different languages throughout the world.

    Blog entries in the English language related to cyber security threats such asmalware, cyber crime, and terrorism were extracted and stored for use in ouranalysis. Figure 1 shows an excerpt of a weblog post related to cyber blackmail.

    1 http://www.icwsm.org/data.html

    http://-/?-http://-/?-http://-/?-http://www.icwsm.org/data.htmlhttp://www.icwsm.org/data.htmlhttp://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    7/12

    52 F.S. Tsai and K.L. Chan

    Cyber blackmail is on the increase ... Criminal gangs have moved away from the stealthuse of infected computers ... to direct blackmailing of victims. ... Cyber blackmailingis done ... by encrypting data or by corrupting system information. The criminal then

    demands a ransom for its return to the victim. ...

    Fig. 1. Excerpt of weblog post related to cyber blackmail and ransom

    There are a total of 5493 entries in our dataset, and each weblog entry is savedas a text file for further text preprocessing. For the preprocessing of the blogdata, HTML tags were removed and lexical analysis was performed by removingstopwords, stemming, and pruning using the Text to Matrix Generator (TMG)

    [20]. The total number of terms after pruning and stopword removal is 797. Theterm-document matrix was then input to the LSA and PLSA algorithms.

    4.2 Semantic Detection of Terms

    We used the LSA model [4] for analyzing semantic detection of terms, as LSA isable to consider weblog entries with similar words which are semantically close.The results of applying LSA on this term-document matrix (with k=2) is shownin Figure 2.

    The plot shows the similarity in two-dimensional space of the terms in theweblog entries. Although many terms are not visible because of the large numberof words, there are a few groupings evident from the graph. Some of the visibleterms include the grouping of spyware, malware, and software at the top centerof the plot. Another group visible at the right include Iraq, war, Bush, and

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    1

    0.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    abilabsolutabuabus

    accept

    accessaccount

    accus actaction

    activ

    actual adaddadditaddress

    administr

    admit afghanistanag

    agenc

    agentago

    agreaidairallegalli allow

    america

    american

    amountanalysi

    announcanswer

    anti

    apparappearapproachapprov

    aprilarab

    areaarenarguargumentarmarmiarrest

    articlask

    associ

    attack

    attemptattentauthor

    avoidawar backbad

    baghdadbasebasicbattlbeganbeginbeliev bigbillbillion

    bin

    bitblackblameblog

    blowbodiboi bombbookborder

    breakbringbritish

    brother

    broughtbui build

    burn

    bush

    busi

    call

    campcampaign

    capit capturcar

    cardcarecarri

    case

    caughtcaus

    cellcenter

    centralchalleng

    chanc

    chang

    charg

    checkcheneichiefchildrenchoicchristian

    cia

    citicitizencivil

    civilian claimclass

    clearclintonclose

    code

    collect

    combatcomecommand comment

    commit

    committe

    common

    commun

    compani

    complet

    comput

    concernconductconfirm

    conflict

    congress

    connect

    conservconsid

    constitutcontactcontent continu

    controlconvers

    convict

    corporcost countricoupl

    court

    covercreat

    crimecrimin

    criticcrossculturcurrent

    custom

    cut

    daidailidamag

    danger

    data

    databas

    datedaviddead

    deal

    death

    debat

    deciddecisdeclardefeatdefenddefensdemand

    democraci

    democrat

    deni

    departdesign

    destroidestructdetail

    determindevelop

    dididn

    die

    differdirect

    directli

    directordiscov discuss

    documentdoesndollar

    domest

    dondontdoubt

    drivedropdueearliearliereasi easteconomeduc

    effecteffortelectemail

    emergendenemi

    enforcengagenter entirestablisheuropeventevid

    evil

    execut

    existexpectexperiexpertexplainexpressextremey face factfailfailurfair

    fallfals

    famili

    fatherfavor fbifear

    federfeel

    fieldfight

    figur

    file

    final

    findfine firefly

    focus

    followforcforeign

    formforward foundfree

    freedomfridai friendfrontfullfundfuturgain gamegathergave

    gengener

    georggiveglobalgoal

    god

    good

    govern

    greatground

    groupgrowguess guigunhalf hand

    happenhappihard

    hatehaven head

    hearheardheartheldhell

    helphide high

    hijack

    historihitholdhome

    hopehot

    hour

    houshttp

    huge

    humanhundrhusseinideaidentifi

    ignor

    illeg

    imagimaginimmigrimport

    includincreasindependindividu

    inform

    initiinnoc

    insid

    instal

    institutinsurg

    intellig

    interestinterninternet

    interviewinvadinvas

    investig

    involviran

    iraqiraqi

    islam

    isn israelissu

    januari jobjohnjoinjournal

    judg

    justic

    justifikei

    kid

    kill

    kind

    knewknowledglack

    laden

    landlarglargestlatelatestlaunch

    law

    lawyerlead

    leaderlearnleavled left

    legallet

    letterlevelli liberliberti

    life

    lightlimit line

    linklist listen

    livelocal

    longlongerlose

    lost

    lotlovelow

    machinmade

    mailmain

    majormake

    malwar

    man

    managmarchmark

    marketmassmattermeanmeasur

    mediameet

    membermenmentionmessagmet

    michaelmiddl militari

    million

    mind

    mine

    minutmiss

    missionmistakmoment

    moneimonitor

    monthmoral

    mornmovemovementmovi

    murder muslim

    name

    nation

    naturneed

    network

    newnewspapnice nightnorth

    notenotic

    nsa

    nuclear

    number

    objectoffer offic officioilonlin open

    operopinionopportunopposopposit

    orderorgan

    origin

    osama

    pagepaipakistan

    paper

    part

    parti

    particippasspastpeacpentagon peoplperiod person

    phone

    pickpicturpiec placeplaiplan

    plane

    plot

    pointpolic

    polici politpoll

    poorpopulpopularposit

    possiblpost

    potenti powerpracticprepar present

    presid

    pressprettiprevent

    price

    prison

    privaci

    privat problemprocessproducproduct

    program

    progressprojectpromis

    protect

    protestprove

    provid

    publicpublish

    pullpurpospush

    putqaeda

    questionquotradic

    raisrate

    reach readreadi realrealitirealiz reasonreceivrecent

    record

    redreferrefusregimregionrelat

    releasreligireligionremainrememb

    removreport

    repres

    republicanrequir

    research

    respectrespond respons

    rest

    resultreturnreveal

    review rightriserisk

    role

    rollroom

    rulerun

    saddamsafe

    save

    school

    search secretsecretari

    secur

    seeksell

    senat

    sendseniorsens

    sentenc

    sept

    septemb

    seri

    serv

    servic

    setshareshortshot showsidesignsignificsimilarsimpl simplisingl

    sit

    site

    situatsmallsocialsocieti

    softwar

    soldiersonsortsound

    sourc

    south

    speakspecial

    specifspeech

    spend

    spentspread

    spy

    spywar

    stai standstandard start state

    statementstep stop

    storistrategi

    streetstrikestrongstudentstudistuffsubjectsuccess

    suffer

    suggestsuicid

    support

    suppossurpris

    surveil

    suspectsystemtake taliban

    talktarget

    taxteam

    technologtelephon

    tell

    ten

    termterror

    terrorist

    test thingthink thought

    thousand

    threat

    threaten

    thursdai

    ti timetodai

    told

    tool top

    tortur

    totaltown

    track

    tradetrain

    trial

    trooptroubltruetrust

    truthtuesdai

    turntypeunderstand unit

    universupdat

    usa

    user

    version

    victim

    videoview

    violatviolenc

    viru

    visitvoic vote

    waiwaitwalkwall

    want

    warwarn

    warrantwashington

    wasnwatchwater

    weaponweb

    websit weekwestwestern

    whitewide win

    window

    womanwomen

    won

    wordwork

    worldworri

    worsworstworthwritewritten

    wrongwrote

    www

    yeah yearyesterdaiyorkyoung

    zacaria

    zarqawi

    Fig. 2. Two-dimensional plot of terms for weblog entries using LSA

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    8/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 53

    0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

    0.4

    0.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    abil

    absolut abuabus

    accept

    access

    account

    accusactaction

    activ

    actualad

    addaddit

    address

    admitafghanistan

    ag

    agenc

    agentago

    agre

    aid airallegalli allowamount

    analysi

    announc answer

    anti

    apparappearapproach

    approv

    april

    arab

    areaaren

    arguargumentarmarmi

    arrest

    articl

    askassoci attemptattent

    author

    avoidawar

    backbad

    baghdadbasebasicbattl

    beganbegin

    believbig

    billbillion

    bitblackblame

    blog

    blow bodi boi bomb

    book

    border

    breakbring

    british

    brother

    brought

    bui build

    burn

    busi

    campcampaign

    capit capturcar

    card

    carecarri

    case

    caughtcaus

    cell

    center

    centralchalleng

    chanc

    chang

    charg

    checkchenei

    chiefchildrenchoic

    christian

    cia

    citi

    citizencivil

    civilianclaim

    class

    clear

    clintonclose

    code

    collect

    combatcome

    command comment

    commit

    committe

    common

    commun

    compani

    completconcern

    conductconfirm

    conflict

    congress

    connect

    conservconsid

    constitut

    contactcontent

    continucontrolconvers

    convict

    corporcostcoupl

    court

    covercreat

    crimecrimin

    critic

    crosscultur

    current

    custom

    cutdaili

    damag

    danger

    data

    databas

    datedavid

    dead

    dealdebat

    deciddecis

    declar

    defeat defenddefensdemand

    democraci

    democrat

    deni

    depart

    design

    destroidestruct

    detail

    determin

    develop

    dididn

    die

    differdirect

    directli

    director

    discov discuss

    document

    doesndollar

    domest

    dont

    doubt

    drivedropdue

    earliearlier

    easieast

    economeduceffect

    effortelect

    email

    emerg

    endenemi

    enforcengagenter

    entirestablisheurop

    eventevidevil

    execut

    existexpectexperiexpert

    explainexpressextremey face fact

    failfailur

    fair

    fallfals

    famili

    fatherfavor fbi

    fear

    federfeel

    fieldfight

    figur

    file

    final

    findfinefire

    fly

    focus

    follow

    forcforeignformforward found

    free

    freedomfridai friendfront fullfundfutur

    gain gamegather

    gave

    gengener

    georg

    giveglobalgoal

    god

    goodgreat

    groundgroupgrow

    guessgui

    gunhalf

    handhappenhappi

    hard

    hate

    haven head

    hearheardheartheld

    hell

    helphide high

    hijack

    historihithold

    home

    hope

    hot

    hour

    houshttp

    huge

    human

    hundr husseinidea

    identifiignor

    illeg

    imagimagin

    immigrimportinclud

    increasindepend

    individu

    inform

    initi

    innoc

    insid

    instal

    institutinsurg

    interestintern

    internet

    interviewinvadinvas

    investig

    involv

    iraniraqi

    isn israel

    issujanuari

    jobjohnjoin

    journal

    justic

    justifi

    keikid

    kind

    knew

    knowledglack landlarg

    largest

    latelatestlaunch

    law

    lawyer

    lead

    leaderlearn

    leavled

    left

    legal

    let

    letterlevel

    li liber

    liberti

    lightlimit line

    linklist

    listen

    live

    local

    longlonger

    lose

    lost

    lotlove

    low

    machin

    made

    mail

    mainmajor

    malwar

    man

    managmarch

    mark

    market

    massmattermeanmeasur

    media

    meet

    membermen

    mention messagmetmichael

    middl

    million

    mind

    mine

    minut

    miss

    mission

    mistakmoment

    moneimonitor

    month

    moral

    morn movemovement movi

    murder muslim

    name

    natur

    need

    network

    newspap

    nicenightnorth notenotic

    nuclear

    number

    object offeroffic officioil

    onlinopen

    operopinionopportun

    opposoppositorder

    organorigin

    osama

    pagepai

    pakistan

    paper

    part

    parti

    particippasspast peacpentagon

    periodperson

    pickpicturpiec

    placeplaiplan

    plane

    plot

    pointpolic

    policipolit

    poll

    poorpopulpopular posit

    possibl

    potentipower

    practicpreparpresent presspretti

    preventprice

    privaci

    privatproblem

    processproduc

    product

    progressproject

    promis

    protect

    protestprove

    provid

    publicpublish

    pull

    purpospush

    put

    question

    quotradic

    rais

    ratereach readreadi

    realrealitirealiz reasonreceiv

    recent

    record

    redreferrefusregimregion relat

    releasreligireligion

    remainrememb

    remov

    repres

    republican

    requirresearch

    respectrespond

    respons

    rest

    resultreturn

    revealreview

    rightrise

    risk

    role

    roll

    room

    rule

    run

    saddam

    safe

    save

    school

    searchsecret

    secretariseeksell

    senat

    sendsenior

    sens

    septemb

    seri

    serv

    servic

    setshareshort

    shot showsidesignsignificsimilar

    simpl simplisingl

    sit

    site

    situatsmall

    social

    societi

    softwar

    soldierson

    sort

    sound

    sourc

    south

    speak

    specialspecif

    speech

    spend

    spent

    spread

    spy

    spywar

    staistand

    standardstart

    statement

    step stopstori

    strategistreetstrike

    strong

    studentstudistuff

    subjectsuccess

    suffer

    suggest

    suicid

    support

    suppossurpris

    surveil

    suspect system

    taketaliban

    talk

    target

    taxteam

    technologtelephon

    tell

    ten

    term

    test

    think thoughtthousand

    threat

    threaten

    thursdai

    ti

    todai

    told

    tooltop

    tortur

    totaltown

    track

    tradetrain

    trooptroubl

    true

    trust

    truth

    tuesdai

    turntypeunderstand

    univers

    updat

    usa

    user

    version

    victim

    videoview

    violat

    violenc

    viru

    visit

    voicvote

    wai

    waitwalkwall

    want

    warn

    warrant

    washington

    wasnwatchwater

    weapon

    web

    websit weekwestwestern

    whitewide

    win

    window

    woman

    women

    won

    word

    workworri

    worsworstworth

    writewritten

    wrong

    wrote

    www

    yeah yesterdai

    yorkyoung zarqawi

    Fig. 3. Zoomed-in graph of Figure 2

    American. Yet another grouping include the terms death, prison, and life at the

    bottom of the graph. Zooming into the large cluster from Figure 2, Figure 3 showsa subset of the big cluster of keywords. A larger group of keywords (spyware,malware, software, data, user, window, privacy, spy, domestic) can be identified,thus showing the ability descend through a hierarchical grouping of keywords.The implications of the graphs demonstrate the possibility to visualize closely-related terms in two-dimensional space. Although the two-dimensional graphsmay be an over-simplification of the dimensionality reduction that takes place,the plot can help to visualize the terms and relate to the topics produced for theweblogs.

    4.3 Results for Weblog Topic Analysis

    We conducted some experiments using PLSA for the weblog entries. Tables 1-4summarizes the keywords found for each of the four topics (Computer Security,Osama bin Laden, Iraq War, and US National Security).

    By looking at the various topics listed, we are able to see that the probabilis-tic approach is able to list important keywords of each topic in a quantitativefashion. The keywords listed can relate back to the original topics. For example,the keywords detected in the Computer Security topic features items such ascomputers, spyware, software, and internet.

    Figure 4 shows the graph of the topic-document distribution of the weblogentries by date. Some of the topics have a higher density of documents distributedaround certain dates. This can be used to match certain events in each topic tothe weblog entries. For example, the heavy clustering of documents for Topic4 (US National Security) indicate that there was an increase on the weblog

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    9/12

    54 F.S. Tsai and K.L. Chan

    Table 1. List of keywords for Topic 1:Computer Security

    Keyword Probability

    comput 0.023716malwar 0.020509spywar 0.018047softwar 0.014650secur 0.014257

    window 0.013527internet 0.013436http 0.013266web 0.012022user 0.011651

    Table 2. List of keywords for Topic 2:Osama bin Laden

    Keyword Probability

    moussaoui 0.0170900don 0.0083916life 0.0083244bin 0.0079485

    laden 0.0078515osama 0.0074730prison 0.0074594peopl 0.0064921death 0.0063618god 0.0061734

    Table 3. List of keywords for Topic 3:Iraq War

    Keyword Probability

    iraq 0.0134810war 0.0089393islam 0.0087796

    zarqawi 0.0086076militari 0.0073831muslim 0.0073576

    afghanistan 0.0072725iran 0.0070231qaeda 0.0069711iraqi 0.0065779

    Table 4. List of keywords for Topic 4: USNational Security

    Keyword Probability

    nsa 0.0142720bush 0.0127100phone 0.0121350

    program 0.0099155presid 0.0098480cia 0.0093545

    american 0.0088027record 0.0086708call 0.0086591

    administr 0.0084607

    conversations in this topic around the middle of May 2006. This can be due toUS President Bushs comment on May 11, 2006 about a USA Today report ona massive NSA database that collects information about all phone calls madewithin the United States [5]. This is one example of an event that can triggermuch conversation in the blogosphere.

    For the topic of Computer Security, we further decompose into separatesubtopics, two of which are shown in Tables 5-6. Malware, which includes com-puter viruses, worms, trojan horses, spyware, adware, and other malicious soft-ware, is the topic derived from examining the keywords in Subtopic 1. Subtopic2 is classified as Macintosh, and reflects the increasing reports of cyber attacksaffecting Macintosh computers. Therefore, we can classify and decompose thetopics into a hierarchy of subtopics, which may be useful for examining largerdata sets.

    The power of PLSA in cyber security applications include the ability to au-tomatically detect terms and keywords related to cyber security threats andterror events. By presenting blogs with measurable keywords, we can improve

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    10/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 55

    s

    cipoT

    Documents

    1000 2000 3000 4000 5000

    1

    2

    3

    4

    May 1 5 10 15 20 25 31 (2006)

    Documents By Date

    Fig. 4. Topic by document distribution by date

    Table 5. List of keywords for Subtopic 1:Malware

    Keyword Probabilityspywar 0.0174610trojan 0.0117580adwar 0.0092120scan 0.0088160

    anti 0.0087841free 0.0074914

    spybot 0.0074895remov 0.0072834

    download 0.0069242viru 0.0068496

    Table 6. List of keywords for Subtopic 2:Macintosh

    Keyword Probabilitymac 0.0078348secur 0.0056874appl 0.0054443

    microsoft 0.0041262

    attack 0.0041169system 0.0039921report 0.0037533cyber 0.0037227crime 0.0036151comput 0.0035352

    our understanding of cyber security issues in terms of distribution and trends ofcurrent threats and events. This has implications for security agencies wishingto monitor real-time threats present in weblogs or other related documents.

    5 Conclusions

    In this paper, we analyzed weblog posts for various categories of cyber securitythreats related to the detection of cyber security threats, cyber crime, and cyberterrorism. To our knowledge, is the first such study focusing on cyber securityweblogs. We use latent semantic analysis to illustrate similarities in terms dis-

    tributed across all the terms in the weblog dataset. Our experiments on ourdataset of weblogs demonstrate how our probabilistic weblog model can presentthe blogosphere in terms of topics with measurable keywords, hence trackingpopular conversations and topics in the blogosphere. By applying a probabilisticapproach, we can improve information retrieval in weblog search and keywords

  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    11/12

    56 F.S. Tsai and K.L. Chan

    detection, and provide an analytical foundation for the future of security intel-ligence analysis of weblogs.

    Potential applications of this stream of research may include automaticallymonitoring and identifying trends in cyber terror and security threats in weblogs.This can have some significance for government and intelligence agencies wishingto monitor real-time potential international terror threats present in weblogconversations and the blogosphere.

    References

    1. Avesani, P., Cova, M., Hayes, C., Massa, P.: Learning Contextualised Weblog Top-ics. WWW 05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis

    and Dynamics (2005)2. Berry, M., Dumais, S. and OBrien, G.: Using linear algebra for intelligent infor-

    mation retrieval. SIAM Review, 37(4):573595 (1995).3. Columbus, L.: Blog Mining Gets Real. CRM Buyer (2005).4. Deerwester, S., Dumais, S., Landauer, T., Furnas,G., Harshman, R.: Indexing by

    latent semantic analysis. In Journal of the American Society of Information Science,41(6) (1990) 391407

    5. Diamond, J.: NSA has massive database of Americans phone calls. In USA Today(May 10, 2006)

    6. Gill, K.E.: How Can We Measure the Influence of the Blogosphere? WWW 04Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics(2004)

    7. Glance, N.S. Hurst, M. Tomokiyo, T: BlogPulse: Automated Trend Discovery forWeblogs. WWW 04 Workshop on the Weblogging Ecosystem: Aggregation, Anal-ysis and Dynamics (2004)

    8. Gruhl, D. Guha, R.,Liben-Nowell, D., Tomkins, A.: Information Diffusion ThroughBlogspace. WWW 04 (2004)

    9. Hofmann, T.: Probabilistic Latent Semantic Indexing. SIGIR99 (1999)10. Mei, Q., Liu, C., Su, H., Zhai, C.: A Probabilistic Approach to Spatiotemporal

    Theme Pattern Mining on Weblogs. WWW 06 (2006)11. Nakajima, S., Tatemura, J., Hino,Y., Hara,Y., Tanaka, K.: Discovering ImportantBloggers based on Analyzing Blog Threads. WWW 05 Workshop on the Weblog-ging Ecosystem: Aggregation, Analysis and Dynamics (2005)

    12. Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M.: Analyzing Entities andTopics in News Articles Using Statistical Topic Models. ISI 06 (2006)

    13. Pikas, C.K.: Blog Searching for Competitive Intelligence, Brand Image, and Rep-utation Management. Online. 29(4) (2005) 1621

    14. Prabowo, R., Thelwall, M.: A Comparison of Feature Selection Methods for anEvolving RSS Feed Corpus, Information Processing and Management, 42 (2006)

    1491151215. Tsai, F.S., Chan, C.K. (eds): Cyber Security, Pearson Education, Singapore (2006)

    16. Tsai, F.S., Chen, Y., Chan, K.L.: Probabilistic Latent Semantic Analysis for Searchand Mining of Corporate Blogs (2007)

    17. Wikipedia contributors: Intelligence Analysis. In: Wikipedia, The Free Encyclope-dia, http://en.wikipedia.org/wiki/Intelligence analysis. (accessed Nov 7,2006).

    http://en.wikipedia.org/wiki/Intelligence_analysishttp://en.wikipedia.org/wiki/Intelligence_analysis
  • 8/7/2019 Detecting Cyber Security Threats in Weblogs

    12/12

    Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 57

    18. Yang, C.C., Shi, X., Wei, C.-P.: Tracing the Event Evolution of Terror Attacksfrom On-Line News. ISI 06 (2006)

    19. Yilmazel, O., Symonenko, S., Balasubramanian, N., Liddy, E.D.: Leveraging One-Class SVM and Semantic Analysis to Detect Anomalous Content. ISI 05 (2005)

    20. Zeimpekis, D., Gallopoulos, E.: TMG: A MATLAB Toolbox for generating term-document matrices from text collections. Grouping Multidimensional Data: RecentAdvances in Clustering. Springer (2005) 187210