a tag-topic model for blog mining

6
A tag-topic model for blog mining Flora S. Tsai School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Keywords: Blog mining Weblog Tags Author-Topic model Isomap Latent Dirichlet Allocation abstract Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and adapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction A blog, or weblog, is a type of online journal where entries are made in a reverse chronological order. Blogs can comment on a particular subject, as well as form of a social network (Tsai, Han, Xu, & Chua, 2009). The blogosphere is defined as the collection of all blogs as a community or social network. Because of the large numbers of existing blog documents (posts) the blogosphere con- tent may be random and chaotic (Chen, Tsai, & Chan, 2008). As a result, effective mining and visualization techniques are needed to aid in the analysis and understanding of blog data. A tag is a keyword that can be used to describe a blog. The tag metadata is useful for users to quickly find related blog entries that are tagged to a topic of interest. Tags can be chosen by the blogger, the viewer, or both. If many users tag many items, this tag collec- tion forms a folksonomy. Tagging was popularized by the Web 2.0 and is an important feature of many existing services. Many blog systems allow bloggers to add new tags to a post, in addition to placing the post into categories. For example, a post may display that it has been tagged with ‘‘web’’ and ‘‘security’’. Each of those tags can link to a main page that lists all of the related posts with the same tag. A sidebar may list all the tags for that blog, with each tag leading to an index page. If a post is incorrectly clas- sified, a blogger can edit the list of tags. Analysis of large data of multiple tags may require the use of dimensionality reduction or projection techniques to transform the data into a smaller set. Dimensionality reduction finds a smal- ler set of features that can describe the original set of observed dimensions. Dimensionality reduction can uncover hidden struc- ture which is useful to understand and visualize of the data. Previous studies (Chen, Tsai, & Chan, 2007; Liang, Tsai, & Kwee, 2009; Tsai & Chan, 2007a) use existing data mining techniques without considering the additional dimensions present in blogs. In this paper, we show how blog mining is different from tradi- tional Web and text mining by defining the multiple dimensions in blog documents, and comparing to Web and text documents. Next, we describe a tag-topic model for mining the multiple tags present in blogs. Finally, we implement Isomap (Tenenbaum, de Silva, & Langford, 2000) dimensionality reduction technique for visualizing real-world collections of security blogs. The paper is organized as follows: Section 2 describes past work in blog content and tag mining. Section 3 presents the models and techniques for blog mining, including the proposed tag-topic mod- el to analyze and visualize the multiple tags present in blog data. Section 4 presents experimental results on real-world blog data, and Section 5 concludes the paper. 2. Blog content and tag mining 2.1. Dimensions of blog documents A blog is structured differently from a typical Web or text doc- ument. Table 1 compares the different components of blog, Web, and text documents. URL stands for the Uniform Resource Locator, the Web address from which a document can be found. A perma- link is specific to blogs, and is a URL that points to a specific blog entry after the entry has passed from the front page into the blog archives. Outlinks are documents that are linked from the blog or Web document. Tags are labels that people use to make it easier to find related blog posts, photos, and videos. 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.10.025 Tel.: +65 6790 6369; fax: +65 6793 3318. E-mail address: [email protected] Expert Systems with Applications 38 (2011) 5330–5335 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: flora-s-tsai

Post on 21-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A tag-topic model for blog mining

Expert Systems with Applications 38 (2011) 5330–5335

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A tag-topic model for blog mining

Flora S. Tsai ⇑School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

a r t i c l e i n f o

Keywords:Blog miningWeblogTagsAuthor-Topic modelIsomapLatent Dirichlet Allocation

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.10.025

⇑ Tel.: +65 6790 6369; fax: +65 6793 3318.E-mail address: [email protected]

a b s t r a c t

Blog mining addresses the problem of mining information from blog data. Although mining blogs mayshare many similarities to Web and text documents, existing techniques need to be reevaluated andadapted for the multidimensional representation of blog data, which exhibit dimensions not present intraditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuablesources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic modelfor blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topicmodel determines the most likely tags and words for a given topic in a collection of blog posts. The modelhas been successfully implemented and evaluated on real-world blog data.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

A blog, or weblog, is a type of online journal where entries aremade in a reverse chronological order. Blogs can comment on aparticular subject, as well as form of a social network (Tsai, Han,Xu, & Chua, 2009). The blogosphere is defined as the collection ofall blogs as a community or social network. Because of the largenumbers of existing blog documents (posts) the blogosphere con-tent may be random and chaotic (Chen, Tsai, & Chan, 2008). As aresult, effective mining and visualization techniques are neededto aid in the analysis and understanding of blog data.

A tag is a keyword that can be used to describe a blog. The tagmetadata is useful for users to quickly find related blog entries thatare tagged to a topic of interest. Tags can be chosen by the blogger,the viewer, or both. If many users tag many items, this tag collec-tion forms a folksonomy. Tagging was popularized by the Web 2.0and is an important feature of many existing services.

Many blog systems allow bloggers to add new tags to a post, inaddition to placing the post into categories. For example, a postmay display that it has been tagged with ‘‘web’’ and ‘‘security’’.Each of those tags can link to a main page that lists all of the relatedposts with the same tag. A sidebar may list all the tags for that blog,with each tag leading to an index page. If a post is incorrectly clas-sified, a blogger can edit the list of tags.

Analysis of large data of multiple tags may require the use ofdimensionality reduction or projection techniques to transformthe data into a smaller set. Dimensionality reduction finds a smal-ler set of features that can describe the original set of observed

ll rights reserved.

dimensions. Dimensionality reduction can uncover hidden struc-ture which is useful to understand and visualize of the data.

Previous studies (Chen, Tsai, & Chan, 2007; Liang, Tsai, & Kwee,2009; Tsai & Chan, 2007a) use existing data mining techniqueswithout considering the additional dimensions present in blogs.In this paper, we show how blog mining is different from tradi-tional Web and text mining by defining the multiple dimensionsin blog documents, and comparing to Web and text documents.Next, we describe a tag-topic model for mining the multiple tagspresent in blogs. Finally, we implement Isomap (Tenenbaum, deSilva, & Langford, 2000) dimensionality reduction technique forvisualizing real-world collections of security blogs.

The paper is organized as follows: Section 2 describes past workin blog content and tag mining. Section 3 presents the models andtechniques for blog mining, including the proposed tag-topic mod-el to analyze and visualize the multiple tags present in blog data.Section 4 presents experimental results on real-world blog data,and Section 5 concludes the paper.

2. Blog content and tag mining

2.1. Dimensions of blog documents

A blog is structured differently from a typical Web or text doc-ument. Table 1 compares the different components of blog, Web,and text documents. URL stands for the Uniform Resource Locator,the Web address from which a document can be found. A perma-link is specific to blogs, and is a URL that points to a specific blogentry after the entry has passed from the front page into the blogarchives. Outlinks are documents that are linked from the blog orWeb document. Tags are labels that people use to make it easierto find related blog posts, photos, and videos.

Page 2: A tag-topic model for blog mining

Table 1Comparison of blog, Web, and text documents.

Components Blog Web Text

Titlep p

Contentp p p

Tagsp

Authorp

URLp p

Permalinkp

Outlinksp p

Timep

Datep

F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5331

If we consider the different components of blogs, we can groupgeneral blog data mining into five main dimensions (blog content,tags, authors, links, and time), shown in Table 2.

The next sections defines and summarizes blog content and tagmining techniques.

2.2. Blog content mining

Blog content consists of the title and content of the blog docu-ments. Many of the techniques are similar to text and Web docu-ments; however important distinctions that pose challenges innatural language processing include common use of abbreviationsand slang words, spelling and grammatical errors, and differentlanguages present within one document.

Many blog content mining techniques focuses on sentiment oropinion mining, or judging whether a particular blog post is nega-tive, positive, or neutral to a particular entity (such as a person orproduct). In fact, one of the main tasks in the Text Retrieval Confer-ence (TREC) Blog Track was the Opinion Retrieval Task, which in-volved locating blog posts that express an opinion about a giventarget (Ounis, de Rijke, Macdonald, Mishne, & Soboroff, 2006; Oun-is, Macdonald, & Soboroff, 2008; Macdonald, Ounis, & Soboroff,2007).

Another prevalent theme in blog content mining is the filteringof spam blogs, or splogs, which can greatly misrepresent any esti-mations of the number of blogs posted. Previous work in splogdetection include splog detection using self-similarity analysis onblog temporal dynamics (Lin, Sundaram, Chi, Tatemura, & Tseng,2007), using Support Vector Machines (SVMs) to identify andsplogs (Kolari, Finin, & Joshi, 2006).

Yet another important task in blog content mining is topic dis-tillation, which was the second main task in TREC Blog 2007 (Mac-donald et al., 2007) and 2008 (Ounis et al., 2008). The blogdistillation, or feed search, task focuses on blog feeds, which areaggregates of blog posts. Blog distillation task searches for a blogfeed with a principle, recurring interest in topic t. For a given topict, systems should suggest feeds that are principally devoted to tover the timespan of the feed, and would be recommended to sub-scribe to as an interesting feed about t (Macdonald et al., 2007).This task has direct relevance to the problem of searching for blogs

Table 2Blog dimensions.

Dimensions Blog components

Content Title and contentTags Tags (labels or

categories)Author Author or bloggerLinks URL, permalink,

outlinksTime Date and time

that a user may wish to subscribe. As many blog posts are inher-ently noisy, finding the relevant feeds is not a trivial problem.

2.3. Blog tag mining

A blog tag is a word that categorizes documents according to itstopic. Blog tag mining is a subset of social media tag mining. Socialmedia sites, such as Flickr, MySpace, and del.icio.us, allow users tosemantically annotate many different types of content. These user-generated tags classifies content so they can be easily found.

Because blog tags are typically user-generated different usersmay use different tags to describe a similar blog. There is also a lackof information about the meaning of each tag. For example, the tag‘‘apple’’ could refer to either the fruit or the company. The person-alized variety of vulnerable finding comprehensive informationabout a subject. Our proposed model attempts to solve some ofthe difficulties of blog tag mining by applying probabilistic anddimensionality reduction techniques, which can reduce the noisein blog tags.

3. Models and techniques for blog mining

In this section, we propose and apply probabilistic models anddimensionality reduction techniques for analyzing and visualizingthe multiple tags present in blog data. This model can easily be ex-tended for different categories of multidimensional data, such asother types of social media. The techniques are based on LatentDirichlet Allocation (Blei, Ng, & Jordan, 2003), a modified versionof the Author-Topic model, and Isomap dimensionality reductionalgorithm.

3.1. Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) (Blei et al., 2003) models textdocuments as mixtures of latent topics, which are key conceptspresented in the text. LDA is not as vulnerable to overfitting as tra-ditional methods based on Latent Semantic Analysis (LSA) (Chenet al., 2008; Deerwester, Dumais, Furnas, Landauer, & Harshman,1990).

The topic mixture is drawn from a conjugate Dirichlet prior thatis the same for all documents. The steps adapted for blog docu-ments are summarized below:

(1) Select a multinomial distribution /t for each topic t from aDirichlet distribution with parameter b.

(2) For each blog document b, select a multinomial distributionhb from a Dirichlet distribution with parameter a.

(3) For each word token w in blog b, select a topic t from hb.(4) Select a word w from /t.

The probability of generating a corpus is:Z Z YK

t¼1

Pð/tjbÞYN

b¼1

PðhbjaÞYNb

i¼1

XK

ti¼1

PðtijhÞPðwijt;/Þ !

dhd/ ð1Þ

3.2. Topic-tag model

An extension of LDA to probabilistic Author-Topic (AT) model-ing (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004; Steyvers, Smyth,Rosen-Zvi, & Griffiths, 2004) is proposed for the blog tag and topicvisualization. The AT model is based on Gibbs sampling, a Markovchain Monte Carlo technique, where each author is represented bya probability distribution over topics, and each topic is representedas a probability distribution over terms (words) for that topic(Steyvers et al., 2004).

Page 3: A tag-topic model for blog mining

5332 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335

We have extended the AT model for analysis of blog tags. Forthe tag-topic (TT) model, each tag is represented by a probabilitydistribution over topics, and each topic represented by a probabil-ity distribution over terms for that topic.

Fig. 1 shows the generative model of the TT model using platenotation.

For the TT model, the probability of generating a blog is givenby:

YNb

i¼1

1Tb

Xl

XK

t¼1

/withtl ð2Þ

where blog b has Tb tags. The probability is then integrated over /and h and their Dirichlet distributions and sampled using the Gibbssampling Monte Carlo technique.

The similarity matrices for tags and content can then be calcu-lated using the symmetrized Kullback Leibler (KL) distance be-tween topic distributions, which is able to measure thedifference between two probability distributions. The similaritymatrices can be visualized using the Isomap dimensionality tech-nique described in the following section.

3.3. Isometric feature mapping (Isomap)

Isomap (Tenenbaum et al., 2000) is a nonlinear dimensionalityreduction technique that uses multidimensional scaling (MDS)(Davison, 2000) techniques with geodesic interpoint distances in-stead of Euclidean distances. Geodesic distances represent theshortest paths along the curved surface of the manifold. Unlikethe linear techniques, Isomap can discover the nonlinear degreesof freedom that underlie complex natural observations (Tenen-baum et al., 2000).

Isomap deals with finite data sets of points in Rn which are as-sumed to lie on a smooth submanifold Md of low dimension d < n.The algorithm attempts to recover M given only the data points.Isomap estimates the unknown geodesic distance in M betweendata points in terms of the graph distance with respect to somegraph G constructed on the data points.

Isomap algorithm consists of three basic steps:

(1) Find the nearest neighbors on the manifold M, based on thedistances between pairs of points in the input space.

(2) Approximate the geodesic distances between all pairs ofpoints on the manifold M by computing their shortest pathdistances in the graph G.

Fig. 1. The graphical model for the tag-topic model using plate notation.

(3) Apply MDS to matrix of graph distances, constructing anembedding of the data in a d-dimensional Euclidean spaceY that best preserves the manifold’s estimated intrinsicgeometry (Tenenbaum et al., 2000).

If two points appear on a nonlinear manifold, their Euclideandistance in the high-dimensional input space may not accuratelyreflect their intrinsic similarity. The geodesic distance along thelow-dimensional manifold is thus a better representation for thesepoints. The neighborhood graph G constructed in the first step ofallows an estimation of the true geodesic path to be computed effi-ciently in step two, as the shortest path in G. The two-dimensionalembedding recovered by Isomap in step three, which best pre-serves the shortest path distances in the neighborhood graph.The embedding now represents simpler and cleaner approxima-tions to the true geodesic paths than do the corresponding graphpaths (Tenenbaum et al., 2000).

Isomap is a very useful noniterative, polynomial-time algorithmfor nonlinear dimensionality reduction. Isomap is able to computea globally optimal solution, and for a certain class of data manifolds(Swiss roll), is guaranteed to converge asymptotically to the truestructure (Tenenbaum et al., 2000). However, Isomap may not eas-ily handle more complex domains such as non-trivial curvature ortopology. Because a previous study showed that Isomap was gen-erally able to perform well on visualization of synthetic as wellas real-world data (Tsai & Chan, 2007b), we have applied Isomapfor visualizing blog content and tags.

4. Experiments and results

We used the tag-topic model for blog data mining on our collec-tion of real-world blog data. Dimensionality reduction was per-formed with Isomap to show the similarity plot of blog contentand tags. Experiments show that the tag-topic model can revealinteresting patterns in the underlying tags and topics for our data-set of security-related blogs.

4.1. Data corpus

For our experiments, we extracted a subset of the Nielson Buzz-Metrics blog data corpus1 that focuses on blogs related to securitythreats and incidents related to cyber crime and computer viruses.The original dataset consists of 14 million blog posts collected byNielsen BuzzMetrics for May 2006. Although the blog entries spanonly a short period of time, they are indicative of the amount andvariety of blog posts that exists in different languages throughoutthe world.

Blog entries related to security threats such as malware, cybercrime, computer virus, encryption, and information security wereextracted by keyword search and stored for use in our analysis.

There were a total of 3096 entries in our dataset; however, asmost of the blog posts do not have tags associated with them, weeliminated those documents with null or blank tags, as well asthose with tags labeled as ‘‘uncategorized’’. Each of the remaining948 blog entries was saved as a text file for further text preprocess-ing. For the preprocessing of the blog content, HTML tags were re-moved, lexical analysis was performed by removing stopwords,stemming, and pruning by the Text to Matrix Generator (TMG)(Zeimpekis & Gallopoulos, 2006) prior to generating the term-doc-ument matrix using term frequency (TF) local term weighting. Thetotal number of terms after pruning and stopword removal was4111. For the tag-document matrix, tags separated by ‘‘and’’, ‘‘/’’,or ‘‘& ’’ were treated as separate tags. Otherwise, the words were

1 http://www.icwsm.org/data.html.

Page 4: A tag-topic model for blog mining

Table 3Topic 11: malware.

Term Probability

browser 0.07184worm 0.04667yahoo 0.03283user 0.03121safeti 0.02768instal 0.02488facetim 0.02355hijack 0.02002malwar 0.01870site 0.01708

Tag Probabilityworld 0.13636web 0.09365videogames 0.07790links 0.05805www 0.05079news 0.05011opinion 0.03409internet 0.03245windows 0.02834economy 0.02369

Table 5Topic 26: Spyware.

Term Probability

spyware 0.10403comput 0.02331software 0.02177anti 0.01868yahoo 0.01800web 0.01594user 0.01525system 0.01320new 0.01234person 0.01183

Tag Probabilityspywarenews 0.52080quizzes 0.04806thankyouforsmoking 0.04412aquifer 0.03719catchingupwithtowanda 0.01765writing 0.01623spywarebooks 0.01529secularhumanism 0.00961sport 0.00804warroom 0.00756

F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5333

combined to form one tag. The tag-document matrix was gener-ated with binary local term weighting, resulting in a total of 552unique tags. The term-document matrix and tag-document matrixwere used to compute the tag-topic model.

In this model, each tag is represented by a probability distribu-tion over topics, and each topic is represented as a probability dis-tribution over terms for that topic (Steyvers et al., 2004). The topic-term and tag-topic distributions were then learned from the blogdata in an unsupervised manner. The parameters used in ourexperiments were the number of topics (t = 50) and number of iter-ations (N = 2000). We used symmetric Dirichlet priors in the TTestimation with a = 50/t and b = 0.01, which are common settingsin the literature.

The most likely terms and corresponding tags from each topic ofthe blog entry collection are listed in Tables 3–6.

From the results, we observe that some of the blog tags may notbe very descriptive of the topic. For example, for the topic Spyware,the tags ‘‘quizzes’’, ‘‘thankyouforsmoking’’, ‘‘aquifer’’, and ‘‘catchi-ngupwithtowanda’’ do not seem especially relevant to the topic.

Table 4Topic 22: Windows security.

Term Probability

threat 0.02759secure 0.02566custom 0.02227window 0.02203antivirus 0.02178beta 0.01985protect 0.01960response 0.01839vista 0.01839offer 0.01476

Tag Probabilitydiggnews 0.47986miscellanea 0.03511gallery 0.02606world 0.02111musique 0.01960spywarenews 0.01637blogging 0.01271warroom 0.01228photos 0.00862mobilesociety 0.00797

Since tags are user generated, there is often a problem of mislabel-ing, or using long phrases instead of one or two words to tag a blog.Bloggers also have a tendency to use the same tag for many or all oftheir posts, no matter what the subject.

4.2. Blog content visualization

For visualizing the document similarities, the symmetrizedKullback Leibler distance between topic distributions was calcu-lated for each document pair. Fig. 2 shows the 2D plot of the doc-ument similarities based on the document-topic distributions. Arandom sample of 100 titles were taken in and shown in the plot.

4.3. Blog tag visualization

For visualizing the tag similarities, the symmetrized KullbackLeibler distance between topic distributions was calculated foreach tag pair. Fig. 3 shows the 2D plot of the tag similarities basedon the tag-topic distributions of the most popular tags. In the plot,

Table 6Topic 48: Identity theft.

Term Probability

secure 0.04668card 0.02941theft 0.02462access 0.02334credit 0.02302compani 0.01982ident 0.01695execute 0.01567laptop 0.01567employe 0.01503

Tag Probabilityphotos 0.31245security 0.04562religion 0.03325miscellanea 0.03243vehicles 0.02556review 0.01539veggingout 0.01182wespen 0.01154intellisense 0.01127writing 0.01127

Page 5: A tag-topic model for blog mining

tes ict blog no.

about trojans

viewpoint media player

eat the dog food, drink the koolaid...

cheating adsense

yhoo32.explr malware threat relatedto yahoo! messenger

racerxe

keeping the software free

frightening world out here!

about trojans

trojan out of nowhere

tech radio

about adware

the awesome five

how to fix the va information theftproblem

ca to offer free etrust ez antivirusto microsoft windows vista beta

users

dammit... psphtmltron

sunday, may

profiling the hackermacs may no longer be immune toviruses

the things i do for my friends

another trip!malware is getting smarter, each

day it puzzles us!

thirsty for qoolaid

free adware

internet disclaimer

news new trojan horse threatensto delete files unless you pay

up

useful firefox extensions

about trojans

the guys (ed skoudis, tom listonand mike poor) at

agnitum outpost firewall pro . (build )

life as it goes

may ,

cyber blackmail increasing

best boat loans

virtual task force nets cybercriminals

torrent infectado

stop pima county from buying dieboldvoting machines

hackers straks ook in de cola

top three computer protection priorities

april malware review yonkers

hackers straks ook in de cola

spybot definition file update . .

attention please

new safe browser now available

yike

shameless self

global virus, spam and phishingtrends

story time!

a classhelping law enforcement fight cyber

all about spyware

new trojan targets word

random stuff,

dissecting leftism

optical scan machines fail in michigan, officials ....

apple airs new mac commercial

diebold voting systems criticallyflawed

altiris svs

un broadcasting treaty restrictsfree speech

attention virus

about keyloggers

welcome a newcomer in our spywareand adware collection.

consigned to the waste basket

stupid peoplefirst antivirus for s60 3rd edition

five architectural flaws in windowsmicrosoft hackers exploiting unpatchedflaw in ms ....

windows live safety center maynot remove some malware

first antivirus for s60 3rd edition

security

spy data furor

woo

sea angel

apple sans viruses and malwarethe exile files

thebroken check it out!

zfone encrypts voip calls

linked by shanmuga

customers who bought sony cds withxcp copy control ....

kids just say no!xoftspy , , ....

cyber criminals targeting gamers

nerdy news in april

spyware advice

your fortune calls for efficaciousblocker for wood flooring low cost

installed spam and malicious software!

northwest mortgage

new e

yahoo! im worm

Fig. 2. Results on visualization of blog content using Isomap (k = 100).

artculos

blog

blogging

chrysler

diggnews

emolenindianpolis

gadgetnews

general

hairinternet

linksmovies

news

recipesweasley

spywarenewssecurity

miscellanea

techpolitics

Fig. 3. Results on visualization of blog tags using Isomap (k = 20).

5334 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335

each tag was scaled according to the number of blogs posted usingthat tag. The distances between the tags are proportional to thesimilarity between tags, based on the topic distributions of theblogs that were posted. As seen from the graphs, the majority ofblogs in our dataset were tagged with either ‘‘spywarenews’’ or‘‘news’’. Because of the free-form nature of the tags, problemsmay arise due to nonstandardized tag labels. This problem maybe solved when a larger set of blogs are taken. In addition, someof the tags overlap because they are tagged to the same or similartopics. This may be due to the specialized nature of our dataset,which focused on security blogs. If a larger set of blogs are taken,there may not be as many overlapping tags.

5. Conclusion and future work

In this paper, we proposed a tag-topic model for blog miningbased on the Author-Topic model. In this model, each tag is repre-sented by a probability distribution over topics, and each topic isrepresented as a probability distribution over terms for that topic.This can solve the problem of finding the most likely tags andterms for a given topic.

We have successfully implemented and evaluated the tag-topicmodel on real-world security blogs. Using the output of the tag-to-pic model, we present results in visualizing which tags are similarto each other with the Isomap dimensionality reduction technique.

In addition, we also plot the results of the blog document similar-ities, based on the same techniques.

Since the tags are user generated, there may be some inherentnoise in the tags. Dimensionality reduction can help remove thenoise in the tags, and may prove useful for future studies focusingon tag mining and visualization. The tag-topic model can be ex-tended in the future for larger datasets as well as other types of so-cial media with semantic annotations.

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn.Res., 3, 993–1022.

Chen, Y., Tsai, F. S., & Chan, K. L. (2007). Blog search and mining in the businessdomain. In DDDM ’07: Proceedings of the 2007 international workshop on domaindriven data mining (pp. 55–60). New York, NY, USA: ACM.

Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for businessblog search and mining. Expert Systems and Applications, 35(3), 581–590.

Davison, M. (2000). Multidimensional scaling. Florida: Krieger Publishing Company.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).

Indexing by latent semantic analysis. Journal of the American Society forInformation Science, 41(6), 391–407.

Kolari, P., Finin, T., & Joshi, A. (2006). SVMs for the blogosphere: Blog identificationand splog detection. In AAAI spring symposium on computational approaches toanalysing Weblogs.

Liang, H., Tsai, F. S., Kwee, & A. T. (2009). Detecting novel business blogs. In ICICS2009–Conference Proceedings of the 7th international conference on information,communications and signal processing (ICICS).

Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., & Tseng, B. L. (2007). Splog detectionusing self-similarity analysis on blog temporal dynamics. In AIRWeb ’07:Proceedings of the third international workshop on Adversarial informationretrieval on the web (pp. 1–8). New York, NY, USA: ACM.

Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC-2007 blog track.In The sixteenth text REtrieval conference (TREC 2007) proceedings.

Ounis, I., de Rijke, M., Macdonald, C., Mishne, G.A., & Soboroff, I. (2006). Overview ofthe TREC-2006 Blog track. In TREC 2006 working notes. (pp. 15–27).

Ounis, I., Macdonald, C., & Soboroff, I. (2008). Overview of the TREC-2008 Blog track.In TREC 2008 working notes.

Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic modelfor authors and documents. In AUAI ’04: Proceedings of the 20th conference onuncertainty in artificial intelligence (pp. 487–494). Arlington, Virginia, UnitedStates: AUAI Press.

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACMSIGKDD international conference on knowledge discovery and data mining(pp. 306–315). New York, NY, USA: ACM.

Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework fornonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

Page 6: A tag-topic model for blog mining

F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5335

Tsai, F. S., & Chan, K. L. (2007a). Detecting cyber security threats in weblogs usingprobabilistic models. In Lecture notes in computer science LNCS. (Vol. 4430, pp.46–57).

Tsai, F. S., & Chan, K. L. (2007b). Dimensionality reduction techniques for dataexploration. In 2007 6th international conference on information, communicationsand signal processing, ICICS.

Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobilepeer-to-peer social networking application. Expert Systems and Applications,36(8), 11077–11087.

Zeimpekis, D., & Gallopoulos, E. (2006). TMG: A MATLAB Toolbox for generatingterm-document matrices from text collections. In Grouping multidimensionaldata (pp. 187–210). Cambridge, MA: MIT Press.