a tag-topic model for blog mining
TRANSCRIPT
Expert Systems with Applications 38 (2011) 5330–5335
Contents lists available at ScienceDirect
Expert Systems with Applications
journal homepage: www.elsevier .com/locate /eswa
A tag-topic model for blog mining
Flora S. Tsai ⇑School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
a r t i c l e i n f o
Keywords:Blog miningWeblogTagsAuthor-Topic modelIsomapLatent Dirichlet Allocation
0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.10.025
⇑ Tel.: +65 6790 6369; fax: +65 6793 3318.E-mail address: [email protected]
a b s t r a c t
Blog mining addresses the problem of mining information from blog data. Although mining blogs mayshare many similarities to Web and text documents, existing techniques need to be reevaluated andadapted for the multidimensional representation of blog data, which exhibit dimensions not present intraditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuablesources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic modelfor blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topicmodel determines the most likely tags and words for a given topic in a collection of blog posts. The modelhas been successfully implemented and evaluated on real-world blog data.
� 2010 Elsevier Ltd. All rights reserved.
1. Introduction
A blog, or weblog, is a type of online journal where entries aremade in a reverse chronological order. Blogs can comment on aparticular subject, as well as form of a social network (Tsai, Han,Xu, & Chua, 2009). The blogosphere is defined as the collection ofall blogs as a community or social network. Because of the largenumbers of existing blog documents (posts) the blogosphere con-tent may be random and chaotic (Chen, Tsai, & Chan, 2008). As aresult, effective mining and visualization techniques are neededto aid in the analysis and understanding of blog data.
A tag is a keyword that can be used to describe a blog. The tagmetadata is useful for users to quickly find related blog entries thatare tagged to a topic of interest. Tags can be chosen by the blogger,the viewer, or both. If many users tag many items, this tag collec-tion forms a folksonomy. Tagging was popularized by the Web 2.0and is an important feature of many existing services.
Many blog systems allow bloggers to add new tags to a post, inaddition to placing the post into categories. For example, a postmay display that it has been tagged with ‘‘web’’ and ‘‘security’’.Each of those tags can link to a main page that lists all of the relatedposts with the same tag. A sidebar may list all the tags for that blog,with each tag leading to an index page. If a post is incorrectly clas-sified, a blogger can edit the list of tags.
Analysis of large data of multiple tags may require the use ofdimensionality reduction or projection techniques to transformthe data into a smaller set. Dimensionality reduction finds a smal-ler set of features that can describe the original set of observed
ll rights reserved.
dimensions. Dimensionality reduction can uncover hidden struc-ture which is useful to understand and visualize of the data.
Previous studies (Chen, Tsai, & Chan, 2007; Liang, Tsai, & Kwee,2009; Tsai & Chan, 2007a) use existing data mining techniqueswithout considering the additional dimensions present in blogs.In this paper, we show how blog mining is different from tradi-tional Web and text mining by defining the multiple dimensionsin blog documents, and comparing to Web and text documents.Next, we describe a tag-topic model for mining the multiple tagspresent in blogs. Finally, we implement Isomap (Tenenbaum, deSilva, & Langford, 2000) dimensionality reduction technique forvisualizing real-world collections of security blogs.
The paper is organized as follows: Section 2 describes past workin blog content and tag mining. Section 3 presents the models andtechniques for blog mining, including the proposed tag-topic mod-el to analyze and visualize the multiple tags present in blog data.Section 4 presents experimental results on real-world blog data,and Section 5 concludes the paper.
2. Blog content and tag mining
2.1. Dimensions of blog documents
A blog is structured differently from a typical Web or text doc-ument. Table 1 compares the different components of blog, Web,and text documents. URL stands for the Uniform Resource Locator,the Web address from which a document can be found. A perma-link is specific to blogs, and is a URL that points to a specific blogentry after the entry has passed from the front page into the blogarchives. Outlinks are documents that are linked from the blog orWeb document. Tags are labels that people use to make it easierto find related blog posts, photos, and videos.
Table 1Comparison of blog, Web, and text documents.
Components Blog Web Text
Titlep p
Contentp p p
Tagsp
Authorp
URLp p
Permalinkp
Outlinksp p
Timep
Datep
F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5331
If we consider the different components of blogs, we can groupgeneral blog data mining into five main dimensions (blog content,tags, authors, links, and time), shown in Table 2.
The next sections defines and summarizes blog content and tagmining techniques.
2.2. Blog content mining
Blog content consists of the title and content of the blog docu-ments. Many of the techniques are similar to text and Web docu-ments; however important distinctions that pose challenges innatural language processing include common use of abbreviationsand slang words, spelling and grammatical errors, and differentlanguages present within one document.
Many blog content mining techniques focuses on sentiment oropinion mining, or judging whether a particular blog post is nega-tive, positive, or neutral to a particular entity (such as a person orproduct). In fact, one of the main tasks in the Text Retrieval Confer-ence (TREC) Blog Track was the Opinion Retrieval Task, which in-volved locating blog posts that express an opinion about a giventarget (Ounis, de Rijke, Macdonald, Mishne, & Soboroff, 2006; Oun-is, Macdonald, & Soboroff, 2008; Macdonald, Ounis, & Soboroff,2007).
Another prevalent theme in blog content mining is the filteringof spam blogs, or splogs, which can greatly misrepresent any esti-mations of the number of blogs posted. Previous work in splogdetection include splog detection using self-similarity analysis onblog temporal dynamics (Lin, Sundaram, Chi, Tatemura, & Tseng,2007), using Support Vector Machines (SVMs) to identify andsplogs (Kolari, Finin, & Joshi, 2006).
Yet another important task in blog content mining is topic dis-tillation, which was the second main task in TREC Blog 2007 (Mac-donald et al., 2007) and 2008 (Ounis et al., 2008). The blogdistillation, or feed search, task focuses on blog feeds, which areaggregates of blog posts. Blog distillation task searches for a blogfeed with a principle, recurring interest in topic t. For a given topict, systems should suggest feeds that are principally devoted to tover the timespan of the feed, and would be recommended to sub-scribe to as an interesting feed about t (Macdonald et al., 2007).This task has direct relevance to the problem of searching for blogs
Table 2Blog dimensions.
Dimensions Blog components
Content Title and contentTags Tags (labels or
categories)Author Author or bloggerLinks URL, permalink,
outlinksTime Date and time
that a user may wish to subscribe. As many blog posts are inher-ently noisy, finding the relevant feeds is not a trivial problem.
2.3. Blog tag mining
A blog tag is a word that categorizes documents according to itstopic. Blog tag mining is a subset of social media tag mining. Socialmedia sites, such as Flickr, MySpace, and del.icio.us, allow users tosemantically annotate many different types of content. These user-generated tags classifies content so they can be easily found.
Because blog tags are typically user-generated different usersmay use different tags to describe a similar blog. There is also a lackof information about the meaning of each tag. For example, the tag‘‘apple’’ could refer to either the fruit or the company. The person-alized variety of vulnerable finding comprehensive informationabout a subject. Our proposed model attempts to solve some ofthe difficulties of blog tag mining by applying probabilistic anddimensionality reduction techniques, which can reduce the noisein blog tags.
3. Models and techniques for blog mining
In this section, we propose and apply probabilistic models anddimensionality reduction techniques for analyzing and visualizingthe multiple tags present in blog data. This model can easily be ex-tended for different categories of multidimensional data, such asother types of social media. The techniques are based on LatentDirichlet Allocation (Blei, Ng, & Jordan, 2003), a modified versionof the Author-Topic model, and Isomap dimensionality reductionalgorithm.
3.1. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) models textdocuments as mixtures of latent topics, which are key conceptspresented in the text. LDA is not as vulnerable to overfitting as tra-ditional methods based on Latent Semantic Analysis (LSA) (Chenet al., 2008; Deerwester, Dumais, Furnas, Landauer, & Harshman,1990).
The topic mixture is drawn from a conjugate Dirichlet prior thatis the same for all documents. The steps adapted for blog docu-ments are summarized below:
(1) Select a multinomial distribution /t for each topic t from aDirichlet distribution with parameter b.
(2) For each blog document b, select a multinomial distributionhb from a Dirichlet distribution with parameter a.
(3) For each word token w in blog b, select a topic t from hb.(4) Select a word w from /t.
The probability of generating a corpus is:Z Z YK
t¼1
Pð/tjbÞYN
b¼1
PðhbjaÞYNb
i¼1
XK
ti¼1
PðtijhÞPðwijt;/Þ !
dhd/ ð1Þ
3.2. Topic-tag model
An extension of LDA to probabilistic Author-Topic (AT) model-ing (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004; Steyvers, Smyth,Rosen-Zvi, & Griffiths, 2004) is proposed for the blog tag and topicvisualization. The AT model is based on Gibbs sampling, a Markovchain Monte Carlo technique, where each author is represented bya probability distribution over topics, and each topic is representedas a probability distribution over terms (words) for that topic(Steyvers et al., 2004).
5332 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335
We have extended the AT model for analysis of blog tags. Forthe tag-topic (TT) model, each tag is represented by a probabilitydistribution over topics, and each topic represented by a probabil-ity distribution over terms for that topic.
Fig. 1 shows the generative model of the TT model using platenotation.
For the TT model, the probability of generating a blog is givenby:
YNb
i¼1
1Tb
Xl
XK
t¼1
/withtl ð2Þ
where blog b has Tb tags. The probability is then integrated over /and h and their Dirichlet distributions and sampled using the Gibbssampling Monte Carlo technique.
The similarity matrices for tags and content can then be calcu-lated using the symmetrized Kullback Leibler (KL) distance be-tween topic distributions, which is able to measure thedifference between two probability distributions. The similaritymatrices can be visualized using the Isomap dimensionality tech-nique described in the following section.
3.3. Isometric feature mapping (Isomap)
Isomap (Tenenbaum et al., 2000) is a nonlinear dimensionalityreduction technique that uses multidimensional scaling (MDS)(Davison, 2000) techniques with geodesic interpoint distances in-stead of Euclidean distances. Geodesic distances represent theshortest paths along the curved surface of the manifold. Unlikethe linear techniques, Isomap can discover the nonlinear degreesof freedom that underlie complex natural observations (Tenen-baum et al., 2000).
Isomap deals with finite data sets of points in Rn which are as-sumed to lie on a smooth submanifold Md of low dimension d < n.The algorithm attempts to recover M given only the data points.Isomap estimates the unknown geodesic distance in M betweendata points in terms of the graph distance with respect to somegraph G constructed on the data points.
Isomap algorithm consists of three basic steps:
(1) Find the nearest neighbors on the manifold M, based on thedistances between pairs of points in the input space.
(2) Approximate the geodesic distances between all pairs ofpoints on the manifold M by computing their shortest pathdistances in the graph G.
Fig. 1. The graphical model for the tag-topic model using plate notation.
(3) Apply MDS to matrix of graph distances, constructing anembedding of the data in a d-dimensional Euclidean spaceY that best preserves the manifold’s estimated intrinsicgeometry (Tenenbaum et al., 2000).
If two points appear on a nonlinear manifold, their Euclideandistance in the high-dimensional input space may not accuratelyreflect their intrinsic similarity. The geodesic distance along thelow-dimensional manifold is thus a better representation for thesepoints. The neighborhood graph G constructed in the first step ofallows an estimation of the true geodesic path to be computed effi-ciently in step two, as the shortest path in G. The two-dimensionalembedding recovered by Isomap in step three, which best pre-serves the shortest path distances in the neighborhood graph.The embedding now represents simpler and cleaner approxima-tions to the true geodesic paths than do the corresponding graphpaths (Tenenbaum et al., 2000).
Isomap is a very useful noniterative, polynomial-time algorithmfor nonlinear dimensionality reduction. Isomap is able to computea globally optimal solution, and for a certain class of data manifolds(Swiss roll), is guaranteed to converge asymptotically to the truestructure (Tenenbaum et al., 2000). However, Isomap may not eas-ily handle more complex domains such as non-trivial curvature ortopology. Because a previous study showed that Isomap was gen-erally able to perform well on visualization of synthetic as wellas real-world data (Tsai & Chan, 2007b), we have applied Isomapfor visualizing blog content and tags.
4. Experiments and results
We used the tag-topic model for blog data mining on our collec-tion of real-world blog data. Dimensionality reduction was per-formed with Isomap to show the similarity plot of blog contentand tags. Experiments show that the tag-topic model can revealinteresting patterns in the underlying tags and topics for our data-set of security-related blogs.
4.1. Data corpus
For our experiments, we extracted a subset of the Nielson Buzz-Metrics blog data corpus1 that focuses on blogs related to securitythreats and incidents related to cyber crime and computer viruses.The original dataset consists of 14 million blog posts collected byNielsen BuzzMetrics for May 2006. Although the blog entries spanonly a short period of time, they are indicative of the amount andvariety of blog posts that exists in different languages throughoutthe world.
Blog entries related to security threats such as malware, cybercrime, computer virus, encryption, and information security wereextracted by keyword search and stored for use in our analysis.
There were a total of 3096 entries in our dataset; however, asmost of the blog posts do not have tags associated with them, weeliminated those documents with null or blank tags, as well asthose with tags labeled as ‘‘uncategorized’’. Each of the remaining948 blog entries was saved as a text file for further text preprocess-ing. For the preprocessing of the blog content, HTML tags were re-moved, lexical analysis was performed by removing stopwords,stemming, and pruning by the Text to Matrix Generator (TMG)(Zeimpekis & Gallopoulos, 2006) prior to generating the term-doc-ument matrix using term frequency (TF) local term weighting. Thetotal number of terms after pruning and stopword removal was4111. For the tag-document matrix, tags separated by ‘‘and’’, ‘‘/’’,or ‘‘& ’’ were treated as separate tags. Otherwise, the words were
1 http://www.icwsm.org/data.html.
Table 3Topic 11: malware.
Term Probability
browser 0.07184worm 0.04667yahoo 0.03283user 0.03121safeti 0.02768instal 0.02488facetim 0.02355hijack 0.02002malwar 0.01870site 0.01708
Tag Probabilityworld 0.13636web 0.09365videogames 0.07790links 0.05805www 0.05079news 0.05011opinion 0.03409internet 0.03245windows 0.02834economy 0.02369
Table 5Topic 26: Spyware.
Term Probability
spyware 0.10403comput 0.02331software 0.02177anti 0.01868yahoo 0.01800web 0.01594user 0.01525system 0.01320new 0.01234person 0.01183
Tag Probabilityspywarenews 0.52080quizzes 0.04806thankyouforsmoking 0.04412aquifer 0.03719catchingupwithtowanda 0.01765writing 0.01623spywarebooks 0.01529secularhumanism 0.00961sport 0.00804warroom 0.00756
F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5333
combined to form one tag. The tag-document matrix was gener-ated with binary local term weighting, resulting in a total of 552unique tags. The term-document matrix and tag-document matrixwere used to compute the tag-topic model.
In this model, each tag is represented by a probability distribu-tion over topics, and each topic is represented as a probability dis-tribution over terms for that topic (Steyvers et al., 2004). The topic-term and tag-topic distributions were then learned from the blogdata in an unsupervised manner. The parameters used in ourexperiments were the number of topics (t = 50) and number of iter-ations (N = 2000). We used symmetric Dirichlet priors in the TTestimation with a = 50/t and b = 0.01, which are common settingsin the literature.
The most likely terms and corresponding tags from each topic ofthe blog entry collection are listed in Tables 3–6.
From the results, we observe that some of the blog tags may notbe very descriptive of the topic. For example, for the topic Spyware,the tags ‘‘quizzes’’, ‘‘thankyouforsmoking’’, ‘‘aquifer’’, and ‘‘catchi-ngupwithtowanda’’ do not seem especially relevant to the topic.
Table 4Topic 22: Windows security.
Term Probability
threat 0.02759secure 0.02566custom 0.02227window 0.02203antivirus 0.02178beta 0.01985protect 0.01960response 0.01839vista 0.01839offer 0.01476
Tag Probabilitydiggnews 0.47986miscellanea 0.03511gallery 0.02606world 0.02111musique 0.01960spywarenews 0.01637blogging 0.01271warroom 0.01228photos 0.00862mobilesociety 0.00797
Since tags are user generated, there is often a problem of mislabel-ing, or using long phrases instead of one or two words to tag a blog.Bloggers also have a tendency to use the same tag for many or all oftheir posts, no matter what the subject.
4.2. Blog content visualization
For visualizing the document similarities, the symmetrizedKullback Leibler distance between topic distributions was calcu-lated for each document pair. Fig. 2 shows the 2D plot of the doc-ument similarities based on the document-topic distributions. Arandom sample of 100 titles were taken in and shown in the plot.
4.3. Blog tag visualization
For visualizing the tag similarities, the symmetrized KullbackLeibler distance between topic distributions was calculated foreach tag pair. Fig. 3 shows the 2D plot of the tag similarities basedon the tag-topic distributions of the most popular tags. In the plot,
Table 6Topic 48: Identity theft.
Term Probability
secure 0.04668card 0.02941theft 0.02462access 0.02334credit 0.02302compani 0.01982ident 0.01695execute 0.01567laptop 0.01567employe 0.01503
Tag Probabilityphotos 0.31245security 0.04562religion 0.03325miscellanea 0.03243vehicles 0.02556review 0.01539veggingout 0.01182wespen 0.01154intellisense 0.01127writing 0.01127
tes ict blog no.
about trojans
viewpoint media player
eat the dog food, drink the koolaid...
cheating adsense
yhoo32.explr malware threat relatedto yahoo! messenger
racerxe
keeping the software free
frightening world out here!
about trojans
trojan out of nowhere
tech radio
about adware
the awesome five
how to fix the va information theftproblem
ca to offer free etrust ez antivirusto microsoft windows vista beta
users
dammit... psphtmltron
sunday, may
profiling the hackermacs may no longer be immune toviruses
the things i do for my friends
another trip!malware is getting smarter, each
day it puzzles us!
thirsty for qoolaid
free adware
internet disclaimer
news new trojan horse threatensto delete files unless you pay
up
useful firefox extensions
about trojans
the guys (ed skoudis, tom listonand mike poor) at
agnitum outpost firewall pro . (build )
life as it goes
may ,
cyber blackmail increasing
best boat loans
virtual task force nets cybercriminals
torrent infectado
stop pima county from buying dieboldvoting machines
hackers straks ook in de cola
top three computer protection priorities
april malware review yonkers
hackers straks ook in de cola
spybot definition file update . .
attention please
new safe browser now available
yike
shameless self
global virus, spam and phishingtrends
story time!
a classhelping law enforcement fight cyber
all about spyware
new trojan targets word
random stuff,
dissecting leftism
optical scan machines fail in michigan, officials ....
apple airs new mac commercial
diebold voting systems criticallyflawed
altiris svs
un broadcasting treaty restrictsfree speech
attention virus
about keyloggers
welcome a newcomer in our spywareand adware collection.
consigned to the waste basket
stupid peoplefirst antivirus for s60 3rd edition
five architectural flaws in windowsmicrosoft hackers exploiting unpatchedflaw in ms ....
windows live safety center maynot remove some malware
first antivirus for s60 3rd edition
security
spy data furor
woo
sea angel
apple sans viruses and malwarethe exile files
thebroken check it out!
zfone encrypts voip calls
linked by shanmuga
customers who bought sony cds withxcp copy control ....
kids just say no!xoftspy , , ....
cyber criminals targeting gamers
nerdy news in april
spyware advice
your fortune calls for efficaciousblocker for wood flooring low cost
installed spam and malicious software!
northwest mortgage
new e
yahoo! im worm
Fig. 2. Results on visualization of blog content using Isomap (k = 100).
artculos
blog
blogging
chrysler
diggnews
emolenindianpolis
gadgetnews
general
hairinternet
linksmovies
news
recipesweasley
spywarenewssecurity
miscellanea
techpolitics
Fig. 3. Results on visualization of blog tags using Isomap (k = 20).
5334 F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335
each tag was scaled according to the number of blogs posted usingthat tag. The distances between the tags are proportional to thesimilarity between tags, based on the topic distributions of theblogs that were posted. As seen from the graphs, the majority ofblogs in our dataset were tagged with either ‘‘spywarenews’’ or‘‘news’’. Because of the free-form nature of the tags, problemsmay arise due to nonstandardized tag labels. This problem maybe solved when a larger set of blogs are taken. In addition, someof the tags overlap because they are tagged to the same or similartopics. This may be due to the specialized nature of our dataset,which focused on security blogs. If a larger set of blogs are taken,there may not be as many overlapping tags.
5. Conclusion and future work
In this paper, we proposed a tag-topic model for blog miningbased on the Author-Topic model. In this model, each tag is repre-sented by a probability distribution over topics, and each topic isrepresented as a probability distribution over terms for that topic.This can solve the problem of finding the most likely tags andterms for a given topic.
We have successfully implemented and evaluated the tag-topicmodel on real-world security blogs. Using the output of the tag-to-pic model, we present results in visualizing which tags are similarto each other with the Isomap dimensionality reduction technique.
In addition, we also plot the results of the blog document similar-ities, based on the same techniques.
Since the tags are user generated, there may be some inherentnoise in the tags. Dimensionality reduction can help remove thenoise in the tags, and may prove useful for future studies focusingon tag mining and visualization. The tag-topic model can be ex-tended in the future for larger datasets as well as other types of so-cial media with semantic annotations.
References
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn.Res., 3, 993–1022.
Chen, Y., Tsai, F. S., & Chan, K. L. (2007). Blog search and mining in the businessdomain. In DDDM ’07: Proceedings of the 2007 international workshop on domaindriven data mining (pp. 55–60). New York, NY, USA: ACM.
Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for businessblog search and mining. Expert Systems and Applications, 35(3), 581–590.
Davison, M. (2000). Multidimensional scaling. Florida: Krieger Publishing Company.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society forInformation Science, 41(6), 391–407.
Kolari, P., Finin, T., & Joshi, A. (2006). SVMs for the blogosphere: Blog identificationand splog detection. In AAAI spring symposium on computational approaches toanalysing Weblogs.
Liang, H., Tsai, F. S., Kwee, & A. T. (2009). Detecting novel business blogs. In ICICS2009–Conference Proceedings of the 7th international conference on information,communications and signal processing (ICICS).
Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., & Tseng, B. L. (2007). Splog detectionusing self-similarity analysis on blog temporal dynamics. In AIRWeb ’07:Proceedings of the third international workshop on Adversarial informationretrieval on the web (pp. 1–8). New York, NY, USA: ACM.
Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC-2007 blog track.In The sixteenth text REtrieval conference (TREC 2007) proceedings.
Ounis, I., de Rijke, M., Macdonald, C., Mishne, G.A., & Soboroff, I. (2006). Overview ofthe TREC-2006 Blog track. In TREC 2006 working notes. (pp. 15–27).
Ounis, I., Macdonald, C., & Soboroff, I. (2008). Overview of the TREC-2008 Blog track.In TREC 2008 working notes.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic modelfor authors and documents. In AUAI ’04: Proceedings of the 20th conference onuncertainty in artificial intelligence (pp. 487–494). Arlington, Virginia, UnitedStates: AUAI Press.
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACMSIGKDD international conference on knowledge discovery and data mining(pp. 306–315). New York, NY, USA: ACM.
Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework fornonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
F.S. Tsai / Expert Systems with Applications 38 (2011) 5330–5335 5335
Tsai, F. S., & Chan, K. L. (2007a). Detecting cyber security threats in weblogs usingprobabilistic models. In Lecture notes in computer science LNCS. (Vol. 4430, pp.46–57).
Tsai, F. S., & Chan, K. L. (2007b). Dimensionality reduction techniques for dataexploration. In 2007 6th international conference on information, communicationsand signal processing, ICICS.
Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobilepeer-to-peer social networking application. Expert Systems and Applications,36(8), 11077–11087.
Zeimpekis, D., & Gallopoulos, E. (2006). TMG: A MATLAB Toolbox for generatingterm-document matrices from text collections. In Grouping multidimensionaldata (pp. 187–210). Cambridge, MA: MIT Press.