dimensionality reduction techniques for blog visualization

8
Dimensionality reduction techniques for blog visualization Flora S. Tsai School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Keywords: Blog Weblog Visualization Dimensionality reduction Manifold Multidimensional scaling Isomap Locally Linear Embedding abstract Exploratory data analysis often relies heavily on visual methods because of the power of the human eye to detect structures. For large, multidimensional data sets which cannot be easily visualized, the number of dimensions of the data can be reduced by applying dimensionality reduction techniques. This paper reviews current linear and nonlinear dimensionality reduction techniques in the context of data visual- ization. The dimensionality reduction techniques were used in our case study of business blogs. The supe- rior techniques were able to discriminate the various categories of blogs quite accurately. To our knowledge, this is the first study using dimensionality reduction techniques for visualization of blogs. In summary, we have applied dimensionality reduction for visualization of real-world blog data, with potential applications in the ever-growing digital realm of social media. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Dimensionality reduction reduces a large set of observed dimensions into a smaller set of features. Some advantages of dimensionality reduction are to visualize, compress, and decrease processing time of data. The process of reducing the number of dimensions can serve to distinguish the important features or variables, thus providing additional insight into the nature of the data. Dimensionality reduction or projection techniques can trans- form large data of multiple dimensions into a smaller, more man- ageable set. Thus, we can uncover hidden structure that aids in the understanding and visualization of the data. Linear dimen- sionality reduction techniques such as Principal Component Analysis (PCA) (Pearson, 1901) and Multidimensional Scaling (MDS) (Cox & Cox, 2000; Davison, 2000; Kruskal & Wish, 1978) are capable only of handling data that is inherently linear in nature, but nonlinear techniques for dimensionality reduction such as Locally Linear Embedding (LLE) (Roweis & Saul, 2000) and Isometric Feature Mapping (Isomap) (Tenenbaum, de Silva, & Langford, 2000), can handle nonlinear data with a certain type of topological manifold, such as the Swiss Roll. However, LLE and Isomap both fail in other types of nonlinear data, such as a sphere or torus (Saul & Roweis, 2003). In addition, the nonlinear tech- niques tend to be extremely sensitive to noise (Tsai & Chan, 2007b). This paper reviews dimensionality reduction techniques for visualization and applies the techniques to visualize blogs. Although previous studies have used dimensionality reduction techniques for data visualization (Geng, Zhan, & Zhou, 2005; Tsai & Chan, 2007b; Yang & Hubball, 2007), to our knowledge, the techniques have not been applied to visualize weblogs, or blogs, which are websites where entries are made in a reverse chrono- logical order. The rapid growth of blogs (Chen, Tsai, & Chan, 2007) and other new forms of social media (Tsai, Han, Xu, & Chua, 2009) has created a critical need for new technologies to transfer the digital realm of blogs and other media into a manageable form, and visualization aided by dimensionality reduction can help in this respect. Previous studies on blog visualization applied tomographic clustering to visualize blog communities (Tseng, Tatemura, & Wu, 2005), visualized the cyber security threats present in security blogs (Tsai & Chan, 2007a), and implemented a probabilistic approach for spatiotemporal theme pattern mining on weblogs (Mei, Liu, Su, & Zhai, 2006). However, these approaches seldom use dimensionality reduction for blog visuali- zation. In this paper we demonstrate dimensionality reduction in visualizing a collection of business blogs (Chen, Tsai, & Chan, 2008). To our knowledge, this is the first study using dimension- ality reduction techniques to visualize blogs. 2. Linear dimensionality reduction techniques If the transformation to a lower-dimensional space is a linear combination of the original variables, then this is called linear dimensionality reduction. In feature extraction, all available vari- ables are used and the data is transformed using a linear transfor- mation to a reduced dimension space. The aim is to replace the original variables by a smaller set of underlying variables. The techniques covered here are also referred to as techniques of exploratory data analysis, geometric methods, or methods of 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.08.067 Tel.: +65 6790 6369; fax: +65 6793 3318. E-mail address: [email protected] Expert Systems with Applications 38 (2011) 2766–2773 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: flora-s-tsai

Post on 21-Jun-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dimensionality reduction techniques for blog visualization

Expert Systems with Applications 38 (2011) 2766–2773

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Dimensionality reduction techniques for blog visualization

Flora S. Tsai ⇑School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

a r t i c l e i n f o a b s t r a c t

Keywords:BlogWeblogVisualizationDimensionality reductionManifoldMultidimensional scalingIsomapLocally Linear Embedding

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.08.067

⇑ Tel.: +65 6790 6369; fax: +65 6793 3318.E-mail address: [email protected]

Exploratory data analysis often relies heavily on visual methods because of the power of the human eyeto detect structures. For large, multidimensional data sets which cannot be easily visualized, the numberof dimensions of the data can be reduced by applying dimensionality reduction techniques. This paperreviews current linear and nonlinear dimensionality reduction techniques in the context of data visual-ization. The dimensionality reduction techniques were used in our case study of business blogs. The supe-rior techniques were able to discriminate the various categories of blogs quite accurately. To ourknowledge, this is the first study using dimensionality reduction techniques for visualization of blogs.In summary, we have applied dimensionality reduction for visualization of real-world blog data, withpotential applications in the ever-growing digital realm of social media.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Dimensionality reduction reduces a large set of observeddimensions into a smaller set of features. Some advantages ofdimensionality reduction are to visualize, compress, and decreaseprocessing time of data. The process of reducing the number ofdimensions can serve to distinguish the important features orvariables, thus providing additional insight into the nature ofthe data.

Dimensionality reduction or projection techniques can trans-form large data of multiple dimensions into a smaller, more man-ageable set. Thus, we can uncover hidden structure that aids inthe understanding and visualization of the data. Linear dimen-sionality reduction techniques such as Principal ComponentAnalysis (PCA) (Pearson, 1901) and Multidimensional Scaling(MDS) (Cox & Cox, 2000; Davison, 2000; Kruskal & Wish, 1978)are capable only of handling data that is inherently linear innature, but nonlinear techniques for dimensionality reductionsuch as Locally Linear Embedding (LLE) (Roweis & Saul, 2000)and Isometric Feature Mapping (Isomap) (Tenenbaum, de Silva,& Langford, 2000), can handle nonlinear data with a certain typeof topological manifold, such as the Swiss Roll. However, LLE andIsomap both fail in other types of nonlinear data, such as a sphereor torus (Saul & Roweis, 2003). In addition, the nonlinear tech-niques tend to be extremely sensitive to noise (Tsai & Chan,2007b). This paper reviews dimensionality reduction techniquesfor visualization and applies the techniques to visualize blogs.Although previous studies have used dimensionality reduction

ll rights reserved.

techniques for data visualization (Geng, Zhan, & Zhou, 2005; Tsai& Chan, 2007b; Yang & Hubball, 2007), to our knowledge, thetechniques have not been applied to visualize weblogs, or blogs,which are websites where entries are made in a reverse chrono-logical order. The rapid growth of blogs (Chen, Tsai, & Chan, 2007)and other new forms of social media (Tsai, Han, Xu, & Chua, 2009)has created a critical need for new technologies to transfer thedigital realm of blogs and other media into a manageable form,and visualization aided by dimensionality reduction can help inthis respect. Previous studies on blog visualization appliedtomographic clustering to visualize blog communities (Tseng,Tatemura, & Wu, 2005), visualized the cyber security threatspresent in security blogs (Tsai & Chan, 2007a), and implementeda probabilistic approach for spatiotemporal theme pattern miningon weblogs (Mei, Liu, Su, & Zhai, 2006). However, theseapproaches seldom use dimensionality reduction for blog visuali-zation. In this paper we demonstrate dimensionality reduction invisualizing a collection of business blogs (Chen, Tsai, & Chan,2008). To our knowledge, this is the first study using dimension-ality reduction techniques to visualize blogs.

2. Linear dimensionality reduction techniques

If the transformation to a lower-dimensional space is a linearcombination of the original variables, then this is called lineardimensionality reduction. In feature extraction, all available vari-ables are used and the data is transformed using a linear transfor-mation to a reduced dimension space. The aim is to replace theoriginal variables by a smaller set of underlying variables. Thetechniques covered here are also referred to as techniques ofexploratory data analysis, geometric methods, or methods of

Page 2: Dimensionality reduction techniques for blog visualization

F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773 2767

ordination, where no prior assumption is made about the existenceof groups or clusters in the data. Geometric methods are some-times further categorized as being variable-directed when theyare primarily concerned with relationships between variables, orindividual-directed when they are primarily concerned with rela-tionships between individuals (Webb, 2002).

2.1. Principal Component Analysis (PCA)

PCA, also known as Karhunen–Love transform, is a very estab-lished method of dimensionality reduction introduced by Pearson(1901). The purpose of PCA is to derive new variables (in decreas-ing order of importance) that are linear combinations of the origi-nal variables and are uncorrelated. Geometrically, PCA can bedescribed as a rotation of the axes of the original coordinate systemto a new set of orthogonal axes that are ordered in terms of theamount of variation of the original data they account for (Webb,2002).

One of the reasons for performing PCA is to find a smaller groupof underlying variables that describe the data. The hope is that thefirst few components will account for most of the variation in theoriginal data. PCA is a variable-directed technique, making noassumptions about the existence of groupings within the data,and is thus considered an unsupervised feature extraction tech-nique (Webb, 2002).

PCA projects n-dimensional data onto a lower d-dimensionalsubspace in a way that minimizes the sum-squared error, or(equivalently) maximizes the variance, or (equivalently) givesuncorrelated projected distributions (Duda, Hart, & Stork, 2000).

Since PCA is a linear transformation method, it is simple to com-pute and is guaranteed to work. It is useful in reducing dimension-ality and finding new, more informative, uncorrelated features.However, PCA may not be able to accurately represent nonlineardata.

2.2. Multidimensional Scaling (MDS)

Multidimensional scaling (MDS) is a general approach whichachieves a lower-dimensional representation of data, while tryingto preserve the distances between the data points (Hand, Mannila,& Smyth, 2001). This class of methods is sometimes called distancemethods. The distance can be represented as either a similarity ordissimilarity measure. We can think of squeezing a high-dimen-sional point cloud into a small number of dimensions (2 or 3) whilepreserving as well as possible the interpoint distances (Venables &Ripley, 2002). Recent work include MDS for visualization of high-dimensional datasets (Yang & Hubball, 2007) and robust lineardimensionality reduction (Koren & Carmel, 2004).

MDS is equivalent to PCA when the distances are Euclidean.There are various MDS methods, differing in the types of metricsused as well as the calculations performed. However, all of themethods are governed by a set of similar principles. The startingpoint for MDS is the determination of the ‘‘spatial distance model”(Davison, 2000). In order to determine the proximities, the follow-ing notations are used:

Let D and D, N � N matrices, represent the collection of objects,indexed by i and j, where the proximity or data value connectingobject i with object j is represented by dij (Kruskal & Wish, 1978),and the distances between pairs of points xi and xj be representedby dij, as shown in Eqs. (1) and (2):

D ¼

d11 d12 . . . d1N

d21 d22 . . . d2N

..

. ... . .

. ...

dN1 dN2 . . . dNN

266664

377775; ð1Þ

D ¼

d11 d12 . . . d1N

d21 d22 . . . d2N

..

. ... . .

. ...

dN1 dN2 . . . dNN

266664

377775: ð2Þ

The aim of MDS is to find a configuration such that the distancesdij match, as well as possible, the dissimilarities dij (Cox & Cox,2000).

The variations of MDS come in the differences in functions usedto transform the dissimilarities. Classical Metric MultidimensionalScaling is a basic form of MDS, in which the distances betweenpoints in the results dij are as close as possible to the dissimilaritiesdij, measured in Euclidean distances. This is also sometimes re-ferred to as principal coordinate analysis, which is also equivalentto PCA (Venables & Ripley, 2002). MDS methods such as ClassicalMDS are called metric methods because the relationship betweendij and dij depend on the numerical or metric properties of the dis-similarities. Nonmetric MDS refers to those methods where therelationship between dij and dij depend on the rank ordering ofthe dissimilarities (Kruskal & Wish, 1978).

MDS is relatively simple to implement, very useful for visualiza-tion, and able to uncover hidden structure in the data.

2.3. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) (Deerwester, Dumais, Furnas,Landauer, & Harshman, 1990) is a well-known technique for infor-mation retrieval and document classification. It is a form of lineardimensionality reduction, and solves two fundamental problems innatural language processing: synonymy and polysemy. In synon-ymy, different words may have the same meaning. Thus, a personissuing a query in a search engine may use a different word thanappears in a document, and may not retrieve the document. Inpolysemy, the same word can have multiple meanings, so a search-er can get unwanted documents with the alternate meanings. LSAhas been previously used for corporate blog mining (Tsai, Chen, &Chan, 2007).

LSA solves the problem of lexical matching methods by usingstatistically derived conceptual indices instead of individual wordsfor retrieval (Berry, Dumais, & O’Brien, 1995). LSA uses a term-doc-ument matrix which describes patterns of term (word) distributionacross a set of documents.

LSA then finds a low-rank approximation which is smaller andless noisy than the original term-document matrix. The downsiz-ing of the matrix is achieved through the use of singular valuedecomposition (SVD), where the set of all the terms is then repre-sented by a vector space of lower dimensionality than the totalnumber of terms in the vocabulary. The consequence of the ranklowering is that some dimensions get ‘‘merged”. Thus, LSA is ide-ally suited for documents where the text input is noisy (Berryet al., 1995).

In LSA, each element of the n �m term-document matrix re-flects the occurrence of a particular word in a particular document,i.e.,

A ¼ ½aij�; ð3Þ

where aij is the number of times or frequency in which term i ap-pears in document j. As each word will not usually appear in everydocument, the matrix A is typically sparse with rarely any notice-able nonzero structure (Berry et al., 1995).

The matrix A is then factored into the product of three matricesusing Singular Value Decomposition (SVD). Given a matrix A,where rank (A) = r, the SVD of A is defined as:

A ¼ USVT : ð4Þ

Page 3: Dimensionality reduction techniques for blog visualization

2768 F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773

The columns of U and V are referred to as the left and right singularvectors, respectively, and the singular values of A are the diagonalelements of S, or the nonnegative square roots of the n eigenvaluesof AAT.

As defined by Eq. (4), the SVD is used to represent the originalrelationships among terms and documents as sets of linearly-inde-pendent vectors. Performing truncated SVD by using the k-largestsingular values and corresponding singular vectors, the originalterm-by-document matrix can be reduced to a smaller collectionof vectors in k-space for conceptual query processing (Berryet al., 1995).

2.4. Probabilistic latent semantic analysis model for blog mining

Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 2001) isbased on a generative probabilistic model that stems from a statisti-cal approach to LSA (Deerwester et al., 1990). PLSA is able to capturethe polysemy and synonymy in text for applications in the informa-tion retrieval domain. Similar to LSA, PLSA uses a term-documentmatrix which describes patterns of term (word) distribution acrossa set of documents (blog entries). By implementing PLSA, topicsare generated from the blog entries, where each topic produces a listof word usage, using the maximum likelihood estimation method,the expectation maximization (EM) algorithm.

The aspect model (Hofmann, 2001) in PLSA is a latent variablemodel for co-occurrence data associating an unobserved class var-iable zk 2 {z1, . . . ,zk} with each observation, an observation beingthe occurrence of a keyword in a particular blog entry. There arethree probabilities used in PLSA:

1. P(bi) denotes the probability that a keyword occurrence will beobserved in a particular blog entry bi,

2. P(wjjzk) denotes the class-conditional probability of a specifickeyword conditioned on the unobserved class variable zk,

3. P(zkjbi) denotes a blog-specific probability distribution over thelatent variable space.

In the collection, the probability of each blog and the probabilityof each keyword are known, while the probability of an aspect gi-ven a blog and the probability of a keyword given an aspect are un-known. By using the above three probabilities and conditions,three fundamental schemes are implemented:

1. Select a blog entry bi with probability P(bi),2. Pick a latent class zk with probability P(zkjbi),3. Generate a keyword wj with probability P(wjjzk).

As a result, a joint probability model is obtained in asymmetricparameterization:

Pðbi;wjÞ ¼ PðbiÞPðwjjbiÞ; ð5Þ

PðwjjbiÞ ¼XK

k¼1

PðwjjzkÞPðzkjbiÞ: ð6Þ

After the aspect model is generated, the model is fitted using the EMalgorithm. The EM algorithm involves two steps, namely the expec-tation (E) step and the maximization (M) step. The E-step computesthe posterior probability for the latent variable, by implying Bayes’formula, so the parameterization of joint probability model is ob-tained as:

Pðzkjbi;wjÞ ¼PðwjjzkÞPðzkjbiÞPKl¼1PðwjjzlÞPðzljbiÞ

: ð7Þ

The M-step updates the parameters based on the expected com-plete data log-likelihood depending on the posterior probability re-

sulted from the E-step. Hence the M-step re-estimates the followingtwo probabilities:

PðwjjzkÞ ¼PN

i¼1nðbi;wjÞPðzkjbi;wjÞPMm¼1

PNi¼1nðbi;wmÞPðzkjbi;wmÞ

; ð8Þ

PðzkjbiÞ ¼PM

j¼1nðbi;wjÞPðzkjbi;wjÞnðbiÞ

: ð9Þ

The EM iteration is continued to increase the likelihood functionuntil the specific conditions are met and the program is terminated.These conditions can be a convergence condition, or a cut-off point,which is specified for reaching a local maximum, rather than a glo-bal maximum.

In summary, the PLSA model selects the model parameter val-ues that maximize the probability of the observed data, and re-turns the relevant probability distributions by using the EMalgorithm. Based on the preprocessed term-document matrix, theblogs are then classified onto different aspects or topics. For eachaspect, the keyword usage, such as the probable words in theclass-conditional distribution P(wjjzk), is determined. Empirical re-sults indicate the advantages of PLSA in reducing perplexity, andhigh performance of precision and recall in information retrieval(Hofmann, 2001).

3. Nonlinear dimensionality reduction techniques

As seen in the previous section, PCA is useful in identifying sig-nificant coordinates and linear correlations in high-dimensionaldata. PCA as well as classical MDS are unsuitable if the data setcontains nonlinear relationships among the variables. GeneralMDS techniques are appropriate when the data is highly nonmetricor sparse. If the original high-dimensional data set contains nonlin-ear relationships, then nonlinear dimensionality reduction tech-niques may be more appropriate. In recent years, there has beenmuch interest in the development of nonlinear dimensionalityreduction techniques for data lying on a high-dimensional mani-fold, a topological space which is locally Euclidean. These methods,also known as manifold learning algorithms, are generally are basedon the MDS approach, whereby a lower dimensional representa-tion of data is achieved while preserving the original distances be-tween the data points. However, there are some slightmodifications and assumptions for these manifold learning tech-niques that make it different than MDS.

Methods such as LLE and Isomap rely on applying linear tech-niques on a set of locally linear neighborhoods. Therefore theycan be classified into the category of local linear dimensionalityreduction techniques. Although these local linear techniques per-form well for particular classes of manifolds, they fail to achieve re-sults with other classes of manifolds.

3.1. Isometric Feature Mapping (Isomap)

Isomap (Tenenbaum et al., 2000) is a nonlinear dimensionalityreduction technique that uses MDS techniques with geodesic inter-point distances, which represent the shortest paths along thecurved surface of the manifold. Isomap can be used to discoverthe nonlinear degrees of freedom that underlie complex naturalobservations (Tenenbaum et al., 2000).

Isomap deals with finite data sets of points in Rn which are as-sumed to lie on a smooth submanifold Md of low dimension d < n.The algorithm attempts to recover M given only the data points.Isomap estimates the unknown geodesic distance in M betweendata points in terms of the graph distance with respect to somegraph G constructed on the data points.

Isomap algorithm consists of three basic steps:

Page 4: Dimensionality reduction techniques for blog visualization

F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773 2769

1. Determine which points are neighbors on the manifold M,based on the distances between pairs of points in the inputspace.

2. Estimate the geodesic distances between all pairs of points onthe manifold M by computing their shortest path distances inthe graph G.

3. Apply MDS to matrix of graph distances, constructing anembedding of the data in a d-dimensional Euclidean space Ythat best preserves the manifold’s estimated intrinsic geometry(Tenenbaum et al., 2000).

For two arbitrary points on a nonlinear manifold, their Euclid-ean distance in the high-dimensional input space may not accu-rately reflect their intrinsic similarity, as measured by geodesicdistance along the low-dimensional manifold. The neighborhoodgraph G constructed in step one of the Isomap algorithm allowsan approximation to the true geodesic path to be computed effi-ciently in step two, as the shortest path in G. The two-dimensionalembedding recovered by Isomap in step three, which best pre-serves the shortest path distances in the neighborhood graph.The embedding now represents simpler and cleaner approxima-tions to the true geodesic paths than do the corresponding graphpaths (Tenenbaum et al., 2000).

3.2. Locally Linear Embedding (LLE)

Like Isomap, LLE is a nonlinear dimensionality reduction tech-nique that computes low-dimensional, neighborhood preservingembeddings of high-dimensional inputs. Unlike Isomap, LLE elim-inates the need to estimate pairwise distances between widelyseparated data points (Roweis & Saul, 2000). LLE assumes thatthe data manifold is linear when viewed locally.

The LLE algorithm is summarized as follows:

1. Determine which points are neighbors on the manifold M,based on the distances between pairs of points in the inputspace (same as Isomap).

2. Compute the weights Wij that best reconstruct each data pointxi from its neighbors.

3. Compute the vectors yi that are best reconstructed by theweights Wij (Roweis & Saul, 2000).

Although LLE is very efficient, it can only find an embeddingthat preserves the local structure, is not guaranteed to asymptoti-cally converge, and may introduce unpredictable distortions. BothIsomap and LLE algorithms are unlikely to work well for manifoldslike a sphere or a torus, require dense data points for good estima-tion, and are strongly dependent on a good local neighborhood.

3.3. Hessian Locally Linear Embedding (HLLE)

Variants of LLE, such as Hessian Eigenmaps (Donoho & Grimes,2003), combine LLE with Laplacian Eigenmaps (Belkin & Niyogi,2003). Hessian Eigenmap modifies the Laplacian Eigenmap frame-work, replacing a quadratic form based on the Hessian for onebased on the Lapacian (Donoho & Grimes, 2003).

HLLE is a method for recovering the underlying parametrizationof scattered data (mi) lying on a manifold M embedded in high-dimensional Euclidean space. The manifold M, viewed as a Rie-mannian submanifold of the ambient Euclidean space Rn, is locallyisometric to an open, connected subset H of Euclidean space Rd .Because H does not need to be convex, HLLE is able to hander awider class of situations than the original ISOMAP (Donoho &Grimes, 2003), such as data with a central square removed. Theunderlying correct parameter space that generated the data is a

square with a central square removed, similar to what is obtainedby the Hessian approach (Donoho & Grimes, 2003).

The algorithm HLLE involves the following steps:

� Identify k-nearest neighbors.� Obtain Tangent Coordinates.� Develop Hessian Estimator.� Develop Quadratic Form.� Find Approximate Null Space.� Find Basis for Null Space.

Although HLLE contains an orthogonalization step that makesthe local fits more robust to pathological neighborhoods thanLLE, HLLE still requires a numerical second differencing at eachpoint (Donoho & Grimes, 2003).

3.4. Local Tangent Space Alignment (LTSA)

Based on a set of unorganized data points sampled with noisefrom a parameterized manifold, the local geometry of the manifoldis learned by constructing an approximation for the tangent spaceat each data point, and those tangent spaces are then aligned togive the global coordinates of the data points with respect to theunderlying manifold (Zhang & Zha, 2005).

The LTSA algorithm is given as follows:

� Given N n-dimensional points sampled possibly with noise froman underlying d-dimensional manifold, this algorithm producesN d-dimensional coordinates T 2 Rd�N for the manifold con-structed from k local nearest neighbors.� The first step in the algorithm involves extracting local informa-

tion. For each i = 1, . . . ,N, determine k nearest neighbors xij ofxi, j = 1, . . .,k. Then, compute the d largest unit eigenvectorsg1, . . . ,gd of the correlation matrix

ðXi � �xieTÞTðXi � �xieTÞ ð10Þ

and set

Gi ¼ e=ffiffiffikp

; g1; . . . ; gd

h i: ð11Þ

� The second step involves constructing the alignment matrix.Form the matrix B by locally summing (12) if a direct eigen-sol-ver will be used.

BðIi; IiÞ BðIi; IiÞ þ I � GiGTi ; i ¼ 1; . . . ;N; ð12Þ

with initial B = 0.Otherwise implement a routine that computes matrix–vector mul-tiplication Bu for an arbitrary vector u.� The last step involves aligning global coordinates by computing

the d + 1 smallest eigenvectors of B and pick up the eigenvectormatrix [u2, . . . ,ud+1] corresponding to the 2nd to d+1st smallesteigenvalues, and set T = [u2, . . . ,ud+1]T.

The LTSA algorithm, when compared with LLE, is less sensitiveto choice of k neighborhoods than LLE. LLE was implemented onthe S-Curve data set with different neighborhood size k. The defor-mations (stretching and compression) in the generated coordinatesare quite prominent, and depend on the value of k. LTSA exhibitsmuch fewer geometric deformations in the generated coordinates.In general, k, should be chosen to match the sampling density,noise level and the curvature at each of the data points so as to ex-tract an accurate tangent space. Too few neighbors may result in arank-deficient tangent space and leads to over-fitting, while toolarge a neighborhood will introduce too much bias and the com-puted tangent space will not match the local geometry well (Zhang& Zha, 2005).

Page 5: Dimensionality reduction techniques for blog visualization

Table 1List of keywords for blog categories.

Product Company Marketing Finance

Mobile Blog Market SaveBattery Ebay Company MoneyDevice Amazon Custom DebtPhone Google Busi YearWindow Web Firm FinancTablet Develop Advertise FinanciUmpc Api Brand CreditSamsung Site Product CardKeyboard Search Corpor CollegeApple Product Client Invest

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07−0.15

−0.1

−0.05

0

0.05

0.1ProductCompanyMarketingFinance

Fig. 1. Results on visualization of blogs using PCA.

2770 F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773

−2 −1 0−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Fig. 2. Results on visualizati

LTSA constructs approximations of tangent spaces in order torepresent local geometry of the manifold (Zhang & Zha, 2005). Itis less sensitive than LLE to the choice of k neighborhoods.

Although LTSA and other nonlinear dimensionality reductionalgorithms are able to handle nonlinearities in data, they are gener-ally not as robust as the linear dimensionality reduction techniques.

4. Experiments and results

To see the performance of the various algorithms in visualizingblogs, experiments were conducted on BizBlogs07 (Chen et al.,

1 2 3 4

ProductCompanyMarketingFinance

on of blogs using MDS.

Page 6: Dimensionality reduction techniques for blog visualization

F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773 2771

2008), a data corpus of business blogs. BizBlogs07 contains 1269business blog entries from various CEOs’s blog sites and businessblog sites. There are a total of 86 companies represented in the blogentries, and the blogs were classified into four categories based onthe contents or the main description of the blog: Product, Com-pany, Marketing, and Finance (Chen et al., 2008). Using PLSA, thefirst ten keywords with the highest probability to the correspond-ing topic from each category (Product, Company, Marketing, andFinance) of the entire blog entry collection are listed in Table 1.

In order to prepare the dataset, we first created a normalizedterm-document matrix with term frequency (TF) local term

−5 0−12

−10

−8

−6

−4

−2

0

2

4

6

8

Fig. 3. Results on visualization

−3 −2 −1 0 1−3

−2

−1

0

1

2

3

Fig. 4. Results on visualization of

weighting and inverse document frequency (IDF) global termweighting. From this matrix, we applied LSA to create the1269 � 1269 document–document similarity matrix, and used thisas input to the dimensionality reduction algorithms. The resultscan be seen in Figs. 1–6.

The linear dimensionality reduction techniques such as PCA andMDS were able to discriminate between the main categories, andthe distinction between Product and Finance category were fairlystrong. Although there were obvious overlaps in the categories ofCompany and Marketing, MDS was able separate the classes fairlywell, with only a few outliers. Most of the nonlinear dimensionality

5

ProductCompanyMarketingFinance

of blogs using LLE (k = 12).

2 3 4 5

ProductCompanyMarketingFinance

blogs using Isomap (k = 12).

Page 7: Dimensionality reduction techniques for blog visualization

2772 F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773

reduction techniques, on the other hand, were not able to discrim-inate well between the classes, even among the fairly distinct cat-egories of Product and Finance. The Isomap algorithm performedthe best out of the nonlinear algorithms. The results from the othernonlinear algorithms, LLE, HLLE, and LTSA did not discriminatewell between the various categories. This shows that the natureof the BizBlogs07 dataset does not match the type of topologicalstructures with which the nonlinear dimensionality reductiontechniques perform well.

−0.06 −0.04 −0.02 0−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Fig. 5. Results on visualization o

0.5 1 1.5 2−8

−6

−4

−2

0

2

4

6

8x 10−3

Fig. 6. Results on visualization o

5. Conclusion

The applicability of dimensionality reduction techniques fordata and blog visualization has been evaluated in this work. A sum-mary of some current linear and nonlinear dimensionality reduc-tion techniques has been presented. PCA is useful in identifyingsignificant coordinates and linear correlations in high-dimensionaldata. PCA as well as classical MDS are unsuitable if the data setcontains nonlinear relationships among the variables. General

0.02 0.04 0.06

ProductCompanyMarketingFinance

f blogs using HLLE (k = 12).

2.5 3 3.5 4x 10−3

ProductCompanyMarketingFinance

f blogs using LTSA (k = 12).

Page 8: Dimensionality reduction techniques for blog visualization

F.S. Tsai / Expert Systems with Applications 38 (2011) 2766–2773 2773

MDS techniques are appropriate when the data is highly nonmetricor sparse. If the original high-dimensional data set contains nonlin-ear relationships, then nonlinear dimensionality reduction tech-niques are more appropriate. LSA is a dimensionality reductiontechnique that is widely used in information retrieval andclassification.

We applied dimensionality reduction techniques to visualize aset of business blogs using various dimensionality reduction tech-niques. Some of the linear techniques such as MDS perform betterthan the nonlinear techniques, such as LLE. The superior tech-niques were able to discriminate the various categories of blogsquite accurately. To our knowledge, this is the first study usingdimensionality reduction techniques to visualize blogs. Futurework can focus on learning the topology for other types of datasets,to aid in the selection of the best dimensionality reduction algo-rithm. In conclusion, we have successfully applied dimensionalityreduction to visualize real-world blog data, with potential applica-tions in the ever-growing digital realm of social media.

References

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reductionand data representation. Neural Computation, 15(2), 1373–1396.

Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear algebra forintelligent information retrieval. SIAM Review, 37(4), 573–595.

Chen, Y., Tsai, F. S., & Chan, K. L. (2007). Blog search and mining in the businessdomain. In 2007 International workshop on domain driven data mining,DDDM2007 (pp. 55–60). New York, NY, USA: ACM.

Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for businessblog search and mining. Expert Systems with Applications, 35(3), 581–590.

Cox, T. F., & Cox, M. A. A. (2000). Multidimensional scaling (2nd ed.). /CRC, New York:Chapman & Hall.

Davison, M. (2000). Multidimensional scaling. Florida: Krieger Publishing Company.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).

Indexing by latent semantic analysis. Journal of the American Society forInformation Science, 41(6), 391–407.

Donoho, D. L., & Grimes, C. (2003). Hessian eigenmaps: Locally linear embeddingtechniques for high-dimensional data. PNAS, 100(10), 5591–5596.

Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley-Interscience Publication.

Geng, X., Zhan, D.-C., & Zhou, Z.-H. (2005). Supervised nonlinear dimensionalityreduction for visualization and classification. IEEE Transactions on Systems, Man,and Cybernetics, Part B, 35(6), 1098–1107.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Massachusetts:MIT Press.

Hofmann, T. (2001). Unsupervised learning by probabilistic latent semanticanalysis. Machine Learning, 42(1–2), 177–196.

Koren, Y., & Carmel, L. (2004). Robust linear dimensionality reduction. IEEETransactions on Visualization and Computer Graphics, 10(4), 459–470.

Kruskal, J., & Wish, M. (1978). Multidimensional scaling. London: Sage Publications.Mei, Q., Liu, C., Su, H., & Zhai, C. (2006). A probabilistic approach to spatiotemporal

theme pattern mining on weblogs. In WWW ’06: Proceedings of the 15thinternational conference on World Wide Web (pp. 533–542). New York, NY, USA:ACM.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.Philosophical Magazine, Series B, 2(11), 559–572.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locallylinear embedding. Science, 290(5500), 2323–2326.

Saul, L. K., & Roweis, S. T. (2003). Think globally, fit locally: unsupervised learning oflow-dimensional manifolds. Journal of Machine Learning Research, 4, 119–155.

Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework fornonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

Tsai, F.S., & Chan, K.L. (2007a). Detecting cyber security threats in weblogs usingprobabilistic models. In Lecture notes in computer science LNCS. Vol. 4430 (pp.46–57).

Tsai, F.S., & Chan, K.L. (2007b). Dimensionality reduction techniques for dataexploration. In 2007 6th International conference on information, communicationsand signal processing, ICICS.

Tsai, F.S., Chen, Y., & Chan, K.L. (2007). Probabilistic techniques for corporate blogmining. In Lecture notes in computer science (LNCS). Vol. 4819 (pp. 35–44).

Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobilepeer-to-peer social networking application. Expert Systems with Applications,36(8), 11077–11087.

Tseng, B.L., Tatemura, J., & Wu, Y. (2005). Tomographic clustering to visualize blogcommunities as mountain views. In WWW 2005 Workshop on the webloggingecosystem.

Venables, W., & Ripley, B. (2002). Modern applied statistics with s. New York:Springer.

Webb, A. (2002). Statistical pattern recognition. New London: John Wiley.Yang, J., & Hubball, D. (2007). Value and relation display: interactive visual

exploration of large data sets with hundreds of dimensions. IEEE Transactions onVisualization and Computer Graphics, 13(3), 494–507. member-Ward, Matthew,O., and Member-Rundensteiner, Elke, A., and Member-Ribarsky, William.

Zhang, Z., & Zha, H. (2005). Principal manifolds and nonlinear dimensionalityreduction via tangent space alignment. SIAM Journal of Scientific Computing,26(1), 313–338.