[ieee 2010 10th international conference on intelligent systems design and applications (isda) -...

6
A Novel Interactive Visualization Framework For Gene Expression Analysis Nesreen Mahmoud, Hend El-Eraky, Noha A. Yousri, Sara Magdy, Omneya Khaled, Mai Ahmed, Shaimaa Abdel-Moety Computers and System Engineering Department, Faculty of Engineering, Alexandria University, Egypt {n.mcse, hend.eleraky}@yahoo.com, [email protected] AbstractVisualization techniques provide attractive tools to explore and analyze huge and high dimensional gene expression sets. Several visualization techniques have been developed that enabled users to visually analyze high dimensional data. However, these techniques should be integrated with efficient exploration techniques, as efficient clustering, outlier analysis, ensembles and cluster validation to boost the quality of the analysis process. Integrating such techniques in an interactive framework exploits users’ knowledge to further enhance the analysis process. In this work, an interactive visualization framework for gene expression analysis is proposed that introduces several features to the visualization process by integrating it with recent data analysis algorithms in an interactive framework. Keywords: gene expression, visualization, clustering, interactivity, high dimensions, data analysis I. INTRODUCTION Gene expression data is characterized by their huge size and high dimensionality. Efficient techniques are thus needed to discover relations between genes and tumors, or among genes. Visualization and data mining techniques can be used for such purpose. Visualization of data is trivial if the number of dimensions doesn’t exceed three-dimensions because the human perception operates in a three- dimensional space with orthogonal coordinates. However, dealing with high dimensionality requires the presence of efficient visualization techniques to view the data in a lower dimensional space. Several approaches exist for visualizing high dimensional data. One approach is dimensionality reduction, as in SVD [1] and PCA [2], which projects the data from their original space onto 2 or 3 principal components to be able to view the data in 2D or 3D space. Another approach maps the data from the original dimension space to a 2D or 3D view, as Parallel Coordinates [3] and Heat Maps [4]. Each approach has its merits and disadvantages as discussed later. However, mapping or projection of high dimensional data is not without its drawbacks; they result in a loss of information and/or clarity because there are only three space dimensions which are called extrinsic dimensions. Dimensions exceeding three have to be omitted or mapped to in intrinsic dimensions such as color. Therefore, several visualization techniques can be integrated together in order to reveal several aspects of the same data set. Clustering can be used to analyze gene expression data by grouping similar expression patterns together, revealing different relations between genes. While most existing clustering algorithms can discover coherent (globular) gene expression patterns (corresponding to globular clusters), they cannot discover connected patterns in an arbitrary shaped cluster [5]. The introduction of efficient density- based [6] and distance-relatedness based algorithms [5], [7] has addressed such problem. Visualizing clusters gives a better understanding of the structure of the data and make it easier to interpret the clustering results. The framework proposed here, integrates efficient clustering and data analysis with different visualization techniques, to improve upon the analysis process. It uses recent techniques in both domains to get the most out of the data. It also intends to promote the concept of continuous interaction with data while visualizing it. The framework contains features that enhances gene expression analysis and make the application reliable and efficient compared to current systems, as illustrated in the following points: • Other data analysis techniques are integrated with clustering analysis as outlier analysis, cluster validity and cluster ensemble methods. • Beside the traditional operations as zooming, scaling, rotating, more interactive operations in each visualization are added such as switching between different visualizations for the same clustering/outlier analysis results, changing algorithm’s parameters in an interactive manner and saving corresponding results, reducing cluster sizes using representatives, and most important keeping history of rendered visualization views, for comparison purposes, and also evaluating clustering results in parallel while changing parameters. II. PROPOSED FRAMEWORK The framework allows the user to analyze data in an easy way giving access to clustering, and visualization methods in an integrated manner. The framework has been implemented as a Java application, and different visualizations are displayed for elaboration. A. Interactive Visualization techniques The techniques implemented in the proposed framework are discussed next. Parallel Coordinates In parallel coordinates (PC) [3], each feature/dimension is represented as a horizontal or vertical axis. Each pattern is mapped to a polyline that passes through all axes, crossing each axis at a position proportional to its value for that dimension. 1247 978-1-4244-8136-1/10/$26.00 c 2010 IEEE

Upload: shaimaa

Post on 09-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

A Novel Interactive Visualization Framework For Gene Expression Analysis

Nesreen Mahmoud, Hend El-Eraky, Noha A. Yousri,

Sara Magdy, Omneya Khaled, Mai Ahmed, Shaimaa Abdel-Moety Computers and System Engineering Department, Faculty of Engineering, Alexandria University, Egypt

{n.mcse, hend.eleraky}@yahoo.com, [email protected]

Abstract—Visualization techniques provide attractive tools to explore and analyze huge and high dimensional gene expression sets. Several visualization techniques have been developed that enabled users to visually analyze high dimensional data. However, these techniques should be integrated with efficient exploration techniques, as efficient clustering, outlier analysis, ensembles and cluster validation to boost the quality of the analysis process. Integrating such techniques in an interactive framework exploits users’ knowledge to further enhance the analysis process. In this work, an interactive visualization framework for gene expression analysis is proposed that introduces several features to the visualization process by integrating it with recent data analysis algorithms in an interactive framework. Keywords: gene expression, visualization, clustering, interactivity, high dimensions, data analysis

I. INTRODUCTION

Gene expression data is characterized by their huge size and high dimensionality. Efficient techniques are thus needed to discover relations between genes and tumors, or among genes. Visualization and data mining techniques can be used for such purpose. Visualization of data is trivial if the number of dimensions doesn’t exceed three-dimensions because the human perception operates in a three-dimensional space with orthogonal coordinates. However, dealing with high dimensionality requires the presence of efficient visualization techniques to view the data in a lower dimensional space. Several approaches exist for visualizing high dimensional data. One approach is dimensionality reduction, as in SVD [1] and PCA [2], which projects the data from their original space onto 2 or 3 principal components to be able to view the data in 2D or 3D space. Another approach maps the data from the original dimension space to a 2D or 3D view, as Parallel Coordinates [3] and Heat Maps [4]. Each approach has its merits and disadvantages as discussed later. However, mapping or projection of high dimensional data is not without its drawbacks; they result in a loss of information and/or clarity because there are only three space dimensions which are called extrinsic dimensions. Dimensions exceeding three have to be omitted or mapped to in intrinsic dimensions such as color. Therefore, several visualization techniques can be integrated together in order to reveal several aspects of the same data set. Clustering can be used to analyze gene expression data by grouping similar expression patterns together, revealing different relations between genes. While most existing

clustering algorithms can discover coherent (globular) gene expression patterns (corresponding to globular clusters), they cannot discover connected patterns in an arbitrary shaped cluster [5]. The introduction of efficient density-based [6] and distance-relatedness based algorithms [5], [7] has addressed such problem. Visualizing clusters gives a better understanding of the structure of the data and make it easier to interpret the clustering results. The framework proposed here, integrates efficient clustering and data analysis with different visualization techniques, to improve upon the analysis process. It uses recent techniques in both domains to get the most out of the data. It also intends to promote the concept of continuous interaction with data while visualizing it. The framework contains features that enhances gene expression analysis and make the application reliable and efficient compared to current systems, as illustrated in the following points: • Other data analysis techniques are integrated with clustering analysis as outlier analysis, cluster validity and cluster ensemble methods. • Beside the traditional operations as zooming, scaling, rotating, more interactive operations in each visualization are added such as switching between different visualizations for the same clustering/outlier analysis results, changing algorithm’s parameters in an interactive manner and saving corresponding results, reducing cluster sizes usingrepresentatives, and most important keeping history of rendered visualization views, for comparison purposes, and also evaluating clustering results in parallel while changing parameters.

II. PROPOSED FRAMEWORK

The framework allows the user to analyze data in an easy way giving access to clustering, and visualization methods in an integrated manner. The framework has been implemented as a Java application, and different visualizations are displayed for elaboration.

A. Interactive Visualization techniques

The techniques implemented in the proposed framework are discussed next. Parallel Coordinates In parallel coordinates (PC) [3], each feature/dimension is represented as a horizontal or vertical axis. Each pattern is mapped to a polyline that passes through all axes, crossing each axis at a position proportional to its value for that dimension.

1247978-1-4244-8136-1/10/$26.00 c©2010 IEEE

Page 2: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

One of the main advantages of PC is its ability to project the whole set of dimensions into a 2-D view. Whereas, the main weakness of the PC technique is the inability to scale to large data sets. Visual clutter usually results when the data set is huge. However, this is handled by filtering and brushing operations. Each of filtering and brushingincreases the clarity of PC display [3]. Both of these focus on subsets of data. Brushing facility helps users to highlight their regions of interest. Another solution to the visual clutter problem is to cluster the data and then show only the polylines of the clusters’ representatives or select one cluster to view. Considering user interactivity, operations as axes ordering, scaling, zooming, focusing, re-centering -to unite the origin of dimensions- are included. Also, filtering and brushing -which focus on specified data sub-sets- are added to understand correlations across multiple dimensions.Another feature is the histogram over Parallel Coordinates to show where patterns accumulate along coordinate axes. The size of the bin at each sub-division indicates the number of patterns across that axis within this sub-division. To allow users to interact with clusters and outlier analysis, the genes are colored according to their memberships to clusters, and user can add and remove clusters as required. Users can also remove top most outliers from the data, to increase the clarity of cluster behavior. Parameters can also be varied in an interactive manner. Figure 1 shows an example of parallel coordinates where two clusters are displayed with different colors. They show the expression of genes in two different classes of the leukemia data set.

Figure 1: Two Leukemia clusters using Parallel Coordinates, colored as red and blue.

Principal Component Analysis (PCA) It’s a technique used for dimensionality reduction in data sets while keeping the existing variation in data as much as possible [2]. Principal components represent a linear combination of the original variables. The most significant components capture most of the variation existing in the data set. It’s known to be an efficient dimensionality reduction technique that removes the effect of redundant features from data focusing on the effect of the discriminating features. Besides, PCA may sometimes help in discovering the cluster structure before applying any clustering algorithm. However, PCA assumes linear relationship between original features. Thus, if there is non-linear relation between features, PCA fails as the largest variance is not present along a single vector. It faces a problem with high dimensional data sets or very large data

sets, as computing the covariance matrix costs O(np2),where n is the number of objects in p-dimensional data set [8]. In the proposed framework, 2D and 3D PCA visualization are used, with additional interactivity operations. These operations are zooming, stretching and shrinking, which can modify the horizontal to vertical scale ratio, and scrolling. Another important feature is showing the expression signal of the selected gene by clicking a specified point in the 2-or 3-D view.

Figure 2: PCA 3-D visualization for the leukemia data set, showing 3 clusters

Other operations are added to interact with data analysis as clustering and outlier analysis. Cluster Coloring is an operation that helps in discriminating between different clusters by providing a distinct color for each cluster. It also discriminates clusters from outliers. In addition, outliers produced by some clustering techniques can be hidden to get better view of the structure in data. For integration with outlier analysis, a feature that removes top-n outliers is added to help the user to analyze data more accurately .3-D PCA visualization provides additional feature which is scene rotation where the user can rotate the scene in any direction. This can help him to see hidden data points and see the effect of changing the angle of rotation of principal components. Different cluster structure can then be visually perceived by the user. Figure 2 shows an example onvisualization using PCA 3-D on the leukemia data set.

Rolling dice scatter plot Scatter plots are one of the popular visualization techniques used due to its simplicity and clarity. In a scatter plot matrix each cell represents the scatter plot between two variables. This approach is useful in interpreting more than two

variables and discovering hidden trends. It also helps in viewing all possible two-way relations between variables/dimensions where it provides the reader the capability of better understanding of more complex hidden trends in high dimensional data sets. However, this isn’t suitable when the data is high dimensional, as it becomes difficult to read the scatter matrix. The rolling dice scatter plot is a recent approach introduced in [9], for interactive exploration of high dimensional data. It tries to introduce a structured navigation that uses the scatter plot matrix for navigating between different dimensions in an interactive way. Rolling dice scatter plot provides some navigational operations for navigating on the scatter plot matrix. These

1248 2010 10th International Conference on Intelligent Systems Design and Applications

Page 3: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

operations are: 1)Stepping, where the user may use the arrow buttons(as shown in Figure 3.a) to step to the adjacent cells of the scatter plot matrix which facilitates the navigation on the matrix step by step to see the effect of each dimension on the visualized data and 2)Path planning:where the user can specify a path (see Figure 3.a) in which this path is shown on the Scatter Cube; transitions appear to the user one after the other in a sequence specified by his path. The Scatter Cube rotates in which a new scatter plot is shown on the front face. Instead of showing 2-D scatter plots, a new idea is introduced to merge every 2 adjacent cells in the scatter plot matrix into one 3-D scatter plot this could help the user to see the relations between more dimensions in a flexible way.

Figure 3: a) A Path on the Navigation window b) Scatter Cube with PCA Data, c) Leukemia Original data plotted

using the first two dimensions.

Some visualization operations are added for more interactivity. These operations include

• Input selection: User can choose whether to work onthe original data or data projected on the PCA components.(See figures 3.a ,3.b and 3.c) • Dimension Reordering: This is done to bring similardimensions near each other, which helps in reducingsudden changes when moving between scatter plot matrix cells. • Cell Selection: The user can select any cell in order investigate each scatter plot individually, where he could see which scatter plots are clearer than the others, and which dimensions are more effective than the others.

Heat Map A heat map [4] is a graphical representation of data where the values taken by a variable are represented by colors. For gene expression data, red and green are usually used to reflect relatively high and relatively low expression values respectively.

Colors provide a much powerful technique to represent data values compared to using numerical representation. It also results in a display that exhibits clear multivariate patterns that can be easily compared, especially when using clustering. Unlike other techniques like parallel coordinates, heat maps do not suffer from the problem of occlusion (objects hiding behind and being obscured by other objects), caused by cluttering when many patterns appear near each other. Clustering results can be displayed on the heat map to show expression behavior of similar genes. When a hierarchical clustering technique is used, a Dendrogram [10] can also appear on one side of the heat map. This enhances the user’s perception of the clusters. The expression signal of each gene can also be displayed when the user clicks on it on the heat map. This indicates the expression behavior of that single gene, or the cluster to which it belongs. It can also display the expression value of a gene under a specific condition. Figure 4 shows an example of the heat map visualization, with the clustering obtained from hierarchical clustering, and the dendrogram (explained later) is shown on its left.

Figure 4: Heat map with dendrogram, revealing 3 clusters in the

leukaemia data set

Dendrogram It is a binary tree used to visualize the Hierarchical Clustering [11] results using any criteria as single linkage, complete linkage or average linkage [11]. At the top of the Dendrogram [10] tree is the root that represents the single final cluster that all data objects are contained in. At the bottom are labeled leaf nodes that represent the data objects (genes). The importance of the Dendrogram lies in its ability to show the cluster structure at different levels of abstraction. In the proposed framework, the user is able to interactively explore different clustering solutions by cutting the dendrogram at different levels. Interactivity features are added to the dendrogram as explained next. The Dendrogram has two sliders; one slider is used to specify the number of clusters to cut the tree at. The other is used to specify a certain threshold or distance to cut the tree at. While these sliders change, different parts of the dendrogram are removed resulting in viewing different clustering results. Zooming and scrolling are also added to increase the interactivity, in order to facilitate viewing leaf nodes and subtrees in huge data sets. As in other techniques described in the proposed framework, the expression signal appears when one of the leaf nodes (genes) is selected. While, when a cluster sub-tree is selected, all the signals of the genes belonging to it

(a) (b)

(c)

2010 10th International Conference on Intelligent Systems Design and Applications 1249

Page 4: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

are displayed. This can help in identifying the objects in the same cluster. The weakness of Dendrogram is that it is not suitable for huge data sets due to its high complexity (time and space complexity). Thus, it can be used for small data sets, or compressed data sets. The Dendrogram has a separate visualization window, and is also visualized on the side of the heat map as shown before in figure 4.

Network visualization Vizster [12] is an interactive visualization tool for online social networks. It is available as an open source. It is embedded in the framework, to add another visualization technique that enables the user to view all genes without dimensionality reduction. Genes are the nodes of the network and edges can be used to represent similarities between genes. However, due to the visual clutter of edges, they are removed when using clustering to group genes. The original algorithm used by Vizster randomly selects the positions of the nodes in 2-D space. However, to be able to view gene clusters, Multi-Dimensional Scaling (MDS) [13] is proposed to be used in order to map the real distances between genes into 2-D distances that preserve relative proximity between genes. Our goal is collecting the most similar nodes together depending on the real distances, so MDS algorithm is used to locate each node. In integrating the visualization with the clustering results, outliers can be shown or removed as in the case of DBSCAN algorithm. Another important feature proposed here is the ability to view cluster representatives, where the principal of core patterns is applied [14]. The cores are ranked according to their neighborhood densities, and can appear gradually in the order of their coreness. This enables users to see first the most representative gene expression in the whole data, or in each cluster. A user can interactively select the number of representatives, and when any representative node is clicked, it is expanded to another set of representatives according to the selected number. One of the important features of the network structure is its ability to visualize and interact with clustering ensembles analysis. Clustering solutions obtained from different algorithms or at different parameter settings are displayed as nodes. The edges can be used to display the similarity between those solutions. In this case, the user can select the most related solutions to be input to the ensemble. He can also view the results of clustering the clustering solutions. An example of the network structure using Vizster is shown in figure 5.

Figure 5: Network visualization for the normal dataset, clustered using K-Means at k=2.

Stacked Bar ChartA novel method is used to visualize the results of fuzzy clustering. The results are visualized as a stacked bar chart. Each expression pattern is represented as a bar and the membership to each cluster is represented by coloring part of the bar of a length proportional to its value. Figures 6.a and 6.b show fuzzy clustering results before and after ordering the patterns.

(a) (b)

Figure 6: Stacked Bar Chart: a) Before ordering, b) After ordering

B. Clustering algorithms

Different clustering algorithms are implemented. They include both classical algorithms as partitioning and Hierarchical algorithms and more efficient Density based and distance-relatedness based algorithms. Clustering algorithms that are developed are: K-Means, Agglomerative Hierarchical clustering [11] (single,complete, and average linkage), and fuzzy C-Means algorithm DBSCAN [6], Mitosis [7]. Hierarchical clustering [11] consists of two types of algorithms either agglomerative or divisive; the agglomerative clustering starts with the number of data objects until they all merge in one single cluster so it is called bottom up clustering method, while the divisive clustering starts with all the data objects in one cluster until each data object becomes in a cluster, thus called top-down clustering method. Density-based algorithms as DBSCAN [6], use the neighborhood density of data objects to detect clusters of arbitrary shapes and detect noise. The basic assumption is that the density inside the cluster is higher than the density outside. Moreover, it assumes a static density for all clusters. Unlike the static model used by DBSCAN, the distance-relatedness based algorithm Mitosis [5], [7] uses a dynamic model. This enables it to detect clusters of arbitrary shapes and densities, rather than just clusters of static density as in DBSCAN. It uses distances to measure the density contexts of patterns, merging patterns of similar density contexts together.

Clustering ensembles Clustering ensembles is an important concept that is used to improve the clustering results either by combining multiple solutions of the same algorithm, or by combining solutions of different algorithms. Many ensemble models are present in literature [15]. In the proposed framework,

1250 2010 10th International Conference on Intelligent Systems Design and Applications

Page 5: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

four models are implemented: K-Means ensemble, DBSCAN ensemble, General ensemble, and General ensemble using graph. K-Means ensemble is done based on the idea of evidence accumulation based clustering [15] which is used tocombine the results of multiple clustering solutions into a single data partition. In this framework K-Means ensemble is implemented by combining K-Means results at different initializations. DBSCAN ensemble [16] is done to avoid selecting aspecific density threshold, which is hard to specify. DBSCAN [6] is executed multiple times at diverse thresholds, and a consensus function is then used to combine the obtained solutions. A general ensemble is used for merging results of multiple clustering algorithms into a single result of clusters where the user can select any number of algorithms to combine together and get one accurate result. In General ensemble Using Graph [17], the clustering results of individual clustering algorithm are converted into a distance matrix. These distance matrices are combined and a weighted graph is constructed according to the combined matrix. Then a graph partitioning approach is used to cluster the graph to generate the final clusters. The use of Meta clustering [17] in the proposed framework, aims at creating an interaction between users, the clustering system, and the data. By using Meta clustering users will have no need to try many times to get the optimal solution because Meta clustering will suggest possible solutions by grouping the similar solutions into clusters. The user can then select one of them to obtain a final clustering solution.

Cluster Validity Validation measures are used to evaluate the clustering result and to find the best parameter settings to the dataset. In general, validity measures aim at optimizing specific objectives as compactness and separation, i.e. the patterns in the same cluster will be more similar compared to other patterns outside the cluster. Both classical and recent validity measures are implemented in this framework. Classical indices are Xie-Beni (XB) index and Dunn's index (see [18]). Both XB and Dunn's index prefer more compact and well-separated clusters. The problem with such classical indices, that they are generally designed to evaluate center-based clustering, where clusters are assumed to be of globular shapes. So there is a need to define a validity measure for clusters of arbitrary shapes and densities as that proposed in [19]. This is suitable to evaluate results obtained by more efficient algorithms as DBSCAN, and Mitosis. This measure is based on minimizing the standard deviation of the minimum spanning tree (MST) distances of the cluster, as a homogeneity measure, and minimizing the number of neighborhoods that mix patterns from different clusters, as a density separateness measure.

Outlier analysis Outlier analysis is an integral part of the data analysis process. Outliers are considered as abnormal patterns in data where removing them can improve the clustering solution. In addition it may help in discovering novel genes. In our framework we used two approaches for outlier analysis: Density based approach as LOF [20] that measures outlierness as the ratio of the density of a pattern’s neighborhood to the densities of surrounding patterns’ neighborhoods. Thus, it is suitable to data where different densities can exist. Other approaches are used; a distance based approach as KNN [21], and LDOF [22] that considers distances between patterns for measuring outlierness.

Figure 7: Comparing KNN and LDOF outlier algorithms An experiment was done to compare the performance of both KNN and LDOF while changing the neighborhood size specified by parameter k. The precision is calculated as the ratio of detected outliers from the set of real outliers. Figure 7 shows that LDOF is better than KNN while changing the value of k for a sample dataset.

C. Interactivity

Interactivity is important in enhancing the data analysis process. Interactivity is illustrated in this framework as follows: • Switching between views enables users to switch

between different visualizations for the same clustering solution. This enables the user to analyze the visualized data more appropriately.

• Users can change the parameters of any clustering technique and see its effect immediately on the visualized data. However, this may require some offline processing with algorithms that are computationallyintensive.

• Result saving allows the user to save the clustering algorithm's parameter values, result, and the validity result. The user is then able to navigate through the whole set of results and select the required result by clicking on its corresponding view. This enables users to see multiple visualizations at the same time, as well as multiple clustering solutions for the same algorithm. The user can switch back to any of the stored results, and switch between different visualizations for the

selected clustering result. • Background processes are proposed in order to be able

to evaluate the clustering results while different algorithms are executed at different parameters. These

2010 10th International Conference on Intelligent Systems Design and Applications 1251

Page 6: [IEEE 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA) - Cairo, Egypt (2010.11.29-2010.12.1)] 2010 10th International Conference on Intelligent

include the validation process LOF can run in the background and the user can select any algorithm and any visualization technique.

Figure 8 shows an example where multiple views are saved for later use by the user.

Figure 8: Multiple Views saved at different clustering results

D. Comparison to other frameworks Considering other visualization frameworks, XGobi [23]

supports visualizations like Scatter Plots and Parallel coordinates. XGvis [23] supports networks and graphs, and uses metric and non metric MDS to locate the objects in any dimensions. GGobi [24] supports Scatter Plots, Bar Charts and Parallel Coordinates, where plots are interactive and linked with brushing and identification. XmdvTool [25] supports Scatter plots, Star Glyphs, Parallel Coordinates and Dimensional Stacking. In comparison, the proposed framework intends to

integrate more visualization techniques with clustering and other data analysis, specifically for gene expression data. Interactivity is illustrated in this framework by switching between views, changing parameters and saving the clustering results, continuously recording validation results and monitoring multiple visualization views. Interactivity also aids the use of ensembles by embedding users’ selection of related clustering solutions, in order to get a more accurate clustering result or changing the parameters of different algorithms. Other differences from existing tools include the use of clustering ensembles, outlier analysis and fuzzy analysis.

III. CONCLUSION

An interactive visualization framework is proposed to analyze gene expression data. Several new features are added to enhance the analysis of the data. A group of visualization techniques are integrated with cluster and related data analysis. Users have the ability to share in the analysis process using several novel interactivity features. Examples on different expression sets are shown to illustrate the framework usage. Future work will consider publishing a version of the framework (will be linked to http://www.alexeng.edu.eg/~nyousri/ ).

REFERENCES

[1] N.S. Holter, M. Mitra , A. Maritan , M. Cieplak, J.R. Banavar , and N.V.Fedoroff , “Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity.”, in PNAS, July 18, 2000, vol. 97, no.15.

[2] K.Y. Yeung and W. L. Ruzzo ,"An empirical study on Principal Component Analysis for clustering gene expression data", Bioinformatics Jornal, vol 17, 2001.

[3] H. Hauser, F. Ledermann, and H. Doleisch. "Angular Brushing of Extended Parallel Coordinates". Proceedings of the IEEE Symposium on Information Visualization 2002 (InfoVis'02).

[4] D. Cook,H. Hofmann,E-K. Lee,H. Yang,B. Nikolau,E. Wurtele, "Exploring Gene Expression Data Using Plots", Journal of Data Science 5, 2007, pp. 151-182.

[5] N. A. Yousri, M. A. Ismail and M.S. Kamel, "Discovering Connected Patterns in Gene Expression Arrays”, IEEE CIBCB, Hawaii, USA, April 2007, pp. 113-120.

[6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996." A density-based algorithm for discovering clusters in large spatial databases with noise", Second International Conference on Knowledge Discovery and Data Mining (KDD-96)

[7] N. A. Yousri, M. S. Kamel, M. A. Ismail, “A Distance-Relatedness Dynamic Model for Clustering High Dimensional Data of Arbitrary Shapes and Densities”, Pattern Recognition, 2009.

[8] S. Roweis, “EM Algorithms for PCA and SPCA”, NeuralInformation Processing Systems 10 (NIPS'97), 1997, pp.626-632

[9] N. Elmqvist, P. Dragicevic, and J-D Fekete,"Rolling the Dice: Multidimensional Visual Exploration using Scatter plot Matrix Navigation", In IEEE Transactions on Visualization and Computer Graphics (Proc. InfoVis 2008), 14(6):1141-1148, 2008.

[10] J. Chen, A. M. MacEachren, and D. J. Peuquet, "Constructing Overview + Detail Dendrogram-Matrix Views", IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, November/December 2009.

[11] C. Ding and X. He, "Cluster merging and splitting in hierarchical clustering algorithms", IEEE International Conference on Data Mining (ICDM’02) 2002.

[12] J. Heer, D. boyd, "Vizster: Visualizing Online Social Networks", InfoViz, 2008.

[13] Mark Steyvers, "Multidimensional Scaling", Steyvers, M. . “Multidimensional Scaling”, Encyclopedia of Cognitive Science, 2002.

[14] N. A. Yousri, M. S. Kamel, M. A. Ismail, “Pattern Cores and Connectedness in Cancer Gene Expression”, in BIBE 2007.

[15] A. L.N. Fred and A. K. Jain "Data Clustering Using Evidence Accumulation", Proceedings of the 16th International Conference on Pattern Recognition (ICPR'02) Vol 4,2002.

[16] L. Xia, J. Jing. "An Ensemble Density-based Clustering Method", International Conference on Intelligent Systems and Knowledge Engineering (ISKE 2007).

[17] R. Caruana, M. Elhawary, N. Nguyen, C. Smith, ”MetaClustering”, ICDM 2006.

[18] M. Halkidi ,Y. Batistakis, M. Vazirgiannis, "On clustering validation techniques", Journal of Intelligent Information Systems, 17:2/3, 107–145, 2001.

[19] N. A.Yousri, M. S. Kamel, M. A. Ismail. "A Novel Validity Measure for Clusters of Arbitrary Shapes and Densities", ICPR 2008.

[20] M. M. Breunig, H-P Kriegel, R. T. Ng, J. Sander, “LOF: Identifying Density-Based Local Outliers", Proc. ACM SIGMOD Int. Conf. On Management of Data, Dalles, TX, 2000.

[21] M. I. Petrovskiy "Outlier Detection Algorithms in Data Mining”, Programming and Computer Software 29(4): 228-237 (2003)

[22] K. Zhang, M. Hutter and H. Jin , "A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data", Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, LNAI, 2009.

[23] http://lib.stat.cmu.edu/general/XGobi/[24] http://www.ggobi.org/ [25] http://davis.wpi.edu/xmdv/index.html

1252 2010 10th International Conference on Intelligent Systems Design and Applications