dicon: visual analysis on multidimensional clusters · visual encoding encoding data items in...
TRANSCRIPT
DICON: Visual Analysis On Multidimensional Clusters
Nan Cao, David Gotz, Jimeng Sun, Huamin Qu
Topic: Cluster Analysis Link: http://en.wikipedia.org/wiki/Cluster_analysis
Applications: •Biology •Medicine •Market research •Education Research •Other applications
Cluster Analysis
Cluster Analysis
dataset Cluster Analysis: K = 3 K = 5
Cluster Analysis
Ground Truth: The data contains 6 clusters
• Problems of cluster analysis
– The cluster result is not always precisely reveals the ground truth of the data
– The cluster analysis highly depend on the experience of the analyzer. It is most unlike to find the ground truth within a single iteration
– In case of multidimensional dataset, it is difficult for explain the meaning of the clusters
Cluster Analysis
Ground Truth: The data contains 6 clusters
• Problems of cluster analysis
– The cluster result is not always precisely reveals the ground truth of the data
– The cluster analysis highly depend on the experience of the analyzer. It is most unlike to find the ground truth within a single iteration
– In case of multidimensional dataset, it is difficult for explain the meaning of the clusters
How can information visualization aid on
cluster analysis?
Challenges
• How can we interpret the multidimensional cluster results?
• How can we make comparisons among multidimensional clusters?
• How can we refine the clustering results and detect multidimensional patterns?
Solution • Goal:
– Design an novel visualization for multidimensional cluster analysis that facilitates cluster interpretation, quality evaluation, comparison and manipulation
• Approach: – A multidimensional cluster icon design that encodes
multiple data attributes as well as derived statistical information for cluster interpretation
– A stabilized icon layout algorithm that generates similar icons for similar clusters for cluster comparison
– New visual cues that evaluate cluster qualities and highlights the information patterns as well as Intuitive user interactions driven by these cues to support cluster refinement via direct manipulation of icons
How can we interpret the multidimensional cluster result in details?
Encoding the single entity
Packing entities into clusters
Global layout
? Using an iconic design to
visualize multidimensional clusters at multiple granularity
Visual Encoding
Encoding data items in detail
Packing Entities into clusters
0.3 0.2 0.1 0.1 0.2 0.1
entity
cancer diabetes
kidney disorder heart disease Fever high blood pressure
cancer diabetes
kidney disorder
heart disease Fever high blood pressure
Global Layout
E.g. the patient dataset
Intuitively share the same visual encodings at the feature level, the entity level and the cluster level
Design Guideline 1
feature entity
cluster
DEMO
How can we make comparisons among multidimensional clusters ?
Encoding data items in detail
Packing Entities into clusters
Global Layout
0.3 0.2 0.1 0.1 0.2 0.1
entity cancer
diabetes
kidney disorder
heart disease
hiv
high blood pressure
cancer diabetes
kidney disorder
heart disease HIV high blood pressure
Similar clusters should be represented by similar icons – Overview: Similar clusters have
similar data distributions
– Details: Similar clusters must be laid out in a similar way
Design Guideline 2
?
How can we make comparisons among multidimensional clusters ?
Encoding data items in detail
Packing Entities into clusters
Global Layout
Statistical Embedding (overview)
Stabilized icon Layout
(detail)
Similar clusters should be represented by similar icons – Overview: Similar clusters have
similar data distributions
– Details: Similar clusters must be laid out in a similar way
Design Guideline 2
?
Statistical Embedding(1)
• Kurtosis
• Skewness
Statistical Embedding
Stabilized icon Layout
Stabilized Layout
Statistical Embedding
Stabilized icon Layout
1. Initial Spiral layout 2. Weighted Centroid Voronoi Tessellation
3. Random Layout for features
4. Optimization
ji
ii
ji
ji
iji
ii XpreXXXd
cX 2
3
2
22
2
1 ||||1
||min
Centroid Similarity Smoothness
Fit in multiple scales and can be embedded into various other visualizations – Both color and shape is highly
scalable can be distinguishable even in a very small area
Design Guideline 3 Global Layout
Encoding data items in detail
Packing items into clusters
Global Layout
How can we refine the clustering results and How can we detect interesting patterns within the multidimensional clusters? ?
Interactive visual analysis driven by visual cues
Cluster Quality Cue
Cluster Quality: Defined by the signed variances of its containing entities
f
ffsign
1
1)(
)( fsign
f the feature vector of a single entity
the mean feature vector of the cluster C that contains f
the variance between f and
Signed variance:
High quality clusters has a homogenous representation
Low quality clusters has a heterogeneous representation
Feature Co-occurrence and Dominant Cue
f1 f2 f3 f4 f5 f6 f2
if fi > 0, we call it occurred
If fi > 0, fj > 0, and fi, fj in the same vector, we call they are co-occurred
f5 Feature Vector
j
iji ffpC2
0|0
Co-occurrence Score:
Co-occurrence Cue: Highlight the features that are mostly co-occurred with others
Dominant Cue: Highlight the features that are not co-occurred with any other feature
Interactions and Animated Transition
• Interactions – attribute group
– Split : binary split / outlier split
– Merge: drag merge and select merge
• Animation Path Bundling – Aggregate the animation
paths with similar trends
– Inspired by the hierarchal edge bundling
• Demo
Evaluation
Comparing with other techniques
• The cluster is easy to identify
• Immediately convene the size of each cluster
• Fast comparison
• Highly compressed, can be imbedded into other visualizations
• Base on intuitive designs
Advantages:
• Multidimensional Only
• No precise value is directly observed
• Splitting entities into multiple parts
Disadvantages:
Case Study (1) Study on Patient Similarity
1. Find a group of patient that are similar to a target patient. The similarity is automatically computed based on five features 2. Initial cluster result is given 3. Users are required to refine the clusters and interpret why the patient in the cluster are similar
Case Study (2)
Highlight all the co-occurred features we find different disease distribution patterns
User Study
• T1: Compare on feature details of 9 clusters
• T2: Compare on large set of clusters, 50 clusters
• 3 (groups) X 10 (user) X 2 (tasks)
Icons laid out randomly
Icons laid out by our algorithm
With statistical embedding
User Study Results
• Finding:
– The cluster icon design is extremely efficient on cluster comparison (Average 12s for compare 50 clusters)
– The proposed design principles help great on comparison
DICON: Visual Analysis On Multidimensional Clusters
Nan Cao, David Gotz, Jimeng Sun, Huamin Qu
Related Work
• Pixel Based Technique
• Iconic Techniques
• Parallel Coordinates
• Scatter Plots
Prior Art: Icon-based techniques
• Chernoff face visualization • Stick figure technique
– two dimensions are mapped to the display dimensions and the remaining dimensions are mapped to the angles and/or limb lengths of the stick figure icon
– the number of dimensions that can be visualized is limited
• Shape encoding • Color Icons
Prior Art:Pixel-Oriented Techniques
• Query Independent – Space-Filling Curve
Arrangements
– Recursive Pattern Technique
• Query Dependent
– Spiral Technique
– Axes Technique
– Circle Segments
Prior Art: Table-based techniques
• Table Lens
• Tableau
• Heat Map
Prior Art: Others (Hybrid Techniques) • NodeTrix: a Hybrid Visualization of Social Networks.
Nathalie Henry, Jean-Daniel Fekete, Michael J. McGuffin, InfoVis 2007
• Scattering Points in Parallel Coordinates. Xiaoru Yuan, Peihong Guo, He Xiao, Hong Zhou, Huamin Qu, InfoVis 2009
• Bubble Sets: Revealing Set Relations with Isocontours over Existing Visualizations, Christopher Collins, Gerald Penn, Sheelagh Carpendale, InfoVis 2009
• Rolling the Dice: Multidimensional Visual Exploration using Scatterplot Matrix Navigation. Niklas Elmqvist, Pierre Dragicevic, Jean-Daniel Fekete, InfoVis 2008
• Interactive Dimensionality Reduction Through User-defined Combinations of Quality Metrics, Sara Johansson, Jimmy Johansson, InfoVis 2009
• FacetAtlas: Multifaceted Visualization for Rich Text Corpora, InfoVis 2010