computational intelligence: methods and applications

50
Computational Intelligence: Computational Intelligence: Methods and Applications Methods and Applications Lecture 4 CI: simple visualization. Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Upload: taite

Post on 18-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Computational Intelligence: Methods and Applications. Lecture 4 CI: simple visualization. Source: Włodzisław Duch ; Dept. of Informatics, UMK ; Google: W Duch. 2D projections: scatterplots. Simplest projections: use scatterplots, select only 2 features. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 4 CI: simple visualization.

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 2: Computational Intelligence:  Methods and Applications

2D projections: scatterplots 2D projections: scatterplots Simplest projections: use scatterplots, select only 2 features. Example: sugar – teeth decay.

If d=3 than d(d-1)/2=3 subsets in 2D are formed and sometimes displayed in one figure.

Each 2D point is an orthogonal projection from all other d-2 dimensions.

What to look for:

correlations between variables,

clustering of different objects.

Problem: for discrete values data points overlap.

Extreme case: binary data in many dimensions, all structure is hidden, each scatterogram shows 4 points.

Page 3: Computational Intelligence:  Methods and Applications

Sugar example Sugar example

What conclusion can we draw?

Can there be alternative explanations?

Page 4: Computational Intelligence:  Methods and Applications

Brain-body index example Brain-body index example

What conclusion can we draw?

Are whales and elephants smarter than man?

Are correlations sufficient to establish causes?

Page 5: Computational Intelligence:  Methods and Applications

4 Gaussians in 8D, 4 Gaussians in 8D, XX1 1 vs. vs. XX224 Gaussians in 8D, 4 Gaussians in 8D, XX1 1 vs. vs. XX22

Scatterograms of 8D data in F1/F2 dimensions.

4 Gaussian distributions, each in 4D, have been generated, the red centered at (0,0,0,0), green at (1,1/2,1/3,1/4), yellow at 2(1,1/2,1/3,1/4) and blue at 3(1,1/2,1/3,1/4)

Demonstration of various projections using Ghostminer software.

Page 6: Computational Intelligence:  Methods and Applications

4 Gaussians in 8D, 4 Gaussians in 8D, XX1 1 vs. vs. XX554 Gaussians in 8D, 4 Gaussians in 8D, XX1 1 vs. vs. XX55

What happened here?

All Xi vs. Xi+4 have this kind of plots.

How were the remaining 4 features generated?

Page 7: Computational Intelligence:  Methods and Applications

Cars example Cars example Scatterograms for all feature pairs, data on cars with 3, 4, 5, 6 or 8 cylinders.

To detailed? We are interested in trends that can be seen in probability density functions.

Cluster all points that are close for cars with N cylinders. This may be done by adding Gaussian noise with a growing variance to each point.

See this on the movie:

Movie for cars.

Page 8: Computational Intelligence:  Methods and Applications

Direct representation: GTDirect representation: GT

How to deal with more than 3D? We cannot see more dimensions.

Grand Tour: move between different 2D projections; implemented in XGobi, XLispStat, ExplorN software packages.

Ex: 7D data viewed as scatterplot in Grand Tour

More examples: http://www.public.iastate.edu/~dicook/JSS/paper/paper.html

Try to view 9D cube – most of the time looks like Gaussian cloud.

It may take time to “calibrate our eyes” to imagine high-D structure.

Page 9: Computational Intelligence:  Methods and Applications

Direct representation: starDirect representation: starStar Plots, radar plots:

represent the value of each component in a “spider net”.

Useful to display single or a few vectors per plot, uses many plots.

Too many individual plots? Cluster similar ones, as in the car example.

x1

x2

x3x4

x5

Page 10: Computational Intelligence:  Methods and Applications

Direct representation: starDirect representation: starWorking woman

population changes in different states, projections and reality.

Page 11: Computational Intelligence:  Methods and Applications

Direct representations: ||Direct representations: ||Parallel coordinates: instead of perpendicular axes use parallel!

Many engineering applications, popular in bioinformatics.

Two clusters in 3D

See more examples at: http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Inselberg/

Instead of creating perpendicular axes

put each coordinate on the horizontal x axis

and its value on the vertical y axis.

Point in N dim => line with N segments.

Page 12: Computational Intelligence:  Methods and Applications

|| lines|| linesLines in parallel representation:

2D line 3D line 4D line

Page 13: Computational Intelligence:  Methods and Applications

|| cubes|| cubesHypercubes in parallel representation:

2D (square) 3D cube: 8 vertices 8D: 256 vertices

Page 14: Computational Intelligence:  Methods and Applications

|| spheres|| spheresHypercubes in parallel representation:

2D (circle) 3D (sphere) ... 8D: ???

Try some other geometrical figures and see what patterns are created.

Page 15: Computational Intelligence:  Methods and Applications

|||| coordintates coordintatesRepresentation of 10-dim. line (x1,,... x10) t, car information data

Parallax software: http://www.kdnuggets.com/software/parallax/

IBM Visualization Data Explorer http://www.research.ibm.com/dx/

has Parallel Coordinates module

http://www.cs.wpi.edu/Research/DataExplorer/contrib/parcoord/

Financial analysis example.

Page 16: Computational Intelligence:  Methods and Applications

More toolsMore tools

Statgraphics charting tools

Modeling and Decision Support Tools collected at the University of Cambridge (UK) are at:

http://www.ifm.eng.cam.ac.uk/dstools/

Book: T. Soukup, I. Davidson, Visual Data Mining-Techniques and Tools for Data Visualization and Mining. Wiley 2002

More tools:

http://www.is.umk.pl/~duch/CI.html#vis

Page 17: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 5 EDA and linear transformations.

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 18: Computational Intelligence:  Methods and Applications

Chernoff facesChernoff facesHumans have specialized brain areas for face recognition.

For d < 20 represent each feature by changing some face elements.

Interesting applets:

http://www.cs.uchicago.edu/~wiseman/chernoff/

http://www.cs.unm.edu/~dlchao/flake/chernoff/ (Chernoff park)

http://kspark.kaist.ac.kr/Human%20Engineering.files/Chernoff/Chernoff%20Faces.htm

Page 19: Computational Intelligence:  Methods and Applications

Fish viewFish viewOther shapes may also be used to visualized data, for example fish.

Page 20: Computational Intelligence:  Methods and Applications

Ring visualizationRing visualization (SunBurst) (SunBurst)Show tree-like hierarchical representation in form of rings.

Page 21: Computational Intelligence:  Methods and Applications

Other EDA techniquesOther EDA techniques

NIST Engineering Statistics Handbook has a chapter on

exploratory data analysis (EDA).

http://www.itl.nist.gov/div898/handbook/index.htm

Unfortunately many visualization programs are written for X-Windows only, are in Fortran, or S or R languages.

Sonification: data converted to sounds!

Example: sound of EEG data.

Java Voice

Think about potential applications! More: http://sonification.de/

http://en.wikipedia.org/wiki/Sonification

Page 22: Computational Intelligence:  Methods and Applications

CI approach to visualizationCI approach to visualization

Scatterograms: project all data on two features.

Find more interesting directions to create projections.

Linear projections:

• Principal Component Analysis,

• Discriminant Component Analysis,

• Projection Pursuit – “define interesting” projections.

Non-linear methods – more advanced, some will appear later.

Statistical methods: multidimensional scaling.

Neural methods: competitive learning, Self-Organizing Maps.

Kernel methods, principal curves and surfaces.

Information-theoretic methods.

Page 23: Computational Intelligence:  Methods and Applications

Distances in feature spacesDistances in feature spaces

Data vector, d-dimensions XT = (X1, ... Xd), YT = (Y1, ... Yd)

Distance, or metric function, is a 2-argument function that satisfies:

Distance functions measure (dis)similarity.

, 0; , ,

, , ,

d d d

d d d

X Y X Y X Y Y X

X Y X Z Z Y

1/ 2

2

21

11

d

i ii

d

i ii

X Y

X Y

X Y

X Y

Popular distance functions:

Euclidean distance (L2 norm)

Manhattan (city-block) distance

(L1 norm)

Page 24: Computational Intelligence:  Methods and Applications

Two metric functionsTwo metric functions

Equidistant points in 2D:

Euclidean case: circle or sphere Manhattan case: square

Identical distance between two points X, Y: imagine that in 10 D !

X1

X2

X1

X2

L L X P Y P

X X

YY

All points in the shaded area have the same Manhattan distance to X and Y!

isotropic non-isotropic

Page 25: Computational Intelligence:  Methods and Applications

Linear transformationsLinear transformations

2D vectors X in a unit circle with mean (1,1); Y = AX, A = 2x2 matrix

The shape and the mean of data distribution is changed.

Scaling (diagonal aii elements); rotation (off-diag), mirror reflection.

Distances between vectors are not invariant: ||Y1-Y2||≠||X1-X2||

1 11 12 1

2 21 22 2

Y a a X

Y a a X

Page 26: Computational Intelligence:  Methods and Applications

Invariant distancesInvariant distances

Euclidean distance is not invariant to linear transformations Y = AX,

scaling of units has strong influence on distances.

How to select scaling/rotations for simplest description of data?

Orthonormal matrices: ATA = I, are inducing rigid rotations.

To achieve full invariance requires therefore standardization of data (scaling invariance) and should use covariance matrix.

Mahalanobis metric will replace ATA by inverse of the covariance matrix.

T21 2 1 2 1 2

T1 2 1 2T

Y Y Y Y Y Y

X X A A X X

Page 27: Computational Intelligence:  Methods and Applications

Data standardizationData standardization

For each vector component X(j)T=(X1(j), ... Xd

(j)), j=1 .. n

calculate mean and std: n – number of vectors, d – their dimension

( ) ( )

1 1

1 1;

i

n nj j

ij j

X Xn n

X X Vector of mean

feature values.

Averages over rows.

(1) (2) ( )

(1) (2) ( )1 1 1 1

(1) (2) ( )2 2 2 2

(1) (2) ( )

n

n

n

nd d d d

X X X X

X X X X

X X X X

X X X

Page 28: Computational Intelligence:  Methods and Applications

Standard deviationStandard deviation

Calculate standard deviation:

Why n1, not n ? If true mean was known it should be n, but if the mean

is calculated the formula with n1 converges to the true variance!

Transform X => Z, standardized data vectors:

( )

1

22 ( )

1

1

1

1

i

i i

nj

ij

nj

ij

X Xn

X Xn

Vector of mean feature values.

Variance = square of standard deviation (std), sum of all deviations from the mean value.

( ) ( )j ji i i iZ X X

Page 29: Computational Intelligence:  Methods and Applications

Standardized dataStandardized data

Std data: zero mean and unit variance.

Standardize data after making data transformation.

Effect: data is invariant to scaling only; for diagonal transformations distances after standardization are invariant, are based on identical units.

Note: it does not mean that all data models are performing better!

How to make data invariant to any linear transformations?

,

( ) ( )

1 1

2 22 ( ) ( ) 2

1 1

1 10

1 11

1 1

i i

Z i i i

n nj j

i i ij j

n nj j

i i ij j

Z Z X Xn n

Z Z X Xn n

Page 30: Computational Intelligence:  Methods and Applications

Std exampleStd example

Before std

After std

Mean and std are shown using a colored bar; minimum and max values may extend outside.

Some features (ex. yellow), have large values; some (ex: gray) have small values; this may depend on units used to measure them.

Standardized data have all mean 0 and =1, thus contribution from different features to similarity or distance calculation is comparable.

Page 31: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 6 Principal Component Analysis.

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 32: Computational Intelligence:  Methods and Applications

Linear transformations – exampleLinear transformations – example2D vectors X uniformly distributed in a unit circle with mean (1,1);

Y = AX, A = 2x2 matrix

The shape is elongated, rotated and the mean is shifted.

1 1

2 2

2 1

1 1

Y X

Y X

Page 33: Computational Intelligence:  Methods and Applications

Invariant distancesInvariant distances

Euclidean distance is not invariant to general linear transformations

This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances.

Idea: standardize the data in some way to create invariant distances.

Y A X

2 T1 2 1 2 1 2

T1 2 1 2T

Y Y Y Y Y Y

X X A A X X

Page 34: Computational Intelligence:  Methods and Applications

Data standardizationData standardization

For each vector component X(j)T=(X1(j), ... Xd

(j)), j=1 .. n

calculate mean and std: n – number of vectors, d – their dimension

( ) ( )

1 1

1 1;

i

n nj j

ij j

X Xn n

X X Vector of mean

feature values.

Averages over rows.

(1) (2) ( )

(1) (2) ( )1 1 1 1

(1) (2) ( )2 2 2 2

(1) (2) ( )

n

n

n

nd d d d

X X X X

X X X X

X X X X

X X X

Page 35: Computational Intelligence:  Methods and Applications

Standard deviationStandard deviation

Calculate standard deviation:

Transform X => Z, standardized data vectors

( )

1

22 ( )

1

1

1

1

i

i i

nj

ij

nj

ij

X Xn

X Xn

Vector of mean feature values.

Variance = square of standard deviation (std), sum of all deviations from the mean value.

( ) ( )j ji i i iZ X X

Page 36: Computational Intelligence:  Methods and Applications

Std dataStd data

Std data: zero mean and unit variance.

Standardize data after making data transformation.

Effect: data is invariant to scaling only (diagonal transformation).

Distances are invariant, data distribution is the same.

How to make data invariant to any linear transformations?

,

( ) ( )

1 1

2 22 ( ) ( ) 2

1 1

1 10

1 11

1 1

i i

Z i i i

n nj j

i i ij j

n nj j

i i ij j

Z Z X Xn n

Z Z X Xn n

Page 37: Computational Intelligence:  Methods and Applications

Data standardization exampleData standardization example

In slide 2 example Y=AX, assume all X means =1 and variances = 1

Transformation

Vector of mean

feature values.

Variance

check it!

1 3 2 1 1

1 2 1 1 1

X Y

1 1

2 2

2 1

1 1

Y X

Y X

2 2 T1 5Diag

1 2X

Yσ σ AA

2 T1 2 1 2 1 2T Y Y X X A A X X How to make this

invariant?

Page 38: Computational Intelligence:  Methods and Applications

Covariance matrixCovariance matrix

Variance (spread around mean value) + correlation between features.

where X is d x n dimensional matrix of vectors shifted to their means.

Covariance matrix is symmetric Cij = Cji and positive definite.

Diagonal elements are variances (square of std), i2 = Cii

Pearson correlation coefficient

( ) ( )

1

T( ) ( ) T

1

1; , 1

1

1 1

1 1

i

nk k

ij i j jk

nk k

k

C X X X X i j dn

n n

XC X X X X XX

[ 1, 1]ij ij i jr C

Spherical distribution of data has Cij=I (unit matrix).

Elongated ellipsoids: large off-diagonal elements, strong correlations between features.

CX is d x d

Page 39: Computational Intelligence:  Methods and Applications

CorrelationCorrelation

Correlation coefficient is linear and may be confusing …

Page 40: Computational Intelligence:  Methods and Applications

Mahalanobis distanceMahalanobis distance

Linear combinations of features leads to rotations and scaling of data.

Mahalanobis distance defined as:

is invariant to linear transformations:

T; ; Y X Y AX Y AX C AC A

2 T1 2 1 2 1 21

T 11 2 1 2T T 1 1

21 2

Y

X

YC

X

C

Y Y Y Y C Y Y

X X A A C A A X X

X X

2 T 1

XXC

X X C X

Page 41: Computational Intelligence:  Methods and Applications

Principal componentsPrincipal components

How to avoid correlated features?

Correlations covariance matrix is non-diagonal !

Solution: diagonalize it, then use the transformation that makes it

diagonal to de-correlate features.

C – symmetric, positive definite matrix XTCX > 0 for ||X||>0;

its eigenvectors are orthonormal:

its eigenvalues are all non-negative i ≥ 0

Z – matrix of orthonormal eigenvectors (because CX is real+symmetric),

transforms X into Y, with diagonal CY, i.e. decorrelated.

T ( ) ( )

T T

; ;i ii

X X

Y X

Y Z X C Z Z C Z ZΛ

C Z C Z Z ZΛ Λ

In matrix form, X, Y are dxn, Z, CX, CY are dxd

( )T ( )i jij Z Z

Page 42: Computational Intelligence:  Methods and Applications

Matrix formMatrix form

Eigenproblem for C matrix in matrix form:X C Z ZΛ

11 12 1 11 12 1

21 22 2 21 22 2

1 2 1 2

11 12 1 1

21 22 2 2

1 2

0 0

0 0

0 0

d d

d d

d d dd d d dd

d

d

d d dd d

C C C Z Z Z

C C C Z Z Z

C C C Z Z Z

Z Z Z

Z Z Z

Z Z Z

Page 43: Computational Intelligence:  Methods and Applications

Principal componentsPrincipal componentsPCA: old idea, C. Pearson (1901), H. Hotelling 1933

Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues.

T

T

;

Y X

Y Z X

C Z C Z Λ

TXZΛZ C

Small i small variance data change little in direction Yi

PCA minimizes C matrix reconstruction errors:

Zi vectors for large i are sufficient to get:

because vectors for small eigenvalues will have very

small contribution to the covariance matrix.

Y – principal components, or vectors X transformed using eigenvectors of CX

Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data.

Page 44: Computational Intelligence:  Methods and Applications

Two components for visualizationTwo components for visualization

New coordinate system: axis ordered according to variance = size of the eigenvalue.

First k dimensions account for

1

1

k

ii

dk

ii

V

fraction of all variance (please note that i are variances); frequently 80-90% is sufficient for rough description.

Diagonalization methods: see Numerical Recipes, www.nr.com

Page 45: Computational Intelligence:  Methods and Applications

PCA propertiesPCA properties

PC Analysis (PCA) may be achieved by:

• transformation making covariance matrix diagonal

• projecting the data on a line for which the sums of squares of distances from original points to projections is minimal.

• orthogonal transformation to new variables that have stationary variances Y(W) – around max. variance change is minimal.

True covariance matrices are usually not known, they have to be estimated from data.

This works well on single-cluster data;

more complex structure may require local PCA: the PCA transformation should then be done separately for each cluster or neighborhood of a query vector X.

Page 46: Computational Intelligence:  Methods and Applications

Some remarks on PCASome remarks on PCA

PC results obviously depend on the initial scaling of the features, therefore one should standardize the data first to make it independent of scaling or measurement units. Example: Heart data.

Assume that the data matrix X has been standardized, show that:

that is the mean stays as zero and the variance of principal components

is equal to the eigenvalues. Therefore rejecting Yi components with

small variance leads to small errors in reconstruction of X = ZY, where rejected components are replaced by zero values.

PC is useful for:

finding new, more informative, uncorrelated features;

reducing dimensionality: reject low variance features,

reconstructing original data from lower-dimensional projections.

20i i iY Y

Page 47: Computational Intelligence:  Methods and Applications

PCA Wisconsin examplePCA Wisconsin exampleWisconsin Breast Cancer data:

• Collected at the University of Wisconsin Hospitals, USA.

• 699 cases, 458 (65.5%) benign (red), 241 malignant (green).

• 9 features: quantized 1, 2 .. 10, cell properties, ex:

Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei,

Bland Chromatin, Normal Nucleoli, Mitoses.

2D scatterograms do not show any structure no matter which subspaces are taken!

Page 48: Computational Intelligence:  Methods and Applications

Example cont.Example cont.PC gives useful information already in 2D.

Taking first PCA component of the standardized data:

If (Y1.41) then benign else malignant

18 errors/699 cases = 97.4%

Transformed vectors are not

standardized, std’s are below.

Eigenvalues decrease to zero slowly, but classes are well separated.

Page 49: Computational Intelligence:  Methods and Applications

PCA disadvantagesPCA disadvantagesUseful for dimensionality reduction but:

• Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data.

• The meaning of features is lost when linear combinations are formed.

Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight.

PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix.

PCA is also called Karhuen-Loève transformation.

Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002.

Page 50: Computational Intelligence:  Methods and Applications

2 skewed distributions2 skewed distributions

PCA transformation for 2D data:

First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible.

In fact projection to orthogonal axis to the first PCA component has much more discriminating power.

Discriminant coordinates should be used to reveal class structure.