multivariate description visualisation reduction of...

Post on 04-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Multivariate descriptionVisualisation

Reduction of dimensionality

Data Mining courseMaster in Information Technologies

Enginyeria Informàtica

Tomàs Aluja

2

Two types of datasets to analyze

Data in Data Mining:massive, secondary, not random, with errors and missing values

topicsSocio-econ. Opinions Products

Data to explore Data to modelize

Output(s)Inputs

Course DM: Multivariate Visualisation. T. Aluja

3Course DM: Multivariate Visualisation. T. Aluja

4

Data exploration: Visualisation + “clustering”

• Data contains information about the genereting phenomenon.

• Visualization. The human eyes …– To consent a loss in the information in exchange for gaining

interpretability.

• Synthesis of the reality (clustering)– Reality is complex, we render operational simplifying it in a

limited number of clusters.

Snow’s Cholera Map, 1855

Course DM: Multivariate Visualisation. T. Aluja

5

South and North Korea at night

South Korea,Guess where is Seoul?

North KoreaNotice how dark it is

Course DM: Multivariate Visualisation. T. Aluja

6

Graph visualisation

Ggobi project

Course DM: Multivariate Visualisation. T. Aluja

Parallel coordinates of IRIS data

7Course DM: Multivariate Visualisation. T. Aluja

8

Iris versicolor

Iris virginica

Iris setosa

Course DM: Multivariate Visualisation. T. Aluja

9

Visualization of the tableBCN Quarters x Profession of inhabitants

Course DM: Multivariate Visualisation. T. Aluja

10

Spanish inquisition 1567‐1600sentences & crimes

Course DM: Multivariate Visualisation. T. Aluja

11

Visualisation of international cities according their 

salaries. USB 1994.

Course DM: Multivariate Visualisation. T. Aluja

12

Microarray data: 64 cancers 6830 gen cromotografy

Course DM: Multivariate Visualisation. T. Aluja

13

M.Turk and A.Pentland. Eigen Faces for Recognition. Journal of Cognitive Neuroscience, 3(1), 1991.

Reconstitution of images

Course DM: Multivariate Visualisation. T. Aluja

14

Actual image

Course DM: Multivariate Visualisation. T. Aluja

15

Reconstituted image

Course DM: Multivariate Visualisation. T. Aluja

16

Monitoring of the inner temperatures of Lascaux cave (France): 

Course DM: Multivariate Visualisation. T. Aluja

17

Multivariate VisualizationSelection of the active topic

• Exploratory situation (without response variable but with illustrative varaibles).

p

n

Variables

Variablesactivas

Variablesilustrativas

Ind

ivid

uos

Course DM: Multivariate Visualisation. T. Aluja

18

Active topic Multivariate technique

Continuous variables PCA - Principal Component Analysis

Count variables CA - (Simple) Correspondence Analysis

Categorical variables MCA - Multiple Correspondence Analysis

Course DM: Multivariate Visualisation. T. Aluja

19

PCA, CA, MCA can be useful for …

• Visualisition of the information contained in a data matrix • Detection of “outliers”

• Reduction of the dimensionality (feature selection)• Image compression• Extraction of new derived variables (latent), “feature

extraction”

• Smoothing of data (error reduction, avoiding collineality)• First phase of the explanatory variables for modeling

Course DM: Multivariate Visualisation. T. Aluja

20

Principal Component Analysis

• Cloud of points associated to the rows of the data matrix

• Total information contained in the cloud of points: the inertia respect G

i

i'

n

p

X=

••

•••

• •••

i

i'

var2

var1

var3Rp

Harold Hotelling, 1895-1973American statistician

Course DM: Multivariate Visualisation. T. Aluja

21

• Purpose:– To project the cloud of points upon a subspace (a

plan) to retain the maximum of the original cloud information.

Course DM: Multivariate Visualisation. T. Aluja

22

Principal Component Analysis

• Fitness Criterion– Find the subspace

maximizing the projected inertia.

• Decomposition of inertia in orthogonal directions (factorial axes) I I I Itotal p= + + +1 2

I I Ip1 2> > >

Course DM: Multivariate Visualisation. T. Aluja

23

Fit in Rp

2

1

n

i iu i

p N u X NX uMax ψ ψ ψ=

′ ′ ′= =∑

X uψ =

( )( )( )

Cov Xdiag X NX

Cor X⎧′ = ⎨⎩

1

1

, , ( ), ,

r

r

r rang Xu uλ λ→ =……

X NX u uλ′ =

1 1 1

1Max u X NX uu u

λ′ ′ =′ =

Let call u∈Rp the unit vector defining the direction maximizing the projected inertia

Diagonalization of the correlation matrix (or

covariance)

Let X be the data matrix: centered or standardized

Course DM: Multivariate Visualisation. T. Aluja

24

Eje 1nube multidimensional

Eje 2

Rp

Principal Components(derived latent variables)Factors, …

Direction maximizig the projected inertia: u1. Direction maximizing the projected inertia orthogonal to u1 : u2...

Xuα αψ =

Nα α αψ ψ λ′ =

1 2 3 4 5 6

0

1

2

3

Component Number

Eige

nval

ue

Scree Plot of Clarity-Quality

Assessing the importance of orthogonal directionsScree plot of eigenvalues:

Inertia of a PC

Course DM: Multivariate Visualisation. T. Aluja

25

variables muy correlacionadas

variable ortogonal

correlación muy negativacon x e y

xy

z

w

Associated cloud of points to the columns of a data matrix in Rn

ind3

varjsj

Nube de las variablesRn

ind2

ind1 Centered variables Standardized variables

n

p

X=

Course DM: Multivariate Visualisation. T. Aluja

26

Fit in Rn(standardized data)

v1

v3

v4

v2

Eje 1

Eje 2

•• •

Eje 1

Eje 2

v1

v4

v3

v2

Original cloud

Optimal joint visualisation of the correlations between variables

First factorial plan

Course DM: Multivariate Visualisation. T. Aluja

27

Fit in Rn

1 12 22

1

p

jv j

v N XX N vMax ϕ ϕ ϕ=

′ ′ ′= =∑

12X N vϕ ′=

1v v′ =

1 12 2

1 12 2

u X N v

v N Xuα α

α α

λ

λ

′=

=

12X N vα α

α α α

ϕϕ ϕ λ

′=′ =

1 12 2N XX N v vλ

⎫ ′ =⎬⎭

12

1 12 2

u

N vα α

α α

ϕ λ

ψ λ −

=

=

Let v∈Rp be the unit vector defining the direction maximizing the inertia:

Transition relationships between both fits:

Indirect projection formulas

12

( , )( , )

j

j j

cor xX N

s cor xα

α αα

ψϕ λ ψ

ψ− ⎧

′= = ⎨⎩

Interpeting the projections

Data matrix X: centered o standardized

Course DM: Multivariate Visualisation. T. Aluja

28

The PCA is a device to find artificial latent variables, from observed ones.

World of ideas, concepts, theories, …

Real worldObserved variables

PCA

exp

l.

Factors

ACP: Ψα α α α= + + +u u u1 2 px x x1 2 p

( , ) ( , ) ( , )n p n p p pΨ = X U

Var. 1

Var. 2

Var. p

Fac. 1

Fac. q

But only the first q Factors convey structural information, the remaining

are noise

Course DM: Multivariate Visualisation. T. Aluja

29

PCA in practice

• Role of de las variables: Normed or non normed analysis– Normed PCA means to give all varaibles the same importance, we

achive this by standardization of data (diagonalization of the correlation matrix)

– Non normed PCA means to give to each varaible an importance proportional to tis standard deviation. We achieve this working with the just centered data matrix (diagonaization of the covaraince matrix)

• What variables to analize?– This is the most crucial decision. Often the information contained is

obvious, then try to perform partial analysis. PCA is a device of exploration.

Course DM: Multivariate Visualisation. T. Aluja

30

PCA in practice

• How many factorial directions are significative? – Difficult to assess. How many axes remain stable with independent

data?– Use the screeplot.– Perform random perturbation of data to assess stability.

• How to interpret the axes– The significative axes convey structural (deterministic) information of

the phenomenon under study and they can be interpreted and given a name (this is the most appealing outcome.

– Interpretation is done in the basis of the correlations between the principal component (the new artificial latent variables and the original ones, the pc is a mean variable of the most correlated).

Course DM: Multivariate Visualisation. T. Aluja

31

Projection of illustrative variables

• Continuous– We depict their correlations with the factorial axes.

• Categorical– We represent a categorical varaible by the set of the centre

of gravity of the different subclouds of individuals correponding to each level of the categorical variable.

Very useful … It allows to relate each illustrative variable to the active topic altogether

Course DM: Multivariate Visualisation. T. Aluja

32

Finding the PCA solution iteratively (NIPALS)

Initialize X1←XFor h=1,..., r=rang(X)

Ψh = mean column of Xh

Repeat till convergence of uh

uh = X’hΨh

uh = uh/|uh|Ψh = Xh uh

Xh = Xh-1 - Ψh uh’

Rn

Rp

ψh

uh

hX ′hX

In the convergence: h h h h

h h h h

X X u uX X ψ ψ

Course DM: Multivariate Visualisation. T. Aluja

A relevant application: Google• GoogleTM uses SVD to accelerate finding relevant web pages. Define a web

site as an authority if many sites link to it. Define a web site as a hub if it links to many sites. We want to compute a ranking x1; … ; xN of authorities and y1;… ; yMof hubs.

As a first pass, we can compute the ranking scores as follows: xi0 is the number of

links pointing to i and yi0 is the number of links going out of i. But, not all links

should be weighted equally. For example, links from authorities (or hubs) should count more. So, we can revise the rankings as follows

Where A is the adjacency matrix with aij = 1 if i links to j. (of 109 order)

But an authority depends also from the pages linking to the linking pages of the authority. Hence iterating …

33

1 0

1 0i

j

x A y

y Ax

′=

=

1 1k k k ki jx A Ax y AA y− −′ ′= =

Course DM: Multivariate Visualisation. T. Aluja

34

prcomp(x, retx=T, center=T, scale.=F, tol = NULL, ...)

Arguments:x: a numeric (or data frame) which provides the data.retx: a logical value indicating whether the rotated variables should be returned.center: a logical value indicating whether the variables should be shifted to be

zero centered. scale.: a logical value indicating whether the variables should be scaled to have

unit variance before the analysis takes place.tol: a value indicating the magnitude below which components should be omitted.

Attributessdev: the standard deviations of the principal components (i.e., the square roots of

the eigenvalues).rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the

eigenvectors). x: if 'retx' is true the value of the rotated data (the centred (and scaled if

requested) data multiplied by the 'rotation' matrix) is returned.

Course DM: Multivariate Visualisation. T. Aluja

35

biplot(x, y, var.axes = TRUE, main = NULL, ...)

Arguments:

x: The first set of points (a two-column matrix), usually associated with observations.

y: The second set of points (a two-column matrix), usually associated with variables.

var.axes: If 'TRUE' the second set of points have arrows representingthem as (unscaled) axes.

Course DM: Multivariate Visualisation. T. Aluja

36

Beyond PCA ⇒ MCA

• PCA just analyzes continuous variables through their correlations, hence it just can reveal linear relationships between variables

• Thus, transform the original variablesRecode them to ordinal to take into account non linearities

f(X) Ψ

var j a a

xj1 jk

ij → 001000

Ludovic LebartFrench statistician, promoter of MCA

Course DM: Multivariate Visualisation. T. Aluja

37

MCA of hypercubes• Dimensions (= categorical variables)• Measured variables in cells (=responses, they may be continuous

or categorical)• (Hypercube can be explicit or implicit in a relational DB.

A1 B1 C1A1 B2 C2A3 B1 C3A2 B2 C1

Hypercube dimensions Numerical coding (bining)

1000 10 1001000 01 0100010 10 0010100 01 100

A1 A2 A3 A4B1 B2

C1 C2 C3 (=Z)

Course DM: Multivariate Visualisation. T. Aluja

38

• Active Variables : Dimensions• Ilustrative variables : Responses

p

n

Variables

Dimensiones Variablesrespuesta

Ind

ivid

uos

1000 10 100

We will visualize the responses upon the grid provided by the dimensions

Course DM: Multivariate Visualisation. T. Aluja

39

MCAActive grid

Edad CSP Nivel de ingresos

2 1 3 0 1 0 1 0 0 0 0 1

Edad

CSP Ingr

.

nj

p

n

nnp

Ed1

Ed2

Ed3CSP2 CSP3

CSP1

ing3

ing1

ing2Course DM: Multivariate Visualisation. T. Aluja

40

El ACM como un ACP no lineal

Course DM: Multivariate Visualisation. T. Aluja

41

2 1 1

1

1n

i iu i

p nu D Z ZD uMaxn

ψ ψ ψ − −

=

′ ′ ′= =∑1 1npu D u−′ =

11eig Z ZD u up

λ−′⇒ =

0010 01 010 pi

nj

i

1 … j … J

Z=

D=

1 … pvariablesmodalities

n

1

1

1n1

1J

J

1i

ppnp n

= =

1

n

j iji

n z=

= ∑1 Zp

Row profile:

1Znp D up

ψ −=Chi-square Metric:

1Dnp

−⎛ ⎞⎜ ⎟⎝ ⎠

Course DM: Multivariate Visualisation. T. Aluja

42

What are the factors in MCA?

Edad

CSP

Nivel de ingresos

z1

z3

z2

Ψ

Rn

Max cor

u aj 1

p

j

j jk jkk

2 ( , )Ψ=

∑=

z

z

⇒ Optimal quantificationof the categorical variables

MCA

Original categorical data Equivalent continuous factors

But we will work with more dimensions than in PCACourse DM: Multivariate Visualisation. T. Aluja

43

Interesting properties of the MCA displays

• Every individual is the cdg of their chosen modalities(apart from a multiplicative factor)

• Une modality (=level) is the cdg of individuals having chosen it(apart from a multiplicative factor)

ind

mod

1αλ

1αλ

1J

j ijji

z

αα

ϕψ

λ=

1n

i ijij

zn

αα

α

ψϕ

λ= ∑

Course DM: Multivariate Visualisation. T. Aluja

44

MCA iterative algorithm

Initialize Y0 ← Z; Z ← [Z1,... Zp]; D=Z’Z; Dk=Zk’Zk

For h=1,..., rang(Z)Ψh = rowmean of YRepeat till convergence of uh

uh = D-1Y’hΨhuh = uh/|uh|Ψh = (1/p) Yh uh

Yh = Yh-1 - Ψh uh’zk = Zk uk; uk = Dk

-1Zk’ Ψh k=1...p

Course DM: Multivariate Visualisation. T. Aluja

45

Projection of the illustrative variables

• Continuous– From their correlations with the factorial axes.

• Categorical– As the set of cdgs of the individuals having chosen each

level of the categorical variable.

Course DM: Multivariate Visualisation. T. Aluja

46

library(MASS) mca(df, nf = 2, abbrev = FALSE)

Arguments:df: A data frame containing only factors nf: The number of dimensions for the MCA.

Attributes: rs: The coordinates of the rows, in 'nf' dimensions. cs: The coordinates of the column vertices, one for each level of

each factor. fs: Weights for each row, used to interpolate additional factors

in 'predict.mca'. d: The singular values for the 'nf' dimensions.

Course DM: Multivariate Visualisation. T. Aluja

47

Qua

rters

CSP

nkj nk

nj n

CSP

n

Quarters

00010000 00000010000

nk njz1

z2

Max

u a

v b

1 k kk

j jj

cor( , )z z

z

z

1 2

2

=

=

∑Rn

Ψ

Jean Paul Benzecri, Analyse des Données father

Simple Correspondences AnalysisAnalyisis of crosstables

Course DM: Multivariate Visualisation. T. Aluja

48

library(MASS)corresp(x, data, ...)

Argumentsx : A two-way frequency table. Currently accepted forms are

matrices, data frames ...

nf: The number of factors to be computed. (max. value = min (nrow-1, ncol-1).

Course DM: Multivariate Visualisation. T. Aluja

top related