descriptive analysis and pca

Descriptive Analysisand PCA

Hervé AbdiThe university of Texas at Dallas

[email protected]

Dominique ValentinENSBANA/CESG

[email protected]

Back to the yogurt example

Texture Thickness: consistency of the mass in the mouthRate of Melt: amount of product melted after a certain pressure of the tongueGraininess: amount of particle in massMouth coating: amount of film left on the mouth surfaces

Basic tastesSweet: SucroseSour: lactic acidBitter: caffeineSalty: sodium chloride

ArômeWater: taste like water down Flour: 1 spoon of flavor mixed in waterWood: cutting from pencil sharpening Chalk: smectaMilk: whole milk Raw pie crust: commercial raw pie crustCream: crème fraiche Hazelnut: : hazelnut powderearthy: earth Mushroom: dry mushrooms soaked in water

9 panélistes

5 yogurts: 2 cow milk yogurts 3 soy yogurts

Pas du tout Très

Amer

Pas du tout TrèsSalé

Pas du toutAstringent


TextureFarineux - Flour

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour

sojasun sojade veloutédanone

leaderprice

Inte

nsi

té m

oye

nn

e

ab ab ab b

leaderprice

Épais – thickness

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


Inte

nsi

té m

oye

nn

e

bc bca

abd

Gras – Mouth coating

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

b abab a

ab

Fondant - melt

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

abc abcc

abcab


astringent

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

Taste

Sucré - Sweet

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

ab ab abab ab

leaderprice

Acide - Sour

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


Inte

nsi

té m

oye

nn

e

cd cd cd bca

Amer - Bitter

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

aa a a a

a abc abcc

abc


AromaFarine - flour

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

Craie - chalk

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

Crème - cream

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

cabc

d

abc

d

ab b

b b

Noisette - Hazelnut

0,00

2,00

4,00

6,00

8,00

10,00

sojacarrefour


leaderprice

Inte

nsi

té m

oye

nn

e

aba

ab ab b c cc

ac


-0.8 -0.4 0 0.4 0.8

-0.8

-0.4

0

0.4

0.8

Facteur 1 - 61.04 %

Facteur 2 - 17.84 %

farineux

epais

gras

fondant

sucre

acide

astringent

eau

farine

bois

craie

lait

creme

noisette

terreuxchampignon

-4.5 -3.0 -1.5 0 1.5 3.0

-2

-1

0

1

2

Facteur 1 - 61.04 %

Facteur 2 - 17.84 %

soja bio

soja champion

Soja leaderpriceSoja carrefour

Soja bifidus

Soja sun

sojade

Soja délice

carrefour

velouté danone

danone bifidus

Leader price

A solution: Principal Component Analysis

A statistical technique used to transform a number of correlated variables into a smaller number of uncorrelated variables called principal components.

The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible

The mathematical technique used in PCA is called eigen analysis

What is PCA ?

When to use PCA ?

To analyze 2 dimensional data tables describing I observations with J quantitative variables

1 … j … J

1...i...I

yij…...

……

...

Variables

Obs

erva

tions

Why using PCA ?

1.To evaluate the similarity between the observations, here the products

2. to detect structure in the relationships between variables, here the descriptors

3. to reduce the number of variables to allow for a graphical representation of the data

To give a synthetic description of the products

General principle of PCA

1 … j … J1

...i

...I

yij…...

……

...

VariablesO

bse

rva

tions

PC1 .. PCk .. PCK

1

...i

...I

Cpik…...

……

...

Principal components

Diagonalizationor eigen analysis

Cp1

PC2

PC1

PC2

Circle of correlations Projection of observations

++ ++

A baby example: wine profile

AmberBlackcurrent Coconut Leather Musc

Goose berry Woody Vanilla Rasberry

v1 7.000 3.000 1.000 6.000 9.000 3.000 1.000 0.000 2.000

v2 0.000 5.000 1.000 1.000 0.000 7.000 0.000 1.000 6.000

v3 1.000 9.000 0.000 0.000 0.000 6.000 1.000 1.000 5.000

v4 1.000 6.000 7.000 0.000 1.000 6.000 4.000 6.000 4.000

v5 6.000 1.000 8.000 5.000 4.000 2.000 5.000 5.000 1.000

v6 1.000 6.000 5.000 1.000 0.000 5.000 5.000 7.000 6.000

v7 7.000 3.000 1.000 6.000 8.000 2.000 1.000 0.000 2.000

v8 6.000 3.000 0.000 5.000 5.000 3.000 1.000 1.000 3.000

v9 0.000 4.000 4.000 1.000 0.000 7.000 6.000 5.000 5.000

v10 4.000 2.000 6.000 5.000 6.000 2.000 5.000 7.000 1.000

v11 5.000 1.000 4.000 6.000 7.000 1.000 6.000 7.000 2.000

v12 1.000 6.000 0.000 1.000 0.000 5.000 0.000 1.000 8.000

A baby example: wine profile

How to find the principal components?

Step 1: get some data

Step 2: subtract the means of the variables

Step 3: find the eigenvectors and eigenvalues of the covariance matrix

Step 4: find the principal components by projecting the observations onto the eigenvectors

Step 5: compute the loading as the correlation between the original variables and the principal components

A 2D example: step 1 get the data

20 words :

Variable 1 = number of letters

Variable 2 = number of lines used to define the words in the dictionary.

A 2D example: step 1 get the data

A 2D example: step 2 subtract the mean

Y = “length of words ” MY = 6y = (Y −MY)

W = “number of lines of the definition” MW = 8w = (W −MW)

A 2D example: step 2 subtract the mean

A 2D example: step 3 find the eigenvectors

A 2D example: project the observations

A 2D example: compute the loadings

r (W, F1) = 0.97

Pearson correlation coefficient


r (W, F2) = 0.23



r (Y, F1) = -0.87



r (Y, F2) = 0.50


A 2D example: draw the circle of correlation

r (W, F1) = 0.97

r (W, F2) = 0.23

r (Y, F1) = -0.87

r (Y, F2) = 0.50

How to compute the explained variance ?

Eigenvalue % variance Cumulated % variance

392 88 88 52 12 100 444

392

444X 100 = 88%

How many components to keep

The Kaiser criterion. retain only composante with eigenvalues greater than 1.

The scree test.

Common sens. Keep dimensions that are interpretable.

Examines several solutions and chooses the one that makes the best "sense."

0

0,5

1

1,5

2

2,5

3

3,5

4

1 2 3 4 5 6 7 8

Should I normalize the data

Yes if they are not measured on the same scale

Otherwise it depends:

Normalized: same weight for all variables Not normalized: weight proportional to standard deviation

descriptive analysis and pca

Documents

number of variables

yogurt exampleastringent0

principal component

number of correlated

yogurt examplea solution

succeeding component

mathematical technique

remaining variability