![Page 1: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/1.jpg)
Dimensionality Reduction&
Latent Variable Modelling
Andreas Damianou
University of Sheffield, UK
Workshop on Data Science in Africa,17th June, 2015
Dedan Kimathi University of Technology, Kenya
![Page 2: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/2.jpg)
Working with data
● Data-science: everything revolves around a dataset
● Dataset: the set of data (to be) collected for our algorithms to learn from
● Example: animals dataset
height weight
3340287015
202812367
animal 1animal 2animal 3animal 4animal 5
![Page 3: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/3.jpg)
Working with data
● Data-science: everything revolves around a dataset
● Dataset: the set of data (to be) collected for our algorithms to learn from
● Example: animals dataset
height weight
3340287015
202812367
animal 1animal 2animal 3animal 4animal 5
Y =
![Page 4: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/4.jpg)
Working with data
● Data-science: everything revolves around a dataset
● Dataset: the set of data (to be) collected for our algorithms to learn from
● Example: animals dataset
hhhhh
wwwww
yyyyy
Y =
5 rows (instances)2 columns (features)
height weight
12345
12345
12345
12345
![Page 5: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/5.jpg)
What is “dimensionality”?
● Simply, the number of features used for describing each instance.
● In the previous example: height and width.
● The number of features depends on our selection or limitations during data collection.
![Page 6: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/6.jpg)
Notation
It's convenient to use notation from linear algebra.
Y =
1,2,
n,
yy
y
1,2,
n,
yy
y
1,2,
n,
yy
y
11
1
22
2
dd
d
n rows → n instances
d columns → d features (dimensions)
So, the matrix Y containsd-dimensional data
![Page 7: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/7.jpg)
High-dimensional data
● Data having large number of features, d● Examples
– Micro-array data
– Images
Y =
1,2,
n,
yy
y
1,2,
n,
yy
y
1,2,
n,
yy
y
11
1
22
2
dd
d
...
![Page 8: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/8.jpg)
Properties of high-dimensional data
![Page 9: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/9.jpg)
Properties of high-dimensional data
British peopleNon-British people
![Page 10: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/10.jpg)
Name Height
Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74
1.70 1.80 1.90 2.00
A C M J N
![Page 11: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/11.jpg)
Name Height
Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74
1.70 1.80 1.90 2.00
A C M J Ny*
![Page 12: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/12.jpg)
Name Height
Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74
1.70 1.80 1.90 2.00
A C M J Ny*Nearest neighbour classification for a new instance y*
Distance from C to M: sqrt((C_h – M_h)^2) = 0.0009
![Page 13: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/13.jpg)
Adding another feature: weight
J
A
C
MN
Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2) = 16
![Page 14: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/14.jpg)
Adding another feature: “beardness”
Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2 + (C_b – M_b)^2 ) = 65
A
C
M
J
N
![Page 15: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/15.jpg)
Adding another feature: “beardness”
Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2 + (C_b – M_b)^2 ) = 65
A
C
M
J
N
Show demo!! (humanClassif)
![Page 16: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/16.jpg)
The (potential) problem
● What happens as the dimensionality grows?–
–
● Why is that a problem?–
● Is that always a problem?–
![Page 17: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/17.jpg)
The (potential) problem
● What happens as the dimensionality grows?– Distances grow bigger
– Everything seems to be spread apart
● Why is that a problem?–
● Is that always a problem?–
![Page 18: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/18.jpg)
The (potential) problem
● What happens as the dimensionality grows?– Distances grow bigger
– Everything seems to be spread apart
● Why is that a problem?– Think about doing nearest neighbour classification. For a test
instance, everything would just be super far...
● Is that always a problem?–
![Page 19: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/19.jpg)
The (potential) problem
● What happens as the dimensionality grows?– Distances grow bigger
– Everything seems to be spread apart
● Why is that a problem?– Think about doing nearest neighbour classification. For a test
instance, everything would just be super far...
● Is that always a problem?– No, but with real, “dirty” data, it can often be
![Page 20: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/20.jpg)
The (potential) problem
● What happens as the dimensionality grows?– Distances grow bigger
– Everything seems to be spread apart
● Why is that a problem?– Think about doing nearest neighbour classification. For a test
instance, everything would just be super far...
● Is that always a problem?– No, but with real, “dirty” data, it can often be
–
● Another problem we'll see later: noise in the data
![Page 21: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/21.jpg)
Why do dimensionality reduction?
![Page 22: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/22.jpg)
Why do dimensionality reduction?
● Pre-process data for another task (e.g. classification)
● Compression (lossy)
● Visualisation
● Data understanding / clearing
![Page 23: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/23.jpg)
Why do dimensionality reduction?
● Pre-process data for another task (e.g. classification)
● Compression (lossy)
● Visualisation
● Data understanding / clearing
![Page 24: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/24.jpg)
Compression
hhhhh
wwwww
Y =
12345
12345
12345
hhhhh
+ w+ w+ w+ w+ w
12345
12345
12345
redundant!!
![Page 25: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/25.jpg)
Why do dimensionality reduction?
● Pre-process data for another task (e.g. classification)
● Compression (lossy)
● Visualisation
● Data understanding / clearing
![Page 26: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/26.jpg)
Spring-ball example
On the board!
![Page 27: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/27.jpg)
Clustering vs Dimensionality Reduction
Dimensionality Reduction
Clustering
n x d matrixm x d matrix
m < n
n x k matrix
k < d
![Page 28: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/28.jpg)
Dimensionality reduction in action
1st way: One solution is to just drop one of the features
hhhhh
wwwww
Y =
height weight
12345
12345
12345
hhhhh
X' =
height
12345
12345
h
w
h
A DC EB
A
D
C
E
B
![Page 29: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/29.jpg)
Dimensionality reduction in action
2nd way: Another solution is to transform the two features into one
hhhhh
wwwww
Y =
height weight
12345
12345
12345
sssss
X'' =
size
12345
12345
h
w
s
A
A
D
C
E
B
BC DEs = f(h,w)
![Page 30: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/30.jpg)
Dimensionality reduction in action
Another solution is to transform the two features to one
hhhhh
wwwww
Y =
height weight
12345
12345
12345
sssss
X'' =
size
12345
12345
h
w
s
A
A
D
C
E
B
BC DEs = f(h,w)
But how do we find s?
![Page 31: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/31.jpg)
Principal Component Analysis
Run demo!! (pcaPlot)
![Page 32: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/32.jpg)
Eigendecomposition
● Gives a feeling of the properties of the matrix
++++
+ +++
+
+ +
+
+
+
++++
++++ ++++
+
![Page 33: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/33.jpg)
Eigendecomposition
● Gives a feeling of the properties of the matrix
++++
+ +++
+
+ +
+
+
u u1 2
● u1 and u2 define the axes with maximum variances, where the data is most spread
● To reduce the dimensionality I project the data on the axis where data is the most spread
● There is no class information given
+
++++
++++ ++++
+
![Page 34: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/34.jpg)
● We need to optimise in order to minimise the distance between the true point and its reconstrcution:
x* = argmin || y – rec(x) ||x
2
y is 1 times d x is 1 times q
● rec(x) = x W'
• What is the dimensionality of W'?• Answer: d times q
• Then: x = y W. Now W is q times d.
• We can determine both x and W by solving an optimisation problem (called eigenvalue problem)
![Page 35: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/35.jpg)
● When class labels are given, PCA cannot take them into account... but we hope that the natural separation of the data (see clustering) will encode this information
x
xx
xxx
x
xx
x
x x x
++
++ +
+++
+ +
+
+
xx
++ +
+
+
![Page 36: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/36.jpg)
Principal Component Analysis
• Remember: data are noisy!• Trade-off: reduce size / noise without losing too much information
2D → 1D 3D → 2D
![Page 37: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/37.jpg)
But what about this case...
xx
x xx x
xx
xx
xxx x
xxx
ooooo
oo o
oo
ooo
oo
![Page 38: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/38.jpg)
But what about this case...
xx
x xx x
xx
xx
xxx x
xxx
ooooo
oo o
oo
ooo
oo
xxx
xxx
xxxx
xxxxxx
x
ooo
oooooo
oooo
oo
PCA
![Page 39: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/39.jpg)
But what about this case...
xx
x xx x
xx
xx
xxx x
xxx
ooooo
oo o
oo
ooo
oo
xxx
xxx
xxxx
xxxxxx
x
ooo
oooooo
oooo
oo
xxxxxxxx
xxoooooooooo
PCA
![Page 40: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/40.jpg)
But what about this case...
xx
x xx x
xx
xx
xxx x
xxx
ooooo
oo o
oo
ooo
oo
xxx
xxx
xxxx
xxxxxx
x
ooo
oooooo
oooo
oo
xxxxxxxx
xxoooooooooo
LDA
PCA
In discriminant analysis, we want to maximise the spread between classes.
![Page 41: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/41.jpg)
Latent variable models
● Non-probabilistic approach:
● Probabilistic approach:
Learn “backwards”: Model the relationship between Y and X:
p(Y|X). This comes from: y = wx + ε
Then, we can get the posterior distribution:
p(X|Y)
Y X
![Page 42: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set](https://reader034.vdocuments.mx/reader034/viewer/2022050305/5f6d74bd70cf2159b869523c/html5/thumbnails/42.jpg)
Linear vs Non-linear
(Slide from Neil Lawrence.)