bits of machine learning part 2: unsupervised...
TRANSCRIPT
Bits of Machine Learning
Part 2: Unsupervised Learning
Alexandre Proutiere and Vahan Petrosyan
KTH (The Royal Institute of Technology)
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
1
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
2
Clustering
Clustering extracts structures from unlabelled data {Xi, i = 1, · · ·n}, i.e.,
groups similar data points into a small set of clusters
3
Clustering: Applications
• Historical application: find geographical correlations in cholera
deaths in London (John Snow 1850)
• Image segmentation: identify regions of interests in an image
• Marketing: detect groups of customers to design targeted ads
• Insurance: Identify groups of users with high risk
• Online services and web: e.g. cluster webpage based on their
content
• Bioinformatics: group proteins w.r.t. their functionality and
chemical structure
...
4
Algorithms
Commonly used clustering algorithms
• Center based approaches (K-means, K-medians, Gaussian mixture
models, affinity propogation)
• Hierarchical/agglomerative approaches (single linkage, complete
linkage, average linkage, Ward’s method, BIRCH)
• Graph based approaches (spectral clustering)
• Density based approaches (DBSCAN, OPTICS)
• Bayesian nonparametric approaches (Dirichlet Processes, PitmanYor
processes)
...
5
Clustering evaluation
1. Performance benchmarks: artificially generate (clustered) data
and evaluate the number of misclassified points
2. From the data only: a hard problem. What is a good clustering?
- Intra-cluster cohesion: measures how near data points are in the
same cluster
Example: the cohesion of cluster C could be∑
i,j∈C ‖Xi −Xj‖2
K-means algorithm finds over possible clusters C1, . . . , CK local
minima ofK∑
k=1
∑i∈Ck
‖Xi − µk‖2
where µk is the center of cluster k
- Inter-cluster separation: distance between points of different clusters
7
Clustering evaluation
2. From the data only: Graph-based approach, V = {1, . . . , n}Define a complete weighted graph from the data: weight on edge
(i, j) wij = f(‖Xi −Xj‖)
Example: Normalized cut. The Ncuts of (C1, C2) is
Ncut(C1, C2) =cut(C1, C2)
cut(C1, V )+
cut(C2, C1)
cut(C2, V ),
where cut(A,B) =∑i∈A,j∈B wij
We may wish to find clusters minimising the Ncut (NP-hard)
8
K-means (Lloyd 1957, MacQueen 1967)
• K-means aims at finding clusters C = {C1, · · ·Ck} solving:
minC
K∑k=1
∑j∈Ck
||Xi − µk||2
where K is the number of clusters, and µk is the center of the k-th
cluster.
• Solving the K-means clustering problem in Rd is:
1. NP-hard in general Euclidean space d even for 2 clusters
2. NP-hard for a general number of clusters K even in the plane
3. When K and d are fixed, the problem can be solved in time
O(ndk+1 logn)
9
K-means
1. Initialisation: pick K random points as initial cluster centers
2. Cluster improvement: while the cluster assignment changes
do
a. Assign each data point to the cluster with the closest center
b. Change the cluster centers (the average position of points
within the cluster)
10
K-means algorithm: Properties
• Guaranteed to converge in a finite number of iterations
• Running time per iteration: O(nKd)
• Issues
1. Can be stuck in a bad local minimum
2. Different initialisations yield different solutions
3. The number of clusters K needs to be specified in advance
Example of approaches to solve the initialization and finding the number
of clusters problem: V. Petrosyan and A. Proutiere , ”Viral Clustering: A
Robust Method to Extract Structures in Heterogeneous Datasets,” AAAI
2016
13
Gaussian Mixture Models (GMM) and the EM Algorithm
Approach: assume that the data points are generated according to a
GMM in an i.i.d. manner, and estimate the parameters of the GMM
GMM: Each point is distributed according to the distribution with the
density p(x|θ): θ = (θk)k=1,...,K and θk = (αk, µk,Σk)
p(x|θ) =K∑k=1
αkpG(x|θk)
where pG(x|θk) = 1(2π)d/2|Σk|1/2
exp(− 12 (x− µk)>Σ−1
k (x− µk))
αk: probability that a point is attached to cluster k
µk: center of cluster k
Σk: covariance matrix of the k-th component of the GMM
14
The Expectation-Maximisation algorithm
Dempster-Laird-Rubin, 1977
Aims at maximising the log-likelihood of the model given the data
Two alternating steps:
1. E-step: For a given estimate of θ, evaluate the probability of a data
point to belong to the various clusters. The weight membership of
Xi in cluster k is:
wik =αkpG(Xi|θk)∑K
m=1 αmpG(Xi|θm)
2. M-step: Update θ with increased the log-likelihood
16
The Expectation-Maximisation algorithm
Initialisation: select α1k, µ
1k,Σ
1k arbitrarily, t = 1
Until convergence: do
1. E-step: Compute for all i and k
wik =αtkpG(Xi|µtk,Σtk)∑K
m=1 αtmpG(Xi|µtm,Σtm)
, Nk =
n∑i=1
wik
2. M-step: Update the parameters as follows: for all k,
αt+1k =
NkN, µt+1
k =1
Nk
n∑i=1
wikXi
Σt+1k =
1
Nk
n∑i=1
wik(Xi − µt+1k )(Xi − µt+1
k )>
3. t := t+ 1
17
EM algorithm: Properties
• Convergence: ”On the Convergence Properties of the EM
Algorithm”, C.F.J. Wu, Annals of Statistics, 1983
• Issues
1. Can be stuck in a bad local minimum
2. Suffers from the curse of dimensionality (not suitable for high
dimensional data)
18
Spectral Clustering
K-means and the EM algorithm for GMM are center-based algorithms.
How to deal with clusters that do not look like potatoes?
... apply spectral methods on similarity matrices or graphs!
Spectral method: work with eigenvalues and eigenvectors of the similarity
matrix
19
Spectral Clustering
Data: X1, . . . Xn ∈ Rd
Similarity matrix: W = (wij) ∈ Rn×n with for example (Gaussian
kernel)
∀i, j, wij = 1i 6=j exp(−‖Xi −Xj‖2
σ)
Spectral analysis: we get the top-K eigenvectors of W (those
corresponding to the highest singular values)
The spectral idea: the data points are well represented by their
projections on the linear span these eigenvectors
20
Spectral Clustering
”On spectral clustering: Analysis and an algorithm”, A. Ng, M. Jordan,
Y. Weiss, NIPS 2001
1. Input: W ∈ Rn×n,K ∈ {1, 2, · · · , n}2. Let D be a diagonal matrix s.t. Dii :=
∑nj=1 wij
3. L := D−1/2WD−1/2
4. E = [e1, e2, · · · , eK ] := eigs(L,K) ∈ Rn×K
5. Normalize each row of E
6. y := K-means(E,K)
7. Output: y
eigs(L,K) returns the top K eigenvectors of L
21
Tuning Spectral Clustering
We may consider different notions of similarity:
1. Gaussian kernel wij = e−||Xi−Xj ||
2
σ for i 6= j and wii = 0. Choice of
σ?
2. k-nearest neighbour similarity: wij = 1 if i is one of the k nearest
neighbours of j or if j is one of the k nearest neighbours of i.
Choice of k?
3. Self tuning kernel: wij = e−
||Xi−Xj ||2
||Xi−Xim ||∗||Xj−Xjm || for i 6= j and
wi,i = 0, where im is the index of m-th nearest neighbour of i.
Choice of m?
[Play with these parameters in the lab ... :-) ]
23
Outline of the Course
1. Supervised Learning
• Regression and Classification
• Neural Networks (Deep Learning)
2. Unsupervised Learning
• Clustering
• Reinforcement Learning
24
Reinforcement learning
Learning optimal sequential behaviour / control from interacting with the
environment
State dynamics:
sπt+1 = Ft(sπt , a
πt )
25
Reinforcement learning
Learning optimal sequential behaviour / control from interacting with the
environment
[. . .]
By the time we learn to live
It’s already too late
Our hearts cry in unison at night
[. . .]
Louis Aragon
26
Reinforcement learning: Applications
• Making a robot walk
• Portfolio optimisation
• Playing games better than
humans
• Helicopter stunt
manoeuvres
• Optimal communication
protocols in radio networks
• Display ads
• Search engines
• ...
27
1. Bandit Optimisation
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• Interact with an i.i.d. or adversarial environment
• The reward is independent of the state and is the only feedback:
- i.i.d. environment: rt(a, s) = rt(a) random variable with mean θa
- adversarial environment: rt(a, s) = rt(a) is arbitrary!
28
2. Markov Decision Process (MDP)
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• History at t: hπt = (sπ1 , aπ1 , . . . , s
πt−1, a
πt−1, s
πt )
• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)
• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)
29
What is to be learnt and optimised?
• Bandit optimisation: the average rewards of actions are unknown
Information available at time t under π:
aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)
• MDP: The state dynamics p(·|s, a) and the reward function r(a, s)
are unknown
Information available at time t under π:
sπ1 , aπ1 , r1(aπ1 , s
π1 ), . . . , sπt−1, a
πt−1, rt−1(aπt−1, s
πt−1), sπt
• Objective: maximise the cumulative reward
T∑t=1
E[rt(aπt , s
πt )] or
∞∑t=1
λtE[rt(aπt , s
πt )]
30
Regret
• Difference between the cumulative reward of an ”Oracle” policy and
that of agent π
• Regret quantifies the price to pay for learning!
• Exploration vs. exploitation trade-off: we need to probe all actions
to play the best later ...
31
1. Bandit Optimisation
First application: Clinical trial, Thompson 1933
- A set of possible actions at each step
- Unknown sequence of rewards for each action
- Bandit feedback: only rewards of chosen actions are observed
- Goal: maximise the cumulative reward (up to step T )
Two examples:
a. Finite number of actions, stochastic rewards
b. Continuous actions, concave adversarial rewards
32
a. Stochastic bandits – Robbins 1952
• Finite set of actions A
• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli
with E[rt(a)] = θa
• Optimal action a? ∈ arg maxa θa
• Online policy π: select action aπt at time t depending on
aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)
• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt
33
a. Stochastic bandits
Fundamental performance limits: (Lai-Robbins1985)
For any reasonable π:
lim infT
Rπ(T )
log(T )≥∑a 6=a?
θa? − θaKL(θ, θa?)
where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)
Algorithms:
(i) ε-greedy: linear regret
(ii) εt-greedy: logarithmic regret (εt = 1/t)
(iii) Upper Confidence Bound algorithm:
ba(t) = θ̂a(t) +
√2 log(t)
na(t)
θ̂(t): empirical reward of a up to t
na(t): nb of times a played up to t
34
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
35
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
36
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
37
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
38
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
39
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
40
b. Adversarial Convex Bandits
• Continuous set of actions A = [0, 1]
• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)
• Online policy π: select action xπt at time t depending on
xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)
• Regret up to time T : (defined w.r.t. the best empirical action up to
time T )
Rπ(T ) = maxx∈[0,1]
T∑t=1
rt(x)−T∑t=1
rt(xπt )
Can we do something smart at all? Achieve a sublinear regret?
41
b. Adversarial Convex Bandits
• If rt(·) = r(·), and if r(·) was known, we could apply a gradient
ascent algorithm
• One-point gradient estimate:
f̂(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}
Eu∈S [f(x+ δu)u] = δ∇f̂(x), S = {x : ‖x‖2 = 1}
• Simulated Gradient Ascent algorithm: at each step t, do
- ut uniformly chosen in S
- yt = xt + δut
- yt+1 = yt + αrt(xt)ut
• Regret: R(T ) = O(T 5/6)
42
2. Markov Decision Process (MDP)
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)
• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)
• p(·|s, a) and r(·, ·) are unknown initially
43
Example
Playing pacman (Google Deepmind experiment, 2015)
State: the current displayed image
Action: right, left, down, up
Feedback: the score and its incre-
ments + state
44
Bellman’s equation
Objective: max the average discounted reward∑∞t=1 λ
tE[r(aπt , sπt )]
Assume the transition probabilities and the reward function are known
• Value function: maps the initial state s to the corresponding
maximum reward v(s)
• Bellman’s equation:
v(s) = maxa∈A
r(a, s) + λ∑j
p(j|s, a)v(j)
• Solve Bellman’s equation. The optimal policy is given by:
a?(s) = arg maxa∈A
r(a, s) + λ∑j
p(j|s, a)v(j)
45
Q-learning
What if the transition probabilities and the reward function are unknown?
• Q-value function: the max expected reward starting from state s
and playing action a:
Q(s, a) = r(a, s) + λ∑j
p(j|s, a) maxb∈A
Q(j, b)
Note that: v(s) = maxa∈AQ(s, a)
• Algorithm: update the Q-value estimate sequentially so that it
converges to the true Q-value
46
Q-learning
1. Initialisation: select Q ∈ RS×A arbitrarily, and s0
2. Q-value iteration: at each step t, select action at (each
state-action pair must be selected infinitely often)
Observe the new state st+1 and the reward r(st, at)
Update Q(st, at):
Q(st, at) :=Q(st, at)
+ αt
[r(st, at) + λmax
a∈AQ(st+1, a)−Q(st, at)
]
It converges to Q if∑t αt =∞ and
∑t α
2t <∞
47
Q-learning: demo
The crawling robot ...
https://www.youtube.com/watch?v=2iNrJx6IDEo
48
Scaling up Q-learning
Q-learning converges very slowly, especially when the state and action
spaces are large ...
State-of-the-art algorithms (optimal exploration, ideas from bandit opt.):
regret O(S√AT )
What if the action and state are continuous variables?
Example: Mountain car demo1
1See Sutton tutorial, NIPS 2015
49
Q-learning with function approximation
Idea: restrict our attention to Q-value functions belonging to a family of
functions Q
Examples:
1. Linear functions: Q = {Qθ, θ ∈ RM},
Qθ(s, a) =
M∑i=1
φi(s, a)θi = φ>θ
where for all i, φi is linear. The φi’s are linearly independent.
2. Deep networks: Q = {Qw,w ∈ RM}, Qw(s, a) given as the output
of a neural network with weights w and inputs (s, a)
50
Q-learning with linear function approximation
1. Initialisation: select θ ∈ RM arbitrarily, and s0
2. Q-value iteration: at each step t, select action at (each
state-action pair must be selected infinitely often)
Observe the new state st+1 and the reward r(st, at)
Update θ:
θ := θ + αt∆t∇θQθ(st, at)= θ + αt∆tφ(st, at)
where ∆t = r(st, at) + λmaxa∈AQθ(st+1, a)−Qθ(st, at)
For convergence results, see ”An analysis of Reinforcement Learning with
Function Approximation”, Melo et al., ICML 2008
51
Q-learning with function approximation
Success stories:
• TD-Gammon (Backgammon), Tesauro 1995 (neural nets)
• Acrobatic helicopter autopilots, Ng et al. 2006
• Jeopardy, IBM Watson, 2011
• 49 atari games, pixel-level visual inputs, Google Deepmind 2015
52