bits of machine learning part 2: unsupervised...

54
Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology)

Upload: vanphuc

Post on 31-Aug-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Bits of Machine Learning

Part 2: Unsupervised Learning

Alexandre Proutiere and Vahan Petrosyan

KTH (The Royal Institute of Technology)

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

1

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

2

Clustering

Clustering extracts structures from unlabelled data {Xi, i = 1, · · ·n}, i.e.,

groups similar data points into a small set of clusters

3

Clustering: Applications

• Historical application: find geographical correlations in cholera

deaths in London (John Snow 1850)

• Image segmentation: identify regions of interests in an image

• Marketing: detect groups of customers to design targeted ads

• Insurance: Identify groups of users with high risk

• Online services and web: e.g. cluster webpage based on their

content

• Bioinformatics: group proteins w.r.t. their functionality and

chemical structure

...

4

Algorithms

Commonly used clustering algorithms

• Center based approaches (K-means, K-medians, Gaussian mixture

models, affinity propogation)

• Hierarchical/agglomerative approaches (single linkage, complete

linkage, average linkage, Ward’s method, BIRCH)

• Graph based approaches (spectral clustering)

• Density based approaches (DBSCAN, OPTICS)

• Bayesian nonparametric approaches (Dirichlet Processes, PitmanYor

processes)

...

5

Algorithms

6

Clustering evaluation

1. Performance benchmarks: artificially generate (clustered) data

and evaluate the number of misclassified points

2. From the data only: a hard problem. What is a good clustering?

- Intra-cluster cohesion: measures how near data points are in the

same cluster

Example: the cohesion of cluster C could be∑

i,j∈C ‖Xi −Xj‖2

K-means algorithm finds over possible clusters C1, . . . , CK local

minima ofK∑

k=1

∑i∈Ck

‖Xi − µk‖2

where µk is the center of cluster k

- Inter-cluster separation: distance between points of different clusters

7

Clustering evaluation

2. From the data only: Graph-based approach, V = {1, . . . , n}Define a complete weighted graph from the data: weight on edge

(i, j) wij = f(‖Xi −Xj‖)

Example: Normalized cut. The Ncuts of (C1, C2) is

Ncut(C1, C2) =cut(C1, C2)

cut(C1, V )+

cut(C2, C1)

cut(C2, V ),

where cut(A,B) =∑i∈A,j∈B wij

We may wish to find clusters minimising the Ncut (NP-hard)

8

K-means (Lloyd 1957, MacQueen 1967)

• K-means aims at finding clusters C = {C1, · · ·Ck} solving:

minC

K∑k=1

∑j∈Ck

||Xi − µk||2

where K is the number of clusters, and µk is the center of the k-th

cluster.

• Solving the K-means clustering problem in Rd is:

1. NP-hard in general Euclidean space d even for 2 clusters

2. NP-hard for a general number of clusters K even in the plane

3. When K and d are fixed, the problem can be solved in time

O(ndk+1 logn)

9

K-means

1. Initialisation: pick K random points as initial cluster centers

2. Cluster improvement: while the cluster assignment changes

do

a. Assign each data point to the cluster with the closest center

b. Change the cluster centers (the average position of points

within the cluster)

10

Example 1

11

Example 2: Image Segmentation

12

K-means algorithm: Properties

• Guaranteed to converge in a finite number of iterations

• Running time per iteration: O(nKd)

• Issues

1. Can be stuck in a bad local minimum

2. Different initialisations yield different solutions

3. The number of clusters K needs to be specified in advance

Example of approaches to solve the initialization and finding the number

of clusters problem: V. Petrosyan and A. Proutiere , ”Viral Clustering: A

Robust Method to Extract Structures in Heterogeneous Datasets,” AAAI

2016

13

Gaussian Mixture Models (GMM) and the EM Algorithm

Approach: assume that the data points are generated according to a

GMM in an i.i.d. manner, and estimate the parameters of the GMM

GMM: Each point is distributed according to the distribution with the

density p(x|θ): θ = (θk)k=1,...,K and θk = (αk, µk,Σk)

p(x|θ) =K∑k=1

αkpG(x|θk)

where pG(x|θk) = 1(2π)d/2|Σk|1/2

exp(− 12 (x− µk)>Σ−1

k (x− µk))

αk: probability that a point is attached to cluster k

µk: center of cluster k

Σk: covariance matrix of the k-th component of the GMM

14

GMM Example

The density of a GMM with three components in R2

15

The Expectation-Maximisation algorithm

Dempster-Laird-Rubin, 1977

Aims at maximising the log-likelihood of the model given the data

Two alternating steps:

1. E-step: For a given estimate of θ, evaluate the probability of a data

point to belong to the various clusters. The weight membership of

Xi in cluster k is:

wik =αkpG(Xi|θk)∑K

m=1 αmpG(Xi|θm)

2. M-step: Update θ with increased the log-likelihood

16

The Expectation-Maximisation algorithm

Initialisation: select α1k, µ

1k,Σ

1k arbitrarily, t = 1

Until convergence: do

1. E-step: Compute for all i and k

wik =αtkpG(Xi|µtk,Σtk)∑K

m=1 αtmpG(Xi|µtm,Σtm)

, Nk =

n∑i=1

wik

2. M-step: Update the parameters as follows: for all k,

αt+1k =

NkN, µt+1

k =1

Nk

n∑i=1

wikXi

Σt+1k =

1

Nk

n∑i=1

wik(Xi − µt+1k )(Xi − µt+1

k )>

3. t := t+ 1

17

EM algorithm: Properties

• Convergence: ”On the Convergence Properties of the EM

Algorithm”, C.F.J. Wu, Annals of Statistics, 1983

• Issues

1. Can be stuck in a bad local minimum

2. Suffers from the curse of dimensionality (not suitable for high

dimensional data)

18

Spectral Clustering

K-means and the EM algorithm for GMM are center-based algorithms.

How to deal with clusters that do not look like potatoes?

... apply spectral methods on similarity matrices or graphs!

Spectral method: work with eigenvalues and eigenvectors of the similarity

matrix

19

Spectral Clustering

Data: X1, . . . Xn ∈ Rd

Similarity matrix: W = (wij) ∈ Rn×n with for example (Gaussian

kernel)

∀i, j, wij = 1i 6=j exp(−‖Xi −Xj‖2

σ)

Spectral analysis: we get the top-K eigenvectors of W (those

corresponding to the highest singular values)

The spectral idea: the data points are well represented by their

projections on the linear span these eigenvectors

20

Spectral Clustering

”On spectral clustering: Analysis and an algorithm”, A. Ng, M. Jordan,

Y. Weiss, NIPS 2001

1. Input: W ∈ Rn×n,K ∈ {1, 2, · · · , n}2. Let D be a diagonal matrix s.t. Dii :=

∑nj=1 wij

3. L := D−1/2WD−1/2

4. E = [e1, e2, · · · , eK ] := eigs(L,K) ∈ Rn×K

5. Normalize each row of E

6. y := K-means(E,K)

7. Output: y

eigs(L,K) returns the top K eigenvectors of L

21

Example

22

Tuning Spectral Clustering

We may consider different notions of similarity:

1. Gaussian kernel wij = e−||Xi−Xj ||

2

σ for i 6= j and wii = 0. Choice of

σ?

2. k-nearest neighbour similarity: wij = 1 if i is one of the k nearest

neighbours of j or if j is one of the k nearest neighbours of i.

Choice of k?

3. Self tuning kernel: wij = e−

||Xi−Xj ||2

||Xi−Xim ||∗||Xj−Xjm || for i 6= j and

wi,i = 0, where im is the index of m-th nearest neighbour of i.

Choice of m?

[Play with these parameters in the lab ... :-) ]

23

Outline of the Course

1. Supervised Learning

• Regression and Classification

• Neural Networks (Deep Learning)

2. Unsupervised Learning

• Clustering

• Reinforcement Learning

24

Reinforcement learning

Learning optimal sequential behaviour / control from interacting with the

environment

State dynamics:

sπt+1 = Ft(sπt , a

πt )

25

Reinforcement learning

Learning optimal sequential behaviour / control from interacting with the

environment

[. . .]

By the time we learn to live

It’s already too late

Our hearts cry in unison at night

[. . .]

Louis Aragon

26

Reinforcement learning: Applications

• Making a robot walk

• Portfolio optimisation

• Playing games better than

humans

• Helicopter stunt

manoeuvres

• Optimal communication

protocols in radio networks

• Display ads

• Search engines

• ...

27

1. Bandit Optimisation

State dynamics:

sπt+1 = Ft(sπt , a

πt )

• Interact with an i.i.d. or adversarial environment

• The reward is independent of the state and is the only feedback:

- i.i.d. environment: rt(a, s) = rt(a) random variable with mean θa

- adversarial environment: rt(a, s) = rt(a) is arbitrary!

28

2. Markov Decision Process (MDP)

State dynamics:

sπt+1 = Ft(sπt , a

πt )

• History at t: hπt = (sπ1 , aπ1 , . . . , s

πt−1, a

πt−1, s

πt )

• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)

• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)

29

What is to be learnt and optimised?

• Bandit optimisation: the average rewards of actions are unknown

Information available at time t under π:

aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)

• MDP: The state dynamics p(·|s, a) and the reward function r(a, s)

are unknown

Information available at time t under π:

sπ1 , aπ1 , r1(aπ1 , s

π1 ), . . . , sπt−1, a

πt−1, rt−1(aπt−1, s

πt−1), sπt

• Objective: maximise the cumulative reward

T∑t=1

E[rt(aπt , s

πt )] or

∞∑t=1

λtE[rt(aπt , s

πt )]

30

Regret

• Difference between the cumulative reward of an ”Oracle” policy and

that of agent π

• Regret quantifies the price to pay for learning!

• Exploration vs. exploitation trade-off: we need to probe all actions

to play the best later ...

31

1. Bandit Optimisation

First application: Clinical trial, Thompson 1933

- A set of possible actions at each step

- Unknown sequence of rewards for each action

- Bandit feedback: only rewards of chosen actions are observed

- Goal: maximise the cumulative reward (up to step T )

Two examples:

a. Finite number of actions, stochastic rewards

b. Continuous actions, concave adversarial rewards

32

a. Stochastic bandits – Robbins 1952

• Finite set of actions A

• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli

with E[rt(a)] = θa

• Optimal action a? ∈ arg maxa θa

• Online policy π: select action aπt at time t depending on

aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)

• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt

33

a. Stochastic bandits

Fundamental performance limits: (Lai-Robbins1985)

For any reasonable π:

lim infT

Rπ(T )

log(T )≥∑a 6=a?

θa? − θaKL(θ, θa?)

where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)

Algorithms:

(i) ε-greedy: linear regret

(ii) εt-greedy: logarithmic regret (εt = 1/t)

(iii) Upper Confidence Bound algorithm:

ba(t) = θ̂a(t) +

√2 log(t)

na(t)

θ̂(t): empirical reward of a up to t

na(t): nb of times a played up to t

34

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

35

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

36

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

37

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

38

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

39

b. Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

40

b. Adversarial Convex Bandits

• Continuous set of actions A = [0, 1]

• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)

• Online policy π: select action xπt at time t depending on

xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)

• Regret up to time T : (defined w.r.t. the best empirical action up to

time T )

Rπ(T ) = maxx∈[0,1]

T∑t=1

rt(x)−T∑t=1

rt(xπt )

Can we do something smart at all? Achieve a sublinear regret?

41

b. Adversarial Convex Bandits

• If rt(·) = r(·), and if r(·) was known, we could apply a gradient

ascent algorithm

• One-point gradient estimate:

f̂(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}

Eu∈S [f(x+ δu)u] = δ∇f̂(x), S = {x : ‖x‖2 = 1}

• Simulated Gradient Ascent algorithm: at each step t, do

- ut uniformly chosen in S

- yt = xt + δut

- yt+1 = yt + αrt(xt)ut

• Regret: R(T ) = O(T 5/6)

42

2. Markov Decision Process (MDP)

State dynamics:

sπt+1 = Ft(sπt , a

πt )

• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)

• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)

• p(·|s, a) and r(·, ·) are unknown initially

43

Example

Playing pacman (Google Deepmind experiment, 2015)

State: the current displayed image

Action: right, left, down, up

Feedback: the score and its incre-

ments + state

44

Bellman’s equation

Objective: max the average discounted reward∑∞t=1 λ

tE[r(aπt , sπt )]

Assume the transition probabilities and the reward function are known

• Value function: maps the initial state s to the corresponding

maximum reward v(s)

• Bellman’s equation:

v(s) = maxa∈A

r(a, s) + λ∑j

p(j|s, a)v(j)

• Solve Bellman’s equation. The optimal policy is given by:

a?(s) = arg maxa∈A

r(a, s) + λ∑j

p(j|s, a)v(j)

45

Q-learning

What if the transition probabilities and the reward function are unknown?

• Q-value function: the max expected reward starting from state s

and playing action a:

Q(s, a) = r(a, s) + λ∑j

p(j|s, a) maxb∈A

Q(j, b)

Note that: v(s) = maxa∈AQ(s, a)

• Algorithm: update the Q-value estimate sequentially so that it

converges to the true Q-value

46

Q-learning

1. Initialisation: select Q ∈ RS×A arbitrarily, and s0

2. Q-value iteration: at each step t, select action at (each

state-action pair must be selected infinitely often)

Observe the new state st+1 and the reward r(st, at)

Update Q(st, at):

Q(st, at) :=Q(st, at)

+ αt

[r(st, at) + λmax

a∈AQ(st+1, a)−Q(st, at)

]

It converges to Q if∑t αt =∞ and

∑t α

2t <∞

47

Q-learning: demo

The crawling robot ...

https://www.youtube.com/watch?v=2iNrJx6IDEo

48

Scaling up Q-learning

Q-learning converges very slowly, especially when the state and action

spaces are large ...

State-of-the-art algorithms (optimal exploration, ideas from bandit opt.):

regret O(S√AT )

What if the action and state are continuous variables?

Example: Mountain car demo1

1See Sutton tutorial, NIPS 2015

49

Q-learning with function approximation

Idea: restrict our attention to Q-value functions belonging to a family of

functions Q

Examples:

1. Linear functions: Q = {Qθ, θ ∈ RM},

Qθ(s, a) =

M∑i=1

φi(s, a)θi = φ>θ

where for all i, φi is linear. The φi’s are linearly independent.

2. Deep networks: Q = {Qw,w ∈ RM}, Qw(s, a) given as the output

of a neural network with weights w and inputs (s, a)

50

Q-learning with linear function approximation

1. Initialisation: select θ ∈ RM arbitrarily, and s0

2. Q-value iteration: at each step t, select action at (each

state-action pair must be selected infinitely often)

Observe the new state st+1 and the reward r(st, at)

Update θ:

θ := θ + αt∆t∇θQθ(st, at)= θ + αt∆tφ(st, at)

where ∆t = r(st, at) + λmaxa∈AQθ(st+1, a)−Qθ(st, at)

For convergence results, see ”An analysis of Reinforcement Learning with

Function Approximation”, Melo et al., ICML 2008

51

Q-learning with function approximation

Success stories:

• TD-Gammon (Backgammon), Tesauro 1995 (neural nets)

• Acrobatic helicopter autopilots, Ng et al. 2006

• Jeopardy, IBM Watson, 2011

• 49 atari games, pixel-level visual inputs, Google Deepmind 2015

52

Questions?

Alexandre Proutiere

[email protected]

53