review ml and ml(ext) bob durrant school of computer science university of birmingham

Review ML and ML(Ext)

Bob DurrantSchool of Computer Science

University of Birmingham

Machine Learning

• We introduced the concept of a learning machine, a program that can learn from data.– Learning = ability to improve performance automatically

through experience– Experience = previously seen examples

• We looked at a range of approaches for different sorts of problems, and talked about:– Which methods are good for what– How the algorithms work• What they can do and why• What they cannot do and why

Learning

• In common usage, learning is a poorly defined term.

• We defined what we mean by learning:– Improve over a task T– With respect to a performance measure P– Based on some experience E

• Example: Learning to play checkers– T: play checkers– P: % of games won in world tournament– E: opportunity to play against self

Different Learning Methods• Supervised learning: – The algorithm gets information that says which class/what

value is associated with each instance of the training data.– The algorithm gets immediate feedback.

• Reinforcement learning:– The algorithm carries out a sequence of actions and gains

rewards only for reaching specific states.– The algorithm gets delayed feedback.

• Unsupervised learning:– The algorithm applies rules that reveal (or enforce)

structure on the data. – The algorithm gets no feedback.

Two Main Problem Classes:

• Classification problems

• Regression problems

Classification Problems

• The training data have labels that identify which class they belong to.

• We want to classify data that we haven’t seen before.– That is, assign the correct label to the new data.

• The algorithm learns how to classify the unseen data from the training data.

Regression Problems

• The training data are input values with associated output values.

• The output values are the result of evaluating some (unknown) function of the inputs.

• The algorithm learns to approximate the function from the training data.

• The algorithm can evaluate the function for new data, using its learned model.– That is, assign the correct output value to the new

data.

Probabilistic Models of Sequences

• Simple models for predicting the next term in a sequence.

• Maximum likelihood model:– Count frequency of each term in the sequence (=

training data).• e.g. the sequence S = {aabbabababbb} has 5 a’s and 7

b’s.– Assign next term in sequence according to its

distribution in the training data.• e.g. here P(a|S) = 5/12 and P(b|S) = 7/12.

Probabilistic Models of Sequences• First-order Markov model:– Assign probabilities for initial state.– Count frequency of successions in the sequence (= training

data).– e.g. the sequence S = {aabbabababbb} contains the

subsequence aa once, ab 4 times, bb 3 times and ba 3 times.

– Assign initial term according to the initial state probabilities.• e.g. here we might have P(a) = P(b) = ½ for the initial term.

– Assign succeeding terms according to the distribution of the subsequences in the training data.• e.g. here P(a|a) = 1/5, P(b|a) = 4/5 , P(a|b) = ½, P(b|b) = ½

Probabilistic Models of Sequences

• Hidden Markov model:

– The observed data do not satisfy the Markov

assumption...

– ...but they rely upon a hidden (latent) variable

that does satisfy the Markov condition.

Reinforcement Learning• Deterministic Q-learning Algorithm• Goal: Maximise reward.– For each (s,a) initialise table entry– Observe current state s– Do forever:

• Select an action a and execute it• Receive immediate reward r(s,a)• Observe new state s’• Update table entry as follows:

• s:=s’

ˆ ˆ( , ) : ( , ) max ( ( , ), )a

Q s a r s a Q s a a

Reinforcement Learning• Non-deterministic Q-learning Algorithm• Goal: Maximise expected reward.– For each (s,a) initialise table entry– Observe current state s– Do forever:

• Select an action a and execute it• Receive immediate reward r(s,a)• Observe possible new states s’• Update table entry as follows:

• s:=s’

'

''

( , ) [ ( , ) max{ ( ', ')}]

( , ) ( ' | , ) max{ ( ', ')}a

as

Q s a E r s a Q s a

r s a p s s a Q s a

Probabilistic Latent Semantic Analysis

• Represent each document as a column vector• Build term-by-document matrix• Select number of topics, K, that will be used to

describe the data• Generate the term-by-topic and topic-by

document matrices using PLSA algorithm• Probability of retrieving document is:

( , )

11

( ) ( | ) ( | )X t docT K

kt

P doc P t k P k doc

Probabilistic Latent Semantic Analysis

• Inputs: – T x N term-by-document matrix – Number of topics K sought

• Initialise the T x K array P1 and K x N array P2 randomly with numbers between [0,1] and normalise them to sum to 1 along rows

• Iterate until convergence.– For d=1 to N, For t =1 to T, For k=1 to K,

• Output: arrays P1 and P2, which hold the estimated parameters P(t|k) and P(k|d) respectively.

1

1 1

1

1 1

( , ) 1( , )1( , ) 1( , ) 2( , ); 1( , )

1( , ) 2( , ) 1( , )

( , ) 2( , )2( , ) 2( , ) 1( , ); 2( , )

1( , ) 2( , ) 2( , )

N

K Td

k t

T

K Kt

k k

X t d P t kP t k P t k P k d P t k

P t k P k d P t k

x t d P k dP k d P k d P t k P k d

P t k P k d P k d

Naive Bayes Classifier• Estimate (from the training examples) the following parameters:

– For each target value (hypothesis) h

– For each attribute value at of each data instance

– Under the naive Bayes assumption that the attributes at are conditionally independent.

• Classify a new data instance, x=(a1, …, aT), as:

• That is, classify the new point according to the MAP estimate of the hypothesis.

ˆ( ) : estimate ( )P h P h

ˆ( | ) : estimate ( | )t tP a h P a h

arg max ( ) ( | ) arg max ( ) ( | )Naive Bayes th h t

h P h P h P h P a h x

Independent Components Analysis

• Recover n independent signals given only n mixtures of all the sources.

• The mixtures are linear combinations of the sources.

• Ill-posed problem; signals can only be recovered up to permutation and scaling.

• Recovery is achieved by maximising non-Gaussianity of the mixtures.

Decision Trees• Classification method based on maximising

information gain.

– Calculate average entropy for each attribute:

– Split on attribute that minimises average entropy.– Repeat on remaining attributes until tree is complete or no

further improvements possible.

2 2

( )

Entropy ( , ) log log

| |Average Entropy ( )

| |v

vv Values A

n n n nD n n

n n n nS

Entropy SS

( )

| |( , ) ( ) ( )

| |v

vv Values A

SGain S A Entropy S Entropy S

S

K Nearest Neighbours

• Lazy learning method used for classification and regression.

• Nearest neighbours: Defined by similarity under a metric e.g. Euclidean distance

• Query point classification on basis of labels of the K nearest neighbouring points.

• Function estimation given by average of values of K nearest neighbouring points.

• Distance-weighted variant: Shepards method.

Case Based Reasoning• Advanced form of instance based learning• Can tackle complex instance objects• These can include complex structural descriptions of

cases & adaptation rules• Must devise similarity measure for these complex

structures• Matching can be adaptive – tries to model human

problem-solving:– uses past experience (cases) to solve new problems– retains solutions to new problems

• E.g. R4 model

Case Based Reasoning

• R4 model:New Case

matchingMatched

Cases

Retrieve

Adapt?No

Yes

Closest Case

Suggest solution

Retain

Learn

Revise

Reuse

Case Base

Knowledge and Adaptation rules

Support Vector Machine• Method for supervised learning problems– Classification– Regression

• Discriminative approach (vs. generative approaches)

• Two key ideas:– Assuming linearly separable classes, learn separating

hyperplane with maximum margin– Map inputs into high-dimensional space to deal with

linearly non-separable cases (data may be linearly separable in HD space)

Non-linear SVM

• In the solution of the SVM (i.e. of the quadratic programming problem with linear inequality constraints) the data only enter solver in the form of dot products

• Means that we can make SVM non-linear without complicating algorithm

• Use kernel functions to do this

Kernel Functions

• Transform x (x)• The linear algorithm depends only on x.xi, hence

transformed algorithm depends only on (x)(xi)

• Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)

Kernel Functions

• Example1: 2D input space, 3D feature space

• Example2: in this case the dimension of is infinite• Making new kernels:

21

21 2

22

( ) 2 ( , ) ( )i j i j

x

x x K

x

x x x x x

2 2( , ) exp{ || || /2 }i j i jK x x x x

1 2 1 1 2 2 1 2

1 2 1 1 2

1 2 1 1 2 2 1 2

1 2 1 2

( , ) ( , ) ( , )

( , ) ( , ), 0

( , ) ( , ) ( , )

( , ) ( ) ( )

K K K

K λK

K K K

K f f

x x x x x x

x x x x

x x x x x x

x x x x

Clustering Methods

• Find K clusters in the data, i.e. K collections of similar data

• K usually known for real-world applications• If K is not known, need principled method for

finding best K value (NB overfitting)• We considered two established approaches:– K Means– Hierarchical Clustering

K Means Clustering

K Means Clustering

Begininitialize 1, 2, …,K

(randomly selected)do classify n samples

according to nearest i

recompute i

until no change in i

return 1, 2, …, K

End

Hierarchical Clustering

• Many times, clusters are not disjoint, but a cluster may have subclusters, in turn having sub-subclusters...

Hierarchical Clustering

Curse of Dimensionality(Non-examinable)

• When working with high dimensional (HD) data we encounter a range of problems, collectively known as the curse of dimensionality. These include:– Geometric intuition lets us down– Combinatorial explosion means that exhaustive search is

intractable– Variance in distance between data points and query point

tends to zero– All points are nearly orthogonal to one another– All sample points are close to the edges of samples– HD spaces are very sparsely populated by the sample data

Workarounds

• Some approaches that can work (for some problems):– Use domain knowledge to tailor your solution– Preprocess the data using a dimension reduction

technique e.g. PCA, CS– Try a different metric e.g. L1, Lq (0<q<1) can do better

than L2

– Use a non-metric similarity measure e.g. ranking– Use a method that doesn’t suffer (too much) from the

curse e.g. SVM (where generalization ability depends on margin/diameter of data)

review ml and ml(ext) bob durrant school of computer science university of birmingham

Documents