review ml and ml(ext) bob durrant school of computer science university of birmingham
Post on 21-Dec-2015
213 views
TRANSCRIPT
Review ML and ML(Ext)
Bob DurrantSchool of Computer Science
University of Birmingham
Machine Learning
• We introduced the concept of a learning machine, a program that can learn from data.– Learning = ability to improve performance automatically
through experience– Experience = previously seen examples
• We looked at a range of approaches for different sorts of problems, and talked about:– Which methods are good for what– How the algorithms work• What they can do and why• What they cannot do and why
Learning
• In common usage, learning is a poorly defined term.
• We defined what we mean by learning:– Improve over a task T– With respect to a performance measure P– Based on some experience E
• Example: Learning to play checkers– T: play checkers– P: % of games won in world tournament– E: opportunity to play against self
Different Learning Methods• Supervised learning: – The algorithm gets information that says which class/what
value is associated with each instance of the training data.– The algorithm gets immediate feedback.
• Reinforcement learning:– The algorithm carries out a sequence of actions and gains
rewards only for reaching specific states.– The algorithm gets delayed feedback.
• Unsupervised learning:– The algorithm applies rules that reveal (or enforce)
structure on the data. – The algorithm gets no feedback.
Two Main Problem Classes:
• Classification problems
• Regression problems
Classification Problems
• The training data have labels that identify which class they belong to.
• We want to classify data that we haven’t seen before.– That is, assign the correct label to the new data.
• The algorithm learns how to classify the unseen data from the training data.
Regression Problems
• The training data are input values with associated output values.
• The output values are the result of evaluating some (unknown) function of the inputs.
• The algorithm learns to approximate the function from the training data.
• The algorithm can evaluate the function for new data, using its learned model.– That is, assign the correct output value to the new
data.
Probabilistic Models of Sequences
• Simple models for predicting the next term in a sequence.
• Maximum likelihood model:– Count frequency of each term in the sequence (=
training data).• e.g. the sequence S = {aabbabababbb} has 5 a’s and 7
b’s.– Assign next term in sequence according to its
distribution in the training data.• e.g. here P(a|S) = 5/12 and P(b|S) = 7/12.
Probabilistic Models of Sequences• First-order Markov model:– Assign probabilities for initial state.– Count frequency of successions in the sequence (= training
data).– e.g. the sequence S = {aabbabababbb} contains the
subsequence aa once, ab 4 times, bb 3 times and ba 3 times.
– Assign initial term according to the initial state probabilities.• e.g. here we might have P(a) = P(b) = ½ for the initial term.
– Assign succeeding terms according to the distribution of the subsequences in the training data.• e.g. here P(a|a) = 1/5, P(b|a) = 4/5 , P(a|b) = ½, P(b|b) = ½
Probabilistic Models of Sequences
• Hidden Markov model:
– The observed data do not satisfy the Markov
assumption...
– ...but they rely upon a hidden (latent) variable
that does satisfy the Markov condition.
Reinforcement Learning• Deterministic Q-learning Algorithm• Goal: Maximise reward.– For each (s,a) initialise table entry– Observe current state s– Do forever:
• Select an action a and execute it• Receive immediate reward r(s,a)• Observe new state s’• Update table entry as follows:
• s:=s’
ˆ ˆ( , ) : ( , ) max ( ( , ), )a
Q s a r s a Q s a a
Reinforcement Learning• Non-deterministic Q-learning Algorithm• Goal: Maximise expected reward.– For each (s,a) initialise table entry– Observe current state s– Do forever:
• Select an action a and execute it• Receive immediate reward r(s,a)• Observe possible new states s’• Update table entry as follows:
• s:=s’
'
''
( , ) [ ( , ) max{ ( ', ')}]
( , ) ( ' | , ) max{ ( ', ')}a
as
Q s a E r s a Q s a
r s a p s s a Q s a
Probabilistic Latent Semantic Analysis
• Represent each document as a column vector• Build term-by-document matrix• Select number of topics, K, that will be used to
describe the data• Generate the term-by-topic and topic-by
document matrices using PLSA algorithm• Probability of retrieving document is:
( , )
11
( ) ( | ) ( | )X t docT K
kt
P doc P t k P k doc
Probabilistic Latent Semantic Analysis
• Inputs: – T x N term-by-document matrix – Number of topics K sought
• Initialise the T x K array P1 and K x N array P2 randomly with numbers between [0,1] and normalise them to sum to 1 along rows
• Iterate until convergence.– For d=1 to N, For t =1 to T, For k=1 to K,
• Output: arrays P1 and P2, which hold the estimated parameters P(t|k) and P(k|d) respectively.
1
1 1
1
1 1
( , ) 1( , )1( , ) 1( , ) 2( , ); 1( , )
1( , ) 2( , ) 1( , )
( , ) 2( , )2( , ) 2( , ) 1( , ); 2( , )
1( , ) 2( , ) 2( , )
N
K Td
k t
T
K Kt
k k
X t d P t kP t k P t k P k d P t k
P t k P k d P t k
x t d P k dP k d P k d P t k P k d
P t k P k d P k d
Naive Bayes Classifier• Estimate (from the training examples) the following parameters:
– For each target value (hypothesis) h
– For each attribute value at of each data instance
– Under the naive Bayes assumption that the attributes at are conditionally independent.
• Classify a new data instance, x=(a1, …, aT), as:
• That is, classify the new point according to the MAP estimate of the hypothesis.
ˆ( ) : estimate ( )P h P h
ˆ( | ) : estimate ( | )t tP a h P a h
arg max ( ) ( | ) arg max ( ) ( | )Naive Bayes th h t
h P h P h P h P a h x
Independent Components Analysis
• Recover n independent signals given only n mixtures of all the sources.
• The mixtures are linear combinations of the sources.
• Ill-posed problem; signals can only be recovered up to permutation and scaling.
• Recovery is achieved by maximising non-Gaussianity of the mixtures.
Decision Trees• Classification method based on maximising
information gain.
– Calculate average entropy for each attribute:
– Split on attribute that minimises average entropy.– Repeat on remaining attributes until tree is complete or no
further improvements possible.
2 2
( )
Entropy ( , ) log log
| |Average Entropy ( )
| |v
vv Values A
n n n nD n n
n n n nS
Entropy SS
( )
| |( , ) ( ) ( )
| |v
vv Values A
SGain S A Entropy S Entropy S
S
K Nearest Neighbours
• Lazy learning method used for classification and regression.
• Nearest neighbours: Defined by similarity under a metric e.g. Euclidean distance
• Query point classification on basis of labels of the K nearest neighbouring points.
• Function estimation given by average of values of K nearest neighbouring points.
• Distance-weighted variant: Shepards method.
Case Based Reasoning• Advanced form of instance based learning• Can tackle complex instance objects• These can include complex structural descriptions of
cases & adaptation rules• Must devise similarity measure for these complex
structures• Matching can be adaptive – tries to model human
problem-solving:– uses past experience (cases) to solve new problems– retains solutions to new problems
• E.g. R4 model
Case Based Reasoning
• R4 model:New Case
matchingMatched
Cases
Retrieve
Adapt?No
Yes
Closest Case
Suggest solution
Retain
Learn
Revise
Reuse
Case Base
Knowledge and Adaptation rules
Support Vector Machine• Method for supervised learning problems– Classification– Regression
• Discriminative approach (vs. generative approaches)
• Two key ideas:– Assuming linearly separable classes, learn separating
hyperplane with maximum margin– Map inputs into high-dimensional space to deal with
linearly non-separable cases (data may be linearly separable in HD space)
Non-linear SVM
• In the solution of the SVM (i.e. of the quadratic programming problem with linear inequality constraints) the data only enter solver in the form of dot products
• Means that we can make SVM non-linear without complicating algorithm
• Use kernel functions to do this
Kernel Functions
• Transform x (x)• The linear algorithm depends only on x.xi, hence
transformed algorithm depends only on (x)(xi)
• Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)
Kernel Functions
• Example1: 2D input space, 3D feature space
• Example2: in this case the dimension of is infinite• Making new kernels:
21
21 2
22
( ) 2 ( , ) ( )i j i j
x
x x K
x
x x x x x
2 2( , ) exp{ || || /2 }i j i jK x x x x
1 2 1 1 2 2 1 2
1 2 1 1 2
1 2 1 1 2 2 1 2
1 2 1 2
( , ) ( , ) ( , )
( , ) ( , ), 0
( , ) ( , ) ( , )
( , ) ( ) ( )
K K K
K λK
K K K
K f f
x x x x x x
x x x x
x x x x x x
x x x x
Clustering Methods
• Find K clusters in the data, i.e. K collections of similar data
• K usually known for real-world applications• If K is not known, need principled method for
finding best K value (NB overfitting)• We considered two established approaches:– K Means– Hierarchical Clustering
K Means Clustering
K Means Clustering
Begininitialize 1, 2, …,K
(randomly selected)do classify n samples
according to nearest i
recompute i
until no change in i
return 1, 2, …, K
End
Hierarchical Clustering
• Many times, clusters are not disjoint, but a cluster may have subclusters, in turn having sub-subclusters...
Hierarchical Clustering
Curse of Dimensionality(Non-examinable)
• When working with high dimensional (HD) data we encounter a range of problems, collectively known as the curse of dimensionality. These include:– Geometric intuition lets us down– Combinatorial explosion means that exhaustive search is
intractable– Variance in distance between data points and query point
tends to zero– All points are nearly orthogonal to one another– All sample points are close to the edges of samples– HD spaces are very sparsely populated by the sample data
Workarounds
• Some approaches that can work (for some problems):– Use domain knowledge to tailor your solution– Preprocess the data using a dimension reduction
technique e.g. PCA, CS– Try a different metric e.g. L1, Lq (0<q<1) can do better
than L2
– Use a non-metric similarity measure e.g. ranking– Use a method that doesn’t suffer (too much) from the
curse e.g. SVM (where generalization ability depends on margin/diameter of data)