Download - Machine Learning Lecture 2 Basics
Machine Learning: BasicsApplied Machine Learning: Unit 1, Lecture 1
Anantharaman Narayana Iyer
narayana dot Anantharaman at gmail dot com
9th Jan 2016
Types of Learning Algorithms• Supervised
• Given a set of pairs (x, y) where y is a label (or class) and x is an observation, discover a function that assigns the correct labels to the x.
• Unsupervised• The data is unlabelled. We need to explore the data to discover the intrinsic structures in them
• Semi supervised• Part of the data is labelled while the rest is unlabelled. The labelled data is used to bootstrap. For
example deep learning architectures leverage the vast amount of unlabelled data available over the web and use a small quantity of labelled data for finetuning.
• Reinforcement• Reinforcement learning (RL) is learning by interacting with an environment. An RL agent learns from the
consequences of its actions, rather than from being explicitly taught and it selects its actions on basis of its past experiences (exploitation) and also by new choices (exploration), which is essentially trial and error learning.
x1
x2
Supervised Learning
L1
L2
L5
L3
L4
Class = True
Class = False
Key Concepts• Supervised learning is a technique where the classifier is
trained using training examples• The training examples contain the input attributes
(Features) and the expected outputs.• In the Fig, X1 and X2 are the features
• The input typically is a n-dimensional vector and output may have 1 or more dimensions
• A Binary Classifier classifies the input vector in to one of the two states• Illustrated by Red and Purple boxes in the fig
• A linearly separable system is one where the class labels can be separated by a linear decision boundary.• The straight lines L1, L2, L3, L4, L5 show different
decision boundaries in the Fig• The example in the Fig is 2 dimensional linearly separable
system. It can be generalized to an n-dimensional system. The decision surface is then called an Hyperplane
• Each decision surface can be considered to be a Hypothesis
x1
x2
Unsupervised Learning
Cluster = 2
Cluster = 1
Key Concepts• Unsupervised techniques do not require the
expected outputs to be specified in the dataset.
• This has an advantage as the availability of labelled data is scarce relative to the vast amount of data that is available in Web and other media.
• Clustering is one of the machine learning algorithms that belongs to the category of unsupervised learning
• In the Fig the system finds inputs that can be logically grouped together as a cluster. The example shows 2 such clusters.
Classification and Regression Problems
• The term regression refers to a system with a continuous variable as the output
• Classification is a process by which the machine learning system partitions the input space in a to discrete set of classes
• Example:• Credit card approval (Approve, Not approved decisions)
• Credit line limit
• Home loan approval
• Sentiment Polarity (Positive, Negative, Neutral)
• Sentiment as a real number: -1 <= sentiment <= 1
Notations
•
• m = Number of training examples
• n = Number of features in the input example
• x’s = “input” variable / features
• y’s = “output” variable / “target” variable
• The unknown target function f maps the input space to the outputs as:
f: X -> Y
Problem Statement: ML Classifier
• Given a finite set of training examples and the space of all applicable hypothesis, select a hypothesis that best approximates the unknown target function.• The unknown target function f is
the ideal function that characterizes the underlying pattern that generated the data
• Training examples are provided to the ML designer
• The output of this process is a hypothesis g that approximates f
• The hypothesis set and the learning algorithm constitutes the solution set.
Fig from: Yasser Mustafa, Caltech
Let’s begin: Perceptron Learning
• National cricket team selectors choose the team members of the team and thus play a key role in the performance of the team.
• Suppose we want to build a system that acts as a “virtual selector” by selecting (or rejecting) a player given the data on his past performances.
• Let us consider a selector who looks at only 2 input variables: Batting Average, Bowling Average.
• Here, the features are: x1 = Batting Average, x2 = Bowling Average
• Let us consider PLA for this purpose
Example data
PLAYER BATTING AVERAGE BOWLING AVERAGE SELECTED
Shikhar Dhawan 45.46 -1 Yes
Rohit Sharma 37.89 60.37 Yes
Ajinkya Rahane 29.28 -1 Yes
Virat Kohli 52.61 145.5 Yes
Suresh Raina 35.82 48 Yes
Ambati Rayudu 60 53 Yes
Kedar Jadhav 20 -1 No
Manoj Tiwary 31.62 28.8 No
Manish Pandey -1 -1 No
Murali Vijay 19.46 -1 No
MS Dhoni 52.85 31 Yes
Wriddhiman Saha 13.66 -1 No
Robin Uthappa 26.96 -1 No
Sanju Samson -1 -1 No
Ravindra Jadeja 34.51 32.29 Yes
Akshar Patel 20 20.28 Yes
Stuart Binny 13.33 13 Yes
Parvez Rasool -1 30 Yes
R Ashwin 16.91 32.46 Yes
Karn Sharma -1 -1 No
Amit Mishra 4.8 23.95 No
Kuldeep Yadav -1 -1 No
Ishant Sharma 5.14 31.25 Yes
Bhuvneshwar Kumar 10.4 36.59 Yes
Mohammed Shami 9.12 26.08 Yes
Umesh Yadav 14.66 35.93 Yes
Varun Aaron 8 38.09 No
Dhawal Kulkarni -1 23 No
Mohit Sharma -1 58 No
Ashok Dinda 4.2 51 No
45.46, 100
37.89, 60.37
29.28, 100 52.61, 100
35.82, 48
60, 53
20, 100
31.62, 28.8
0, 100 19.46, 100
52.85, 31
13.66, 100 26.96, 1000, 100
34.51, 32.29
20, 20.28
13.33, 13
0, 3016.91, 32.46
0, 100
4.8, 23.95
0, 100
5.14, 31.25
10.4, 36.59
9.12, 26.08
14.66, 35.938, 38.09
0, 23
0, 58
4.2, 51
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70
Visualization of team performance
PLA Model
x = (x1, x2) where x1, x2 are the features of a given data sample
Select the player if 𝑑 𝑤𝑖𝑥𝑖 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑒𝑙𝑠𝑒 𝑟𝑒𝑗𝑒𝑐𝑡
The above can be written as:
ℎ 𝑥 = 𝑠𝑖𝑔𝑛(( 𝑑 𝑤𝑖𝑥𝑖) – threshold))
ℎ 𝑥 = 𝑠𝑖𝑔𝑛(( 𝑑 𝑤𝑖𝑥𝑖) + w0))
Let us introduce an artificial input x0
ℎ 𝑥 = 𝑠𝑖𝑔𝑛( 𝑑 𝑤𝑖𝑥𝑖) where i takes values from 0 to d
In vector form: h(x) = sign(wTx)
Fig from: Yasser Mustafa, Caltech
PLA Training
• Perceptron implements: h(x) = sign(wTx)
• The goal of training is to determine the model parameters wi’s, given the training data (x1, y1), (x2, y2)…(xn, yn). • Note: Usually x is a vector and y can be a real number or a vector by
itself
• Training Algorithm:• Initialize w to small random numbers
• Iterate t = 1, 2, …• Pick a misclassified point h(𝑥𝑛) ≠ 𝑦𝑛
• Update the weight vector: 𝑤 ← 𝑤 + 𝑦𝑛𝑥𝑛
• It can be shown that for linearly separable data the algorithm converges in a finite number of iterations
• A learning rate α is used to control the increments to the weight vector
Fig from: Yasser Mustafa, Caltech
Representational Power of Perceptrons
• Equation for the decision hyperplane is 𝑤. 𝑥= 0
• The space of candidate hypothesis 𝐻 = {𝑤 |𝑤 ∈ ℝ(𝑛+1)}
• A perceptron represents a hyperplane decision surface in the n-dimensional space of data instances where the hyperplaneseparates positive examples from the negative ones.
• Not all points in the input space can be separated by this hyperplane. The ones that can be separated by the perceptron are called linearly separable.
• Perceptrons can be used to represent many Boolean functions. • E.g. assume logical 0 to be 0 and logical 1 to be +1. Suppose we want to
represent a 2 input AND function is to set the weights w0 = -1.5, w1 = w2 = 1. We can design OR logic similarly by setting w0 = -0.3
• Functions like XOR are non linearly separable and so can’t be represented by perceptrons
• The ability of the perceptrons to represent AND, OR, NAND, NOR is important complex Boolean functions can be built combining these
-1.5w1=1 w2=1
x0 x1 x2
-0.5w1=1 w2=1
x0 x1 x2
Exercise
• Design a perceptron that can represent:• NAND
• NOR
• NOT
Exercise
• Implement the Perceptron Learning Algorithm to learn the given training dataset (Cricket player data). Test the classifier using the test data provided and report the accuracy computed as the percentage of correct classifications.• Set maximum iterations to 1000, 10000, 100000
• Does this converge? If so after how many iterations?
• How many misclassified points do you get?