csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

CSCE 633: Machine Learning

Lecture 1: Introduction and Probability

Texas A&M University

8-28-18

Welcome to CSCE 633!

• About this class• Who are we?• Syllabus• Text and other references

• Introduction to Machine Learning

B Mortazavi CSECopyright 2018 1

Goals of this lecture

• Course structure

• Understanding difference between supervised and unsupervisedmachine learning

• Review of matrix operations

• Understanding differences in notation

• Understanding differences in some supervised models


Who are we?

• Instructor• Bobak Mortazavi• [email protected] - Please put [CSCE 633] in subject line• Office Hours: W 10-11a R 11-12p 328A HRBB

• TA• Xiaohan Chen• [email protected]• Office Hours: M 10:15-11:15a W 4:30-5:30p 320 HRBB

• Discussion Board on eCampus


Syllabus, Schedule, and eCampus


Text Book

Textbook for this course


Other useful references


Assignments

• Homework will be compiled in Latex - you will turn in PDFs

• Programming for homework will be in R or Python.

• for Projects - if you need to use a different language pleaseexplain why

• Project teams will be 2-3 people per team.

• Project proposals will be a 2 page written document

• Project report will be an 8 page IEEE conference-style report


Kahoot

• kahoot.com

• smartphone applications

• ability to interact in class!

• not graded


Topics

• About this class• Introduction to Machine Learning

• What is machine learning?• Models and accuracy• Some general supervised learning


What is machine learning?



Big Data! 40 billion web pages, 100 hours of videos on YouTubeevery minute, genomes of 1000s of people.

Murphy defines Machine learning as a set of methods that canautomatically detect patterns in data, and then use the uncoveredpatterns to predict future data, or to perform other kindso fdecision making under uncertainty (such as planning how to collectmore data!).


Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018


Some cool examples!

Waymo.com/tech



Statistical learning (machine learning) is a vast set of tools tounderstand data.


a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)


a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

Xj is column j, so we can re-write X as:

(X1,X2, · · · ,Xp) or

XT1

XT2...

XTn


Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s


Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s then (ABij) =∑d

k=1 aikbkj .For example, consider:

A =

(1 23 4

)and

B =

(5 67 8

), then

AB =

(1 23 4

)(5 67 8

)=

(1× 5 + 2× 7 1× 6 + 2× 83× 5 + 4× 7 3× 6 + 4× 8

)=(

19 2243 50

)


Vectors

• if x ∈ Rp×1, what is the ”size” of this vector?


Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?• Mapping of Rn to R with the following properties:

•∥∥x∥∥ ≥ 0 for all x

•∥∥x∥∥ = 0↔ x = 0

•∥∥αx

∥∥ = |α|∥∥x∥∥ , α ∈ R

•∥∥x + y

∥∥ ≤ ∥∥x∥∥+∥∥y∥∥ for all x and y


Vector Norms


• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣


Vector Norms


• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i


Vector Norms


• 1-norm∥∥x∥∥

1=∑p

i=1


∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |


Vector Norms


• 1-norm∥∥x∥∥

1=∑p

i=1


∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

• Lp-norm (or p-norm):∥∥x∥∥p

=(∑

i∈N |xi |p) 1

p




• Supervised Modeling - where we aim to predict or estimate anoutput based upon our input data

• Unsupervised Modeling - where we aim to learn relationshipand structure in input data without the guidance of outputdata

• Reinforcement Learning - where we aim to learn optimal rules,by rewarding the modeling method when it correctly satisfiessome goal


Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X


Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

• Model for the relationship is then: Y = f (X ) + ε

• f is the systematic information, ε is the random error


Why estimate f ?

• Prediction

• Inference


Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference


Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference - We would like to understand how Y is affectedbby changes in X .


Predicting f ?

Create a f (X ) that generates a Y that estimates Y as closely aspossible by attempting to minimize the reducible error, acceptingthat there is some irreducible error generated by ε that isindependent of X .

Why is there irreducible error?


Estimating f

Assume we have an estimate f and a set of predictors X , such thatY = f (X ). Then we can calculate the expected value of theaverage error as:

E(Y − Y )2 = E[f (X ) + ε− f (X )]2 = [f (X )− f (X )]2 + Var(ε)

The focus of this class is to learn techniques to estimate f tominimize reducible error.


How do we estimate f

• Need: Training Data to learn from


How do we estimate f

• Need: Training Data to learn from

• Training data D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, wherexi = (xi1, xi2, · · · , xip)T for n subjects and p dimensions.

• Note: The text uses notation N subjects and D dimensions• y can be:

• categorical or nominal → the problem we solve is classificationor pattern recognition

• continuous → the problem we solve is regression• categorical with order (grades A, B, C, D, F) → the problem

we solve is ordinal regression


Parametric Methods

• 2-step model based approach to estimate f .

• Step 1 - make an assumption about the form of f . Forexample, that f is linear

• f (X ) = β0 + β1x1 + β2x2 + · · ·+ βpxp• Advantage - only have to estimate p + 1 coefficients to modelf

• Step 2 - fit or train a model/procedure to estimate β

• The most comment method for this is Ordinary Least Squares.


Parametric vs. Non-parametric


Flexibiilty vs. Interpretability


Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?


Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

• why do we care about previously seen data? Instead we careabout unseen test data to understand out the model predicts.

• so how do we select methods that reduce MSE on test data?


Bias Variance Tradeoff

• Variance - the amount f would change with different trainingdata

• Bias - the error introduced by approximating f with a simplermodel

• E(y − f (x))2 = Var(f (x)) + [Bias(f (x))]2 + Var(ε)

• Error Rate = 1n

∑ni=1 I(yi 6= yi )


Bias Variance Tradeoff

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

• Test error is minimized by understanding the conditionalprobability p(Y = j |X = x0)

• Bayes Error Rate 1− E(maxjp(Y = j |X ))

• K-Nearest Neighbor to solve this:p(Y = j |X = x0) = 1

K

∑i∈N0

I(yi = j)

• More on this later!


Key Challenge: Avoid Overfitting


U-Shape Tradeoff


Multiclass Classification

• yi |y ∈ {1, 2, · · · ,C}, where, if C = 2, y ∈ {0, 1} and if C > 2we have a multi-class classification problem.

•



• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

•


Another Example: Advanced Notation

• Suppose you want to model the cost of a used car

• inputs: x are the car attributes (brand, mileage, year, etc.)

• output: y price of the car

• learn: model parameters w (β in prior slides)

• Deterministic linear model: y = f (x |w) = wT x

• Deterministic non-linear model: y = f (x |w) = wTφ(x)

• Non-deterministic non-linear model:y = f (x |w) = wTφ(x) + ε where ε ∼ N(µ, σ2)


Another Example: Linear Regression

• Parameters θ = (w , σ2)

• Clustering - model cars by p(xi |θ) and see if there is a naturalgrouping of cars

• µ(x) = w0 + w1x = wT x

• Linear regression: p(y |x , θ) = N(y |wTφ(x), σ2) - the basisfunction expansion with a Gaussian Distribution


Logistic Regression

• Assume y ∈ 0, 1, then a Bernoulli distribution is moreaccurate than a Gaussian

• p(y |x ,w) = Ber(y |µ(x)), where µ(x) = E[y |x ] = p(y = 1|x)

• If we require 0 ≤ µ(x) ≤ 1, say by a sigmoid function

• µ(x) = sigm(wT x) = 1

1+exp−wT x(sometimes called the logit

or logistic function)

• Then can classify by a decision rule likey = 1↔ p(y = 1|x) > 0.5


Logistic Regression

• p(y |x , β) = sigm(βT x) = 1

1+exp−βT x

•


K Nearest Neighbor

• p(y = c|x ,D, k) = 1k

∑i∈Nk (x ,D) I(yi = c)

•


No Free Lunch Theorem

• ”All models are wrong but some models are useful”, G. Box,1987

• No single ML system works for everything

• Principal Component Analysis a way to solve curse ofdimensionality

• Machine learning is not magic: it can’t get something out ofnothing, but it can get more from less!


Takeaways and Next Time

• What is the difference between supervised and unsupervisedlearning?

• What are some basics of supervised learning?

• Difference between parametric and non-parametric modeling

• Next Time: Probability, K-Nearest Neighbor, and Naive BayesClassification

• Sources: Machine Learning, A Probabilistic Perspective -Murphy, and An Introduction to Statistical Learning - Jameset al


csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

Documents