csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

59
CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Upload: others

Post on 24-Sep-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

CSCE 633: Machine Learning

Lecture 1: Introduction and Probability

Texas A&M University

8-28-18

Page 2: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Welcome to CSCE 633!

• About this class• Who are we?• Syllabus• Text and other references

• Introduction to Machine Learning

B Mortazavi CSECopyright 2018 1

Page 3: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Goals of this lecture

• Course structure

• Understanding difference between supervised and unsupervisedmachine learning

• Review of matrix operations

• Understanding differences in notation

• Understanding differences in some supervised models

B Mortazavi CSECopyright 2018 2

Page 4: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Who are we?

• Instructor• Bobak Mortazavi• [email protected] - Please put [CSCE 633] in subject line• Office Hours: W 10-11a R 11-12p 328A HRBB

• TA• Xiaohan Chen• [email protected]• Office Hours: M 10:15-11:15a W 4:30-5:30p 320 HRBB

• Discussion Board on eCampus

B Mortazavi CSECopyright 2018 3

Page 5: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Syllabus, Schedule, and eCampus

B Mortazavi CSECopyright 2018 4

Page 6: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Text Book

Textbook for this course

B Mortazavi CSECopyright 2018 5

Page 7: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Other useful references

B Mortazavi CSECopyright 2018 6

Page 8: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Assignments

• Homework will be compiled in Latex - you will turn in PDFs

• Programming for homework will be in R or Python.

• for Projects - if you need to use a different language pleaseexplain why

• Project teams will be 2-3 people per team.

• Project proposals will be a 2 page written document

• Project report will be an 8 page IEEE conference-style report

B Mortazavi CSECopyright 2018 7

Page 9: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Kahoot

• kahoot.com

• smartphone applications

• ability to interact in class!

• not graded

B Mortazavi CSECopyright 2018 8

Page 10: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Topics

• About this class• Introduction to Machine Learning

• What is machine learning?• Models and accuracy• Some general supervised learning

B Mortazavi CSECopyright 2018 9

Page 11: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

What is machine learning?

B Mortazavi CSECopyright 2018 10

Page 12: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

What is machine learning?

Big Data! 40 billion web pages, 100 hours of videos on YouTubeevery minute, genomes of 1000s of people.

Murphy defines Machine learning as a set of methods that canautomatically detect patterns in data, and then use the uncoveredpatterns to predict future data, or to perform other kindso fdecision making under uncertainty (such as planning how to collectmore data!).

B Mortazavi CSECopyright 2018 11

Page 13: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

B Mortazavi CSECopyright 2018 12

Page 14: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

B Mortazavi CSECopyright 2018 13

Page 15: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 14

Page 16: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 15

Page 17: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 16

Page 18: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

B Mortazavi CSECopyright 2018 17

Page 19: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

B Mortazavi CSECopyright 2018 18

Page 20: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

Xj is column j, so we can re-write X as:

(X1,X2, · · · ,Xp) or

XT1

XT2...

XTn

B Mortazavi CSECopyright 2018 19

Page 21: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s

B Mortazavi CSECopyright 2018 20

Page 22: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s then (ABij) =∑d

k=1 aikbkj .For example, consider:

A =

(1 23 4

)and

B =

(5 67 8

), then

AB =

(1 23 4

)(5 67 8

)=

(1× 5 + 2× 7 1× 6 + 2× 83× 5 + 4× 7 3× 6 + 4× 8

)=(

19 2243 50

)

B Mortazavi CSECopyright 2018 21

Page 23: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vectors

• if x ∈ Rp×1, what is the ”size” of this vector?

B Mortazavi CSECopyright 2018 22

Page 24: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?• Mapping of Rn to R with the following properties:

•∥∥x∥∥ ≥ 0 for all x

•∥∥x∥∥ = 0↔ x = 0

•∥∥αx

∥∥ = |α|∥∥x∥∥ , α ∈ R

•∥∥x + y

∥∥ ≤ ∥∥x∥∥+∥∥y∥∥ for all x and y

B Mortazavi CSECopyright 2018 23

Page 25: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣

B Mortazavi CSECopyright 2018 24

Page 26: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

B Mortazavi CSECopyright 2018 25

Page 27: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

B Mortazavi CSECopyright 2018 26

Page 28: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

• Lp-norm (or p-norm):∥∥x∥∥p

=(∑

i∈N |xi |p) 1

p

B Mortazavi CSECopyright 2018 27

Page 29: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

B Mortazavi CSECopyright 2018 28

Page 30: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

• Supervised Modeling - where we aim to predict or estimate anoutput based upon our input data

• Unsupervised Modeling - where we aim to learn relationshipand structure in input data without the guidance of outputdata

• Reinforcement Learning - where we aim to learn optimal rules,by rewarding the modeling method when it correctly satisfiessome goal

B Mortazavi CSECopyright 2018 29

Page 31: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

B Mortazavi CSECopyright 2018 30

Page 32: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

• Model for the relationship is then: Y = f (X ) + ε

• f is the systematic information, ε is the random error

B Mortazavi CSECopyright 2018 31

Page 33: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Why estimate f ?

• Prediction

• Inference

B Mortazavi CSECopyright 2018 32

Page 34: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference

B Mortazavi CSECopyright 2018 33

Page 35: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference - We would like to understand how Y is affectedbby changes in X .

B Mortazavi CSECopyright 2018 34

Page 36: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Predicting f ?

Create a f (X ) that generates a Y that estimates Y as closely aspossible by attempting to minimize the reducible error, acceptingthat there is some irreducible error generated by ε that isindependent of X .

Why is there irreducible error?

B Mortazavi CSECopyright 2018 35

Page 37: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Estimating f

Assume we have an estimate f and a set of predictors X , such thatY = f (X ). Then we can calculate the expected value of theaverage error as:

E(Y − Y )2 = E[f (X ) + ε− f (X )]2 = [f (X )− f (X )]2 + Var(ε)

The focus of this class is to learn techniques to estimate f tominimize reducible error.

B Mortazavi CSECopyright 2018 36

Page 38: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

How do we estimate f

• Need: Training Data to learn from

B Mortazavi CSECopyright 2018 37

Page 39: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

How do we estimate f

• Need: Training Data to learn from

• Training data D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, wherexi = (xi1, xi2, · · · , xip)T for n subjects and p dimensions.

• Note: The text uses notation N subjects and D dimensions• y can be:

• categorical or nominal → the problem we solve is classificationor pattern recognition

• continuous → the problem we solve is regression• categorical with order (grades A, B, C, D, F) → the problem

we solve is ordinal regression

B Mortazavi CSECopyright 2018 38

Page 40: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Parametric Methods

• 2-step model based approach to estimate f .

• Step 1 - make an assumption about the form of f . Forexample, that f is linear

• f (X ) = β0 + β1x1 + β2x2 + · · ·+ βpxp• Advantage - only have to estimate p + 1 coefficients to modelf

• Step 2 - fit or train a model/procedure to estimate β

• The most comment method for this is Ordinary Least Squares.

B Mortazavi CSECopyright 2018 39

Page 41: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Parametric vs. Non-parametric

B Mortazavi CSECopyright 2018 40

Page 42: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Flexibiilty vs. Interpretability

B Mortazavi CSECopyright 2018 41

Page 43: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

B Mortazavi CSECopyright 2018 42

Page 44: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

• why do we care about previously seen data? Instead we careabout unseen test data to understand out the model predicts.

• so how do we select methods that reduce MSE on test data?

B Mortazavi CSECopyright 2018 43

Page 45: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Bias Variance Tradeoff

• Variance - the amount f would change with different trainingdata

• Bias - the error introduced by approximating f with a simplermodel

• E(y − f (x))2 = Var(f (x)) + [Bias(f (x))]2 + Var(ε)

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

B Mortazavi CSECopyright 2018 44

Page 46: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Bias Variance Tradeoff

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

• Test error is minimized by understanding the conditionalprobability p(Y = j |X = x0)

• Bayes Error Rate 1− E(maxjp(Y = j |X ))

• K-Nearest Neighbor to solve this:p(Y = j |X = x0) = 1

K

∑i∈N0

I(yi = j)

• More on this later!

B Mortazavi CSECopyright 2018 45

Page 47: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Key Challenge: Avoid Overfitting

B Mortazavi CSECopyright 2018 46

Page 48: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

U-Shape Tradeoff

B Mortazavi CSECopyright 2018 47

Page 49: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Multiclass Classification

• yi |y ∈ {1, 2, · · · ,C}, where, if C = 2, y ∈ {0, 1} and if C > 2we have a multi-class classification problem.

B Mortazavi CSECopyright 2018 48

Page 50: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Multiclass Classification

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

B Mortazavi CSECopyright 2018 49

Page 51: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Multiclass Classification

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

B Mortazavi CSECopyright 2018 50

Page 52: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Another Example: Advanced Notation

• Suppose you want to model the cost of a used car

• inputs: x are the car attributes (brand, mileage, year, etc.)

• output: y price of the car

• learn: model parameters w (β in prior slides)

• Deterministic linear model: y = f (x |w) = wT x

• Deterministic non-linear model: y = f (x |w) = wTφ(x)

• Non-deterministic non-linear model:y = f (x |w) = wTφ(x) + ε where ε ∼ N(µ, σ2)

B Mortazavi CSECopyright 2018 51

Page 53: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Another Example: Linear Regression

• Parameters θ = (w , σ2)

• Clustering - model cars by p(xi |θ) and see if there is a naturalgrouping of cars

• µ(x) = w0 + w1x = wT x

• Linear regression: p(y |x , θ) = N(y |wTφ(x), σ2) - the basisfunction expansion with a Gaussian Distribution

B Mortazavi CSECopyright 2018 52

Page 54: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Logistic Regression

• Assume y ∈ 0, 1, then a Bernoulli distribution is moreaccurate than a Gaussian

• p(y |x ,w) = Ber(y |µ(x)), where µ(x) = E[y |x ] = p(y = 1|x)

• If we require 0 ≤ µ(x) ≤ 1, say by a sigmoid function

• µ(x) = sigm(wT x) = 1

1+exp−wT x(sometimes called the logit

or logistic function)

• Then can classify by a decision rule likey = 1↔ p(y = 1|x) > 0.5

B Mortazavi CSECopyright 2018 53

Page 55: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Logistic Regression

• p(y |x , β) = sigm(βT x) = 1

1+exp−βT x

B Mortazavi CSECopyright 2018 54

Page 56: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

K Nearest Neighbor

• p(y = c|x ,D, k) = 1k

∑i∈Nk (x ,D) I(yi = c)

B Mortazavi CSECopyright 2018 55

Page 57: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Curse of Dimensionality

•B Mortazavi CSE

Copyright 2018 56

Page 58: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

No Free Lunch Theorem

• ”All models are wrong but some models are useful”, G. Box,1987

• No single ML system works for everything

• Principal Component Analysis a way to solve curse ofdimensionality

• Machine learning is not magic: it can’t get something out ofnothing, but it can get more from less!

B Mortazavi CSECopyright 2018 57

Page 59: CSCE 633: Machine Learning€¦ · CSCE 633: Machine Learning Lecture 1: Introduction and Probability Texas A&M University 8-28-18

Takeaways and Next Time

• What is the difference between supervised and unsupervisedlearning?

• What are some basics of supervised learning?

• Difference between parametric and non-parametric modeling

• Next Time: Probability, K-Nearest Neighbor, and Naive BayesClassification

• Sources: Machine Learning, A Probabilistic Perspective -Murphy, and An Introduction to Statistical Learning - Jameset al

B Mortazavi CSECopyright 2018 58