bayesian learning, part 1 of (probably) 4 reading: dh&s, ch. 2.{1-5}, 3.{1-4}

30
Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Post on 19-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Bayesian Learning, Part 1 of

(probably) 4Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Administrivia•Finalizing the straw poll...

•Reminder: I’m out of town, Mar 1-3

Blatant advertisement!2nd CS UNM Student

Conference(CSUSC 2006)

March 3, 2006

http://www.cs.unm.edu/~csgsa/conference/

•See cool work going on in CS

•Learn what constitutes good research

•Support your fellow grad students

•Free food...

Reading #2•Kolter, J.Z., & Maloof, M.A. (2005). “Using additive expert ensembles to cope with concept drift.” In Proceedings of the Twenty-second International Conference on Machine Learning (ICML-2005), 449–456. New York, NY: ACM Press.•http://www.cs.georgetown.edu/~maloof/pubs/icml05.php

•Due: Tues, Mar 7

Yesterday, today, and...•Last time:

•Finish up SVMs (informal notes)

•Discussion of R1

•This time:

•Intro to statistical/generative modeling

•The Bayesian viewpoint

•(maybe) Maximum likelihood (ML) estimation

•Next time:

•ML for sure...

ML trivia of the day...•Which data mining techniques [have] you used in a successfully deployed application?

htt

p:/

/w

ww

.kdnu

gg

ets

.com

/

Assumptions•“Assume makes an a** out of U and ME”...

•Bull****

•Assumptions are unavoidable

•It is not possible to have an assumption-free learning algorithm

•Must always have some assumption about how the data works

•Makes learning faster, more accurate, more robust

Example assumptions

•Decision tree:

•Axis orthogonality

•Impurity-based splitting

•Greedy search ok

•Accuracy (0/1 loss) objective function

Example assumptions

•k-NN:

•Distance function/metric

•Accuracy objective

•Data drawn from probability distribution

•k controls “smoothness” of prob. estimate

Example assumptions

•Linear discriminant (hyperplane classifier) via MSE:

•Data is linearly separable

•Squared-error cost

Example assumptions

•Support vector machines

•Data is (close to) linearly separable...

•... in some high-dimensional projection of input space

•Interesting nonlinearities can be captured by kernel functions

•Max margin objective function

Specifying assumptions•Bayesian learning assumes:

•Data were generated by some stochastic process

•Can write down (some) mathematical form for that process

•CDF/PDF/PMF

•Mathematical form needs to be parameterized

•Have some “prior beliefs” about those params

Specifying assumptions•Makes strong assumptions about form

(distribution) of data

•Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm

•In practice, not a single learning algorithm, but a recipe for generating problem-specific algs.

•Will work well to the extent that these assumptions are right

Example•F={height, weight}

•Ω={male, female}

•Q1: Any guesses about individual distributions of height/weight by class?

•What probability function (PDF)?

•Q2: What about the joint distribution?

•Q3: What about the means of each?

•Reasonable guess for the upper/lower bounds on the means?

Some actual data*

* Actual synthesized data, anyway...

General idea•Find probability distribution that describes classes of data

•Find decision surface in terms of those probability distributions

H/W data as PDFs

Or, if you prefer...

General idea

•Find probability distribution that describes classes of data

•Find decision surface in terms of those probability distributions

•What would be a good rule?

Recall: Bayes optimality•For 0/1 loss (accuracy), we showed that

optimal decision is (Lecture 7, Feb 7):

•Equivalently, it’s sometimes useful to use log odds ratio test:

Recall: Bayes optimality•In pictures:

Bayesian learning process•So where do the probability distributions come from?

•The art of Bayesian data modeling is:

•Deciding what probability models to use

•Figuring out how to find the parameters

•In Bayesian learning, the “learning” is (almost) all in finding the parameters

Back to the H/W data

•Gaussian (a.k.a. normal or bell curve) is a reasonable assumption for this data

•Other distributions better for other data

•Can make reasonable guesses about means

•Probably not -3 kg or 2 million lightyears

•Assumptions like these are called

•Model assumptions (Gaussian)

•Parameter priors (means)

•How do we incorporate these into learning?

Prior knowledge

5 minutes of math...•Our friend the Gaussian distribution

•1n 1-dimension:

•Mean:

•Std deviation:

•Both parameters scalar

•Usually, we talk about variance rather than std dev:

Gaussian: the pretty picture

Gaussian: the pretty picture

Location parameter: μ

Gaussian: the pretty picture

Scale parameter: σ

5 minutes of math...•In d dimensions:

•Where:

•Mean vector:

•Covariance matrix:

•Determinant of covariance:

Exercise:•For the 1-d Gaussian:

•Given two classes, with means μ1 and μ2 and std devs σ1 and σ2

•Find a description of the decision point if the std devs are the same, but diff means

•And if means are the same, but std devs are diff

•For the d-dim Gaussian,

•What shapes are the isopotentials? Why?

•Repeat above exercise for d-dim Gaussian