bayesian learning, part 1 of (probably) 4 reading: dh&s, ch. 2.{1-5}, 3.{1-4}
Post on 19-Dec-2015
216 views
TRANSCRIPT
Blatant advertisement!2nd CS UNM Student
Conference(CSUSC 2006)
March 3, 2006
http://www.cs.unm.edu/~csgsa/conference/
•See cool work going on in CS
•Learn what constitutes good research
•Support your fellow grad students
•Free food...
Reading #2•Kolter, J.Z., & Maloof, M.A. (2005). “Using additive expert ensembles to cope with concept drift.” In Proceedings of the Twenty-second International Conference on Machine Learning (ICML-2005), 449–456. New York, NY: ACM Press.•http://www.cs.georgetown.edu/~maloof/pubs/icml05.php
•Due: Tues, Mar 7
Yesterday, today, and...•Last time:
•Finish up SVMs (informal notes)
•Discussion of R1
•This time:
•Intro to statistical/generative modeling
•The Bayesian viewpoint
•(maybe) Maximum likelihood (ML) estimation
•Next time:
•ML for sure...
ML trivia of the day...•Which data mining techniques [have] you used in a successfully deployed application?
htt
p:/
/w
ww
.kdnu
gg
ets
.com
/
Assumptions•“Assume makes an a** out of U and ME”...
•Bull****
•Assumptions are unavoidable
•It is not possible to have an assumption-free learning algorithm
•Must always have some assumption about how the data works
•Makes learning faster, more accurate, more robust
Example assumptions
•Decision tree:
•Axis orthogonality
•Impurity-based splitting
•Greedy search ok
•Accuracy (0/1 loss) objective function
Example assumptions
•k-NN:
•Distance function/metric
•Accuracy objective
•Data drawn from probability distribution
•k controls “smoothness” of prob. estimate
Example assumptions
•Linear discriminant (hyperplane classifier) via MSE:
•Data is linearly separable
•Squared-error cost
Example assumptions
•Support vector machines
•Data is (close to) linearly separable...
•... in some high-dimensional projection of input space
•Interesting nonlinearities can be captured by kernel functions
•Max margin objective function
Specifying assumptions•Bayesian learning assumes:
•Data were generated by some stochastic process
•Can write down (some) mathematical form for that process
•CDF/PDF/PMF
•Mathematical form needs to be parameterized
•Have some “prior beliefs” about those params
Specifying assumptions•Makes strong assumptions about form
(distribution) of data
•Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm
•In practice, not a single learning algorithm, but a recipe for generating problem-specific algs.
•Will work well to the extent that these assumptions are right
Example•F={height, weight}
•Ω={male, female}
•Q1: Any guesses about individual distributions of height/weight by class?
•What probability function (PDF)?
•Q2: What about the joint distribution?
•Q3: What about the means of each?
•Reasonable guess for the upper/lower bounds on the means?
General idea•Find probability distribution that describes classes of data
•Find decision surface in terms of those probability distributions
General idea
•Find probability distribution that describes classes of data
•Find decision surface in terms of those probability distributions
•What would be a good rule?
Recall: Bayes optimality•For 0/1 loss (accuracy), we showed that
optimal decision is (Lecture 7, Feb 7):
•Equivalently, it’s sometimes useful to use log odds ratio test:
Bayesian learning process•So where do the probability distributions come from?
•The art of Bayesian data modeling is:
•Deciding what probability models to use
•Figuring out how to find the parameters
•In Bayesian learning, the “learning” is (almost) all in finding the parameters
•Gaussian (a.k.a. normal or bell curve) is a reasonable assumption for this data
•Other distributions better for other data
•Can make reasonable guesses about means
•Probably not -3 kg or 2 million lightyears
•Assumptions like these are called
•Model assumptions (Gaussian)
•Parameter priors (means)
•How do we incorporate these into learning?
Prior knowledge
5 minutes of math...•Our friend the Gaussian distribution
•1n 1-dimension:
•Mean:
•Std deviation:
•Both parameters scalar
•Usually, we talk about variance rather than std dev:
5 minutes of math...•In d dimensions:
•Where:
•Mean vector:
•Covariance matrix:
•Determinant of covariance:
Exercise:•For the 1-d Gaussian:
•Given two classes, with means μ1 and μ2 and std devs σ1 and σ2
•Find a description of the decision point if the std devs are the same, but diff means
•And if means are the same, but std devs are diff
•For the d-dim Gaussian,
•What shapes are the isopotentials? Why?
•Repeat above exercise for d-dim Gaussian