11/16: after sanity test

Download 11/16: After Sanity Test

Post on 24-Feb-2016




0 download

Embed Size (px)


11/16: After Sanity Test. Post-mortem Project presentations in the last 2-3 classes Start of Statistical Learning. Sanity Test. Max: 52.5 Min: 6 Avg : 24.6 Stdev : 14.8 Including those sitting-in: Avg : 24.42; Stdev : 13.8 70+: 0 60-70: 0 50-60: 2 40-50: 0 30-40: 3 - PowerPoint PPT Presentation


11/16: After Sanity Test

11/16: After Sanity TestPost-mortemProject presentations in the last 2-3 classesStart of Statistical LearningSanity Test..Max: 52.5 Min: 6 Avg: 24.6 Stdev: 14.8 Including those sitting-in: Avg: 24.42; Stdev: 13.870+: 060-70: 0 50-60: 240-50: 030-40: 320-30: 410-20: 50-10: 2Students with low scores have to re-do the test at home (with access to notes, web etc.). An eventual score of less than 45 will be viewed as failing on the content

P(hi ) is called the hypothesis prior Nothing special about learning just vanilla probabilistic inference

Where is the hypothesis prior?

i.i.d. assumption

How did this prediction come about? Which hypothesis did we use?The analogy with diagnosisMedical diagnosisGiven symptoms of a patient, predict whether she will have other symptoms (such as death)Can try predicting directly from symptoms (is what we did before the advent of medicine)But we normally assume that diseases cause symptoms.. Thus we want to first figure out the disease and then predict other symptomsDiseases have prior probabilities (in fact, the ignored prior fallacy is the main reason for internet induced hypochondria)Given the symptoms, we compute the posterior on the diseases, and then use that to predict other symptoms

Full Bayesian learningGiven training data, predict test dataCan try predicting test data directly from training data (e.g. k-NN)But we normally assume that hypothesis explain data. Thus we want to first figure out the hypotheses causing the data and then using them predict test dataHypotheses have prior probabilities (as to how likely they areindependent of the data being seen right now). Given the data, we compute the posterior on the hypotheses, and then use that to predict test data7

How many Why should P(hi ) be low for complex hypotheses? --connection to MDL principleEquivalently minimize - log P(d|hi) log P(hi)Bits required to specify hiAdditional bits required to specify d8

--because statisticians distrust priors (and want the data to speak for itself)When will ML hit roadblock? Small data Should AI also distrust priors? Priors can encode background knowledge.. (There is even evidence that human brain uses priors)

http://web.mit.edu/cocosci/Papers/significance.pdfA technical head ache with priors: What if the posterior keeps changing parametrically? Conjugate priors.. 11/18Density EstimationThe general task of learning a probability model, given data that are assumed to be generated from that model Given data D whose instances are made-up of attributes x that are distributed according to P*(x), we want to learn an estimate P to P* such that the distance between P* and P is minimizedDistance between distributions is typically measured using KL DivergenceD(P*||P) = EP*[log P*(x)/P(x)]But alas, we dont know P* = EP* log P*(x) - EP* log P(x) The first term is constant and can be ignored in comparing two estimates P and PBut how do we get the second term? Since the data instances are drawn from P*, a P that maximizes their log likelihood is the best.If data are drawn i.i.d, then their joint likelihood is a product and log terms over all data instances can be summed..

Bias-Variance Tradeoff (learning with bias vs. regularization)So, we want to get the distribution P that maximizes EP* log P(x) Learning is just optimization!But how do we select the candidate space of distributions (hypotheses)?[Bias problem] If the class of distributions we consider is too small/inflexible (highly biased), then the best we get may still be too far from P* [Variance problem] If the class of distributions considered is too large/expressive, then small random fluctuations in the choice of data can radically change the properties of the model, thus exhibiting high variance on the test dataStandard solutions: Limit attention to a reasonable class of distributions (e.g. Nave Bayes)Allow a large class of distributions, but penalize the more complex ones [Also called regularization].Combination of both..A class of distributions can be defined by restricting the class of graphical models (e.g. only nave bayes models) or CPDs (only noisy-or or conditional Gaussians) allowed in the hypothesis spaceGenerative vs. Discriminative LearningOften, we are really more interested in predicting only a subset of the attributes given the rest.E.g. we have data attributes split into subsets X and Y, and we are interested in predicting Y given the values of X You can do this by either by learning the joint distribution P(X, Y) [Generative learning] or learning just the conditional distribution P(Y|X) [Discriminative learning]Often a given classification problem can be handled either generatively or discriminativelyE.g. Nave Bayes and Logistic RegressionWhich is better?Generative vs. DiscriminativeGenerative LearningMore general (after all if you have P(Y, X) you can predict Y given X as well as do other inferencesYou can predict jokes as well as make them up (or predict spam mails as well as generate them)In trying to learn P(Y,X), we are often forced to make many independence assumptions both in Y and Xand these may be wrong..Interestingly, this type of high bias can help generative techniques when there is too little data

Discriminative LearningMore to the point (if what you want is P(Y|X), why bother with P(Y,X) which is after all P(Y|X) *P(X) and thus models the dependencies between Xs also?Since we dont need to model dependencies among X, we dont need to make any independence assumptions among them. So, we can merrily use highly correlated features..Interestingly, this freedom can hurt discriminative learners when there is too little data (as over fitting is easy) Bayes networks are not well suited for discriminative learning; Markov Networks are--thus Conditional Random Fields are basically MNs doing discriminative learning --Logistic regression can be seen as a simple CRFP(y)P(x|y) = P(y,x) = P(x)P(y|x)Taxonomy of (statistical) Learning TasksModel constraintsType of network being learnedBayes Network vs. Markov networkTopology given; CPTs to be learnedOnly relevant attributes are given; need to learn topology as well as CPTsTricky part for MLE is that increasing the connectivity of a network cannot reduce likelihoodWe dont know what the relevant attributes areObservability of dataComplete dataEach data instance gives the values of each of the attributesIncomplete dataSome of the data instances might be missing the values for some of the attributesHidden attributes (variables)None of the data instances have values for some of the attributes (which often correspond to intermediate concepts which help improve the sparsity of network. E.g. syndromes which connect symptoms to diseases; or class variables in mixture models

Sample complexity linearly varies with # parameters to be learned, and #parameters vary exponentially with # edges in the graphical model11/23

Steps in ML based learningWrite down an expression for the likelihood of the data as a function of the parameter(s)Assume i.i.d. distribution Write down the derivative of the log likelihood with respect to each parameterFind the parameter values such that the derivatives are zeroThere are two ways this step can become complexIndividual (partial) derivatives lead to non-linear functions (depends on the type of distribution the parameters are controlling; binomial is a very easy case) Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations)In general, we will need to use continuous function optimization techniquesOne idea is to use gradient descent to find the point where the derivative goes to zero. But for gradient descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).

Note that for us, data are 2-attribute tuples [Flavor, Wrapper]

No entanglement of parameters for complete data for Bayes nets with known topology and tabular CPTs Specifically, each partial derivative will involve only one parameter i.e., each partial derivative contains only one parameter so you are solving single variable equations rather than simultaneous equations. doesnt hold for markov nets ; doesnt also hold for Bayes nets where CPDs induce direct parameter dependencies Celebrating ease of learning for bayes nets with complete data!So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts. Questions: How big a deal is this? Can we have complete data?Can we have known topology?

Learning the parameters of a Gaussian

Case Study: Learning Bayes Net models for Relational database tablesConsider a relational table in RDBMS with n attributesSay an employee table giving the age, position, salary etc of each employeeSuppose we want to learn the generative model underlying it Suppose we were able to hypothesize the topologyWe might be able to do so if (a) we know the domain or (b) we know some of the causal dependencies in the dataIf the relational table is complete i.e.., every tuple gives the value for every attribute, (which is the standard RDBMS model), then learning the parameters of this network is easy!

Now, suppose the table is slightly dirtyin that there are tuples that have some missing values for some of the attributesSay, some of the employee tuples are missing age information, others are missing salary information etc.If only a small percent of the tuples are incomplete, then