# logistic regression

Post on 14-Feb-2016

87 views

Embed Size (px)

DESCRIPTION

Logistic Regression. Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC. Overview. Motivating example Why not ordinary linear regression? The logistic formulation Probability of “success” Odds of “success” Logit of “success” The logistic regression model - PowerPoint PPT PresentationTRANSCRIPT

Logistic RegressionRam Akella Lecture 3February 2, 2011

UC Berkeley Silicon Valley Center/SC

OverviewMotivating exampleWhy not ordinary linear regression?The logistic formulationProbability of successOdds of successLogit of successThe logistic regression modelRunning the modelInterpreting the outputEvaluating the goodness of fitting

The Aim of Classification MethodsSimilar to ordinary regression models, except the response, Y, is categorical.Y indicates the group membership of each observation (each category is a group). Y = C1, C2,Predictors X1, X2,.., are continuous and/or categorical.Aims:Profiling (=Explanatory): What are the differences (in terms of X1, X2,) between the various groups? (as indicated by Y)Classification (=Prediction): Predict Y (group membership) on the basis of X1, X2,0

Example 1: Classifying Firm StatusFinancial analysts are interested in predicting the future solvency of firms. In order to predict whether a firm will go bankrupt in the near future, it is useful to look at different ratio measures of financial health such as :Cash_Debt:cash flow/total debtROA:net income/total assetsCurrent:current assets/current liabilitiesAssets_Sales:current assets/net salesStatus: bankrupt / solvent0

Example 2: Profiling Customers by Beer PreferenceA beer-maker would like to know the characteristics that distinguish customers who prefer light beer from those who prefer regular beer. The beer-maker would like to use this information to predict any customers beer preference based on:

gender, marital status, income and age.

0

Example: Beer PreferenceConsider the data on beer preferences (light or regular) of 100 customers along with their age, income, gender and marital status.Suppose we code the response variable as

Now we fit the multiple regression modelY = 0 + b1 Gender + b2 Married + b3 Income + b4 Age + e

Model assumptions:Observations/residuals are independentResiduals are normally distributedLinear model is adequateVariance of residuals is constantWhich assumptions are violated?What about predictions from this model?

Different FormulationLet = Prob(Y=1).In the beer example, is the probability that a customer prefers _______ beer.It follows that Prob(Y=0) = _________ .In order to get rid of the 0/1 values, we can look at a function of and treat it as the response model.light1-

Logistic RegressionLogistic regression learns the conditional distribution P(y | x) We will assume two classes y = 0 and y = 1 The parametric form for P(y = 1 | x, w) and P(y=0|x,w) is:

were w is the parameter vector w=[w1, w2, , wk]

Logistic RegressionWe can represent the probability ratio as a linear combination of features:

This is known as log odds

Logistic RegressionA linear function wx which ranges [-, ] can be transformed to a range [0,1] using the sigmoid function g(x,w)

Logistic RegressionGiven P( y | x) we predict =1 if the expected loss function of predicting 0 L(0,1) is greater than predicting 1 L(1,0) (for now assume L(0,1) = L(1,0))

Logistic RegressionThis assumed L(0,1)=L(1,0)

A similar derivation can be done for arbitrary L(0,1) andL(1,0). (If we decide that one class is more important to be detected than the other)

Maximum Likelihood Learning

The likelihood function is the probability of the data (x,y) given the parameters w p(x,y|w)It is a function of the parametersMaximum likelihood learning finds the parameters that maximize this likelihood functionA common trick is to work with log-likelihood, i.e., take the logarithm of the likelihood function log p(x,y|w)

Computing the Likelihood

In our framework, we assume each training example (xi , yi) is drawn independently from the same (but unknown) distribution P( x ,y ) (the i. i. d assumption) hence, we can write:

This is the function that we will maximize

Computing the LikelihoodFurther P(x|w)=P(x) because x because it does not depend on w, so:

Computing the LikelihoodThis can be written as:

Then the objective learning function is:

Fitting the Logistic Regression with Gradient Ascend

Fitting the Logistic Regression with Gradient Ascend

Gradient Ascend Algorithm

Multi-class CaseChoose class K to be the reference class and represent each of the other classes as a logistic function of the odds of k versus class K:

Multiclass CaseConditional probability for class k K can be computed as:

For class K the conditional probability is:

ExampleA 1959 article presents data concerning the proportion of coal miners who exhibit the symptoms of severe pneumoconiosis and the number of years of exposure. y is the proportion of miners who have severe symptoms

# years# severe cases# of minersProportion of severe cases y5.8098015.01540.018521.53430.0069827.58480.166733.59510.176539.58380.210546.010280.357151.55110.4545

ExampleThe fitted model is

ExampleThe covariance matrix is:

ExampleLogistic Regression Table

PredictorwZPOdds RatioConf IntervalConstant-4.7964-8.44Years0.0936.060.001.101.07-1.13

Interpretation of the ParametersConsider we have a single regressor xi

If we increment the value of the regressor in one unit then:

The difference between the two predicted values is:

The odds ratioThe odds ratio can be interpreted as the increase in probability of success associated with a one-unit change in the value of the predictor variable and it is defined as:

ExampleFollowing the pneumoconiosis data we have the model equal to:

The resulting odds ratio is

This implies that every year of additional exposure increases the odds of contracting a severe case of pneumoconiosis by 10%

Overall Usefulness of the Model

For maximum likelihood estimation, the fit of a model is measured by its deviance D (similar to sum-of-squared-errors in the case of least-squares estimation)

We compare the deviance of a model to the deviance of the nave model (no explanatory variables: simply classify each observation as belonging to the majority class)

0

Overall Usefulness of the ModelIf the ratio of D/n-p, where p is the number of predictors and n the number of samples is much greater than unity, then the current model is not adequateNote:This test is similar in intent to the ____-test for overall usefulness of a linear regression model

F

Usefulness of Individual PredictorsEach estimated coefficient, , has a standard error, sbj associated with it.To conduct the hypothesis test H0: j = 0 vs. Ha: j 0 Use the test statistic / sbj , (called the Wald statistic) The associated p-value indicates the statistical significance of the predictor xi, or the significance of the contribution of this predictor beyond the other predictors.

0

Evaluating & Comparing ClassifiersEvaluation of a classifier is necessary for two different purposes:To obtain the complete specification of a particular model i.e., to obtain numerical values of the parameters of a particular method.To compare two or more fully specified classifiers in order to select the best classifier.Useful criteriaReasonablenessAccuracyCost measures0

Evaluation Criterion 1: ReasonablenessAs in regression, time series, and other models we expect the model to be reasonable:Based on the analysts domain knowledge, is there a reasonable basis for a causal relationship between the predictor variables and Y (group membership)?Are the predictor variables actually available for prediction in the future?If the classifier implies a certain order of importance among the predictor variables (indicated by p-values for specific predictors), is this order reasonable?

0

Evaluation Criterion 2: Accuracy MeasuresThe idea is to compare the predictions with the actual responses (like forecast errors in time series, or residuals in regression models). In regression/ time series etc. we displayed these as 3 columns (actual values, predicted/fitted values, errors) or plotted them on a graph.In classification the predictions and actual values are displayed in a compact format called a classification/confusion matrix. This can be done for the training and/or validation set.0

Classification/confusion matrixExample with two groups Y = C1 or C2 # of obs that were classified correctly as group C10

Example: Beer PreferenceThe following classification matrix results from using a certain classifier on the data0

Classification MeasuresBased on the classification matrix there are 5 popular measures:The overall accuracy of a classifier is

The overall error rate of a classifier is0

Accuracy Measures cont.The base accuracy of a dataset is the accuracy of the nave rule

The base error rate is

The lift of a classifier (aka its improvement) is

Proportion of majority class1 base accuracy0

Accuracy Measures cont.Suppose the two groups are asymmetric in that it is more important to correctly predict membership in C1 than in C2. E.g., in a bankruptcy example, it may be more important to correctly predict a firm that is going bankrupt than to correctly predict a firm that is going to stay solvent. The classifier is essentially used as a system for detecting or signaling C1.In such a case, the overall accuracy is not a good measure for evaluating the classifier. 0

Accuracy Measures for Unequal Importance of Groups

Sensitivity of a classifier = its ability to correctly detect the i