introduction to statistics

46
1 What Statistics Books Try To Teach You But Dont Joe King University of Washington

Upload: joe-king

Post on 19-Nov-2015

8 views

Category:

Documents


0 download

DESCRIPTION

IntroductiontoStatistics

TRANSCRIPT

  • 1

    What Statistics Books Try To Teach You But Dont

    Joe King

    University of Washington

  • 2

  • Contents

    1 Introduction to Statistics 5

    1.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.1.1 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.1.2 Sample vs. Population . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3.2 Type I & II Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3.3 What does Rejecting Mean? . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4 Writing in APA Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.5 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Description of A Single Variable 9

    2.1 Wheres the Middle? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Skew and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.4 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Correlations and Mean Testing 15

    3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Pearsons Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.3 R Squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.4 Point Biserial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.5 Spurious Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4 Means Testing 19

    4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.2 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.2.1 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.2.2 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.2.3 Effect Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.3 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3

  • CONTENTS 4

    5 Regression: The Basics 235.1 Foundational Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Bibliographic Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6 Linear Regression 256.1 Basics of Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    6.1.1 Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 276.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.3 Interpretation of Parameter Estimates . . . . . . . . . . . . . . . . . . . . . 306.3.1 Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.3.1.1 Transformation of Continous Variables . . . . . . . . . . . . 306.3.1.1.1 Natural Log of Variables . . . . . . . . . . . . . . . 30

    6.3.2 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3.2.1 Nominal Variables . . . . . . . . . . . . . . . . . . . . . . . 326.3.2.2 Ordinal Variables . . . . . . . . . . . . . . . . . . . . . . . . 33

    6.4 Model Comparisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    6.6.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.6.1.1 Normality of Residuals . . . . . . . . . . . . . . . . . . . . . 33

    6.6.1.1.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6.1.1.2 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    6.7 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    7 Logistic Regression 397.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 Regression Modeling Binomial Outcomes . . . . . . . . . . . . . . . . . . . . 39

    7.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2.2 Regression for Binary Outcomes . . . . . . . . . . . . . . . . . . . . . 40

    7.2.2.1 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2.2.2 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2.2.3 Logit or Probit? . . . . . . . . . . . . . . . . . . . . . . . . 417.2.2.4 Logit or Probit? . . . . . . . . . . . . . . . . . . . . . . . . 41

    7.2.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

  • Chapter 1

    Introduction to Statistics

    Statistics is scary to most students but it does not have to be. The trick is to build up yourknowledge base one-step at a time to make sure you get the building blocks necessary tounderstand the more advanced statistics. This paper will go from very simpleunderstanding of variables and statistics to more complex analysis for describing data. Thismini-book of statistics will give several formulas to calculate parameters yet rarely will youhave to calculate these on paper or insert the numbers in an equation for a spreadsheet.This first chapter will look at some of the basic principles of statistics. Some of the basicconcepts that will be necessary to understand statistical inference. These may seem simpleand some of these many may be familiar with but best to start any work of statistics withthe basic principles as a strong foundation.

    1.1 Variables

    First we start with the basics. What is a variable? Essentially a variable is a construct weobserve. There are two kinds of variables, manifest (or observed variables) and latentvariables. Latent variables are ones we can only measure by measuring other manifestvariables, but we infer it (socio-economic status is a classic example). Manifest variables wedirectly measure and can model or we can use them to construct more complex latentvariables, for example we may measure parents education, parents incoming and combinethose into the construct of socio-economic status.

    1.1.1 Types of Variables

    There are four primary categories of manifest variables, nominal, ordinal, interval, andratio. The first two are categorical variables. Nominal variables are variables which arestrictly categorical and have no discernible hierarchy or order to them, this includes race,religion, or states for example. Ordinal is also categorical but this has a natural order toit. Likert scales (strongly disagree, disagree, neutral, agree, strongly agree) is one of themost common examples of an ordinal variable. Other examples include include class status(freshman, sophomore, junior, senior) and levels of education obtained (high school,bachelors, masters, etc).

    Definition 1.1 (Nominal). is a categorical variable with no natural order to categories.Race is a common example.

    The continuous variables are interval and ratio. These are not categorical such as having aset number of values but can take any value between two values. A continuous variable isexam scores; your score may take any value from 0%-100%. Interval has no absolute valueso we cannot make judgements about the distance between two values. Temperature is agood example, Celsius and Fahrenheit realistically wont have an absolute minimum ormaximum from the temperatures we experience. We cannot say 30 degrees Fahrenheit istwice as warm as 15 degrees Fahrenheit. A ratio scale is still continuous but has anabsolute zero, so we can make judgements about differences. I can say a student who gotan 80% on the exam did twice as good as the student who got a 40% on their exam.

    5

  • 1.2. TERMINOLOGY 6

    1.1.2 Sample vs. Population

    One of the primary interests in statistics is to try to generalize our sample to a population.A population doesnt always have to be the population of a state or nation as we usuallythink of the word. Lets say for example the head of UW Medicine came to me and askedme to do a workplace climate survey on all the nursing staff at UW Medical Center. Whilethere are alot of nurses there, I could conceivably give my survey to each and every one ofthem. This would mean I would not have a problem of generalizability because I know theattitudes of my entire population.Unfortunately statistics is rarely this clean, and you will not have access to an entirepopulation. Therefore I must collect data that is representatives of the population I wantto study, this will be a sample. It is important to note though because different notationis used for samples versus populations. For example x is generally a sample mean while is used as the population mean. Rarely will you be able to know the population meanwhere this becomes a huge issue. Many books on statistics have the notation at thebeginning of their book, yet I feel this is not a good idea. I will introduce notation as itbecomes relevant, and specifically discuss it when its necessary. Do not be alarmed if youfind yourself coming back to chapters remembering notation, it happens to everyone, andcommitting this to memory is a truly life long affair.

    1.2 Terminology

    There is also the discussion of terminology. This will be discussed before the primarymethods for understanding how to do statistics because the terminology can get confusing.Unfortunately statistics tends to like to change its terminology and have multiple words forthe same concept, which differ between journals, disciplines and different coursework.One area where this is most true is when talking about types of variables. We classifiedvariables into how they are measured above, but how they fit into our research question isdifferent. Basic statistics books still talk about variables as independent or dependentvariables. Although these have fallen out of favor in alot of disciplines, especially themethodology literature, but still bears weight so will be discussed. We will talk aboutwhich variables are independent and dependent based on the models we run when we getto those models but in general, the dependent variable is the one we are interested inknowing about. In short, we want to know how our independent variables influence ourdependent variable(s).While the terms independent and dependent variables are widely used, there are differentnames for the dependent and independent variables. This in most cases is based on field ofstudy, personal convention, the material you are publishing in, etc. The dependent variableis the one with the least confusion and is generally called the outcome variable. This seemsjustified given its the outcome we are studying. The independent variable is where there isless consistency in terminology. In many cases its called the regressor, predictor, orcovariate. I prefer the second term, and dont like the third. The first one seems too tied toregression modelling and not as general as predictor. Covariate has different meanings withdifferent tests so in my opinion can be confusing. Predictor also can be confusing becausesome people may conflate this with causation which would be a very wrong assumption tomake. I will usually use the term independent variable or predictor due to the lack of

  • 1.3. HYPOTHESIS TESTING 7

    better terms and these are the more common ones you will see in the literature.

    1.3 Hypothesis Testing

    The basis from where we start our research is the null hypothesis. This simply says there isno relationship between the variables we are studying. When we reject the nullhypothesis, we are saying we accept the alternative hypothesis which says the nullhypothesis is not true and there is a significant relationship between the variable(s) weare studying.

    1.3.1 Assumptions

    There are many types of assumptions that we must make in our analysis in order for ourcoefficients to be unbiased.

    1.3.2 Type I & II Error

    So we have a hypothesis associated with a research question. This mini-book will look atways to explore hypothesis and how we can either support or not support our hypothesis.First we must make a few basics about hypothesis testing. We have to have some basis todetermine whether the questions we are testing are true or not. Yet we also dont want tomake hasty judgements about whether our hypothesis is correct or not. This leads us tocommitting errors in our judgements. There are two primary errors in this context. TypeI error is where we reject the null hypothesis when it is correct. Type II error is whenwe do not reject the null hypothesis when it is wrong. While we attempt to avoid bothtypes of errors, the latter is more acceptable than the former. This is because we do notwant to make hasty decision about discussing an important relationship between variableswhen none exists. If we say there is no relationship when in fact there is one, this is a moreconservative approach that hopefully future research will correct.

    1.3.3 What does Rejecting Mean?

    When we try to reject the null hypothesis first we must determine our critical value whichis generally 0.05. It is by convention that it is done and currently debated on whether itsstill of practical use given computing technology today. When we reject the null hypothesisall we are saying is the chances of finding as large or larger result is less than thesignificance level. This does not mean that your research question really merits any majorpractical effect. Rejecting the null hypothesis may be important but so can not rejectingthe null hypothesis be important. For example if there was a school where lower incomegroups and higher income groups were performing significantly different on exams 5years ago, and I came in and tested again, and I found no statistically significantdifferences, I would find that to be highly important. It would mean there was a change inthe test scores and there is now some relative parity.The next concern is practical significance. If my research is significant, but there may notbe any real reason to think its going to make a difference if implemented in policy orclinical settings. This is where other measures come into play, like effect sizes which will bediscussed later. One should also note that larger sample sizes can make even a very smallstatistics statistically significant and a small sample size can mask a significant result.All of these must be considerations. One should not take a black and white approach toanswering research questions. Something is just not significant or not.

  • 1.4. WRITING IN APA STYLE 8

    1.4 Writing in APA Style

    One thing to be cautious about is how to write up your results and present them in amanner which is both ethical and concise. This includes graphics, tables and paragraphs.These should make the main points of what you want to say while not mis-representingyour results. If you are going to be doing alot of writing for publication you should pick upa copy of the APA Manual (American Psychological Association, 2009).

    1.5 Final Thoughts

    A lot was discussed in this first part. These concepts will be revisited in later sections aswe begin to implement these concepts. There are many books which have been writtenwhich expand on these concepts further and articles which have been written about theseconcepts. I ask that you constantly keep an open mind as researchers and realize statisticscan never tell us truth, it can only hint at it, or point us in the right directions, and theprocess of scientific inquiry never ends.

  • Chapter 2

    Description of A Single Variable

    So when we have variables we want to understand the nature of these variables. Our firstjob is to describe our data, before we can start to do any test. There are two measures wewant to know about our data. The first is we want to know where the center of the mass ofthe data is, and how far from the center of the mass our data is distributed. The middle iscalculated by the measures of central tendency (discussed momentarily), how far from themiddle of that helps us know how much variability there is in our data. This is also calleduncertainty or the dispersion parameter. These concepts are more generally known asthe location and scale parameters. Location being the middle of the distribution, whereon a real number line does the middle lie. Scale is how far away from the middle does ourdata go. These are concepts that are common among all statistical distributions.Although, for now our focus is on the normal distribution. This is also known as theGaussian distribution and is widely used in statistics for its satisfying mathematicalproperties and being able to conform to allow us to run many types of analyses.

    2.1 Wheres the Middle?

    The best way to describe data is to use the measure of central tendency, or what is themiddle of a set of values. This includes the mean, median, and mode.The equation to find the mean is in 2.1. The equation below has some notation whichrequires some discussion as you will see this in alot of formulas. The

    is the summation

    sign, which tells us to sum everything to its left. The i = 1 below the summation signsimply means start at the first value in the variable, and the N at the top means go all theway to the end (or the number of responses seen in that variable).

    x =

    Ni=1 x

    N(2.1)

    If we return to our x vector we get 2.2

    x = 1 + 2 + 3 + 4 + 5 = 15/5

    x = 15/5

    x = 3(2.2)

    Our mean is influenced by all the numbers equally, so our example of variable y would givea different mean by formula 2.3.

    x = 1 + 1 + 2 + 3 + 4 + 5 = 15/6

    x = 16/6

    x = 2.67(2.3)

    9

  • 2.2. VARIATION 10

    The addition of the extra one weighed our mean down. As we will see, values can havedramatic changes on our mean, especially when the number of values we have is low.Finally we represent mean in several ways, the Greek letter represents the populationmean, while the mean of a sample can be denoted with a flat bar on top, so we would sayx = 3. Finally the mean is also known as the expected value, so we can write it asE(x) = 3.For categorical data there are two great measures. The first is Median which is simply themiddle number of a set, so for a set of values as in 2.4

    Median = 1, 2, 3Median=3

    , 4, 5 (2.4)

    Now if there is an even number of values we take the mean of the two middle values 2.5

    Median = 1, 1, 2, 3Median=2.5

    , 4, 5 (2.5)

    Mode is simply the most common number in a set, so the last example, 1 is the modesince it occurs twice, the others occurs once. You may get bi-modal data where there is twonumbers that occur most of all, or even more.These last two measures if discussing the middle of a distribution are of great interest incategorical data mostly. Mode is rarely useful in interval or ordinal data, although mediancan be of help in this data. Mean is the most relevant for continuous data and one thatwill be used a lot in statistics. The mean is more commonly referred to as the average Meanis computed by taking the sum of all of the values and dividing by the number of values.

    2.2 Variation

    We now know how to get the mean, but much of the time we also want to know how muchvariation is in our data. When we talk about variation we are talking about why we getdifferent values in the data set. So going on our previous example of [1,2,3,4,5] we want toknow why we got these values and not all 3s, or 4s. A more practical example is why doesone student score a 40 on an exam, and another 80, another 90, another 50, etc. Thismeasure of variation is called variance. It is also called the dispersion parameter in thestatistics literature and the word dispersion will be used in discussion of other models.Variance for the normal distribution is first to find the difference between each value andthe sample mean. Then those differences are squared, and the sum of that is divided by thenumber of observations as seen below in taking the variance of x. Taking the square root ofthe variance gives the standard deviation for the normal distribution. Formula 2.6 showsthe equation for this.

    V ar(x) =Ni=1

    (x x)2

    N(2.6)

    Formula 2.7 below shows how we take the formula above and use our previous variable x tocalculate the sample variance.

  • 2.3. SKEW AND KURTOSIS 11

    V ar(x) = ([1 3 = 2] + [2 3 = 1] + [3 3 = 0] + [4 3 = 1] + [5 3 = 2])/5= (22 + 12 + 02 + 12 + 22)/5= (4 + 1 + 0 + 1 + 4)/5

    = 10/5

    = 2

    (2.7)

    A plot of the normal distribution with lines pointing to the distance between 1, 2 and 3standard deviations is shown in 2.1.

    6 10 14 18 22 26 30 34

    1 Standard Deviation (68.2%)

    2 Standard Deviations (95.4%)

    3 Standard Deviations (99.7%)

    Figure 2.1: Normal Distribution

    Now is when we start getting into the discussion of distributions. Specifically here we willtalk about the normal distribution. The standard deviation is one property of thenormal distribution. The standard deviation is a great way to understand how data isspread out and gives us an idea of how close to the mean our sample is. The rule for thenormal distribution is 68% of the population will be within one standard deviation of themean, 95% will be within two standard deviations, and 99% will be within three standarddeviations. This is shown in Figure 1, which has a mean of 20, and a standard deviation oftwo.There is two other forms of variation that are good to see. This the interquartile range.This shows the middle 50% of the data. It goes from the upper 75th percentile to the lower25th percentile. One good graphing technique for this is a box and whisker plot . This isshown in 7.1. The line in the middle is the middle of the distribution. The box is theinterquartile range, the horizontal lines are two standard deviations out. The dots outsidethose are outliers (data points more than two standard deviations from the mean).

    2.3 Skew and Kurtosis

    Two other concepts which help us evaluate a single normal variable is skew and kurtosis.This is not talked about as much but they are still important. Skew is when one part of

  • 2.4. TESTING FOR NORMALITY 12

    Figure 2.2: Box and Whisker Plot

    the sample is on one side of the mean than the other. Negative skew is where the peak ofthe curve is to the right of the mean (the tail going to the left). Positive Skew is where thepeak of the distribution is to the left and the tail is going to the right.Kurtosis is how flat or peaked a distribution looks. A distribution which has a more peakedshaped is called leptokurtic, and a shape that is flatter is called platokurtic. Althoughskewness and Kurtosis can make a distribution violate normality, it does not always.

    2.4 Testing for Normality

    Can we test for normality? Well we can, and should. One way is to use descriptivestatistics and to look at a histogram. Below you can see a histogram of the frequency of anormal distribution. We can overlay a normal distribution over it, and we can see if thedata looks normal. This is not a test per se but we can get a good idea of our data lookslike. This is shown in 2.3.

    5 10 15 20 25 30 35

    Figure 2.3: A Histogram of the normal distribution above with the normal curve overlaid

    We could also example a PP Plot. This is a plot with a line at a 45 degree angle goingfrom bottom left to upper right of a plot. the closer the points are to the line the closer tonormality the distribution is. This is also the same principle behind a qqplot (Q meaningquantiles).

  • 2.5. DATA 13

    2.5 Data

    I will try to give examples of data analysis and its interpretation. One good data set is onCars released in 1993 (Lock, 1993), names of the variables and more info on the data setcan be found in Appendix ??.

    2.6 Final Thoughts

    A lot of concepts were discussed are necessary for a basic understanding of statisticalknowledge. Although do not feel you have to have this entire chapter memorized. Theconcepts here you may need to come back to from time to time. Do not focus either onmemorizing formulas, focus on what the formulas tell you about the concept. With todayscomputing powers your concern will be understanding what the output is telling you andhow to connect that to your research question. While it is good to know how numbers arecalculated, its just to understand how to use it in your test.

  • 2.6. FINAL THOUGHTS 14

  • Chapter 3

    Correlations and Mean Testing

    The first part of this book we just looked at describing variables. Now we look at how theyare related and want to test the strength of those relationships. This is a difficult task,something that will take time to master not only the concepts but its implementation.Course homeworks are actually the easiest way to do statistics. You are given a researchquestion told what to run and to report your results. In real analysis you will have todecide for yourself what test to run that best fits your data and your research question.While I will provide some equations, its best to look at them just to see what they aredoing, and what they mean, its less important to memorize them. This first part will lookat basic correlations and testing of means (t-tests and ANOVA).Much of statistics is correlational research. It is research where we look at how one variablechanges when another changes, yet causal inferences will not be assessed. It is verytempting to use the word cause or to imply some directionality in your research but youneed to refrain from it unless you have alot of evidence to justify it as the ethical standardsfor determining causality is high. If you are wishing to learn more about causality see(J. Pearl, 2009;Judea Pearl, 2009)

    3.1 Covariance

    Before discussing correlations we have to discuss the idea of a covariance. One of the mostbasic ways to associate variables is by getting a covariance matrix. Now a matrix is like aspreadsheet, each cell having a value in it. The diagonal going from upper left to lowerright is the variance of the variable (as it will be the same variable on the top row as it willbe on the left column. The other values will be the covariance between the two variables.The idea of covariance is similar to variance, except we want to know how one variablevaries with another. So if one changes in one direction, how will another variable change inthe same direction? Do note though we are only talking about continuous variables here(for the most part interval and ratio scales are treated the same and the distinction israrely made in statistical testing, so when I mention continuous it may be either interval orratio without compromising my analysis). The formula for covariance is in 3.1.

    Cov(x, y) =Ni=1

    (x x)(y y)N

    (3.1)

    As one can see it is taking the deviations from the mean, and multiplying them togetherand then dividing by the sample size. This gives a good measure of the relationshipbetween the two variables. While this concept is necessary and a bedrock of manystatistical tools, its not very intuitive. It is not standardizing it in anyway that allows us tomake quick understandings of the relationships, this is what leads us into correlations.

    3.2 Pearsons Correlation

    A correlation is essentially a standardized covariance. We take the covariance and divide itby the standard deviation in 3.2:

    15

  • 3.3. R SQUARED 16

    rx,y =

    Ni=1 (x x)(y y)N

    i=1 (x x)2N

    i=1 (y y)2(3.2)

    If we dissect this formula its not as scary as it looks. The top of the equation is simply thecovariance. The bottom is the variance of x and the variance of y multiplied by each other.Taking the square root is simply converting that to a standard deviation. This puts thecorrelation coefficient into the metric of -1 to 1. A correlation of 0 means no associationwhat so ever. A correlation of 1 is a perfect correlation. So lets say we are looking at theassociation of temperatures between two cities, if city A temperature went up by onedegree, city B would also go up by one degree if their correlations were 1 (remember acorrelation assumes the units of measurement). If the correlation is -1, its a perfect inversecorrelation, so if temperature of city A goes up one degree, city B will go DOWN onedegree. In social science the correlations are never this clean, or clear to understand. Sincethe metrics can differ between correlations one must be careful about when you do acorrelation and how you interpret it. Also remember a correlation is non-directional, so ifwe have a correlation of .5 and temperature in city A goes up one degree and up a halfdegree in city B, then if city B goes up a full degree then will go up a half degree in city A.Pearsons correlations are reported with an r and then the coefficient, followed by thesignificance level. For example r = 0.5, p < .05 if significant.

    3.3 R Squared

    When we get a pearsons correlation coefficient we can take the square of that value, andthat is whats called the percentage of variance explained. So if we get a correlation of .5,then the square of that is .25, so we can say that 25% of the variation in one variable isaccounted for by the other variable. Of course as the correlation increases so will theamount of variance explained.

    3.4 Point Biserial Correlation

    One special case where a categorical variable can use a continuous Pearsons r is thepoint-biserial correlation. If you have a binary variable you can calculate the correlationbetween the two categories if the other variable you are comparing it to is continuous. Thisis similar to a t-test we will examine later. The test looks at whether or not there is asignificant different between the two groups of the dichotomous variables. When we askwhether its significant or not, we are wanting to determine whether or not the differenceis due to random chance. We already know there is going to be random variability in anysample we take, but we want to know if the difference between the two groups is due to thisrandomness or is there a genuine difference in the groups which is due to true differences.

    3.5 Spurious Relationships

    So lets say we get a pearsons r=.5, so what now? Can we say there is a direct relationshipbetween variables? No, because we dont know if the relationship is direct or not. Thereare many examples of spurious relationships. For example, if I look at the rate of illnessstudents report to the health center at their University and the the relative time of exams,I would most likely find a good (probably moderate) correlation. Now before any students

  • 3.6. FINAL THOUGHTS 17

    starts using this statement as a reason to cancel tests, there is no reason to believe yourexams are causing you to get sick! Well what is it then? Well something we DIDNTmeasure, Stress! Stress weakens the immune system, and stress is higher during periods ofexaminations, so you are more likely to get ill. If we just looked at correlations we wouldonly be looking at the surface, so take the results but use them with caution, as they maynot be telling the whole story.

    3.6 Final Thoughts

    This may seem like a short chapter given the heavy use of correlations but much of thebasics of this chapter will be used in future statistical analysis. One of the primaryconcerns to take from this is this is not in anyway measuring causality, and this point cannot be discussed enough. Correlations are a good way of looking at associations, but thatsall, but is a good way to help us explore data and work towards more advanced statisticalmodels which can help us support or not support our hypotheses. While correlations canbe used, use them with caution.

  • 3.6. FINAL THOUGHTS 18

  • Chapter 4

    Means Testing

    This chapter goes a bit more into exploring the differneces between groups. So if we have anominal or ordinal variable, and we want to see if these categories are statistically differentbased on a continous variable, there are several tests we can do. We already looked at thepoint bi-serial correlation, which is one test. This chapter examines the t-test which is atest that gives a bit more detail, and Analysis of Variance (ANOVA) which will explorewhen the number of groups is greater than 2 (the letter denoting groups is generally k,as n denotes sample size, so ANOVA will be k > 2 or k 3). Here we will want to knowwhether the difference in the means is statistically significant.

    4.1 Assumptions

    So the first assumption we will make is the continuous variables we are measuring arenormally distributed, and we learned to test that earlier. Another assumption we mustmake is called homogeneity of variance. This means the variance is the same for bothgroups (it doesnt have to be exactly the same but similar, again it will be somewhatdifference due to randomness but is the variance different enough to be statisticallydifferent). If this assumption is untenable we will have to correct for the degrees offreedom, which will influence whether our t-statistic is significant or not.This can be shown in the two figures below. 4.1 shows the difference in the means (mean of10 and 20) but with same variance of 4.

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

    Mean Difference

    Figure 4.1: Same Variance

    4.2 has same means but one variance is 4 and the other is 16 (standard deviation of 4).

    4.2 T-Test

    The t-test is similar to the point-biserial as we are wanting to know whether two groups arestatistically different.

    19

  • 4.2. T-TEST 20

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

    Mean Difference

    Figure 4.2: Different Variances

    So we will look at the first equation, which the numerator is the difference between themeans. The denominator is the difference between the standard deviations. the variance ofthe sample is denoted s2, and n is the sample size for that group. This is shown in 4.1

    t =x1 x2s21n1

    +s22n2

    (4.1)

    The degrees of freedom is denoted by 4.2.

    df =s21/n1 + s

    22/n2

    2

    (s21/n1)2/n1 1 + (s22/n2)2/n1 1

    (4.2)

    The above equations assume unequal sample sizes and variances. The equations get smallerif you have same variance or same sample size in each group. Although this only generallyoccurs in experimental settings where sample size and other parameters can be morestrictly controlled.In the end we want to see if there is a statistical difference between groups. If we look atdata from the National Educational Longitudinal Study from 1988 baseline year, we cansee how this works. If we look at the difference in gender and science scores, we can do at-test and we find theres a significant mean difference. The means for gender are in 4.1

    Mean SDMale 52.1055 10.42897Female 51.1838 10.03476

    Table 4.1: Means and Standard Deviations of Male and Female Test Scores

    Our analysis shows t(10963) = 4.712, p < .05. Although the test of whether variances arethe same is significant F = 13.2, p < .05, so we have to use the variances not assumed. This

  • 4.3. ANALYSIS OF VARIANCE 21

    changes our results to t(10687.3) = 4.701, p < .05. You can see the main difference is ourdegrees of freedom dropped, thus our t-statistic dropped.This time it didnt matter, our sample size was so large that both values were significant,but in some tests this may not be the case. If the test of equal variances rejects the nullhypothesis but the test of unequal variances does not reject, even if levenes test is notsignificant, you should really be cautious about how you write it up.

    4.2.1 Independent Samples

    The above example was an independent samples t-test. This means the participants areindependent of each other and so their responses will be too.

    4.2.2 Dependent Samples

    This is a slightly different version of the t-test where you still have two means but thesamples are not independent of each other. A classic example of this is pre-test, post-testdesigns. Also longitudinal data where a measure was collected at one year then measuredon the same test at a later date.

    4.2.3 Effect Size

    The effect size r is used in this part. The equation for this is in 4.3:

    r =

    t2

    t2 + df(4.3)

    4.3 Analysis of Variance

    Analysis of Variance (ANOVA) is used to compute when you have more than two groups.Here we will look at what happens when have race and standardized test scores. Theproblem we will encounter is to see which groups are significantly different. ANOVA addssome steps to testing the analysis. First all of the means are compared (the equations forthis will be quite complex so we will just go through the analysis steps). First you see ifany of the means are statistically different. This is called an omnibus test and follows the Fdistribution (the F distribution and t distribution are similar to the normal but havefatter tails which means it allows for more outliers but this is of not much consequenceto the applied analysis). We get an F statistic for both levenes test and the omnibus test.In this analysis we get four group means. These means are below in 4.2:

    Race Mean SDAsian, Pacific Islander 56.83 10.69Hispanic 46.72 8.53Black, Not Hispanic 45.44 8.29White, Not Hispanic 52.91 10.03American Indian, Alaskan 45.91 8.13

    Table 4.2: Means and Standard Deviations of Race Groups Test Scores

    Table 4.3 is the mean differences. Now after we reject the omnibus test we need to see iftheres a significant differences between the tests. We do this by doing post-hoc tests. For

  • 4.3. ANALYSIS OF VARIANCE 22

    simplicity reasons I have put it in a matrix where the numbers inside is the differencesbetween the groups. Those with (*) beside them are statistically significant. Now this isnot how it is done in SPSS, because it will give you it in rows but this is easily made.There are many post-hoc tests one can do. The ones done below are Tukey andGames-Howell, and both reject the same mean difference groups. There are alot morepost-hoc tests but these two do different things. Tukey adjusts for different sample sizes,Games Howell corrects for heterogeneity of variance. If you do a few types of post-hoc testsand the result is the same this gives credence to your hypothesis. If not you should go backto see if there is a real difference or not or re-examine your assumptions.

    Race GroupsAsian-PI Hispanic Black White AI-Alaskan

    Asian-PI 0Hispanic 10.1092* 0Black 11.3907* 1.2815* 0White 3.9193* -6.1899* -7.4714* 0AI-Alaskan 10.9178* 0.8086 -0.4729 6.9985* 0Note: PI-Pacific Islander; AI-American Indian

    Table 4.3: Mean Differences Among Race Groups

  • Chapter 5

    Regression: The Basics

    Regression techniques make up a major portion of social science statistical inference.Regression is also called linear models (this will be generalized later but for now we willstick with linear models) as we try to fit a line to our data. These methods allow us tocreate models to predict certain variables of interest. This section will be quite deep, sinceregression requires a lot of concepts to consider, but as in past portions of this book, wewill take it one step at a time, starting out with basic principles and moving to moreadvanced ones. The principle of regression to we have a set of variables (known aspredictors, or independent variables) that we want to use to predict an outcome (known asthe dependent variable but fallen out of favor in more advanced statistics classes andworks). Then we have a slope for each independent variable, which tells us the relationshipbetween the predictors and outcomes.If you see yourself not understanding something, come back to the more fundamentalportions of regression and it will sink in. This type of method is so diverse people spendcareers learning and using this modeling procedure, so its not expected you pick it up inone quarter, but are just laying the foundations for the use of it.

    5.1 Foundational Concepts

    So how do we try to predict an outcome? Well it comes back to the concept of variance.Remember early on in this book we looked at variance as simply variation in a variable.There are different values for different cases (i.e. different scores on a test for differentstudents). Regression allows us to use a set of predictors to explain the variation in ouroutcome.Now we will look at the equations themselves and the notation that we will use. The basicequation of a regression model (or linear model) is 6.11.

    y = 0 +

    pi=1

    pxp + (5.1)

    This basic equation may look scary but it is not. There are some basic parts to theequation which will be relevant to the future understanding of these models. So let us goleft to right. The y is our outcome variable, this is the variable we want to predict thebehavior of. The 0 is the slope of the model (where the regression line crosses the y axis ona coordinate plane. The pxp the actually two components together. The x is the predictorvariables, and the is the slopes for each predictor. This tells us the relationship betweenthat predictor and the outcome variable. The summation sign is there, yet unlike othertimes this has been used, at the top is the letter p instead of n. This is because p stands fornumber of predictors, and not summing to the number of cases. The is the error term,which takes into account the variability in the model the predictors dont explain.

    23

  • 5.2. FINAL THOUGHTS 24

    5.2 Final Thoughts

    This brief chapter introduces regression as a concept, or more generally linear modeling. Idont say linear regression (which is the next chapter) as this is just one form of regression.Many more types of regression will be done in future chapters. There are many books onregression, and at the end of each chapter I will note very good ones. One extraordinaryone is A. Gelman and Hill (2007) which I will use a lot to refer to with regards to creatingthis chapter.

    5.3 Bibliographic Note

    Many books have been written on regression. I have used many as inspiration andreferences for this work although much of the information is freely available online. On topof A. Gelman and Hill (2007) for doing regression, the books Everitt, Hothorn, and Group(2010), Chatterjee and Hadi (2006) and finally the free book Faraway (2002), and otherexcellent books that are available for purchase Faraway, 2004; Faraway, 2005. More theorybased books are Venables and Ripley (2002), Andersen and Skovgaard (2010), Bingham andFry (2010), Rencher and Schaalje (2008), Rencher and Schaalje (2008),Sheather (2009). Asyou can tell most of these books use R which is my preferred statistical package of choice.Some books are focused on SPSS and do a good job at that, one notable one being by Field(2009), also more advanced books but still very good is Tabachnick and Fidell (2006) andStevens (2009). Stevens (2009) would not make a good text book but is an excellentreference, including SPSS and SAS instructions and syntax for almost all multivariateapplications in social sciences and is a necessary reference for any social scientist.

  • Chapter 6

    Linear Regression

    Lets focus for a while on one type of regression, linear regression. This requires us to havean outcome variable that is continuous and normally distributed. When we have acontinuous normally distributed outcome, we can use least squares to calculate theparameter estimates. Other forms of regression use maximum likelihood, which will bediscussed in later chapters. Although the least squares estimates are the maximumlikelihood estimates.

    6.1 Basics of Linear Regression

    This first regression technique we will learn, and the most common one used is where ouroutcome is continuous in nature (interval or ratio it nature, it does not matter). Linearregression uses an analytic technique called least squares. We will see how this worksgraphically and then how the equations give us the numbers for our analysis.What linear regression does is it looks at the plot of x and y and tries to fit a straight linethat is closest to all of these points. Figure 6.1 shows how this is done. I just randomlydrew values for both x and y and the line is the regression line that is the best fit for thedata. Now as the plot shows, the line doesnt fit perfectly, its just the best fitting line.The difference between the actual data and the line is whats termed residuals as it iswhat is not being captured in the model. The better the line fits and the less residual thereis, the stronger the predictor will predict the outcome.

    1 0 1 2 3

    1

    01

    23

    x

    y

    Figure 6.1: Simple Regression Plot

    25

  • 6.2. MODEL 26

    6.1.1 Sums of Squares

    When discussing the sums of squares we get two equations, one is for the sums of squaresfor the model in 6.1. This is the difference between our predicted values and the mean.This is how good our model is fitting. We want this number to be as high as possible.

    SSR =ni=1

    (yi y)2 (6.1)

    The second is the sums of squares regression (or error), this is the difference betweenpredicted and actual values of the outcome, this we want to be as low as possible and isshown in 6.2.

    SSE =ni=1

    (yi yi)2 (6.2)

    The total sums of squares can be done by summing the SSR and SSE or by 6.3.

    SST =ni=1

    (yi yi)2 (6.3)

    The table 6.1 shows how this can be arranged. We commonly report sums of squares anddegrees of freedom along with the F statistic, the mean squares are less important but willbe shown for the purposes of the examples in this book.

    Sums of Squares DF Mean Square F RatioRegression SSR p MSR = SSR

    pF = MSR

    MSE

    Residual (Error) SSE n p 1 MSE = SSEnp1

    Total SST n-1

    Table 6.1: ANOVA Table

    6.2 Model

    First lets look at the simplest model, if we had one predictor it would be a simple linearregression 6.4. As shown, 0 is the slope parameter for the model, also called the yintercept, it is where on the coordinate plane the regression line crosses the y-axis whenx = 0. The is the parameter estimate for that predictor beside it, the x. This shows themagnitude and direction of the relationship to the outcome variable. Finally is theresidual, this is how much the data deviates from the regression line. This is also called theerror term, its the difference between the predicted values of the outcome and the actualvalues.

    y = 0 + 1x1 + (6.4)

  • 6.2. MODEL 27

    More than one predictor is multiple linear regression, such as having two or morepredictors will look like 6.5, note the subscript p stands for parameters, so there will be ax for each independent variable.

    y = 0 + 1x1 + 2x2 + + pxp + (6.5)

    6.2.1 Simple Linear Regression

    If we have the raw data, we can find the equations by hand. While in the era of very highspeed computers it is rare you will have to manually compute these statistics we shouldstill look at the equations to see how we derive the slopes. The slope below is how tocalculate the beta coefficient for a simple linear regression. We square values so we get anapproximation of the distance from the best fitting line as shown in 6.6. If we just addedthe numbers up, some would be below the line, and some above giving us negative andpositive values respectively so they would add to zero (as is one of the assumptions of errorterm). Squaring makes sure we have this issue removed.

    1 =

    ni=1(x x)(y y)n

    i=1(x x)2(6.6)

    The equation 6.7 shows how the slope parameter is calculated in a simple linear regression.This is where the regression line crosses the y-axis when x = 0.

    0 = y + x (6.7)

    Finally we come to our residuals. When we plug in values for x into the equation, we getthe fitted values. These values are predicted by the regression equation. This is signifiedby y. When we subtract the actual outcome value for the predicted value (which the fittedvalue is known as). This shows how much our actual values fit from the line, and it gives usan idea of which values are furthest from the regression line.

    = y y (6.8)

    We can also find in the model how much of the variability within our outcome is beingexplained by our predictors. When we run this model we will get a Pearsons correlationcoefficient (r). We can still square this number (as we did in correlation) and get theamount of variance explained. This is done in several ways, see 6.10.

    r2 =SSR

    SST= 1 SSE

    SST=

    ni=1(yi yi)2ni=1(yi y)2

    (6.9)

    We do need to adjust our r squared value to account for complexity of the model.Whenever we add a predictor, we will always explain more variance. The question is, is

  • 6.2. MODEL 28

    this is truly explaining variance for theoretical reasons or if it is just randomly addingvariation explanation. The adjusted r squared should be comparable to the non-adjustedvalue, if they are substantially different, you should look at your model more closely. Theadjusted r-squared can be particularly sensitive to sample size, so smaller sample size willshow differences in adjusted r squared values. Also its best to report both if they vary by anon-trivial amount.

    Adjustedr2 = 1 (1 r2)SSE/n p 1SST/n 1

    (6.10)

    We can look at an example of data. Lets look at our cars example. Lets see if we canpredict the price of a vehicle based on its miles per gallon (MPG) of fuel used while drivingin the city.

    mod1

  • 6.2. MODEL 29

    r2 = 0.3535 and the Adjustedr2 = 0.3464. While the adjusted value is slightly lower itsnot a major issue, so we can trust this value.

    6.2.2 Multiple Linear Regression

    Multiple linear regression is similar to simple regression except we place more than onepredictor in the equation. This is how most models in social science are ran, since weexpect more than one variable to be related to our outcome.

    y = 0 +

    pi=1

    pxp + (6.11)

    Lets go back to the data, lets add to our model above not only miles per gallon in the citybut fuel tank capacity.

    > mod3|t|)

    (Intercept) 10.1104 11.6462 0.868 0.38763

    MPG.city -0.4608 0.2395 -1.924 0.05747 .

    Fuel.tank.capacity 1.1825 0.4104 2.881 0.00495 **

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 7.514 on 90 degrees of freedom

    Multiple R-squared: 0.4081, Adjusted R-squared: 0.395

    F-statistic: 31.03 on 2 and 90 DF, p-value: 5.635e-11

    We find we reject the null hypothesis with F (2, 90) = 31.03, p < .05. We have an r2 = .408and adjustedr2 = 0.395. So this model is fitting well and we can explain around 40% of thevariance by these two parameter estimates. Interestingly, miles per gallon fails to remainsignificant in the model, = 0.4608, t(90) = 1.924, p = 0.057 This is one of those timeswhere significance is close, and most people who hold rigidly to the alpha of .05 would saythis isnt important. I dont hold such views, while this seems less important than in the lastmodel, its still worth mentioning as a possible predictor, but in the presence of fuel tankcapacity has less predictive power.Fuel talk capacity is strongly related to price = 1.1825, t(90) = 2.881, p < .05. We findthe relationship here is positive, so the more fuel tank capacity the higher the price. We

  • 6.3. INTERPRETATION OF PARAMETER ESTIMATES 30

    could speculate larger vehicles, with larger capacity will be more expensive. Although wehave seen consistently that miles per gallon in the city is inversly related, well this mayalso deal with size. Larger vehicles may get less fuel efficiency but may be more expensive,smaller cars may be more fuel efficient and yet cheaper. I am not an expert on vehiclepricing so we will just trust the data from this small sample.

    6.3 Interpretation of Parameter Estimates

    6.3.1 Continuous

    When a variable is continuous generally, interpretation is relatively straight forward. Weinterpret the coefficients to mean that one unit increase in the predictor will mean anincrease in y by the amount of . So lets say you have a coefficient y = 0 + 2x+ . Wellhere the 2 is the parameter estimate (), so we say for each unit increase in x, we willincrease y by 2 units. Now when saying the word unit we are referring to the originalmeasurements of the individual variables. So if x is income in thousands of dollars, and y istest scores, then for each one thousand dollars increase in income (x) will mean 2 pointsgreater score on the exam.This changes if we transform our variables. If we standardize our x values, we would sayfor each standard deviation increase in x, increase y by two units. If we standardized y andx, we would say one standard deviation increase of x would mean two standard deviationincrease in y.If we log our outcome, then we would say that one thousand dollar increase in come wouldmean 2 log units increase in y. One thing to note is when statisticians (or almost allscientists say log) they mean the natural log. To transform this back to the original units,you take the exponential function, so ey if you had taken the log of the outcome (reasonsfor this will be discussed in testing assumptions). If we take the log of y and x, the we cantalk about percents, so a one percent increase in x, means a 2 percent increase in y.Although to get back to original units, exponentiation is still necessary.If we look at our models above, in the simple linear regression model of just MPG in thecity, for each increase in one MPG in the city, the price goes down by 1.0219 thousanddollars. This is because the coefficient is negative, so the relationship is inverse. In ourmultiple regression model we see for each gallon increase in fuel tank capacity the priceincreases 1.1825 thousand dollars. This is because the coefficient is positive.

    6.3.1.1 Transformation of Continous Variables

    Sometimes its neccessary to transform our variables. This can be done to makeinterpretation easier, more relevant to our research question, or to allow our model to meetassumptions.

    6.3.1.1.1 Natural Log of Variables Here we will explore what happens when wetake the log of continous variables.

    > mod2

  • 6.3. INTERPRETATION OF PARAMETER ESTIMATES 31

    Min 1Q Median 3Q Max

    -0.58391 -0.19678 -0.04151 0.19854 1.06634

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 4.15282 0.13741 30.223 < 2e-16 ***

    MPG.city -0.05756 0.00596 -9.657 1.33e-15 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 0.3213 on 91 degrees of freedom

    Multiple R-squared: 0.5061, Adjusted R-squared: 0.5007

    F-statistic: 93.26 on 1 and 91 DF, p-value: 1.33e-15

    Here we have taken the natural logarithm of our outcome variable. This will be shownlater to be advantangeous when looking at our assumptions and violations of that. It canalso make model interpration different and sometimes easier. So now instead of the originalunits, its in log units, so we would say, for each MPG unit increase, the price will decrease0.0576 percent. This is because the coefficient is negative and so the relationship is stillinverse. Notice the percent of variance explained dramatically increased, from 35% to 50%,this is due to the transformation process.

    > mod3|t|)

    (Intercept) 7.5237 0.4390 17.14

  • 6.3. INTERPRETATION OF PARAMETER ESTIMATES 32

    This model looks at what happens when we take the natural log of both the outcome andthe predictor. This is also interpreted differently, but now both estimates are in percents.So for each percent increase in MPG in the city, the price decreases by 1.512 percent. Alsothe model estimates have changed due to our transformation.

    6.3.2 Categorical

    When our predictors are categorical, we need to be careful how they are modeled. Theycannot be added simply as numerical values or words. This would cause estimates to bewrong, as the model will assume it is a continuous variable.

    6.3.2.1 Nominal Variables

    For nominal variables we must recode the levels of the factor. One way to do this isdummy coding. This is where we code one factor per variable as a 1, with the otherfactors as 0. If we denote the number of factors as k, then the total number of dummyvariables we can model for a factor variable is k 1. For example, if we are coding sportingevents, such as football, basketball, soccer, and baseball. The total number of dummyvariables we can have is 3. The coding is easily done in statistics programs automaticallyor you can recode your own variables. Our sports example would look like Table 6.2.

    Factor Levels Dummy 1 Dummy 2 Dummy 3Football 1 0 0Basketball 0 1 0Soccer 0 0 1Baseball 0 0 0

    Table 6.2: How Nominal Variables are Recoded in Regression Models using Dummy Coding

    As you can see, the baseball part of our sports variable has all zeros. This is the baselinegroup, for which the other groups are compared. This is good when there is a naturalbaseline group (like treatment vs. control in medical studies). Although ours does not havea natural baseline. So we can do another type of coding called contrast coding.

    Factor Levels Dummy 1 Dummy 2 Dummy 3Football -1 0 0Basketball 0 -1 0Soccer 0 0 -1Baseball 1 1 1

    Table 6.3: How Nominal Variables are Recoded in Regression Models using Contrast Coding

    As you can see, the factors sum to 0 in the column. Of course in real data sets we may nothave an even number of levels of the factors, the different levels (or group) may havedifferent amounts. So if there were 25 participants that played football and only 23baseball players, finding the numbers that contrasts that equal zero will be more difficult.Luckily many software programs allow for this type of coding automatically.

  • 6.4. MODEL COMPARISIONS 33

    If we only had these variables as our predictors, this would be equivalent to an Analysis ofVariance, and the intercept would be the mean of the baseline variable. This is not so ifmore predictors are added, as this would be an Analysis of Covariance.

    6.3.2.2 Ordinal Variables

    For ordinal variables, we generally can allow them to be in the model as one variable andnot require dummy coding. This is because our assumption of linearity is relativelytenable, as we expect the categories to be naturally ordered and to be increasing. Theinterpretation of this would be as you go up one category, the value of y will change theamount of the parameter estimate (your beta-coefficient for that variable).

    6.4 Model Comparisions

    In many cases of research we want to know the effect of how much we add to the fit of amodel when we add or take away one or more predictors. When we do model comparisions,we must ensure the models are nested. This means we add or take away predictor(s), butotherwise still measuring same things. For example in the above models we comparedMPG and fuel capacity. We will want to know how much adding fuel capacity to the modeladds to model fit, or how adding MPG to the model with fuel capacity already in themodel compares. We can not compare directly a simple regression with only fuel capacityand another model just measuring MPG.

    6.5 Assumptions

    The assumptions for regression depend on the nature of the regression being used. Forcontinuous outcomes, the assumptions are the errors are homoscedastic, normallydistributed errors, linearly related outcome and samples are independent of one another.We look at the assumptions of linear regression and how to test them. Then we will discusscorrections to them.

    6.6 Diagnostics

    We need to make sure our model is fitting our assumptions, and we need to see if we cancorrect for times our assumptions are violated.

    6.6.1 Residuals

    So first we need to look at our residuals. Remember residuals are the actual y valuessubtracted from the predicted y values. For this exercise, I will use the cars data I usedabove, as it is a good data set to discuss regression on. For the purposes of looking at ourassumptions, let us stick with simple regression where we have price of vehicles as ouroutcome and miles per gallon in the city as our predictor. Here I will just provide Rcommands and code along with discussions of it.

    6.6.1.1 Normality of Residuals

    First lets look at our assumption of normality. We assume our errors are normallydistributed with mean 0 and some unknown variance. We can do tests of this via mypreferred test, Shapiro Wilks test which is good from sample sizes from 3 - 5000 (Shapiroand Wilk (1965)).

  • 6.6. DIAGNOSTICS 34

    6.6.1.1.1 Tests Lets look at the above model and see if our normality assumption ismet. First we test mod1 which is just the variables in its original form.

    > shapiro.test(residuals (mod1))

    Shapiro-Wilk normality test

    data: residuals(mod1)

    W = 0.8414, p-value = 1.434e-08

    As you can see the results arent pretty, we reject the null hypothesis for the test, soW = 0.8414, p < .05 which means theres enough evidence to say that the sample deviatesfrom the theoretical normal distribution the test was expecting. This test, the nullhypothesis is the sample does conform to a normal distribution, so unlike most testing, wedo not want to reject this test.

    > shapiro.test(residuals (mod2))

    Shapiro-Wilk normality test

    data: residuals(mod2)

    W = 0.9675, p-value = 0.02022

    Doing a second model with the log of the outcome help some, but we still cant say ourassumption is teneable, W = 0.9675, p < .05.

    > shapiro.test(residuals (mod3))

    Shapiro-Wilk normality test

    data: residuals(mod3)

    W = 0.9779, p-value = 0.1154

    This time we cannot reject the null hypothesis, W = 0.9779, p > .05, so taking the log ofboth our outcome and predictor allows us approximate the normal distribution, or at thevery least we can say there isnt enough evidence to say our distribution is significantlydifferent than the theoretical (or expected) normal distribution.While logging of the variables has its advantages logging both as we did in the last exampleis not common. The non-normality here can be attributed to our sample size of only 93participants. Be cautious when doing transformations like this as

  • 6.6. DIAGNOSTICS 35

    6.6.1.1.2 Plots Now lets look at plots. Two plots are important, one is a QQ plot, andanother is a histogram. A histogram allows us to look at the frequency of values, and theQQ plot plots our residuals against what we would expect from a theoretical normaldistribution. In those plots the line represents where we want our residuals to be, means itsmatching the theoretical normal distribution.

    Distribution of Residuals

    residuals(mod1)

    Den

    sity

    10 0 10 20 30 40

    0.00

    0.02

    0.04

    0.06

    0.08

    2 1 0 1 2

    10

    010

    2030

    40

    Normal QQ Plot

    Theoretical Quantiles

    Res

    idua

    ls

    Figure 6.2: Histogram of Studentized Residuals for Model 1The first set of plots shows us what we expected from our statistics above. Our residualsdont conform to a normal distribution, we can see heavy right skew in the residuals, andthe QQ plot is very non-normal at the extremes.

  • 6.6. DIAGNOSTICS 36

    Distribution of Residuals

    residuals(mod2)

    Den

    sity

    0.5 0.0 0.5 1.0

    0.0

    0.5

    1.0

    1.5

    2 1 0 1 2

    0.

    50.

    00.

    51.

    0

    Normal QQ Plot

    Theoretical Quantiles

    Res

    idua

    lsFigure 6.3: Histogram of Studentized Residuals for Model 2

    As we saw in our statistics, taking the log of our outcome made it better, but still not quiteto make our assumption of normality tenable. We are still seeing too much right skew inour distribution.

    Distribution of Residuals

    residuals(mod3)

    Den

    sity

    0.5 0.0 0.5 1.0

    0.0

    0.5

    1.0

    1.5

    2 1 0 1 2

    0.

    50.

    00.

    51.

    0

    Normal QQ Plot

    Theoretical Quantiles

    Res

    idua

    ls

    Figure 6.4: Histogram of Studentized Residuals for Model 3This looks much better! Our distribution is looking much more normal. Our QQ plot stillshows some deviation at the top and bottom but our Shapiro-Wilks test gives us enoughevidence to show the assumption of normality is tenable, so this is OK.

  • 6.7. FINAL THOUGHTS 37

    6.7 Final Thoughts

    Linear regression is used very widely in statistics, most notably because of the pleasingmathmatical properties of the normal distribution. Its ease of interpretation and wideimplementation in software packages enhances its abilities. One should be cautious aboutthe use of it though to ensure your outcome is normally distributed.

  • 6.7. FINAL THOUGHTS 38

  • Chapter 7

    Logistic Regression

    So now we begin to discuss the idea that our outcome is not linear. Logistic regressiondeals with the idea out outcome is binary, that is it can only take one one of two values(almost universally 0 and 1). This has many applications, graduate or not graduate,contract in illness or not, get a job or not, etc. This does pose problems for interpretationat times, because its not as easy to study.

    7.1 The Basics

    So we have to model the events that take on values of 0 or 1. The problem is with linearregression in this sense is that it requires us to use a straight line. This cant be done sinceour values are bounded. This means we must go to a different distribution than thenormald distribution

    7.2 Regression Modeling Binomial Outcomes

    Contingency tables are useful when we have one categorical covariate. Contingency tablesare not possible when we have a continuous predictor or multiple predictors. Even if thereis one variable of interest in relationship to the outcome, researchers still try to control forthe effects of other covariates. This leads to the use of a regression model to test therelationship between a binary outcome and one or several predictors.

    7.2.1 Estimation

    The basic regression model taught in introductory statistics classes is linear regression.This has a continuous outcome, and estimation is typically estimated using least squareswhich was discussed in 6.1. In a binomial outcome, we cannot use this estimationtechnique. The binomial model will estimate proportions, which are bound from 0 to 1. Aleast squares model may give estimates outside these bounds. Therefore we turn tomaximum liklihood and a class of models known as Generalized Linear Models (GLM)1.

    E(y)RandomComponent

    = 0 +

    pi=1

    pxp Systematic Component

    2 (7.1)

    The random component is the outcome variable, its called the random component becausewe want to know why there is variation in this variable. The systematic component is thelinear combination of our covariates and the parameter estimates. When our variable iscontinuous we dont have to worry about establishing a linear relationship as we assume itexists if the covariates are related to the outcome. When we have categorical outcomes wecan not have this linear relationship, so GLMs provide a link function, that allows a linearrelationship to exist if there is a significant relationship.

    1For SPSS users, do not confuse this with General Linear Model which performs ANOVA, ANCOVA andMANOVA

    2Some authors use to denote the intercept term, although most still use 0 and will be used here

    39

  • 7.2. REGRESSION MODELING BINOMIAL OUTCOMES 40

    7.2.2 Regression for Binary Outcomes

    Two of the most common functions are logit and probit functions. These allow us to lookat a linear relationship between our outcome and our covariates. In figure 7.1, you can seethere is not a lot of difference between logit and probit, the difference is in theinterpretation of coefficients (discussed below). The green line does show how a traditionalregression line is not an appropriate fit, because the data (the blue dots) goes outside therange of the data. The logit and probit fits look at the probabilities of being a success. Thefigure also shows that there is little difference in the actual model fit between the twomodels. Logit and probit models will be very similar in the substantive conclusions made.The primary difference is in the interpretation of the results. While we dont have a true r2

    coefficient, there is a pseudo r2 that was created by Nagelkerke (1992) which does give ageneral sense of how much variation is being explained by the predictors.

    x

    (x)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    10 0 10 20

    Logit Probit OLS Regression

    Figure 7.1: Logit, Probit and OLS regression lines; data simulated from R

  • 7.2. REGRESSION MODELING BINOMIAL OUTCOMES 41

    7.2.2.1 Logit

    The most common model in education is the logit model, also known as logistic regression,there are two equations we can solve, equation 7.2 allows us to get the log odds of apositive response (a success).

    logit[(x)] = log

    ((x)

    1 (x)

    )= 0 + pxp (7.2)

    The probability of a positive response is calcualted from equation 7.3.

    (x) =e0+pxp

    1 e0+pxp(7.3)

    Fitted values (either log odds or probabilities) are usually what is given in statisticalprograms, and just uses the values from the sample. Although a researcher can place valuesfor the covariates of hypothetical participants and it will give a probability for those values.One caution would be to ensure the values you place in the covariates are within the rangeof the data values (i.e. if your sample ages are 18-24 dont solve for an equation of a 26year old). Since the model was fitted with data that did not include that age range.

    7.2.2.2 Probit

    The probit function is similar in that its function is assumes an underlying latent normaldistribution bound between 0 and 1 which is found in 7.4. A probit model will change theprobabilities into z scores. In Agresti (2007, p. 72) he uses the probit coefficient of 0.05,which is -1.645, which is 1.645 standard deviations below the mean for that probability.

    P () = 1(0 + pxp) (7.4)

    7.2.2.3 Logit or Probit?

    As can be seen in figure 7.1 the model fit for both logistic and probit regression is verysimilar and this is usually true. Its also possible to alter the coefficients to change thecoefficients from logit to probit or vice versa. Amemiya (1981) showed multiplying a logitcoefficient by 1.6 will give the probit coefficient. Andrew Gelman (2006) ran simulationsand found results between 1.6 and 1.8 to be correct corrections, and also corresponds toAgresti (2007) which mentions the scaling being between 1.6 and 1.8.

    7.2.2.4 Logit or Probit?

    This is an example from the car data set we have been using.

    > mod1

  • 7.2. REGRESSION MODELING BINOMIAL OUTCOMES 42

    -1.5657 -1.0525 -0.8482 1.2144 1.4507

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -2.44532 1.01996 -2.397 0.0165 *

    MPG.city 0.10721 0.04545 2.359 0.0183 *

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 128.83 on 92 degrees of freedom

    Residual deviance: 122.09 on 91 degrees of freedom

    AIC: 126.09

    Number of Fisher Scoring iterations: 4

    > mod2|z|)

    (Intercept) 1.20897 1.10317 1.096 0.273

    Fuel.tank.capacity -0.07649 0.06512 -1.175 0.240

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 128.83 on 92 degrees of freedom

    Residual deviance: 127.42 on 91 degrees of freedom

    AIC: 131.42

    Number of Fisher Scoring iterations: 4

    > mod3

  • 7.3. FURTHER READING 43

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -1.7426 -1.0539 -0.7408 1.1435 1.6613

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -9.29693 4.14605 -2.242 0.0249 *

    MPG.city 0.24208 0.09519 2.543 0.0110 *

    Fuel.tank.capacity 0.23209 0.13229 1.754 0.0794 .

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 128.83 on 92 degrees of freedom

    Residual deviance: 118.74 on 90 degrees of freedom

    AIC: 124.74

    Number of Fisher Scoring iterations: 4

    7.2.3 Model Selection

    Researchers tend to fit multiple models to try and find the best fitting model consistentwith their theoretical framework. There are several ways to evaluate models to determinewhich model fits best. Sequential model building is a technique frequently used to look atthe addition of predictors to a regression model. The same framework that is used withother regression models as well. In a linear regression the test to test the models will be anF test (since the null hypothesis of the model uses an F distribution), models which usemaximum likelihood use the likelihood ratio test which is chi-squared like the ratio testused above. Shmueli (2010) examines the differences in building a model to explain therelationship of predictors to an outcome, or a model to predict an outcome from futuredata sources. The article also discusses the information criteria such as the AIC and BICmeasures used to test model fit.

    7.3 Further Reading

    This chapter borrows heavily from Alan Agresti (2007) who is well known and respectedfor his work in categorical data analysis. Some books which cover many statistical modelsyet still do a good job at logistic regression is Tabachnick & Fidell (2006) and Stevens(2009). The first book is great for a textbook, Stevens is a dense book, but has both SPSSsyntax and SAS code, works well a must have reference. Gelman and Hill (2007) is rapidlybecoming a classic book in statistical inference yet its computation is focused on R whichhasnt hit mainstream academia much, but they do have some supplemental material at theend of the book for other programs. Although for those who have an interest in R, anothergreat book is by Faraway (2005). Andy Field (2009) has a classic book called DiscoveringStatistics Using SPSS which blends very nicely SPSS and statistical concepts, and is good

  • 7.4. CONCLUSIONS 44

    at explaining of difficult statistical concepts. Students who wish to explore categorical dataanalysis conceptually there are a few good books, I recommend Agresti (2002); this is adifferent book from his 2007 book a focus on theory yet still a lot of great examples ofapplication). Longs (1997) book explores maximum likelihood methods focusing oncategorical outcomes. It combines a more conceptual and mathematical ideas of maximumlikelihood. A classic by McCullagh and Nelder (1989) which is a seminal work in theconcept of generalized linear models (the citation here is their well known second edition).

    7.4 Conclusions

    This chapter looked in an introductory manner. There is more to analyzing the binomialoutcomes and reading some of the works above can help add to analyzing binomialoutcomes. This is especially important for researchers whose outcomes will be binomial.These principals will also act as a starting point to learn about other categorical outcomessuch as nominal outcomes with more than two categories, or an ordinal outcomes (usedoften as likert scales).

  • Bibliography

    Agresti, A. (2002). Categorical Data Analysis. Hoboken, NJ: Wiley-Interscience.Agresti, A. (2007, March). An Introduction to Categorical Data Analysis. Hoboken, NJ:

    Wiley-Blackwell. doi:10.1002/0470114754Amemiya, T. (1981). Qualitative response models: a survey. Journal of Economic

    Literature, 19 (4), 14831536. doi:10.2298/EKA0772055NAmerican Psychological Association. (2009). Publication Manual of the American

    Psychological Association, Sixth Edition. American Psychological Association.Andersen, P. K. & Skovgaard, L. T. (2010). Regression with Linear Predictors. Statistics

    for Biology and Health. New York, NY: Springer New York.Bingham, N. H. & Fry, J. M. (2010). Regression Linear Models in Statistics. Springer

    Undergraduate Mathematics Series. London: Springer London.doi:10.1007/978-1-84882-969-5

    Chatterjee, S. & Hadi, A. S. (2006). Regression analysis by example (4 ed). Hoboken, NJ:Wiley-Interscience.

    Everitt, B. S., Hothorn, T., & Group, F. (2010). A Handbook of Statistical Analyses UsingR, Second Edition. Boca Raton, FL: Chapman and Hall/CRC.

    Faraway, J. J. (2002). Practical Regression and ANOVA using R.Faraway, J. J. (2004). Linear Models with R (Chapman & Hall/CRC Texts in Statistical

    Science). Boca Raton, FL: Chapman and Hall/CRC.Faraway, J. J. (2005). Extending the Linear Model with R: Generalized Linear, Mixed

    Effects and Nonparametric Regression Models (Chapman & Hall/CRC Texts inStatistical Science). Boca Raton, FL: Chapman and Hall/CRC.

    Field, A. (2009). Discovering Statistics Using SPSS (Introducing Statistical Methods).Thousand Oaks, CA: Sage Publications Ltd.

    Gelman, A. [A.] & Hill, J. (2007). Data Analysis Using Regression andMultilevel/Hierarchical Models. New York: Cambridge University Press.

    Gelman, A. [Andrew]. (2006). Take logit coefficients and divide by approximately 1.6 to getprobit coefficients. Retrieved fromhttp://www.andrewgelman.com/2006/06/take%5C logit%5C coef/

    Lock, R. (1993). 1993 new car data. Journal of Statistics Education, 1 (1). Retrieved fromhttp://www.amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.html

    Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables.Thousand Oaks, CA: SAGE Publications.

    McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, Second Edition(Chapman & Hall/CRC Monographs on Statistics & Applied Probability). BocaRaton, FL: Chapman and Hall/CRC.

    45

    http://dx.doi.org/10.1002/0470114754http://dx.doi.org/10.2298/EKA0772055Nhttp://dx.doi.org/10.1007/978-1-84882-969-5http://www.andrewgelman.com/2006/06/take%5C_logit%5C_coef/http://www.amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.html

  • BIBLIOGRAPHY 46

    Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships.Springer-Verlag New York.

    Pearl, J. [J.]. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3,96146.

    Pearl, J. [Judea]. (2009). Causality: Models, Reasoning and Inference. CambridgeUniversity Press.

    Rencher, A. & Schaalje, B. (2008). Linear Models in Statistics (2nd ed.).Wiley-Interscience.

    Shapiro, S. S. & Wilk, M. B. (1965, December). An analysis of variance test for normality(complete samples). Biometrika, 52 (3-4), 591611. doi:10.1093/biomet/52.3-4.591

    Sheather, S. J. S. J. (2009). A modern approach to regression with R. New York, NY:Springer Verlag. Retrieved fromhttp://www.springerlink.com/content/978-0-387-09607-0

    Shmueli, G. (2010, August). To Explain or to Predict? Statistical Science, 25 (3), 289310.Stevens, J. P. (2009). Applied Multivariate Statistics for the Social Sciences, Fifth Edition.

    New York, NY: Routledge Academic.Tabachnick, B. G. & Fidell, L. S. (2006). Using Multivariate Statistics (5th Ed.). Upper

    Saddle River, NJ: Allyn & Bacon.Venables, W. N. N. & Ripley, B. D. D. (2002). Modern applied statistics with S (4th Ed.).

    New York, NY: Springer.

    http://dx.doi.org/10.1093/biomet/52.3-4.591http://www.springerlink.com/content/978-0-387-09607-0

    Introduction to StatisticsVariablesTypes of VariablesSample vs. Population

    TerminologyHypothesis TestingAssumptionsType I & II ErrorWhat does Rejecting Mean?

    Writing in APA StyleFinal Thoughts

    Description of A Single VariableWhere's the Middle?VariationSkew and KurtosisTesting for NormalityDataFinal Thoughts

    Correlations and Mean TestingCovariancePearsons CorrelationR SquaredPoint Biserial CorrelationSpurious RelationshipsFinal Thoughts

    Means TestingAssumptionsT-TestIndependent SamplesDependent SamplesEffect Size

    Analysis of Variance

    Regression: The BasicsFoundational ConceptsFinal ThoughtsBibliographic Note

    Linear RegressionBasics of Linear RegressionSums of Squares

    ModelSimple Linear RegressionMultiple Linear Regression

    Interpretation of Parameter EstimatesContinuousTransformation of Continous VariablesNatural Log of Variables

    CategoricalNominal VariablesOrdinal Variables

    Model ComparisionsAssumptionsDiagnosticsResidualsNormality of ResidualsTestsPlots

    Final Thoughts

    Logistic RegressionThe BasicsRegression Modeling Binomial OutcomesEstimationRegression for Binary OutcomesLogitProbitLogit or Probit?Logit or Probit?

    Model Selection

    Further ReadingConclusions