lecture notes 1 - statistics homepagevollmer/stat307pdfs/ln1_2017.pdflecture notes 1: terminology...

44
Lecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables -Samples -Inference -Parameters & statistics -Sampling and bias -Types of studies -Confounding variables -Blinding & double blinding

Upload: vuongque

Post on 24-May-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Lecture Notes 1:Terminology and Statistical Studies

1

Outline:-What is statistics? -What is data? -Variables -Samples -Inference -Parameters & statistics -Sampling and bias -Types of studies -Confounding variables -Blinding & double blinding

Page 2: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

What Is Statistics?• Statistics is the science of how data is collected,

analyzed, and interpreted.

• Data is information (this is an intentionally vague definition).

• Typically data takes the form of observed measurements (e.g. height, temperature) or descriptions (e.g. dog, cat, blue, female)

2

Page 3: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

• Less dry definition: statistics as a discipline concerns how we use data to say things about the real world. We want to turn raw information into useful knowledge!

• We might want to summarize data. This can be done visually (“here’s a graph of last week’s temperatures”) or with numbers (“the average temperature last week was 42 degrees Fahrenheit”).

• We might want to use data to learn about a quantity that is unknown or practically unknowable.

For instance, how many trees are there in Rocky Mountain National Park? No one knows! But maybe we can gather some data and get an approximate idea.

3

Page 4: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Wise words:

“I believe that it would be worth trying to learn something about the world even if in trying to do so we should merely learn that we do not know much. This state of learned ignorance might be a help in many of our troubles. It might be well for all of us to remember that, while differing widely in the various little bits we know, in our infinite ignorance we are all equal.” -Karl Popper “Knowledge Without Authority”, 1960

4

Page 5: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

• Uncertainty is rampant! We use statistics to answer questions about things that are unknown (e.g. the number of trees in Rock Mountain National Park), but just as importantly, we use statistics to get an idea of how sure we are of our answers. Sometimes it is useful to know that you don’t know!

• Uncertainty will be a theme of this class. “We can’t know for sure” will be the answer to a lot of questions. Statistical tools are useful for making the most out of limited information.

5

Page 6: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Some statistical statements• “I sleep seven hours per night, on average”

• “Cam Newton set NFL rookie records for passing yards in a game, yards in a season, yards in back to back games, and touchdowns in a season”

• “48% of likely voters intend to vote for the incumbent, with a margin of error of +/- 3%”

• “Students who attend class regularly tend to get higher grades than those who do not”

6

Page 7: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Terminology• The next group of slides will cover commonly used

terminology found in the discipline of statistics.

• Most of these terms are best understood in the context of a research study.

• For the purpose of illustration, we will suppose that a study is being done for the purpose of estimating the average height of all adult females living in the U.S.

7

Page 8: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Variables• Variables are items (usually represented by letters or symbols) of

interest which can take on different values.

• You can usually tell what a variable of interest is by looking at data and noting the type of measurement being taken.

• For example, if we have collected data on the heights of adults females living in the US, our variable is simply height.

• If we denote height as “X” Our data might look like:

X1 X2 X3 X4 X5 X6 X7

65” 70” 61” 67” 64” 67” 63”8

Page 9: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Population• A population is the entire overall group we are

interested in.

• So, if we are trying to get an idea of the average height of US adult females, then the population of interest is all US adult females. Sometimes this is called the target population.

• Populations can be large or small. Usually they are large.

9

Page 10: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Sample• A sample is a subset of the entire population that we collect data on. The variable(s) of

interest is/are measured on each member of the sample.

• A single member of a sample is called an observation.

• We take samples because the entire population of interest is usually not available to us.

• For instance, if we are studying the height of US adult females, we will collect height data from a sample. We are not going to measure the height of every woman in the United States.

10

Page 11: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Census Data• In the rare case that measurements are obtained from every

member of a population, then we say we have a census.

• Every 10 years, the federal government runs a census, and the goal is to collect other information on every single person in the country.

• If our population is small, we may have census data. If it is large, we usually don’t. We aren’t going to measure the height of every woman in the US, or count all the trees in Rocky Mountain National Park.

11

Page 12: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Two fields of statistics

• There are, broadly speaking, two major fields of statistics: descriptive and inferential.

• Descriptive statistics involve describing a dataset: we could make a graph of it, or tell you its average and how spread out it is. We could tell you any interesting features about it.

• However, in descriptive statistics, we limit ourselves to describing the data itself. We do not generalize facts about the dataset to a larger group.

12

Page 13: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Inference• If we generalize from a sample to a population, we are

performing inferential statistics.

• These sorts of generalizations are called inferences, and the investigator is said to “draw inference” on the population, using the sample.

• e.g. we can take a sample of 100 women and use their average height to draw inference on the average height for the entire country.

• Statistical inferences always contain uncertainty. Our estimate for average height may be wrong!

13

Page 14: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Parameters A parameter is a numeric characteristic pertaining

to a population. In our example, the parameter of interest is the average height of all US adult females.

Most often parameters cannot be determined because we do not have census data. However, we can still draw inference on a parameter with a statistic. We just can’t know what they are with certainty.

14

Page 15: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Statistics• A statistic is any number you calculate using data.

We use statistics to estimate parameters.

• In our example, if we use the average height of the women in our sample to estimate the average height of women in the country, then this sample average is our statistic.

• But remember, the statistic is just an estimate! We can’t ever know the true parameter value unless we are able to measure the entire population.

15

Page 16: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Putting it all together

• So, in our example, we can take a sample of US women from the population of all US women and measure their height (the variable of interest), which gives us our data. Each individual woman in our data set is an observation. We can then calculate the average height of the women in our sample, which is our statistic. We use this statistic to draw inference about the average height of all women in the country, which is our parameter.

16

Page 17: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

• We have to be careful when generalizing from a sample to a population. The descriptive characteristics of your sample may not accurately reflect the characteristics of the population you are studying.

• Two important questions in statistical inference are: -Is my sample big enough? -Is my sample representative of the population of interest?

17

Major concerns in statistical inference

Page 18: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

• If our sample is small, we can’t confidently draw inference to the population.

• For example, if we measure the height of three women, their average height might not be a good estimate for the true average height of all women. Maybe we happened to sample three tall women. Or three short women.

• Smaller samples mean greater uncertainty when drawing inference to a population.

18

Small sample sizes

Page 19: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Bias• A statistic is said to be biased if it is created in such a way that we would

expect it to differ systematically from the population parameter that it is meant to estimate.

• For example, if our sample consists entirely of the CSU volleyball team, then we will probably end up considerably overestimating the true average height of all US women.

• This is an example of sampling bias, which is a type of bias that arises when your sample is not representative of the population of interest.

19

Page 20: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Bias• Another example of sampling bias comes from political polling. Polling firms

used to contact potential voters using only telephones on land lines. However, with the advent of cell phones, many people do not have land lines.

• This can lead to sampling bias because people who have land lines differ systematically from people who only use cell phones with respect to age. Younger people tend to use only cell phones, and younger people tend to vote differently than older people.

• If a polling firm only calls land lines, its sample will be biased toward the opinions of older voters.

20

Page 21: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Bias• In addition to sampling bias, we will consider two

other common forms of bias: self selection bias and non-response bias.

• Self-selection bias can occur when people choose if they want to be included in a sample. If the reason for their choosing to be in the sample is related to what is being measured, bias can result.

21

Page 22: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Self-selection bias• For example, during the last Republican presidential primary

debates, many websites posted polls where people could vote on who won the debate.

• In nearly every poll, Ron Paul came out as the winner. But in the actually primary elections, Ron Paul rarely came close to winning.

• This is because Ron Paul’s supporters were well organized on the internet, and made coordinated efforts to vote in these online polls.

22

Page 23: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Self-selection bias• Self-selection bias also occurs in online reviews on websites like

amazon.com or yelp.com

• Often you will see that most reviews are either 5 stars or 1 star. Perhaps this means that most people have sharply differing views regarding the thing being reviewed. But perhaps self-selection bias is at play.

• Who is most motivated to write a review? Those who feel lukewarm, or those who feel passionate?

23

Page 24: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Non-response bias• Non-response bias can occur when certain types of

respondents are more or less likely to answer a survey.

• For instance, if a company’s human resources team is trying to determine what proportion of their workforce feel as though they are overworked, sending out a survey may result in an underestimate of this proportion. After all, people who are overworked are less likely to take the time to fill out a survey!

24

Page 25: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Simple Random Sample• To mitigate against bias, samples should be collected

randomly.

• A simple random sample (SRS) is a sample of the population where every unit has an equal opportunity to be selected, as in drawing names from a hat.

• Valid inferences can be drawn using data obtained via a SRS. There are more complicated sampling methods that also yield valid inferences, but we will not cover them in this class.

25

Page 26: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Why sample randomly?• Self-selection bias can be overcome by making sure to select the

members of your sample randomly. If your sample is taken randomly, then people do not get to choose to be a part of it.

• Sampling bias can also be overcome via random sampling from the whole population. For instance, if we take a SRS of US adult women, chances are all of them won’t be volleyball players.

• Sometimes taking a SRS is difficult. How does a polling firm make sure that every likely voter has an equal chance of being sampled?

26

Page 27: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Why sample randomly?• Random samples are more likely to be representative of the

population of interest than non-random samples.

• This does not mean, however, that a random sample is necessarily representative of the population. (maybe, just by chance, we sampled lots of volleyball players!) It just means that we are not introducing bias via the way we collect the sample.

• In other words, we might still get a non-representative sample. But if we do it will have been by random chance, rather than by design.

27

Page 28: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Studies and Experiments

• When we talk about using data to learn something about the world, we are usually talking about performing a study or an experiment.

• These come in two major flavors: observational studies and controlled experiments.

28

Page 29: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Observational Study• In observational studies variable values are

observed and recorded from already existing data.

• e.g. a survey of a hospital’s death records where the variable of interest might be “Cause of Death” will be an observational study.

• Studies involving anything from the past are necessarily observational.

29

Page 30: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled Experiment• In a controlled experiment the researcher gets to

assign members of the study to different groups, which are then subjected to different experimental conditions (or “treatments”)

• e.g. if we want to determine if a medical procedure is effective, we can randomly assign people to either undergo the procedure (the treatment group) or not (the control group). We can then see if we observe differences between the groups with respect to what the procedure was designed to treat.

30

Page 31: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled Experiment

31

• In a controlled experiment, the researcher can try to “control” for factors that could cause bias.

• The idea is to ensure that the treatment and control groups are as similar as possible to one another before, so that if differences are observed after the experiment, they must have been due to the treatment and not some outside factor.

Page 32: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled experiment vs observational study

32

• For example, if I want to determine whether weight loss pills are effective, I could use the following observational method: I could get a sample of people who are trying to lose weight, split them into those who have used a pill and people who have not, and record how much each person’s weight has changed in the last 6 months.

Page 33: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled Experiment

33

• Let’s suppose I observe that the people who took a pill lost more weight than the people who didn’t. This may seem like solid evidence that the pill works, but it is not.

• Problem: I have not controlled for other possible influences on a person’s weight.

• Maybe those who are taking pills are following a healthier diet and exercising more than the people who are not taking weight loss pills. In this case, the observed difference in weight loss might not be due to the pill, but to other outside factors.

Page 34: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled Experiment

34

• To control for these other factors, I should randomly assign some participants to take a pill and others to not take a pill. Perhaps I can also either make sure they are on a similar diet and exercise regiment, or at least collect data on their diet and exercise regiments.

• If I still see that people who took the pill lost more weight, then I might have evidence that the pill caused the weight loss.

• Controlling for other factors eliminates them as possible explanations for any observed differences between the two groups.

Page 35: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Side note: the placebo effect

35

• If we are studying the effectiveness of a pill, we should also compare it to a placebo. It is often the case that the act of receiving a treatment that a person believes will be beneficial will itself result in a beneficial effect, regardless of whether or not the treatment is otherwise effective.

• In controlled experiments, the group that gets a placebo might be called the control group. Sometimes there is a “no treatment” control group in addition to a placebo control group.

Page 36: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Controlled Experiment

36

• In a controlled experiment, we can be reasonably comfortable inferring that the treatment applied to one group but not to the other caused whatever difference we observe between the two.

• But otherwise we must remember that correlation does not imply causation!

• Variables are correlated if they “move together”, i.e. when one variable changes, the other one tends to change as well.

• Examples: temperature is correlated with of time of day; human height is correlated with weight.

Page 37: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

• If changing one variable causes a change in another, they will be correlated. But the converse is not necessarily true.

• If two variables are correlated, it is not necessarily the case that one has caused the other to change.

• For instance, there is a statistical correlation between marriage and income. People who are married tend to make more money that people who are not married. Does this suggest that getting married causes an increase in income, or can you think of another explanation?

37

Correlation and causation

Page 38: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Confounding variable

• A confounding variable is a variable that helps explain the data but is not accounted for in the study.

• Another example: the number of shark attacks during a given month is directly correlated to the number of ice cream cones sold in that month.

• Can you think of another variable that might be related to ice cream sales and shark attacks?

38

Page 39: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

More fun with confounding variables

• A the size of a child’s vocabulary is correlated with the # of cavities he or she has. Learning how to speak causes cavities!

• Retail stores report higher sales volumes when their bathrooms are dirty. Better not clean the bathrooms or it will hurt sales!

39

Page 40: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

More fun with confounding variables

• This is something to really look out for in politics. Suppose the unemployment rate has increased while a certain presidential administration. Did that president cause the unemployment rate to increase?

• We see “correlation = causation” type reasoning in debates about gun control, taxation, immigration, education, etc. But how can we know that a certain government policy caused a certain outcome? It is difficult to do!

40

Page 41: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Blinding• Blinding and double blinding are methods that are

used to try and eliminate bias that occurs when researchers and subjects are aware of which group is which in a study.

• To motivate this concept, consider the “Pepsi challenge”. This is a taste test that was used by the Pepsi corporation, in which participants were asked to compare a cup of Pepsi to a cup of Coca-Cola.

41

Page 42: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Blinding• If the participants are handed cups labeled “Pepsi” and

“Coke”, then their responses may be biased because they may have preconceived notions of which product they prefer.

• In order to eliminate this potential source of bias, the study should be blinded.

• A blinded study is one in which participants do not know which group they are in, or which treatment is which. In the context of a taste test, the study is blinded if the cups are not labeled and the participants do not know which cup contains which product.

42

Page 43: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Double Blinding• However, even if the study is blinded, there could still

be a problem. If the experimenter knows which product is in which cup, then he or she could subtly (and possibly unintentionally) influence the outcome of the experiment.

• To avoid bias being introduced by the experimenter, the study can be double blinded. A double blinded study is one in which neither the participant nor the experimenter know which group is which. In our example, the experimenter should not know which cup contains which product.

43

Page 44: Lecture Notes 1 - Statistics Homepagevollmer/stat307pdfs/LN1_2017.pdfLecture Notes 1: Terminology and Statistical Studies 1 Outline: -What is statistics? -What is data? -Variables

Conclusion• This has been a quick overview of the major concepts

we will be invoking throughout the semester.

• In the next set of notes, we will look at common ways of displaying data .

44