111 department of statistics texas a&m university stat 211 instructor: keith hatfield

36
1 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Upload: stephany-walters

Post on 05-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

111

Department of StatisticsTEXAS A&M UNIVERSITY

STAT 211

Instructor: Keith Hatfield

Page 2: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Topic 1: Data collection and summarization

• Populations and samples• Frequency distributions• Histograms • Mean, median, variance and standard

deviation• Quartiles, interquartile range • Boxplots

Page 3: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

333

What is Statistics?

• What do you think of when you hear the word “statistics”? (sports, boring, not applicable to my field of study)

• Statistics: The science of collecting, classifying, and interpreting data.

• Anticipated learning outcomes:– appreciate and apply basic statistical

methods in an everyday life setting (Election polls, clinical trials, lies, big lies & statistics)

– appreciate and apply basic statistical methods in their scientific field

Page 4: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

444

Collecting data

• Observational study– Observe a group and measure quantities of interest. – This is passive data collection in that one does not

attempt to influence the group. – The purpose of the study is to describe the group.

• Experimental study – Deliberately impose treatments on groups in order to

observe responses. – The purpose is to study whether the treatments

cause a change in the responses

Page 5: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

555

Observational Study Terms• Population: The entire group of interest

• Sample: A part of the population selected to draw conclusions about the entire population

• Census: A sample that attempts to include the entire population

• Parameter: A concept that describes the population

• Statistic: A number produced from a sample that estimates a population parameter

Page 6: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

666

Horry County SC, Murder Case

• Do juries properly represent the racial makeup of Horry County which is 13% African American?

• What is the population parameter of interest?

• What sample statistic could be used to estimate the parameter and does the sample support the claim?

• 295 jurors summoned, 22 were African American

Page 7: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

777

Experiment Terms

• Experimental Group: A collection of experimental units subjected to a difference in treatment, imposed by the experimenter.

• Control Group: A collection of

experimental units subjected to the same conditions as those in an experimental group except that no treatment is imposed.

• This design helps control for potential confounding effects.

Page 8: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

888

What are “confounding” effects?• When you have multiple factors in a study and you

can’t tell which factor causes a change in the variable of interest.

• Example: Does going to church make you live longer?.....Not necessarily. There are too many other factors or “lurking variables”, discussed later.

• Best to set up study with everything else constant and have only one factor changed. That way, you’re more apt to identify that the change in the variable is due to the change you instituted in the study.

Page 9: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

999

NCTR study (National Center for Toxicological Research)

• A large scale study was conducted to see if a new drug might have potential toxic effects. They used rats for the experiment.

• Dose groups of 0, 100, 200, and 400 ppg were evaluated for liver tumors at the end of a two week exposure to the drug. (which is the control and which are the experimental groups?)

• What comparisons would you want to make?

• Should you evaluate each group on consecutive days at the end of the study?

Page 10: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

101010

Analyzing data with StatCrunch

• StatCrunch is a statistical software package that runs through a Web browser.

• You can access StatCrunch once you have registered and created an account ($$). See the information tab in eCampus for details.

• No tutorials for StatCrunch, but demonstrations of how to perform basis tasks and tests will be done in class.

• Note that the homework uses StatCrunch. Several datasets will be given in the homework and in class examples. I don’t advise using your calculator for this purpose as it can be tedious and lead to input errors.

Page 11: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

111111

All about variables• Variable: Any characteristic or quantity to be measured on

units in a study

• Categorical variable: Places a unit into one of several categories– Examples: Gender, race, political party

• Quantitative variable: Takes on numerical values for which arithmetic makes sense– Examples: SAT score, number of siblings, cost of textbooks

• Univariate data has one variable.• Bivariate data has two variables.• Multivariate data has three or more variables.

Page 12: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

121212

Cereal data

mfr A = American Home; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina

type cold or hot

calories calories per serving

protein grams of protein

fat grams of fat

sodium milligrams of sodium

fiber grams of dietary fiber

carbo grams of complex carbohydrates

sugars grams of sugars

potass milligrams of potassium

vitamins vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended

shelf display shelf (1, 2, or 3, counting from the floor)

weight weight in ounces of one serving

cups number of cups in one serving

rating a rating of the cereal

Page 13: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

131313

Summarizing a single categorical variable

• Frequency - number of times the value occurs in the data

• Relative frequency - proportion of the data with the valuemfr Frequency

Relative Frequency

A 1 0.012987013

G 22 0.2857143

K 23 0.2987013

N 6 0.077922076

P 9 0.116883114

Q 8 0.103896104

R 8 0.103896104

Cereal data

Page 14: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

141414

Analyzing a single quantitative variable

• Consider the concentration data which contains the concentration of suspended solids in parts per million at 50 locations along a river.

• What is a typical concentration? (Generally characterized by the center of the data)

• How much spread is there in the concentrations along the river? (Generally, the relative “width” of the data…how dispersed they are around the center)?

– Wide versus narrow and the inherent good and bad things about spread.

– Discuss the difference in typical and spread if taken at a single point on the river, versus several points along the river.

Page 15: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

151515

Histograms• Histogram - bar graph of binned or grouped data where the

height of the bar above each bin denotes the frequency (relative frequency) of values in the bin

• Typical concentration?

• Spread?

• Roughly how many concentrations below 50?

Page 16: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

161616

Choosing the number of histogram bins• General rule:

– Most stat packages will do this for you, but sometimes you may want to change the number of bins or categories, depending on what you want the data to convey….

• Following is a sample of historical geyser eruptions from Old Faithful in Yellowstone National Park.

Demonstration done in class, typical outputs shown on next two slides show same data from different perspectives.

Old Faithful data

# of bins # of observations

Page 17: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Data presented from an alarmist point of view

17

Page 18: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Data presented from a “calming” point of view

18

Page 19: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

191919

Describing the shape of quantitative data• Symmetric data has roughly the same mirror image on

each side of a center value.

• Skewed data has one side (either right or left) which is much longer than the other relative to the mode (peak value).– The above definitions are most useful when describing

data with a single mode.

• Multimodal data has more than one mode.

• Beware of outliers when describing shape.

• Shape of the concentration data?

Page 20: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

202020

States data from 1996

• Define the shape of each variable.

POVERTY percentage of the state population living in poverty

CRIME violent crime rate per 100,000 population

COLLEGE percentage of states population who are enrolled in college

METRO percentage of the state population living in a metropolitan area

INCOME median household income in 1996 dollars

Page 21: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Shapes of states data – Percentage living in poverty

21

Page 22: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

22

Shapes of states data – Violent crime rates per 100K

Page 23: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Shapes of states data - % living in metro area

23

Page 24: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Shapes of states data – Income

24

Page 25: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

252525

Summary statistics for quantitative data

• Measures of central tendency (typical)

– The sample median is the middle observation if the values are arranged in increasing order.

– The sample mean of n observations is the average, the sum of the values divided by n.

1,..., represents data valuesnX X n

1

n

ii

XX

n

Page 26: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

262626

Summary statistics for quantitative data

• pth percentile -the value such that p×100% of values are below it and (1-p) ×100% are above it (How to actually find the value? Multiply the percentile by # of observations and round up if necessary).

– first quartile (Q1) is the 25th percentile– second quartile (Q2) 50th percentile (median)– third quartile (Q3) is the 75th percentile

• 5-number summary: Min, Q1, Q2, Q3, Max– Boxplots: Stacking boxplots can be very useful for comparing

multiple groups (you’ll see in 2 slides).

Page 27: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

272727

• From the boxplot above– Are more than 75% of the values below 80?– Are more than 75% of the values above 40?– What percentage of values fall roughly between 45

and 70?– Is the data symmetrical?– What are the approximate maximum and minimum

values?

Page 28: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

282828

Summary statistics for quantitative data

• Measures of spread:– Interquartile range, IQR = Q3-Q1, the range of the middle

50% of the data– sample variance, s2, is the sum of squared deviations from

the sample mean divided by n-1

– sample standard deviation, s, is the square root of sample variance. Preferred because it has the same units as the data.

2

2 1

( )

1

n

ii

X Xs

n

Page 29: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

Calculation of sample variance (partial from data)

29

Obs x x bar (x-xbar) (x-xbar)^2 x^21 5 4.4 0.6 0.4 252 4 4.4 -0.4 0.2 163 3 4.4 -1.4 2 94 2 4.4 -2.4 5.8 45 2 4.4 -2.4 5.8 46 5 4.4 0.6 0.4 257 7 4.4 2.6 6.8 498 3 4.4 -1.4 2 99 4 4.4 -0.4 0.2 16

10 9 4.4 4.6 21.2 81Totals 44 0 44.4 238

( )x x

Page 30: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

303030

Cereal data

• Compare rating across shelf…– Numerically using StatCrunch “Summary Stats”

Page 31: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

313131

Cereal info – Comparative boxplots• Boxplot/outliers – An example of comparative bloxplots.

– Graphically using StatCrunch “Graphics>Boxplots”

Page 32: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

323232

Comparing measures of central tendency and spread

• The sample mean and the sample standard deviation are good measures of center and spread, respectively, for symmetric data

• If the data set is skewed or has outliers, the sample median and the interquartile range are more commonly used.

• Note about trimmed mean.

Page 33: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

333333

Case Study: Salary data• A fictitious large university decides to study the salaries of

their graduates. A survey was conducted of 2232 recent graduates from engineering and education majors.

• The salary data consists of three variables:– Gender: Male or Female– Major: Education or Engineering– Salary: Reported in $

• What types of variables do we have?

Page 34: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

343434

Salary data by major• Are both majors equally represented in the survey?

• Do salaries differ across major?

Page 35: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

353535

Salary data by gender

• Are both genders equally represented in the survey?

• Do salaries differ across gender? Discrimination?

Summary statistics for Salary:Group by: Gender

Gender n Mean Variance Std. Dev. Median Min MaxFemale 1,088 41,108 97,633,984 9,881 36,369 33,070 64,279

Male 1,144 50,589 86,189,224 9,284 54,471 29,027 61,533

Page 36: 111 Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield

363636

Salary data by gender within each major• How do male and female salaries compare in engineering?

• How do male and female salaries compare in education?

Please read the additional file for Topic 1 for more info

Summary statistics for Salary:Where: Major=EducationGroup by: Gender

Gender n Mean Variance Std. Dev. Median Min MaxFemale 856 36,009 1,004,212 1,002 36,009 33,070 39,411

Male 220 31,971 1,238,722 1,113 32,002 29,027 35,608

Summary statistics for Salary:Where: Major=EngineeringGroup by: Gender

Gender n Mean Variance Std. Dev. Median Min MaxFemale 232 59,921 3,900,454 1,975 59,994 53,598 64,279

Male 924 55,022 4,146,587 2,036 55,019 48,019 61,533