population mean. problem. notation - michigan state … mean. problem. notation . populati ... ti...
TRANSCRIPT
RECALL: In last class, we learned statistical inference for population mean.
Problem. Notation
Population Notation
Meaning
The population mean
𝑋� The sample mean
𝜎 The population standard deviation
s The sample standard deviation n The sample size
RECALL:
Point estimation. (sample mean ) Distribution of
Confidence Interval One-sample z-interval (population SD is known) One-sample t-interval (only sample SD is known) Remark: 1. T-interval needs normal assumption. 2. , which is related to n-1 and C%, can be obtained from t-table.
XX
*1−nt
nstX n
*1−±
RECALL: Hypothesis Testing about 𝜇
Z-Test (population SD is known) Test statistic: P-value:
Null Hypothesis H0 vs. Alternative Hypothesis HA
H0 : vs.
HA : (two-sided)
HA : (one-sided)
HA : (one-sided)
Alternative Hypothesis HA P-value formula
HA : (two-sided) P-value=2P(Z>|z|)
HA : (one-sided) P-value=P(Z>z)
HA : (one-sided) P-value=P(Z<z)
RECALL: Hypothesis Testing about 𝜇
T-Test (sample SD s is known) Test statistic: P-value:(df=n-1)
Null Hypothesis H0 vs. Alternative Hypothesis HA
H0 : vs.
HA : (two-sided)
HA : (one-sided)
HA : (one-sided)
Alternative Hypothesis HA P-value formula
HA : (two-sided) Two-tail prob. of |t|
HA : (one-sided) One-tail prob. of |t|
HA : (one-sided) One-tail prob. of |t|
ns
Xt 0µ−=
RECALL: TI commands (under STATTESTS): T-interval: use 8: T Interval T-Test: use 2:T-Test Decisions: If p-value< alpha level, reject H0, and we say the test is statistically significant at this alpha level); If p-value>alpha level, fail to reject H0, and we say the test is not statistically significant at this alpha level); Errors: Type I error: decide to reject H0, but actually H0 is true; Type II error: decide to retain H0, but actually H0 is false; P(Type I error)=alpha level.
Exploring Relationship Between Variables
Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression
WHERE ARE WE GOING? People might ask the following questions in the real
life: 1. Is the price of sneakers related to how long they last? 2. Is smoking related to lung cancer? 3. Do baseball teams that score more runs sell more tickets to
their games?
Chapter 7 will look at relationships between two quantitative variables X and Y. Scatterplot Correlation
TERM 1: SCATTERPLOTS Is the price of sneakers related to how long they last?
Following table shows some data collected for sneakers:
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12
Price Years Price($) 1 20.00 2 21.99 3 23.29 4 25.99 5 29.99 6 34.99 7 39.99 8 44.99 9 49.99
10 59.99
This is an example of scatterplot. x-axis represents variable years and y-axis represents prices.
TERM 1: SCATTERPLOT Scatterplots may be the most common and most
effective display for paired data.
Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables
010203040506070
0 2 4 6 8 10 12
Price X-axis: Years, Explanatory variable which explains or influences changes in the other variable. Y-axis: Price, Response variable which measures an outcome of a study.
TERM 1: SCATTERPLOTS
How do we describe the scatterplot? Or, What information about the relationship of the two variables can we get by looking at the scatterplot?
Please look at the scatterplot of the sneakers example, and think about what can you tell about the relationship of years and price.
010203040506070
0 2 4 6 8 10 12
Price We are going to describe the relationship from four different aspects. 1) Direction 2) Form 3) Strength 4) Unusual features
TERM 1: SCATTERPLOT Look for direction: What’s my
design—positive, negative or neither? Negative A pattern like this that runs from the upper left to the lower right is said to be negative. Y variable decreases as the X variable increases. Positive
A pattern running the other way is called positive.
Y variable increases as X variable increases.
0 10 20 30 40 50
05
1015
Scatterplot
X
Y
0 10 20 30 40 50
-10
-50
Scatterplot
X
Y
TERM 1: SCATTERPLOT The example in
the text shows a negative association between central pressure and maximum wind speed
As the central pressure increases, the maximum wind speed decreases
TERM 1: SCATTERPLOTS Look for Form: straight, curved or something
exotic, or no pattern?
0 2 4 6 8 10
05
1015
2025
30
Scatterplot
X
Y
0 2 4 6 8 10
050
010
0015
0020
0025
0030
00
Scatterplot
X
Y
0 2 4 6 8 10
-2-1
01
2
Scatterplot
X
Y
Straight line, linear Curved No pattern
In this part, we are more interested in the linear pattern.
TERM 1: SCATTERPLOTS Look for strength: how much scatter? Or, how strong
the relationship is? Strong: the points appear tightly clustered in a single stream.
Weak: the swarm of points seem to form a vague cloud through which we can barely discern any trend or pattern
0 2 4 6 8 10
05
1015
2025
30
Scatterplot
X
Y
0 2 4 6 8 10
02
46
810
Scatterplot
X
Y
0 2 4 6 8 10
-10
12
34
56
Scatterplot
X
Y
0 2 4 6 8 10
-2-1
01
2
Scatterplot
X
Y
TERM 1: SCATTERPLOTS Look for the Unusual Features: Are there
outliers or subgroups?
0 2 4 6 8 10
-20
24
68
10
Scatterplot
X
Y
0 5 10 15
05
1015
2025
30
Scatterplot
X
Y
The point circled is a potential outlier There are two clusters.
Slide 1- 16
TERM 1: SCATTERPLOT-ROLES FOR VARIABLES
It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis.
This determination is made based on the roles played by the variables.
When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.
TERM 1: SCATTERPLOTS Summary
A Scatterplot shows the relationship between two quantitative variables measured on the same individual.
The variable that is designated the X variable is called the explanatory variable
The variable that is designated the Y variable is called the response variable
Always plot the explanatory variable on the horizontal (x) axis
Always plot the response variable on the vertical (y) axis
In examining scatterplots, look for an overall pattern showing the form, direction and strength of the relationship
Look also for outliers or other deviations from this pattern
TERM 1: SCATTERPLOT Example: Fast food is often considered unhealthy because
much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. Analyze the association between fat content and calories.
Fat(g) 20 30 35 36 40 40 44 Calories 410 580 590 570 640 680 660
400
500
600
700
18 28 38 48
Cal
orie
Fat
Comment on the scatterplot: 1) Direction Positive 2) Form Roughly linear 3) Strength Moderately strong 4) Unusual features No.
TERM 2: CORRELATION From scatterplots, we can look for the relationship between two
quantitative variables and whether the relationship is strong or weak. But how strong is it?
Correlation coefficient (or simply correlation) is a quantitative measure of linear relationship (association) between two quantitative variables.
Finding the correlation coefficient, denoted by r, by hand:
Where and are standard deviations for X and Y respectively.
Remarks: Before you use correlation, you must check several conditions:
Quantitative Variables Condition Straight Enough Condition Outlier Condition
yxssnyyxx
r)1(
))((−
−−= ∑
xs ys
TERM 2: CORRELATION (Revisit the calories example) Here are the fat and calories
contents of several brands of burgers.
What is the correlation coefficient of x (fat) and y (calories)? Solution:
Add up the products: 2700+50+0+(-20)+250+450+630=4060 Correlation r=4060/{(7-1)*7.98*89.81}=0.9442
Deviations in x Deviations in y Product 20-35=-15 410-590=-180 (-15)*(-180)=2700 30-35=-5 580-590=-10 (-5)*(-10)=50 35-35= 0 590-590= 0 0*0=0 36-35= 1 570-590=-20 1*(-20)=-20 40-35= 5 640-590= 50 5*50=250 40-35= 5 680-590= 90 5*90=450 44-35= 9 660-590= 70 9*70=630
X: Fat(g) 20 30 35 36 40 40 44 Y: Calories 410 580 590 570 640 680 660
TERM 2: CORRELATION
Slide 1- 22 CORRELATION PROPERTIES The sign of a correlation coefficient gives the
direction of the linear association. Positive sign Positive linear association Negative sign Negative linear association Correlation is always between -1 and +1.
Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line.
A correlation near zero corresponds to a weak linear association.
Example: The correlation between fat and calories as 0.9442 indicates a strong positive linear association between them.
TERM 2: CORRELATION Cautions about correlation:
Quantitative Variables Condition: Correlation applies only to quantitative variables.
Straight Enough Condition: Correlation measures the strength only of the linear association.
Outlier Condition: Outliers can distort the correlation dramatically.
-2 -1 0 1 2
-4-2
02
4
x
y
r=0.92 -2 -1 0 1 2
-20
24
68
x
y
r=0.098
-2 -1 0 1 2
-50
510
x
y
With the outlier: r=0.795
Without the outlier: r=0.938
TERM 2: CORRELATION Correlation≠Causation
Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Based on the fat and calories contents of several brands of burgers, the correlation between them is r=0.9442. Which conclusion is most accurate?
A. More fat in the burgers causes higher calories B. The burgers containing more fat tend to have higher
calories Comment: Even though A sounds all right, it is not the conclusion can
be derived/explained by the correlation. Correlation is an objective story teller of the linear
association between two variables. It can’t tell the causation.
Slide 1- 25 CORRELATION PROPERTIES (CONT.) Correlation treats x and y symmetrically:
The correlation of x with y is the same as the correlation of y with x.
Correlation has no units. Correlation is not affected by shifting and
rescaling of either variable. Correlation depends only on the z-scores, and they
are unaffected by changes in center or scale. i.e. corr(aX+b,cY+d)=corr(X,Y) where a,b,c,d are
constants.
TERM 2: CORRELATION Example: Here are several scatterplots. The calculated
correlations are -0.923, -0.487, 0.006 and 0.777. Which is which?
-10 -5 0 5 10
-120
-80
-40
020
(a)
X
Y
-10 -5 0 5 10
-20
-10
010
20
(b)
X
Y
-10 -5 0 5 10
-20
-10
010
20
(c)
X
Y
-10 -5 0 5 10
-20
-10
010
2030
(d)
X
Y
-0.923
0.006 0.777
-0.487
QUESTION: CAN WE DO MORE? Scatterplot and correlation are useful tolls
helping us to learn the (linear) association between two quantitative variables.
Can we answer the following question: Fast food is often considered unhealthy because much of it is high in fat. What is the calorie content of a kind of fast food with 28g fat?
400450500550600650700
18 28 38 48Fat
Cal
orie
If we want to estimate a unknown value based on the known values, this is called a prediction. One way to do the prediction is by constructing a linear model.
TERM 3: LINEAR MODEL Let’s look at the burger example again.
Fat(g) 20 30 35 36 40 40 44 Calories 410 580 590 570 640 680 660
20 25 30 35 40
400
450
500
550
600
650
BURGERS
FAT
CA
LOR
IES
The red line does not go through all the points, but it can summarize the general pattern with only a couple of parameters: Calories = a+b*fat. This model can be used to predict the Calories based on the fat contain. Explanatory Var: Fat Response Var: Calories
TERM 3: LINEAR MODEL
20 25 30 35 40
400
450
500
550
600
650
BURGERS
FAT
CA
LOR
IES
residual
Predicted value: we call the estimate made from a model the predicted value, denoted as . Residual: The difference between the observed value and its associated predicted value is called the residual. The line of best fit is the line for which the sum of the squared residuals is smallest. And it’s called the least squares line.
y
Prediction
TERM 3: LINEAR MODEL
TERM 3: LINEAR MODEL X: Fat(g) 20 30 35 36 40 40 44 Y: Calories 410 580 590 570 640 680 660
Fat: Calories: Correlation: r=0.9442 Slope: Intercept: Linear model: Q2: What is the predicted calorie when the fat is 30g? When x=30, Q3: What is the residual for the burger with 30g fat? When x=30, the residual is
20 25 30 35 40
400
450
500
550
600
650
BURGERS
FAT
CA
LOR
IES
=210.8+11.06x
Q1: Please construct a linear regression model to predict the calories based on fat.
TERM 3: LINEAR MODEL Remarks: Since regression and correlation are closely
related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition Outlier Condition
TERM 3: LINEAR MODEL (PARAMETERS) We write a and b for the slope and intercept of the
line. They are called the coefficients of the linear model.
The coefficient b is the slope, which tells us how rapidly the predicted value ( ) changes with respect to x. As the value of x increases by 1 unit, the predicted value of y will be increased by b units.
The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. In other words, the intercept a is the predicted value of y when x=0
y
Intercept and Slope (examples) Fast food is often considered unhealthy because much of it
is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. To analyze the association between fat content and calories, the equation of the regression model is: Predicted calories=217.95+10.63*fat For this linear equation, slope=10.63, intercept=217.95
Q1: What does the slope 10.63 mean? A1: An increase in fat of 1 gram is associated with an increase in
calories of 10.63. Q2: If the fat increases by 2 grams, how many more calories are
expected to be contained in the burger? A2: 2*10.63=21.26 Q3: What does the intercept 217.95 mean here? A3: Theoretically, it means: when the burger contains no fat at all,
the amount of calories is 217.95.
TERM 4: RESIDUAL PLOT After you construct the linear model, you have to check whether
the linear model makes sense or not. Residual plot can be used to check the appropriateness of the
linear model. Residual plot is the scatterplot of the residuals versus the x-
values. If a linear model is appropriate, then the residual plot shouldn’t have any interesting features,
like a direction or shape. It should stretch horizontally, with about the same amount
of scatter throughout. It should show no bends, and it should have no outliers.
-10 -5 0 5 10
-2-1
01
2
X
Residu
als
TERM 4: RESIDUAL SCATTERPLOT Now, let’s try to diagnose the model for the calorie
and fat example. Fat(g): x 20 30 35 36 40 40 44 Calories: y 410 580 590 570 640 680 660 Predicted calories: 430.6 536.9 590 600.6 643.2 643.2 685.7 Residual: -20.6 43.1 0 -30.6 -3.2 36.8 -25.7
20 25 30 35 40
-30
-20
-10
010
2030
40
fat
resi
dual
s
Residual plot
x
TERM 4: RESIDUAL PLOT Example: Tell what each of the residual plots below
indicates about the appropriateness of the linear model that was fit to the data.
-2 -1 0 1 2
-2-1
01
2
(a)
x1
y1
-2 -1 0 1 2
-6-5
-4-3
-2-1
01
(b)
x2
y2
-2 -1 0 1 2
-4-2
02
46
(c)
x3
y3
(a) (b) (c)
TI for correlation and regression equation The first time you do this:
Press 2nd, CATALOG (above 0) Scroll down to DiagnosticOn Press ENTER, ENTER Read “Done” Your calculator will remember this setting even when turned
off
Enter predictor (x) values in L1 Enter response (y) values in L2
Pairs must line up There must be the same number of predictor and response
values
Press STAT, > (to CALC) Scroll down to 8:LinReg(a+bx), press ENTER, ENTER Read intercept a, slope b and correlation r at the screen
IMPORTANT NOTES: Take-home quiz is due on Monday. No late
submission will be accepted. Keep the ID assignment and bring it to class on
Monday. Sample exam will be handed out on Monday. We
will discuss the questions on Wednesday. Suggested Problem Set 4 will be collected on
next Thursday. Final exam will be on next Thursday. 2 hours in
class. Please prepare one page A4 size cheat sheet (one-sided) on your own. Formula sheet will not be provided in final exam. Cheat sheet will be collected together with the final exam.