stats of engineers, lecture 8
DESCRIPTION
Stats of Engineers, Lecture 8. - PowerPoint PPT PresentationTRANSCRIPT
Stats of Engineers, Lecture 8
1 2 3 4
50%
8%
75%
17%
Confidence interval width
A 95% confidence interval for the mean resistance of a component was constructed using a random sample of size n = 10, giving . Which of the following conditions would NOT probably lead to a wider confidence interval? (bigger error bar)
(can click more than one)
1. If the sample mean was larger2. If you increased your confidence
level 3. If you increased your sample size 4. If the population standard
deviation was larger
Recap: Confidence Intervals for the mean
If is unknown, we need to make two changes:
(i) Estimate by , the sample variance;
(ii) replace z by , the value obtained from t-tables,
The confidence interval for if we measure a sample mean and sample variance is: where . [ d.o.f.]
A confidence interval for if we measure a sample mean and already know is where
Normal data, variance known or large data sample – use normal tables
Normal data, variance unknown – use t-distribution tables
Q𝑧
𝜈=𝑛−1=1 𝜈=𝑛−1=5 𝜈=𝑛−1=50
Normal
t-distribution
For large the t-distribution tends to the Normal - in general broader tails
Linear regression
Linear regression: fitting a straight line to the mean value of as a function of
403020100
250
200
150
100
x
y
We measure a response variable at various values of a controlled variable
Least-squares estimates and :
�̂�=𝑆𝑥𝑦
𝑆𝑥𝑥∧�̂�=𝑦− �̂�𝑥
𝑆𝑥𝑥=∑𝑖𝑥 𝑖2−
(∑𝑖 𝑥𝑖)2
𝑛 =∑𝑖
(𝑥𝑖−𝑥 )2
𝑆𝑥𝑦=∑𝑖𝑥 𝑖 𝑦 𝑖−
∑𝑖𝑥 𝑖∑
𝑖𝑦 𝑖
𝑛 =∑𝑖
(𝑥𝑖−𝑥) ( 𝑦 𝑖− 𝑦 )❑
Sample means
�̂�=�̂�+�̂� 𝑥Equation of the fitted line is
Estimating : variance of y about the fitted line
¿𝑆𝑦𝑦 −�̂�𝑆𝑥𝑦
𝑛−2
Quantifying the goodness of the fit
Residual sum of squares
Predictions
403020100
250
200
150
100
x
y
For given of interest, what is mean ?
Predicted mean value: .
It can be shown that
Confidence interval for mean y at given x
What is the error bar?
y 240 181 193 155 172 110 113 75 94x 1.6 9.4 15.5 20.0 22.0 35.5 43.0 40.5 33.0
Example: The data y has been observed for various values of x, as follows:
Fit the simple linear regression model using least squares.
�̂�=234.1−3.509 𝑥
Example: Using the previous data, what is the mean value of at and the 95% confidence interval?
𝑦=234.1−3.509 𝑥Recall fit was
Need
95% confidence for Q=0.975
Confidence interval is ,
⇒ .
Hence confidence interval for mean is
≈129±17
Extrapolation: predictions outside the range of the original data
𝑥𝑛𝑒𝑤
𝑦 𝑛𝑒𝑤
What is the prediction for mean at ?
𝑦 𝑛𝑒𝑤=�̂�+�̂� 𝑥𝑛𝑒𝑤
Extrapolation: predictions outside the range of the original data
𝑥𝑛𝑒𝑤
𝑦 𝑛𝑒𝑤
What is the prediction for mean at ?
Looks OK!
𝑦 𝑛𝑒𝑤=�̂�+�̂� 𝑥𝑛𝑒𝑤
Extrapolation: predictions outside the range of the original data
𝑥𝑛𝑒𝑤
𝑦 𝑛𝑒𝑤
What is the prediction for mean at ?
Quite wrong!
Extrapolation is often unreliable unless you are sure straight line is a good model
We previously calculated the confidence interval for the mean: if we average over many data samples of at , this tells us the interval we expect the average to lie in.
What about the distribution of future data points themselves?
Confidence interval for a prediction
Two effects:
- Variance on our estimate of mean at
- Variance of individual points about the mean
Confidence interval for a single response (measurement of at ) is
�̂� 2( 1𝑛+(𝑥−𝑥 )2
𝑆𝑥𝑥 )�̂� 2
Example: Using the previous data, what is the 95% confidence interval for a new measurement of at Answer
≈129±50
A linear regression line is fit to measured engine efficiency as a function of external temperature (in Celsius) at values . Which of the following statements is most likely to be incorrect?
1 2 3 4
6%
61%
11%
22%
1. The confidence interval for a new measurement of at is narrower than at
2. Adding a new data at would decrease the confidence interval width at
3. If and accurately have a linear regression model, adding more data points at and would be better than adding more at and
4. The mean engine efficiency at T= -20 will lie within the 95% confidence interval at T=-20 roughly 95% of the time
Confidence interval for mean y at given x
Confidence interval for a single response (measurement of at )
- Confidence interval narrower in the middle (
- Adding new data decreases uncertainty in fit, so confidence intervals narrower ( larger)
- If linear regression model accurate, get better handle on the slope by adding data at both ends(bigger smaller confidence interval)
- Extrapolation often unreliable – e.g. linear model may well not hold at below-freezing temperatures. Confidence interval unreliable at T=-20.
Answer
Correlation
Regression tries to model the linear relation between mean y and x.
Correlation measures the strength of the linear association between y and x.
20100
60
50
40
30
20
10
x
y
A
20100
60
50
40
30
20
10
x
y
B
Weak correlation Strong correlation
- same linear regression fit (with different confidence intervals)
If x and y are positively correlated:
- if x is high ( y is mostly high ()
- if x is low () y is mostly low ()
on average is positive
If x and y are negatively correlated:
on average is negative
- if x is high ( y is mostly low ()
- if x is low () y is mostly high ()
can use to quantify the correlation
20100
60
50
40
30
20
10
x
y
B
More convenient if the result is independent of units (dimensionless number).
𝑟=𝑆𝑥𝑦
√𝑆𝑥𝑥𝑆𝑦𝑦
r = 1: there is a line with positive slope going through all the points; r = -1: there is a line with negative slope going through all the points; r = 0: there is no linear association between y and x.
Range :
Pearson product-moment.
Define
If , then is unchanged ( Similarly for - stretching plot does not affect .
Example: from the previous data:
Hence
Notes:
- magnitude of r measures how noisy the data is, but not the slope
- finding only means that there is no linear relationship, and does not imply the variables are independent
Correlation
A researcher found that r = +0.92 between the high temperature of the day and the number of ice cream cones sold in Brighton. What does this information tell us?
1 2 3 4
8%
69%
23%
0%
1. Higher temperatures cause people to buy more ice cream.
2. Buying ice cream causes the temperature to go up.
3. Some extraneous variable causes both high temperatures and high ice cream sales
4. Temperature and ice cream sales have a strong positive linear relationship.
Question from Murphy et al.
Correlation r
error
- not easy; possibilities include subdividing the points and assessing the spread in r values.
Error on the estimated correlation coefficient?
Causation? does not imply that changes in x cause changes in y - additional types of evidence are needed to see if that is true.
J Polit Econ. 2008; 116(3): 499–532.http://www.journals.uchicago.edu/doi/abs/10.1086/589524
Strong evidence for a 2-3% correlation.
- this doesn’t mean being tall causes you earn more (though it could)
1.
Correlation
Which of the follow scatter plots shows data with the most negative correlation ?
1 2 3 4
50%
13%
25%
13%
1. No correlation2. Correct3. Not large4. positive
2.
3. 4.