a crash course in statistics - handouts

46
1 Hand-out # 1 To construct histograms: 1. Data are first organised into a table which arranges the data into class intervals (also called bins) — subdivisions of the total range of values which the variable takes. In principle, bins do not have to be of equal width, but for simplicity; use equal width wherever possible. As a guide, six or seven bins should be sufficient, but remember to exercise common sense. 2. To each class interval, the corresponding frequency is determined, that is the number of observations of the variable which falls in each interval. 3. Make two more columns for frequency density (frequency/class width) and cumulative frequency. Note the final column is not required for a histogram per se, although computation of cumulative frequencies may be useful when determining medians and quartiles (to be discussed later in this chapter). 4. Adjacent bars are drawn over the respective class intervals such that the area of each bar is proportional to the interval frequency. This explains why equal bin widths are desirable since this reduces the problem to making the heights proportional to the interval frequency. However, you may be told to use a particular number of bins or bin widths, such that bins will not all be of equal width. In such cases, you will need to compute the frequency density as outlined above. Key points to note: - All bars are centred on the midpoints of each class interval. - Informative labels on the histogram, i.e. title and axis labels, must be provided! - Because area represents frequency, it follows that the dimension of bar heights is number per unit class interval, hence the y-axis should be labelled ‘frequency density’ rather than ‘frequency’. - Must be drawn in PEN on a graph paper

Upload: nida-sohail-chaudhary

Post on 12-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

A summary of the statistical skills needed for any 100-level university course in EMFSS.

TRANSCRIPT

Page 1: A Crash Course in Statistics - Handouts

1

Hand-out # 1 –

To construct histograms: 1. Data are first organised into a table which arranges the data into class intervals (also called

bins) — subdivisions of the total range of values which the variable takes. In principle, bins do not have to be of equal width, but for simplicity; use equal width wherever possible. As a guide, six or seven bins should be sufficient, but remember to exercise common sense.

2. To each class interval, the corresponding frequency is determined, that is the number of observations of the variable which falls in each interval.

3. Make two more columns for frequency density (frequency/class width) and cumulative frequency. Note the final column is not required for a histogram per se, although computation of cumulative frequencies may be useful when determining medians and quartiles (to be discussed later in this chapter).

4. Adjacent bars are drawn over the respective class intervals such that the area of each bar is proportional to the interval frequency. This explains why equal bin widths are desirable since this reduces the problem to making the heights proportional to the interval frequency. However, you may be told to use a particular number of bins or bin widths, such that bins will not all be of equal width. In such cases, you will need to compute the frequency density as outlined above.

Key points to note: - All bars are centred on the midpoints of each class interval.

- Informative labels on the histogram, i.e. title and axis labels, must be provided!

- Because area represents frequency, it follows that the dimension of bar heights is number

per unit class interval, hence the y-axis should be labelled ‘frequency density’ rather than ‘frequency’.

- Must be drawn in PEN on a graph paper

Page 2: A Crash Course in Statistics - Handouts

2

Zone A, 2011 (solutions are overleaf)

Page 3: A Crash Course in Statistics - Handouts

3

Choice of the stem involves determining a major component of a typical data item, for example the ‘10s’ unit, or if data are of the form 1.4, 1.8, 2.1, 2.9 . . ., then the integer part would be appropriate.

The remainder of the data value plays the role of ‘leaf’. A ‘leaf’ is always a single digit! Applied to the weekly production dataset, we obtain the stem-and-leaf diagram below.

Note the following points: - These stems are equivalent to using the (discrete) class intervals 350 − 359, 360 − 369, 370 – 379 … -Leaves are vertically aligned. - A key MUST be provided -The leaves are placed in order of magnitude within the stems — therefore it’s a good idea to sort the raw data into ascending order first of all.

;

( ! )

-Unlike the histogram, the actual data values are preserved. This is advantageous if we want to calculate various (descriptive or summary) statistics. - Note the informative title and labels for the stem and leaf. Key:

45 | 4 = 453

Page 4: A Crash Course in Statistics - Handouts

4

Zone B, 2011

Solution

Page 5: A Crash Course in Statistics - Handouts

5

Hand-out # 2 –

In a box plot, the middle horizontal line is the median and the upper and lower ends of the box are the upper and lower quartiles, respectively.

The ‘whiskers’ are drawn from the quartiles to the observations furthest from the median, but not by more than one-and-a-half times the IQR (i.e. excluding outliers).

The whiskers are terminated by horizontal lines.

Any extreme points beyond the whiskers are plotted individually. An example of a (generic) box plot is given below.

Zone A, 2013 (solution is provided overleaf)

Page 6: A Crash Course in Statistics - Handouts

6

Page 7: A Crash Course in Statistics - Handouts

7

Hand-out # 3 –

Question 5 – Zone A, 2013

Page 8: A Crash Course in Statistics - Handouts

8

Solution 1

Solution 2

Solution 3

Page 9: A Crash Course in Statistics - Handouts

9

Solution 4

Solution 5

Page 10: A Crash Course in Statistics - Handouts

10

Hand-out # 4 –

Question 3

Question 4

Question 5

Page 11: A Crash Course in Statistics - Handouts

11

Solution 1

Solution 2

Solution 3

Page 12: A Crash Course in Statistics - Handouts

12

Hand-out # 5 –

Question 4

Question 5

Page 13: A Crash Course in Statistics - Handouts

13

Solution 1

Solution 2

Solution 3

Page 14: A Crash Course in Statistics - Handouts

14

Solution 4

Solution 5

Page 15: A Crash Course in Statistics - Handouts

15

Hand-out # 6 –

Suppose a simple random sample of 50 households is taken from a population of 1,000 households

in an area of a town. The sample mean and standard deviation of weekly expenditure on alcoholic

beverages are £18 and £4, respectively.

How many more households should you sample if it is required that your final estimate should have

a standard error less than £0.19?

8) ! ;

Suppose the reaction time of a patient to a certain stimulus is known to have a standard deviation of

0.05 seconds.

How large a sample of measurements must a psychologist take in order to be

a) 95% and

b) 99% confident that the error in his estimate of the mean reaction time will not exceed 0.01

seconds?

Page 16: A Crash Course in Statistics - Handouts

16

Hand-out # 7 –

Paired-sample methods are used in special cases when the two samples are not statistically

independent. For our purposes, such paired data are likely to involve observations on the same

individuals in two different states — specifically ‘before’ and ‘after’ some intervening event.

A paired-sample experimental design is advantageous since it allows researchers to determine

whether or not significant changes have occurred as a result of the intervening event free from bias

from other factors since these have been controlled for by observing the same individuals.

A necessary, but not sufficient, indicator for the presence of paired sample data is that n1 = n2, in

order to have ‘pairs’ of data values.

This scenario is easy to analyse as the paired data can simply be reduced to a ‘one sample’ analysis

by working with differenced data. That is, suppose two samples generated sample values x1, x2, . . . ,

xn and y1, y2, . . . , yn respectively (note the same number of observations, n, in each sample).

Compute the differences, i.e. d1 = x1 − y1, d2 = x2 − y2, . . . , dn = xn − yn.

By using the differences to compute a confidence interval for μd, then we get the required

confidence interval for μx − μy.

Page 17: A Crash Course in Statistics - Handouts

17

Hand-out # 8 –

Question 4

Page 18: A Crash Course in Statistics - Handouts

18

Solution 1

Solution 2

Solution 3

Solution 4

Page 19: A Crash Course in Statistics - Handouts

19

Hand-out # 9 –

We choose between two statements about the value of a parameter based on evidence obtained

from sample data.

Our objective is to choose between these two conflicting statements about the population, where

these statements are known as hypotheses. By convention these are denoted by H0 and H1.

The null hypothesis, H0, will always denote the parameter value with equality (=) H0 : μ = 0. In contrast the alternative hypothesis, H1, will take one of three forms, i.e. using ≠, <, or >, that is H1 : μ ≠ 0 or H1 : μ < 0 or H1 : μ > 0. Note that only one of these forms will be used per test. H1 : μ ≠ 0 two tailed test use α/2 H1 : μ < 0 lower-tailed (one-sided) test use α with a negative sign H1 : μ > 0 upper-tailed (one-sided) test use α with a positive sign Always assume the null hypothesis, H0, is true working hypothesis Type I error: Rejecting H0 when it is true. This can be thought of as a ‘false positive’. Denote the probability of this type of error by α. Type II error: Failing to reject H0 when it is false. This can be thought of as a ‘false negative’. Denote the probability of this type of error by β.

Steps of conducting a hypothesis test

1. Define the hypotheses.

2. State test statistic and compute its value.

6

3. Define critical region for given significance level, α.

4. Choose hypothesis.

:

- reject null hypothesis if; test statistic > critical value

Page 20: A Crash Course in Statistics - Handouts

20

- reject null hypothesis if; test statistic < critical value

- reject null hypothesis if; test statistic > + critical value or

test statistic < - critical value

: - - !

5. Retest at appropriate levels.

6. Draw conclusions.

P-value -

( )

100

1570

120

1600 ?

Page 21: A Crash Course in Statistics - Handouts

21

( )

:

Page 22: A Crash Course in Statistics - Handouts

22

:

:

(solution overleaf)

Page 23: A Crash Course in Statistics - Handouts

23

Page 24: A Crash Course in Statistics - Handouts

24

Hand-out # 10 –

Question 3

Question 4

Question 5 - Zone A, 2013

Page 25: A Crash Course in Statistics - Handouts

25

Solution 1

Solution 2

Solution 3

Solution 4

Page 26: A Crash Course in Statistics - Handouts

26

Solution 5

Page 27: A Crash Course in Statistics - Handouts

27

Hand-out # 11 – -

This type of test, tests the null hypothesis that two factors (or attributes) are not associated, against

the alternative hypothesis that they are associated.

Each data unit we sample has one level (or ‘type’ or ‘variety’) of each factor.

Suppose that we are sampling people, and that one factor of interest is hair colour (blonde, brown,

black, etc.) while another factor of interest is eye colour (blue, brown, green, etc.).

We wish to test whether or not these factors are associated.

Hence,

H0 : No association between hair colour and eye colour.

H1 : There is association between hair colour and eye colour.

So under H0 the distribution of eye colour is the same for blondes as it is for brunettes etc., whereas

if H1 is true it may be attributable to blonde-haired people having a (significantly) higher proportion

of blue eyes, say.

In three areas of a city a record has been kept of the numbers of burglaries, robberies and car thefts

that take place in a year. The total number of offences was 150, and they were divided into the

various categories as shown in the following contingency table:

The cell frequencies are known as observed frequencies.

1: ( )

2:

:

= [(row i total)(row j total)]

grand total

Page 28: A Crash Course in Statistics - Handouts

28

( ) :

3: -

0 ( ) ( )

( )

( )

- ( − 1)( − 1)

4:

2 -

= = 3 ( − 1)( − 1) = (3 − 1)(3 − 1) = 4

α = 001 2 - 13277

Page 29: A Crash Course in Statistics - Handouts

29

1% 2313 > 13277

:

-

-

- -

= 1

0

- -

0

5

= − 1

- - -

(Solution overleaf)

Page 30: A Crash Course in Statistics - Handouts

30

Page 31: A Crash Course in Statistics - Handouts

31

Hand-out # 12 –

Page 32: A Crash Course in Statistics - Handouts

32

Solution 1

Solution 2

Solution 3

Page 33: A Crash Course in Statistics - Handouts

33

Page 34: A Crash Course in Statistics - Handouts

34

Hand-out # 13 –

Correlation ( ) and regression (

) enable us to see the connection between the actual dimensions of

two or more variables.

: / :

‘ ’

Page 35: A Crash Course in Statistics - Handouts

35

-

-

!

ρ :

The sample correlation coefficient is calculated using;

(ρ ) :

-

-

Page 36: A Crash Course in Statistics - Handouts

36

- ;

- ±1 —

1

(

ρ) :

(ρ) ( 1)

(ρ) ( −1)

Page 37: A Crash Course in Statistics - Handouts

37

:

( ) — :

( ) —

( )

:

= α + β α β

α β ( )

:

α β

0

(

) 0 — 000

10000 0 = 10 10000!

Page 38: A Crash Course in Statistics - Handouts

38

Hand-out # 14 –

Page 39: A Crash Course in Statistics - Handouts

39

Soultion 1

Solution 2

Page 40: A Crash Course in Statistics - Handouts

40

Page 41: A Crash Course in Statistics - Handouts

41

Page 42: A Crash Course in Statistics - Handouts

42

Page 43: A Crash Course in Statistics - Handouts

43

Page 44: A Crash Course in Statistics - Handouts

44

Page 45: A Crash Course in Statistics - Handouts

45

Page 46: A Crash Course in Statistics - Handouts

46