descriptive statistics - wordpress.com · descriptive statistics maths 4th eso josÉ jaime noguera...

Post on 18-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DESCRIPTIVE STATISTICS

MATHS 4TH ESO

JOSÉ JAIME NOGUERA

1

INTRODUCTION

Statistics is used to collect, organize, analyze and present data.

• POPULATION: the whole group of entities (individuals) that you want to study.

• SAMPLE: a small subset of the population, that represents the entire population.

2

3

EXAMPLE

Spanish general election:

WHICH PARTY ARE YOU GOING TO VOTE?

– POPULATION: all Spanish citizen over 18.

– SAMPLE: the people who you really ask the question.

4

Random Variables

A random variable or statistical variable is the characteristic that we want to study in the population.

According to the answer to your question you can classify the variables as:

Random variable

QUALITATIVE Answer=not a

number

QUANTITATIVE Answer= a number

DISCRETE Answer=a integer

number

CONTINUOUS Answer= a decimal

number 5

Examples

• We want to study the number of brothers or sisters of a population.

– Discrete quantitative variable.

• We want to study the hair colour of a population.

– Qualitative variable.

• We want to study the height of a population.

– Continuous quantitative variable.

6

Organizing data: frequency tables and charts

7

A complete example for a discrete quantitative variable

Study: we ask 25 students the number of brothers and sisters. The answers are:

1, 3 , 0, 1, 2 , 2 , 0, 1 , 2 , 1 , 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 3, 2, 2, 1 ,1

8

YOU NEED TO KNOW

• N = the number of data.

• 𝑥𝑖 = the value of the variable number i.

• 𝑓𝑖 = absolute frequency, is the number of times that 𝑥𝑖 appears in the answers.

• ℎ𝑖 = relative frequency = 𝑓𝑖

𝑁

• 𝐹𝑖 = absolute cumulative frequency= 𝑓1 + 𝑓2 +⋯𝑓𝑖

• 𝐻𝑖 = relative cumulative frequency= ℎ1 + ℎ2 +⋯ℎ𝑖=𝐹𝑖

𝑁

• 𝐻𝑖 as % = relative cumulative frequency as a percentage= 𝐻𝑖 · 100 %

9

1, 3 , 0, 1, 2 , 2 , 0, 1 , 2 , 1 , 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 3, 2, 2, 1 ,1

𝒙𝒊 𝒇𝒊 𝒉𝒊 𝑭𝒊 𝑯𝒊 𝑯𝒊 as %

0 5 5/25=0.2 5 5/25=0.2 0.2·100=20%

1 12 12/25=0.48 5+12=17 17/25=0.68 0.68·100=68%

2 6 6/25=0.24 5+12+6=23 23/25=0.92 0.92·100=92%

3 2 2/25=0.08 5+12+6+2=25 25/25=1 1·100=100%

N=25

FREQUENCY TABLE

10

CHARTS

0

2

4

6

8

10

12

14

0 1 2 3

BAR CHART

Absolute frequency

11

CHARTS

0

2

4

6

8

10

12

14

0 1 2 3

FREQUENCY POLIGON

Absolute frequency

12

CHARTS

0 20%

1 48%

hi·360º=172.8º

2 24%

3 8%

PIE CHART

13

A complete example for a continuous quantitative variable

If we have too many different values of xi we have to group the values into class intervals.

Example: We know the weight of 30 students:

52 63 71 68 72 69

73 81 53 80 71 72

77 61 83 78 55 60

73 53 66 90 80 96

67 70 82 83 71 61

14

Choosing the length (or amplitude) of the intervals

Here we have several options

• If the problem says the number of intervals, for instance, “group the data into 6 intervals”

𝑙𝑒𝑛𝑔𝑡ℎ =𝑀𝑎𝑥.𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛.𝑉𝑎𝑙𝑢𝑒

6

• If the problem says nothing:

𝑙𝑒𝑛𝑔𝑡ℎ =𝑀𝑎𝑥. 𝑉𝑎𝑙𝑢𝑒 −𝑀𝑖𝑛. 𝑉𝑎𝑙𝑢𝑒

𝑁

15

In our case the problem says nothing, therefore:

𝑙𝑒𝑛𝑔𝑡ℎ =96 − 52

30= 8.03

The length should be an integer. We always choose the higher integer 8.03 → 𝑙𝑒𝑛𝑔ℎ𝑡 = 9

The first number of the first interval is also confusing. Sometimes the intervals are centered on the data, but we will simply choose the minimum value of our data.

16

52 63 71 68 72 69

73 81 53 80 71 72

77 61 83 78 55 60

73 53 66 90 80 96

67 70 82 83 71 61

Intervals Class Mark 𝒙𝒊

𝒇𝒊 𝒉𝒊 𝑭𝒊 𝑯𝒊 𝑯𝒊 as %

[52,61)

[61,70)

[70,79)

[79,88)

[88,97]

Pay attention! 17

52 63 71 68 72 69

73 81 53 80 71 72

77 61 83 78 55 60

73 53 66 90 80 96

67 70 82 83 71 61

Intervals Class Mark 𝒙𝒊

𝒇𝒊 𝒉𝒊 𝑭𝒊 𝑯𝒊 𝑯𝒊 as %

[52,61) 52 + 61

2= 56.5

[61,70) 65.5

[70,79) 74.5

[79,88) 83.3

[88,97] 92,5

18

52 63 71 68 72 69

73 81 53 80 71 72

77 61 83 78 55 60

73 53 66 90 80 96

67 70 82 83 71 61

Intervals Class Mark 𝒙𝒊

𝒇𝒊 𝒉𝒊 𝑭𝒊 𝑯𝒊 𝑯𝒊 as %

[52,61) 56.5 5

[61,70) 65.5 7

[70,79) 74.5 10

[79,88) 83.3 6

[88,97] 92,5 2

19

52 63 71 68 72 69

73 81 53 80 71 72

77 61 83 78 55 60

73 53 66 90 80 96

67 70 82 83 71 61

Intervals Class Mark 𝒙𝒊

𝒇𝒊 𝒉𝒊 𝑭𝒊 𝑯𝒊 𝑯𝒊 as %

[52,61) 56.5 5 5/30=0.16 5 5/30=0.16 16%

[61,70) 65.5 7 0.23 5+7=12 12/30=0.4 40%

[70,79) 74.5 10 0.33 22 0.73 73%

[79,88) 83.3 6 0.2 28 0.93 93%

[88,97] 92,5 2 0.06 30 1 100%

N=30 20

HISTOGRAM

0

2

4

6

8

10

12

52 61 70 79 88 97

21

Exercise

In a clothing store, the number of garments sold per day is:

a) Make a frequency table grouping the data into 6 class intervals.

b) Draw the proper chart.

22

Statistical Parameters

23

STATISTICAL CONCENTRATION PARAMETERS

They are also known as CENTRAL TENDENCY MEASURES:

• MEAN: 𝑥 =𝑥1·𝑓1+𝑥2·𝑓2+⋯+𝑥𝑛·𝑓𝑛

𝑁=

𝑥𝑖·𝑓𝑖𝑛𝑖=1

𝑁

• MODE: Mo is the 𝑥𝑖 with the greatest 𝑓𝑖

• MEDIAN: Me is the value in the middle of the data when they are in order.

24

Discrete quantitative variable

• Using the data of our previous example:

𝑥 =0·5+1·12+2·6+3·2

25=1.2

Mo= 1 (because its 𝒇𝒊 is the greatest one)

Me= 1

Because the 50% of N=25 is 0.5·25=12.5, then, the first 𝐹𝑖 greater than or equal to 12.5 is 17=𝐹2 which corresponds with 𝑥2=1

𝒙𝒊 𝒇𝒊 𝑭𝒊

0 5 5

1 12 17

2 6 23

3 2 25

N=25

25

Continuous quantitative variable

• Using the data of our previous example:

𝑥 =56.5·5+65.5·7+74.5·10+83.3·6+92.5·2

30

= 72.36 Mo= 74.5, or modal interval= [70,79)

Me= 74.5, or median class interval=[70,79)

Because the 50% of N=30 is 0.5·30=15, then, the first 𝐹𝑖 greater than or equal to 15 is 22=𝐹3 which corresponds with 𝑥3=74.5

Intervals 𝒙𝒊 𝒇𝒊 𝑭𝒊

[52,61) 56.5 5 5

[61,70) 65.5 7 12

[70,79) 74.5 10 22

[79,88) 83.3 6 28

[88,97] 92.5 2 30

N= 30

26

STATISTICAL POSITION PARAMETERS

• QUARTILES: are the points that divide the data into four equal parts:

– 𝑄1: first quartile. Below 𝑄1 are the 25% of the data.

– 𝑄2 = Me. Below 𝑄2 = 𝑀𝑒 are the 50% of the data.

– 𝑄3: third quartile. Below 𝑄3 are the 75% of the data.

• PERCENTILES, 𝑃𝑘 , below it are the k% of the data.

27

Discrete quantitative variable • Using the data of our previous example:

𝒙𝒊 𝒇𝒊 𝑭𝒊

0 5 5

1 12 17

2 6 23

3 2 25

N=25

• 𝑄1 → 0.25 · 25 = 6.25 The first 𝐹𝑖 greater than or equal to 6.25 is 17=𝐹2 which corresponds with 𝑥2=1. Hence 𝑄1=1

• 𝑄2 → 0.5 · 25 = 12.5 The first 𝐹𝑖 greater than or equal to 12.5 is 17=𝐹2 which corresponds with 𝑥2=1. Therefore 𝑄2=1=Me

• 𝑄3 → 0.75 · 25 = 18.75 The first 𝐹𝑖 greater than or equal to 18.75 is 23=𝐹3 which corresponds with 𝑥3=2. Then 𝑄3=2

• 𝑃95 → 0.95 · 25 = 23.75 The first 𝐹𝑖 greater than or equal to 23.75 is 25=𝐹4 which corresponds with 𝑥3=3. Hence 𝑃95=3

28

Continuous quantitative variable • Using the data of our previous example:

• 𝑄1 → 0.25 · 30 = 7.5 The first 𝐹𝑖 greater than or equal to 7.5 is 12=𝐹2 which corresponds with 𝑥2=65.5. Hence 𝑄1=65.5

• 𝑄2 → 0.5 · 30 = 15 The first 𝐹𝑖 greater than or equal to 15 is 22=𝐹2 which corresponds with 𝑥3=74.5. Therefore 𝑄2=Me=74.5

• 𝑄3 → 0.75 · 30 = 22.5 The first 𝐹𝑖 greater than or equal to 28 is 28=𝐹4 which corresponds with 𝑥4=83.3. Then 𝑄3=83.3

• 𝑃30 → 0.30 · 30 = 9 The first 𝐹𝑖 greater than or equal to 9 is 12=𝐹2 which corresponds with 𝑥2=65.5. Hence 𝑃30=65.5

Intervals 𝒙𝒊 𝒇𝒊 𝑭𝒊

[52,61) 56.5 5 5

[61,70) 65.5 7 12

[70,79) 74.5 10 22

[79,88) 83.3 6 28

[88,97] 92.5 2 30

N= 30

29

Box and whisker plot

30

Discrete quantitative variable

• We know that – Minimum value=0

– 𝑄1=1

– 𝑄2 = 𝑀𝑒=1

– 𝑄3=2

– Maximum value=3

31

Continuous quantitative variable

• We know that – Minimum value = 52

– 𝑄1= 65.5

– 𝑄2 = 𝑀𝑒= 74.5

– 𝑄3= 83.3

– Maximum value = 97

32

Improving the quartiles calculus

Once you know the basics of quartiles let’s see an special case:

• Calculate the quartiles :

1 , 1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5

The frequency table is:

33

𝒙𝒊 𝒇𝒊 𝑭𝒊

1 3 3

2 1 4

3 2 6

4 3 9

5 3 12

N=12

If you want to calculate the mean: • 𝑄2 → 0.5 · 12 = 6 The first 𝐹𝑖 greater than

or equal to 6 is 6=𝐹3 which corresponds with 𝑥3=3. Therefore 𝑄2=3=Me

But this is unreal because if we see the data the mean should be:

1, 1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5

Me=3+4

2= 3.5

To solve this drawback, we simply, calculate the Mean (or any other quartile) as 𝑥𝑖 + 𝑥𝑖+1

2

when we find a 𝐹𝑖 exactly the same as 0.5·N (or k% of N). In other cases we calculate the quartiles as usual. Hereinafter we will calculate the quartiles as has been explained in this slide.

34

Dispersion (spread) Parameters

• Range: R=Max. value-Min value.

• Average Deviation: 𝐷𝑥 =𝑓1 𝑥1−𝑥 +𝑓2 𝑥2−𝑥 +⋯+𝑓𝑛 𝑥𝑛−𝑥

𝑁

• Variance: 𝜎2 =𝑓1 𝑥1−𝑥

2+𝑓2 𝑥2−𝑥 2+⋯+𝑓𝑛 𝑥𝑛−𝑥

2

𝑁

• Standard deviation: 𝜎 = 𝜎2

• Coefficient of variation: CV=𝜎

𝑥

35

Example

We know that:

𝑥 = 1.2

• Range: R=3-0=3

• Average Deviation: 𝐷𝑥 =5 0−1.2 +12 1−1.2 +6 2−1.2 +2 3−1.2

25 = 0.67

• Variance: 𝜎2 =5 0−1.2 2+12 1−1.2 2+6 2−1.2 2+2 3−1.2 2

25= 0.72

• Standard deviation: 𝜎 = 𝜎2 = 0.86 = 0.85

• Coefficient of variation: CV=𝜎

𝑥 =

0.85

1.2= 0.6

36

Exercise

• Calculate the spread parameters:

𝑥 = 72.36

37

Interpreting the spread measures

A 𝑥 = 3

𝜎 = 1.03

𝐶𝑉 =𝜎

𝑥 = 0.34

B 𝑥 = 3

𝜎 = 1.68

𝐶𝑉 =𝜎

𝑥 = 0.56

C 𝑥 = 30 𝜎 = 16.8

𝐶𝑉 =𝜎

𝑥 = 0.56

• A and B have the same mean but the data dispersion is greater in B because its 𝜎 is greater in B (also de CV)

• The CV is useful when we compare two sets of data when the units are different. • In A and B the CV contains the same information as 𝜎 because the units are the

same (1,2,3,4,5), but if we compare B and C the 𝜎 is not useful because it seems that in C the data dispersion is greater (𝜎 is greater) . But that is not true because the units are different, so we have to use the CV . In fact, the data spread in B and C is the same (because they have the same CV)

38

Dispersion diagrams

39

Dispersion diagrams

If we have pairs of data (𝑥𝑖 , 𝑦𝑖) and we plot them, we obtain a dispersion diagram.

Example:

40

Correlation • If the point cloud is near a line, then exist linear correlation.

• In other types of curves, there is correlation but is nonlinear.

• If the point cloud is not near any curve, then there is no correlation.

41

Example • In a laboratory, we give to some mice three medicaments A, B, C in order

to cure a disease. We plot the quantity of substance (X axis) vs the number of dead mice (Y axis). We plot the dispersion diagrams associated with

each substance:

A B C

• In A there is linear positive correlation, as X increases Y tends to increase. This is not a good choice because the medicament kills the mice.

• In B there is linear negative correlation. As X increase, Y decreases. B is a good medicament because reduce the number of dead mice.

• In C there is no correlation. C does not have anything to do with the disease

42

top related