an approach to descriptive statistics through real...

39
MaMaEuSch Management Mathematics for European Schools http://www.mathematik.uni- kl.de/˜ mamaeusch An approach to Descriptive Statistics through real situations Paula Lagares Barreiro 1 Federico Perea Rojas-Marcos 1 Justo Puerto Albandoz 1 MaMaEuSch 2 Management Mathematics for European Schools 94342 - CP - 1 - 2001 - 1 - DE - COMENIUS - C21 1 University of Seville 2 This project has been carried out with the partial support of the European Community in the frame- work of the Sokrates programme. The content does not necessarily reflect the position of the European Community, nor does it involve any responsibility on the part of the European Community.

Upload: buixuyen

Post on 06-Mar-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

MaMaEuSch

Management Mathematics forEuropean Schools

http://www.mathematik.uni-kl.de/˜ mamaeusch

An approach to Descriptive Statistics through real situations

Paula Lagares Barreiro1

Federico Perea Rojas-Marcos1

Justo Puerto Albandoz1

MaMaEuSch2

Management Mathematics for European Schools94342 - CP - 1 - 2001 - 1 - DE - COMENIUS - C21

1University of Seville2This project has been carried out with the partial support of the European Community in the frame-

work of the Sokrates programme. The content does not necessarily reflect the position of the EuropeanCommunity, nor does it involve any responsibility on the part of the European Community.

Page 2: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Contents

1 One-dimensional Descriptive Statistics 31.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 The example: an opinion poll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Population and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative 51.5 Frequency tables: absolute, relative and percentage frequencies . . . . . . . . . . . . 61.6 Graphical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6.1 Bar graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.6.3 Frequency polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6.4 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.6.5 Pictogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.6.6 Stem and leaf plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.6.7 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Measures of central tendency: mean, median, mode, quantiles . . . . . . . . . . . . . 141.8 Measures of variability: Range, variance, standard deviation . . . . . . . . . . . . . . 171.9 Joint use of the mean and the standard deviation: Tchebicheff’s theorem, Pearson’s

coefficient of variation, z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.9.1 Tchebicheff’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.9.2 Pearson’s coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . . 211.9.3 Z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Analysis of the opinion poll 232.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Two-dimensional Descriptive Statistics 283.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 The example: an opinion poll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Introduction and simple tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Frequency tables, marginal distributions and conditional distributions . . . . . . . . 303.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 Functional dependence and statistical dependence . . . . . . . . . . . . . . . . . . . . 333.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.8 Linear correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1

Page 3: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

3.9 Regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2

Page 4: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Chapter 1

One-dimensional DescriptiveStatistics

We are going to study an opinion poll. You will fill a poll, so that we will see what you thinkabout a lot of topics and we will study some characteristics as height, number of brothers/sisters,etc. We will check if your opinions coincide with those of the rest of your friends and also if there aremany people in your classroom with similar characteristics to yours. For instance, how many of yourpartners are higher than you? And how many of them have the same number of brothers/sistersthan you? Before continuing, we will pose the main objectives that we want to achieve in thischapter.

1.1 Objectives

• To distinguish the different types of statistics.

• To determine which type of statistic process we shall use, depending on the type of data thatwe are studying.

• To get to know the concepts of central tendency and variability of a set of data.

• To determine the parameters of an statistics distribution.

• To study the coefficient of variation.

• To motivate through information given in examples and exercises about social, ecological,economical topics, etc.

3

Page 5: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

1.2 The example: an opinion poll

¿From now on, we will work with an opinion poll. We want to know some things about thestudents of the same class than you. We will ask you about some personal data and then you willgive to us some information and opinion about many topics, as sports, food, etc. Our poll will beanonymous, so that each one of you can feel free to answer without worrying about the later readingof those opinions. Thus, with these data, we will pose some interesting questions about ourselvesas a group, that we can maybe use as an orientation to answer other questions about a wider groupof people. For instance,

• Which is the most frequent height in your class?

• Can you consider your weekly pay normal compared with those of your partners?

• How many of you practice sports often? How many have breakfast before coming to the highschool?

• What kind of fruit do you eat more: fruit, milk, coffee, milk, fish . . . ?

We will see that analyzing the answers we get in the poll, you will be able to answer all thesequestions we have posed. Surely, at the end of this chapter we will have all the answers. But firstof all, we are going to present the concepts that you will need.

1.3 Population and samples

Before answering all those questions, we have to clarify some things. Who do we want to getinformation about? We have said yet that we want to know things about the students of your level,so our population will not be only the students of this class, but all the students of your level. Butit will take too long to ask all those students, thus we have decided to take a representative groupof all the classrooms of your level, that is your class, in this case. So that you are the sample.Furthermore, each member of the population is called data point. Let us make some commentsabout what we have just said. First of all, maybe we want to study some characteristic in animals,plants or things, for instance, the life of batteries of a mobile phone and, in this case, the populationis not ”human”, but the different types of mobile phones. Moreover, we can find some situationsin which the use of sampling is even more justified than in our case, due to different reasons: ifwe want to know the vote of all the spanish people, we can’t ask all the inhabitants older than 18,because those are millions of people and that means lots of time and money. To study, for example,the average life of light bulbs we can’t prove all of them because each proof means that a bulbis blown, this is an example of those situation in which sampling means destroying a data point.Therefore, sampling is justified in many situations by reasons of time, money or destruction of thedata points.

Exercise 1.3.1 The University studies demand poll in Andalusia was made in 2001 to know whatthe 65356 high school students wanted to study and why. In order to get that, data from 8500students from all Andalusia were collected. Could you say which are the sample and the populationin this example? Which are the reasons to choose a sample in this example?

4

Page 6: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

1.4 Types of statistical variables: quantitative (discrete andcontinuous) and qualitative

In order to answer to many of our questions in the right way, what we shall first do is to decidethe kind of method we want to apply to our data. Notice that not all the data we can collect arethe same kind, for instance, we can think about the answer to three questions of our poll:

1. The answer to the question sex (male or female).

2. The answer to the question number of brothers/sisters.

3. The answer to the question height.

The first thing we notice is that the answer to the first question is not numerical whereas theanswers to questions two and three are numerical. The characteristic corresponding to the answerof the first question is called qualitative whereas the ones related to the answers of questions twoand three are called quantitative. It is easy to see that quantitative variables allow to do operationsthat we cannot do with qualitative characteristics. We call categories to the different possibilities ofthe qualitative variable and values to the ones of the quantitative variables. Let us see now whichare the differences between variables 2 and 3, because this one is a little more complicated. Thevariable number of brothers/sisters take numerical values that we can call ”isolated”, 0,1,2,3,. . . ,but it cannot take any value between two of those ones, for instance, it cannot have the value 3.5.Nevertheless this does not happen with the variable height. In fact, height can have any valuebetween certain limits, we can measure height as precisely as we want. We can say that height cantake any value from an interval. So the variable in case 2 is called discrete and the variable in case3 is called continuous.

Exercise 1.4.1 Decide whether these variables are qualitative or quantitative, and if they are quan-titative, whether they are discrete or continuous

1. Number of babies born in a day.

2. Blood group of a person.

3. Time needed to solve a problem.

4. Number of questions in an exam.

5. Temperature of a person.

6. Political party voted in the last elections.

7. Number of goals scored by a player in a season.

5

Page 7: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

1.5 Frequency tables: absolute, relative and percentage fre-quencies

It is the time now to start processing the data we have collected with our poll. The data that wehave about number of brothers/sisters are

0 1 3 2 0 1 0 1 1 2 2 3 1 2 1 1 1 1 0 0 4 2 3 1 2 1 2 1 1 0Meanwhile for the weights we have52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52We can pose a lot of questions: how many of my partners have the same number of broth-

ers/sisters as I have? How many of them have more than me? And less than me? how many ofmy partners weigh more than me? and less than me? To answer these questions, we would have tocount how many time each answer appears. Let us start counting the ones related to the numberof brothers/sisters. This is what we have

0 ||||| | → 61 ||||| ||||| ||| → 132 ||||| || → 73 ||| → 34 | → 1

So, we know now that there are 13 people that have 1 brother/sister. This number is called absolutefrequency and we denote it by ni. And, how many people has at most 1 brother/sister? In our case,the people that has 0 or 1 brother/sister, this is, 6 + 13 = 19. This number is called cumulativeabsolute frequency and we will denote it by Ni. We can write now the cumulative and absolutefrequency table:

N. bro/sis absolute fr. cum. absolute fr.0 6 61 13 13 + 6 = 192 7 13 + 6 + 7 = 263 3 13 + 6 + 7 + 3 = 294 1 13 + 6 + 7 + 3 + 1 = 30

It is important to put the values of the characteristic in order from the biggest to the smallest, ifwe want to calculate the cumulative frequencies in the right way. We are going to define now otherkinds of frequencies, because it is interesting to know the proportion of the total that representsa concrete value, because that’s the way we can compare it with other populations. For instance,in our case, there are 6 students that have 0 brothers/sisters, but we have asked in a group of 50people and we know that there are 9 people with 0 brothers/sisters, so in which of the two groups isthere a bigger proportion of people with no brothers/sisters? It is easy to see that the proportionsare

630

= 0.2 and950

= 0.18

So the proportion is bigger in our group of 30 people. This proportion is called relative frequencyand we denote it by fi. If we express it as a percentage (multiplying by 100) we get the percentage

6

Page 8: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

frequency, that in our case are 20% and 18% respectively. We denote these frequencies by pi. Weadd now all these frequencies to our table and we get

Bro/sis absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr.0 6 6

30 = 0.2 20% 6 0.21 13 13

30 = 0.43̂ 43.3̂% 13 + 6 = 19 0.63̂2 7 7

30 = 0.23̂ 23.3̂% 13 + 6 + 7 = 26 0.86̂3 3 3

30 = 0.1 10% 13 + 6 + 7 + 3 = 29 0.96̂4 1 1

30 = 0.3̂ 3.3̂% 13 + 6 + 7 + 3 + 1 = 30 1

Let us analyze now the weight data. We count the different values:

46 | → 147 | → 149 | → 150 || → 252 ||| → 353 | → 154 || → 255 | → 157 | → 158 | → 159 | → 160 | → 162 | → 163 | → 166 | → 167 || → 268 | → 169 | → 170 | → 172 | → 174 | → 175 | → 177 | → 180 | → 182 | → 1

As you can see, most of the values have frequency 1 and our variable takes 25 different values.Those are too many different values to represent in a table (even more if we only have 30 data).How can we get a more representative table of the distribution of the data? It seems logical togroup similar data in intervals. There is a complete theory about how to group data in a right way.These are the main points we want to remark:

• The number of classes shall not be neither too high (around 6 − 8 is the maximum numberwe usually work with) nor too low (it makes no sense to group in 2 or 3 classes because weare losing a lot of information.

7

Page 9: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

• Excepting maybe the extreme classes, all the intervals should have the same width, becauseif not, the information can be misinterpreted.

Can you imagine which are the intervals we are looking for? You can think about the numberof classes you want to have, for instance. Let us note that between the highest value (82) and thelowest value (46) there is a difference of 36 kg. For instance, if we want to group in 6 classes thewidth of the interval should be 36

6 = 6. So we obtain the following intervals: [46,52], (52,58], (58,64],(64,70],(70, 76], (76,82]. Now we have a possible classification though, of course, there are manymore. In some analysis you may find that the first interval is of the kind ”smaller than 52” andthe last interval ”greater than 76”. This kind of interval is considered the same size as the othersin order to make calculus. Once decided the data grouping, we can calculate the frequencies:

Weight absolute fr. relative fr. percentage fr. cum. abs. fr. cum. rel. fr.[46,52] 8 0.26̂ 26.6̂% 8 0.26̂(52,58] 6 0.2 20% 14 0.46̂(58,64] 4 0.13̂ 13.3̂% 18 0.6(64,70] 6 0.2 20% 24 0.8(70,76] 3 0.1 10% 27 0.9(76,82] 3 0.1 10% 30 1

Moreover, when we work with grouped data we shall need to choose a representative of each oneof the intervals, and we will call it class mark, and it will be the half point of the interval (lowerextreme of the interval plus higher extreme of the interval, divided by 2).

Exercise 1.5.1 Make the frequency table from the variable ”answers to the question 1.3” and fromthe answers to the question ”height”, deciding previously if it is necessary to group the data inintervals or not.

1.6 Graphical methods

Once we have the frequency tables, imagine that your teacher ask you to present to the rest ofthe students the conclusions you have obtained. You can present your frequency tables and talkabout the main conclusions, but, is there any way of presenting data in such a way that the mainconclusions can be seen in a more simple way? As you can suppose, the answer to this questionis yes. Maybe you have seen in books or mainly in the media, that data are usually presented ina graphic way, so that are more attractive to the people and also easier to analyze data. In thissection we want to show all the types of graphs and we are going to stress in how important it isto make a right choice of the type of graph depending on the data we are working with. Now wehave the frequency tables for the variables weight and number of brothers/sisters, we are going touse them to introduce the different graphs.

1.6.1 Bar graph

The first kind of graph we are going to study is the bar graph. This is a graph that is used for

8

Page 10: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

qualitative variables and discrete variables grouped in intervals. We know already that our dataabout number of brothers/sisters is a discrete variable, so let us see how to build a bar graph usingthose data. In the OX axis we place the categories if we have a qualitative variable or the valuesin the case we have a discrete variable, in our example, those values are 0, 1, 2, 3 y 4. Over eachone of these values, we place a rectangle or a bar of equal base, having a height proportional to thecorresponding frequency. In our case, we shall have a graph like this:

Figure 1.1: brothers/sisters (vertical bars)

Sometimes this graph is also presented with horizontal bars, in such a way like this:

Figure 1.2: brothers/sisters (horizontal bars)

1.6.2 Histogram

An histogram is a graph very similar to the bar graph, but this one is used for variables groupedin intervals. We are going to build an histogram for the variable weight. As the one before, itis built by representing in the OX axis the intervals and, over each of them we place a rectanglehaving a basis with the same width of the interval and such a height that the area of the rectangleis proportional to the frequency of the interval. In this kind of graph, the areas of the rectangles

9

Page 11: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

are very important, because we are not representing a bar corresponding to a point but the widthof the bar is representing our interval. So, if our intervals have the same width, the height shouldbe the frequency, if not, we shall modify the height in order to keep proportions between frequencyand area. Our histogram for the variable weight, that we have already grouped is:

Figure 1.3: weight (histogram)

We can represent it also with horizontal rectangles:

Figure 1.4: weight (histogram)

Surely, you have seen sometime a population pyramid in any media. You can notice that apopulation pyramid is in fact two horizontal histograms (one for women an other for men) in whichwe represent the number of inhabitants grouped by age .

1.6.3 Frequency polygon

The next type of graph that we are going to define is the frequency polygon. This graph is usedwhen we have quantitative variables, discrete or continuous. In order to draw it, we start from thehistogram or the bar graph, depending on the case that we have a grouped or not grouped variable.

10

Page 12: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

We have to join with a line the half-points of the higher basis in the bar graph or the histogram.In our two examples, we shall have for the number of brothers/sisters the next graph

Figure 1.5: brothers/sisters (frequency polygon)

The case of the weight is a little bit different. In this situation, the area under the line representsthe data we have, as in the histogram, because we are talking about the whole width of the interval.The graph looks like this:

Figure 1.6: weight (frequency polygon)

All the graphs that we have seen before can be drawn also for relative frequencies and forcumulative frequencies.

1.6.4 Pie Chart

The next type of graph that we are going to present is a well-known type, the pie chart. In a piechart, we assign to each category or value a part of a circle in such a way that its area should beproportional to the frequency. This graph is usually used for qualitative variables and not groupeddiscrete variables.

11

Page 13: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Figure 1.7: brothers/sisters (pie chart)

1.6.5 Pictogram

These are a kind of graphs that are very frequent in the media, and they are called pictograms.They are graphs in which a picture related to the variable is used to represent the frequencies. Butwe have to stress again on something: the size (and not only the height) has to be proportionalto the frequency that we want to represent. It is usual to write also the frequency aside to avoidmistakes.

1.6.6 Stem and leaf plot

There is a representation that is between a graph and a data recount, this is the stem and leafplot. We are going to see how to make it through the example of the weight. We recall that thedata we had are:52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52In a stem and leaf plot, the first thing we have to do is to write in a column the different figurescorresponding to the tens that we can find in the data, in our example, as our values range between46 and 82, we shall have to write 4, 5, 6, 7 and 8 in the following way45678

Next, we take the first observation, 52, and we place the units figure aside its corresponding tensfigure, this is

12

Page 14: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

45 2678

So we keep placing the units figures aside the tens ones for the rest of the data. What we get issomething like this:4 6975 2490784235026 628370977 074258 02

You can notice that we have something similar (but not equal) to a bar graph or an histogram.Obviously we could have made it vertically and we would have something like this:

20532 74 98 07 7 50 3 2

7 9 8 49 4 2 7 26 2 6 0 04 5 6 7 8

That looks like an histogram or a bar graph though it is not. But the stem and leaf plot can betaken as an approximation to the distribution of the data. In fact, we have only divided in tens(from 40 to 49, from 50 to 59, . . . ) but we could divide in groups of 5 (from 40 to 44, from 45 to49, from 50 to 54, . . . ) just placing twice each of the ten figure, aside the first one we place the unitfigures between 0 and 4 and aside the second one, the unit figures between 5 and 9. In our exampleand for the horizontal case, we would have:

44 6975 240423025 97856 2306 687977 0427 758 028

13

Page 15: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

1.6.7 Some remarks

Imagine that you see the two following graphs referred to the benefits of a company. Which onewould you choose to be your company?

Figure 1.8: benefits (company 1 and company 2)

Most of you may choose company 2, because surely you agree that it is better than company1, but in fact data from the two graphs are the same. We have only changed the OY axis scale.We will make some remarks before starting the next section. Graphs are a very useful tool andthey make easier to obtain conclusions from our data, but it is necessary to draw them in the rightway in order to avoid mistakes. It is very important to keep proportions among the pictures werepresent so as to make sure that the axis scales keep also proportional, because small changes inscales make big differences in appearance and graph can be misunderstood.

1.7 Measures of central tendency: mean, median, mode,quantiles

Let us suppose now that we are planning a trip with all the class and we want to earn somemoney, so we have decided to sell t-shirts, but we don’t know which is the appropriate price. Theonly thing we know is that we pay for them 4 euros. We would like to have benefits but we cannotput a high price because we want everybody to buy our t-shirts. We think that the weekly pay is agood reference to know what the students can afford. So, we are going to use the weekly pay datathat we have:

6 8 10 5 15 20 9 10 9 9 20 15 12 6 15 12 10 25 20 30 15 12 9 20 6 9 10 25 9 9We have 30 values, but we need only one value to represent them all. Which is the value we can

choose? A first solution might be choosing an intermediate value among all the data we have. Inorder to get that, we sum all the numbers and divide it by the total number of data, so we have:

14

Page 16: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

x =6 + 8 + 10 + 5 + 15 + 20 + 9 + 10 + 9 + 9 + 20 + 15 + 12 + 6 + 15 + 12 + 10 + 25

30+

+20 + 30 + 15 + 12 + 9 + 20 + 6 + 9 + 10 + 25 + 9 + 9

30=

39030

= 13

Now we have the first possible price, 13 euros. This number, we have just calculated is calledmean. But there are more possibilities, for instance, we can choose the most frequent value torepresent our data. In our example, the most frequent value is 9, that can also be a good choicefor a price. We call mode to the most frequent value. But none of those two numbers that we havegot say anything about the number of people that can afford the t-shirt. So, we have another idea.Let us sort the data we have:5 6 6 6 8 9 9 9 9 9 9 9 10 10 10 10 12 12 12 15 15 15 15 20 20 20 20 25 25 30So now we want to find the value that leaves half of the data on each side. The values placed innumbers 15 an 16 leave 14 values in each side, as both of them have value 10, we can consider that10 is the value that leaves half of the data in each side. This number is called median. Just as wehave proposed a value that leaves 50% of the data on each side, we can look for a value that canafford 75% of the class, this is, we want to find the value that leaves 25% on the left (this meansthat only 25% of the data is lower than that value), or any other percentage. This numbers arecalled quantiles.

We can choose now any of those three values, depending on what we pretend on each case ordepending on the value that best represents al the data set. Those three values are not always validfor every case, but can help us to see where the center of the distribution is. These are the mainmeasures of central tendency. We are now going to define in a formal way the concepts that wehave presented. We are speaking from now on about variables.

Let us suppose that we have observed a variable in n data points and we got k different values,x1, x2, . . . xk, each of them with a frequency of n1, n2, . . . nk where ni is the absolute frequency ofthe value xi. We denote by Ni =

∑j≤i nj the cumulative absolute frequency of the value xi and by

fi = ni

n the relative frequency. If the values of the variable are grouped, we can suppose we have hintervals that we can denote by

(L0, L1], (L1, L2], . . . (Lh−1, Lh]

whose class marks will be c1, c2, . . . ch. In this case, the absolute frequencies will be denoted byn1, n2, . . . , nh, the cumulative absolute frequencies by N1, N2, . . . , Nh = n and the relative frequen-cies by f1, f2, . . . , fh.

Then, the mean is defined as follows

x =∑n

i=1 xini

n.

For not grouped variables. If we have a grouped variable we will use the class marks ci insteadof the values xi. The mean has as main characteristics the following:

• It is the gravity center of the distribution and it is unique.

• When we have extreme or scarcely representative values (too big or too small), the mean maynot be representative.

15

Page 17: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

• It makes no sense to calculate the mean for a qualitative variable or if we have grouped dataand anyone of the intervals is not bounded.

• For grouped data, we use the class mark of each interval to calculate the mean.

Moreover, the mean has the following properties:

• If a constant is summed to each value, the mean is summed in that constant also.

• If we multiply all the values by a constant, the mean is also multiplied by the same constant.

The mode is usually defined as the most frequent value. For the case of a not grouped variableit is the value that appears more times. In the case of grouped variables in intervals of the samewidth, we shall look for the interval with the highest frequency (modal class or interval) and theapproximation of the mode is done through the formula:

Mo = Li−1 +ni − ni−1

(ni − ni−1) + (ni − ni+1)· ci

. where:Li−1 is the lower limit of the modal interval.ni is the absolute frequency of the modal interval.ni−1 is the absolute frequency of the previous interval to the modal interval.ni+1 is the absolute frequency of the next interval to the modal interval.ci is the width of the interval.

The mode verifies that:

• We can have more than a mode for the distribution. In that case, we will say that we have abimodal, trimodal, . . . distribution depending on the number of values presenting the highestabsolute frequency.

• The mode is usually a worse representing than the mean, excepting the case of qualitativedata.

• If we have intervals with different width, we have to look for the interval with the highestfrequency density (this is usually the result of dividing the absolute frequency by the widthof the interval ni

ci) and then we use the preceding formula.

The median is, in the case of a grouped variable and once we have sorted our data the centralvalue if there is an odd number of observations and the media of the central values if we have apair number of data. If we have a grouped variable, we have to look for the central interval (theone in which we can find the central value), that is to say the one in which Ni is bigger than n

2 forthe first time, and then we can apply the formula:

Me = Li−1 +n2 −Ni−1

ni· ci

.

16

Page 18: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

whereLi−1 is the lower limit of the interval.ni is the absolute frequency of the central interval.Ni−1 is the cumulative absolute frequency of the previous interval to the central interval.n is the number of data.ci is the width of the interval.

Moreover, the quantiles are position measures that generalize the concept of median. We aregoing to define now the concept of centiles or percentiles, the quartiles and the deciles. We supposethat we have sorted our data. The centiles or percentiles are the values of the variable that leaveon the left side a concrete percentage of the data. We denote them by Ph or Ch where h is thepercentage, h = 1, 2, . . . , 99. If we have a grouped variable, once we have the interval in which wecan find the centil, we apply the next formula:

Ph = Ch = Li−1 +h · n

100 −Ni−1

ni· ci

.Where the different elements have the same meaning as in the median case. The quartiles are thevalues that, once we have sorted the data, divide the variable in 4 equal groups. Between each ofthem there is a 25% of the data points. We denote them by Q1, Q2 y Q3 and they verify thatQ1 = C25, Q2 = C50 = Me, Q3 = C75.

The deciles are the values that, once we have sorted our data, divide the data in 10 equal groups,in such a way that between any 2 of them there is a 10% of the data points. We denote them byD1, D2, D3, . . . , D9. They verify that D1 = C10, D2 = C20, D3 = C30, . . . D9 = C90.

Exercise 1.7.1 For the data of number of brothers/sisters and weight, calculate mean, mode, me-dian and cuantiles: Q1, Q3, C30, C74, D4, D9.

1.8 Measures of variability: Range, variance, standard de-viation

Imagine that we have 3 different data sets about the weights of certain people and we know thatin the 3 cases, the mean of the variable weight is 55. Does this mean that the 3 sets are equal orsimilar? We get the data and we find that the observations are:

Set 1: 55 55 55 55 55 55 55Set 2: 47 51 54 55 56 59 63Set 3: 39 47 53 55 57 63 71

we can see that, though they have the same mean, the data sets are very different. Look at theirstem and leaf plots:

17

Page 19: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

5555555

3 4 5 6 7

9654

7 1 33 4 5 6 7

75

9 7 1 3 13 4 5 6 7

Then, how can we find those differences among the data sets? It seems that the measures ofcentral tendency do not give to us enough information for all the situations, so we have to look forany other measures that can tell us how far the data and the mean are. It means that we needto use the concept of variability of the data. The first thing we notice is that in the first case, allthe data are equal, in the second one there is a little more difference between the biggest and thesmallest ones and in the third case this is even more obvious. Exactly, we have that

55− 55 = 063− 47 = 1671− 39 = 32

This numbers are called range of the data. Nevertheless, though it is a very easy measure tocalculate, it is not very much used, because if we have a very small or a very big value in ourdata, the range changes a lot, so it is not an useful measure for every situation. How can we finda number that can give to us an approximation to the distance between the data and the mean?We can calculate the distances from every data point to the mean (in absolute value) and thencalculate the mean of those distances. This is what we call mean deviation. Let us calculate themean deviation for the second group of data, we have:

|47− 55|+ |51− 55|+ |54− 55|+ |55− 55|+ |56− 55|+ |59− 55|+ |63− 55|7

=

=8 + 4 + 1 + 0 + 1 + 4 + 8

7=

267

= 3.714

.Nevertheless, we usually use a different measure of variability, that is the mean of the square

deviation of the data from the mean, and so we get that the biggest deviations have a smallerinfluence. But we are going to present the formal definition of all these concepts. The range is thedifference between the biggest and the smallest value of the variable, if it is not grouped. If we havea grouped variable, we calculate the difference between the higher limit of the last interval and thelower limit of the first interval.

The range only depends on the biggest and the smallest elements, and not on the rest of thedata. For instance, we could have the following two data sets with the same range:

It is easy to see that the difference between xk and x1 is the same in both situations but bothsets are very different. The interquartile range is the difference between the third and the firstquartiles, and it gives to us a zone where we can find 50% of the distribution. The mean deviationis the mean of the deviations of the data from the mean. We call deviation from the mean theabsolute value of the difference between the values of the variable and the mean (|xi − x|), so thedefinition of the mean deviation is

18

Page 20: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Figure 1.9: range

DM =∑k

i=1 |xi − x| · ni

n

This is a measure that is not used very often because of the difficulty to calculate it due to theabsolute value function. Anyway, a small mean deviation means that data are highly concentratedaround the mean. We can define also the median deviation, though it is even less usual. Thedefinition is:

D =∑k

i=1 |xi −Me| · ni

n.

The variance is the mean of the square deviations of the data from the mean. We denote it byS2 and its expression is

S2 =∑k

i=1(xi − x)2 · ni

n=

∑ki=1 x2

i · ni

n− x2

The variance verifies that:

• As we are taking the square of the deviations, the bigger ones have more influence on theresult.

• The unit of measure of S2 are not the same as the ones of the sample, because we have thesquare of the deviations.

• Variance is always positive. It is 0 when all the values coincide with the mean.

We define the quasivariance as

s2 =∑k

i=1(xi − x)2 · ni

n− 1

its relation to the variance is S2 = n−1n s2. This is a very useful measure when we work with

inferences. Sometimes it is also denoted by S2c . The standard deviation is the square root of the

variance. We denote it by S and its expression is

19

Page 21: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

S = +

√∑ki=1(xi − x)2 · ni

n= +

√∑ki=1 x2

i · ni

n− x2 = +

√x2 − x2

Its main properties are

• It is the most usual measure of variability.

• It has the same measure units than the sample

• Standard deviation is always positive or 0.

Moreover, variance and standard deviation verify:

• If we sum a constant to all the data, the variance and the standard deviation stay the same.

• If we multiply all the values by a positive constant, the variance is multiplied by the squareof the constant, and the standard deviation is multiplied by the constant.

1.9 Joint use of the mean and the standard deviation: Tchebich-eff’s theorem, Pearson’s coefficient of variation, z-scores

1.9.1 Tchebicheff’s theorem

We have already found measures that can give us the center of the data and their variability, butwe still need more information. Let us recall the data about number of brothers/sisters:

Num brothers absolute fr.0 61 132 73 34 1

so we have that

x = 1.33333, S2 = 1.022, S = 1.011

,How many people is there around the mean? Are there many students that have 1 or 2 broth-

ers/sisters? Let us take an interval centered in the mean, this is (x − a, x + a). We know thatvariance and standard deviation measure variability, so we will try to use them now. Which onewould you use? We should reject variance because we cannot sum it to the mean because they havedifferent measure units. Let us take then the standard deviation, a = S. Then we get the interval(1.3333−1.011, 1.3333+1.011) = (0.3223, 2.3443). Inside this interval we can find the students hav-ing 1 or 2 brothers/sisters. These are 20 of the 30 students, i. e., 66% of them. What could happenif we use 2S instead of S? We get the interval (1.3333− 2.022, 1.3333 + 2.022) = (−0.6887, 3.3553).

20

Page 22: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Inside this interval we have 29 of the 30 students, i. e., 96% of them. Obviously if we calculatethe interval for 3S we find that all the data are inside it. But the next question is does this alwayshappen? Are these concentrations of data always the same? Let us see another example using theweekly pay. We have that

x = 13, S2 = 39.2, S = 6.26

Then,

(13− 6.26, 13 + 6.26) = (6.74, 19.26) → contains 19 data (63%)(13− 12.52, 13 + 12.52) = (0.48, 25.52) → contains 29 data (96%)(13− 18.78, 13 + 18.78) = (−5.78, 31.78) → contains 30 data (100%)

As you can see, we get very similar results. This is because there is a theorem that assures thatin this intervals we can find a certain percentage of the data, exactly, the theorem states that inan interval such as (x − aS, x − aS) we have at least 100(1 − 1

a2 )% of the data. This statement isknown as the Tchebicheff’s theorem.

1.9.2 Pearson’s coefficient of variation

We are going to work now with height and weight data. We have that, for the weight:

x = 60.8, S2 = 99.56, S = 9.97

,while for the heights we have

x = 1.7133, S2 = 0.0128, S = 0.1132

.In which case do we have more variability? we could think that for the weight data because

variance and standard deviation are bigger, but look what happens if we calculate the same for theheights measured in centimeters

x = 171.33, S2 = 128.35, S = 11.32

.If we repeat the question now, what shall you answer? In fact, we cannot compare neither

standard deviations nor variances because they depend on the units, just like the mean. We shouldfind an adimensional measure. Until now, we only know that the mean and the standard deviationhave the same measure units, so how can we get an adimensional measure from them? We candivide them and then we get the Pearson’s coefficient of variation.

CV =S

x,

We can calculate it for our examples. For the weight we have that

21

Page 23: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

CV =9.9760.8

= 0.163

,and for the height

CV =11.32171.33

=0.11321.7133

= 0.066

,then we can find more variability in the weights than in the heights.

1.9.3 Z-scores

We can still find more information in our data. Imagine that your height is 1.74 and you havea friend in another class whose height is the same. But, inside each class which of you is higher?How can we compare these two data if we only know that the mean in your friend’s class is 1.708and standard deviation is 12.53? There is a way to change these two data to ”comparable” values.These is what we denote by z-scores and it is calculated by making the difference between the valueand its mean divided by the standard deviation. With this, we get that the two new values belongto a distribution with mean 0 and standard deviation 1, and so we can compare them.

In our example we have the following z-scores

z1 =1.74− 1.7133

0.1132= 0.236

,

z2 =1.74− 1.708

0.1253= 0.255

.And we conclude that your friend is higher than you (each one inside its class) because the

z-score is bigger. The formula for the z-score related to data xi is

zi =xi − x

S.

22

Page 24: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Chapter 2

Analysis of the opinion poll

We are going now to make a deeper analysis of some of the tasks in the opinion poll. We havechosen 3 tasks:

2.1 You smoke2.3 You read other books different than school books3.1 You practice some sport out of the high school

The data we have from question 2.1 are1 3 5 5 5 5 5 1 1 5 1 3 3 1 5 1 5 5 5 5 5 5 1 5 1 5 4 4 3 5

from question 2.3 we have1 1 1 2 2 2 3 4 4 4 1 3 2 4 1 2 1 3 2 1 1 1 2 1 1 1 1 2 2 4

and from 3.13 1 3 5 3 4 2 1 3 3 3 5 5 1 2 1 2 3 5 1 2 5 3 2 4 1 5 5 4 3

The first thing we are going to do is to calculate the frequencies in all cases in order to have thefrequency tables for all of them. For question 2.1 we have that

Answer (2.1) abs fr rel fr perc fr cum abs fr cum rel fr1 8 0.26̂ 26.6̂% 8 0.26̂2 0 0 0% 8 0.26̂3 4 0.13̂ 13.3̂% 12 0.44 2 0.06̂ 6.6̂% 14 0.46̂5 16 0.53̂ 53.3̂% 30 1

For question 2.3 we have the following frequency table

23

Page 25: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Answer (2.3) abs fr rel fr perc fr cum abs fr cum rel fr1 13 0.43̂ 43.3̂% 13 0.53̂2 9 0.3 30% 22 0.73̂3 3 0.1 10% 25 0.83̂4 5 0.16̂ 16.6̂% 30 15 0 0 0% 30 1

and finally, the frequency table for question 3.1 is

Answer (3.1) abs fr rel fr perc fr cum abs fr cum rel fr1 6 0.2 20% 6 0.22 5 0.16̂ 1.66̂% 11 0.36̂3 9 0.3 30% 20 0.6̂4 3 0.1 10% 23 0.76̂5 7 0.23̂ 23.3̂% 30 1

Just looking at the data we have in the tables, we can notice that the three are very different.We will try now to see graphically how these variables are distributed and then we will talk aboutthe first conclusions.

As you can notice we have three discrete variables, so we are going to use the bar graph and thepie chart. These are the graphs for the question 2.1

Figure 2.1: answers to question 2.1

Let us represent now the graphs for question 2.3:and now here we have the ones for the question 3.1

24

Page 26: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Figure 2.2: answers to question 2.3

Figure 2.3: answers to question 3.1

We can talk now about the first conclusions. Is it quite obvious that for the question 2.1 the mostfrequent values are the extreme ones, 1 and 5, that is because there is a tendency to relate number1 with the people that don’t smoke and number five with the people that do smoke. Anyway, mostof the data are placed in the bigger values (3,4 and 5). On the contrary, in question 2.3 we cansee that the most frequent values are the smaller ones, so we can say that reading is not a very”popular” hobby. The third question is a little more ”spread” on all the values.

It is also interesting in this example to represent a bar graph whit the cumulative absolutefrequencies. We show you the three graphs in which you can see that the frequencies are moregradually distributed in the third case:

Anyway, we are now going to confirm what we see by calculating the main measures of centraltendency: We are going to present them in a table, in order to make easier to compare them:

25

Page 27: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Figure 2.4: cumulative bar graphs

Mean Median ModeQ. 2.1 3.6 5 5Q. 2.3 2 2 1Q. 3.1 3 3 3

This table gives us some interesting information. It is quite simple to see that though the meanfor question 2.1 is 3.6, most of the data are bigger than the mean, because both the median andthe mode are 5. For question 2.3 the situation is very different, we can see that most of the dataare around the smallest values, and even the mode is the smallest one. In the question 3.1 we cannotice that the 3 values coincide, then we can see that number 3 is the best one to represent ourdata.

Let us calculate now the main measures of variability and then we will try to see which is thevariable that is more spread.

Range Variance Standard deviationQ. 2.1 4 3 1.73Q. 2.3 3 1.24 1.11Q. 3.1 4 2.06 1.43

26

Page 28: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

In our example, range is not very relevant, because all the answers range between 1 and 5. Theonly thing we can notice from the fact that in question 2.3 the range is 3 (smaller than the others)is that one of the extreme values (in this case value 5 has frequency 0) but for example, we cannotice that for question 2.1, the frequency of value 2 is also 0. From the standard deviation we canconclude that the answers to question 2.1 are very spread. This is true because if you take a lookto the data, you can find that most of them are extreme values, 1 or 5. The other two variables area bit more concentrated around the mean, specially the answer to question 2.3.

Let us check now if the mean is representative in our variables. We shall the calculate thecoefficient of variation in each case. We have that

Coefficient of variationQ. 2.1 0.48Q. 2.3 0.55Q. 3.1 0.47

So the mean is representative for the three cases we are studying.

2.1 Conclusions

In this last section of the analysis, it is important to stress on the meaning of the data we arestudying. Until now, we have been talking about the statistical characteristics of the data, but wecannot forget that all those data have their own meaning.

We can notice that smoking is something very popular among young people. More than half ofthis class says that they smoke every day, but only 8 people express that they never smoke. If wesum the frequencies of the students that at least smoke sometimes, we find that we get 22 of you,almost 3 quarters of the total.

On the contrary, there is very few interest in reading. 22 of you express that never or rarelyread a book different than the ones you need for school. This is maybe one of the biggest contrastswe can get from the poll. No one of you say that they read everyday, though there are 5 peoplethat say to read usually.

Sports are the middle ground. This is maybe because many of you can practice any sport in theweekends or when there is good weather, while the ones that practice sports very often balance theones that almost never practice any sport.

27

Page 29: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Chapter 3

Two-dimensional DescriptiveStatistics

In the previous chapter, we were working with the data we got from a poll and we obtained thefirst conclusions. But we want to know more than what we already do, because from those data wecan have more information with certain methods that we are going to study from now on. Beforegoing on, we will state our objectives in this chapter.

3.1 Objectives

• To represent and analyze data on two variables through an scatterplot.

• To identify as a two-dimensional distribution a data set on two variables given in a table orby an scatterplot.

• To analyze the relationship between two variables through their scatterplot, establishing byintuition if this relationship is positive or negative, if it is functional or not, and, in this caseif it approaches to a line.

• To compare global tasks of several distributions through their scatterplots.

• To assign given scatterplots to different situations.

• To determine the relationship between the different means through the scatterplot.

• To find, in a graphical way, a line that fits the scatterplot.

• To estimate the correlation coefficient from a scatterplot.

• To analyze the grade of the relationship between two variables when the correlation coefficientis known.

28

Page 30: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

• To calculate the correlation coefficient in two-dimensional distributions and the regressionlines.

• To make predictions from the regression line.

3.2 The example: an opinion poll

In this chapter we will keep on getting deep in the analysis of the opinion poll we have beenworking with. From the information that we already have, we will try to answer questions like

• Is there any relationship between the pay you receive and the number of brothers/sisters youhave?

• Does the sport you practice have any influence on how much you smoke or how much alcoholyou drink?

• Can we measure precisely these relationships?

Along this chapter we will try to answer these questions and many more. We are presentingfrom now on the concepts that will be necessary to get these answers.

3.3 Introduction and simple tables

We can think about many variables that can have influence over many others. For instance, wecan think that as older you are, the bigger pay you get. We are going to see if that is really true.So, as you already know from the previous chapter, the first thing we have to do is to organize ourdata. We recall that the data about ages and pays that we had are the following:

Age Pay16 616 816 1016 517 1518 2016 917 1017 917 919 2016 1517 1216 617 15

Age Pay17 1216 1018 2518 2018 3019 1517 1216 919 2016 616 916 1017 2516 916 9

29

Page 31: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

These are the pairs of data that we have. Let us start grouping the pairs that are equal. Weget the following table

Age Pay Number16 5 116 6 316 8 116 9 516 10 316 15 117 9 217 10 117 12 317 15 217 25 118 20 218 25 118 30 119 15 119 20 2

This table we have just built will be called simple table and it will be the starting point for ouranalysis.

3.4 Frequency tables, marginal distributions and conditionaldistributions

Is it simple to you to obtain conclusions from the previous table? Can we find any other way torepresent our data? The idea is to avoid those repeated values that we can see in the column ofages and also in the columns of pays. We can group our data in the following way

AgePay 16 17 18 195 16 38 19 5 210 3 112 315 1 2 120 2 225 1 130 1

30

Page 32: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

This table allows us to have a more global vision of the distribution of the frequencies and themore different values we have,the more useful the table is. We call it table on two variables whenwe are representing two quantitative variables and contingency table when we have two qualitativevariables. But from these tables, can we obtain the total number of people whose pay is 12 euros?and the total number of people whose age is 17? Obviously, the answer is yes. Notice that you cansum all the frequencies appearing on the row related to value 12 of the pay and so we can get thenumber of people whose pay is 12. In the same way, we can sum all the frequencies on the columnrelated to value 17 of the age and we will have the total number of people that is 17. We add thesenumbers to our table and we have

AgePay 16 17 18 19 Tot5 1 16 3 38 1 19 5 2 710 3 1 412 3 315 1 2 1 420 2 2 425 1 1 230 1 1Tot 14 9 4 3 30

In fact, what you have just got are the values of the two single variables independently onefrom the other. This values are called marginal distributions of the variables. To obtain the wholemarginal distribution of the variable age we take the first and the last row,

Age 16 17 18 19frequency 14 9 4 3

We can do this also for the variable pay, taking the first and the last column.

Exercise 3.4.1 Can you build that similar table for the variable pay?

In a general way, a table on two variable is defined as follows:

YX y1 y2 . . . yp . . . ym Totx1 n11 n12 . . . n1p . . . n1m n1∗x2 n21 n22 . . . n2p . . . n2m n2∗. . . . . . . . . . . . . . . . . . . . . . . .xs ns1 ns2 . . . nsp . . . nsm ns∗. . . . . . . . . . . . . . . . . . . . . . . .xk nk1 nk2 . . . nkp . . . nkm nk∗Tot n∗1 n∗2 . . . n∗p . . . n∗m n

31

Page 33: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

where the values or characteristics of X are x1, x2, . . . , xk and the ones of Y are y1, y2, . . . , ym; nij

is the number of data points presenting characteristic xi for the variable X and yj for the variableY . Moreover, ni∗ denotes the number of data points presenting the characteristic xi and n∗j thenumber of data points presenting the characteristic yj . n is the total number of elements of thepopulation or the sample.

Once we know the marginal distributions, we can calculate the mean and the standard deviationof each of them as if the were one-dimensional variables. Their expressions are:

x =∑k

i=1 xini∗

nSx =

√∑ki=1(xi − x)ni∗

n

y =

∑mj=1 yjn∗j

nSy =

√∑mj=1(yj − y)n∗j

n

Exercise 3.4.2 Which are the mean and the standard deviation of the pay and the age?

One of your partners has a question. He is 17 and he wants to know if his pay is among thehigher or the lower to ask for a raise in it if the pay is too low. In order to get that he wants tocompare himself with all the other students of his age, so he takes out the data of those studentshaving his age:

Pay 5 6 8 9 10 12 15 20 25 30Age = 17 0 0 0 2 1 3 2 0 1 0

As this boy has a pay of 10 euros, he decides that most of his partners have a higher pay thanhim, so he is going to ask for a raise.

What we have just calculated is the conditional distribution of the variable pay for a fixed valueof the age, in this case 17. We have again a one-dimensional variable to whom we can calculate themeasures of central tendency and of variability that we already know.

Exercise 3.4.3 Calculate the frequency table for the variable age for pay=15 euros.

Exercise 3.4.4 Calculate the frequency table, with the marginal frequencies, for the weight and theanswer to the question 3.1

3.5 Scatterplots

As it usually happens for one-dimensional variables, data are more easily analyzed if we representthem in a graph. Anyway, the situation now is different, because we need to represent two variableseach one with its frequencies. To do that we use a graph called scatterplot. We are going to explainnow how to draw it: we represent in the OX axis the variable pay and in the OY axis the variableage. We represent a point as big as its frequency or we represent as many points as the frequencyshows.

32

Page 34: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

Figure 3.1: scatterplot

The shape of the points in the scatterplot can give us an idea of the possible dependence thatcan exist between the variables, as we will see on the following.

Exercise 3.5.1 Draw the scatterplot of the variables weight and the answer to the question 3.1

3.6 Functional dependence and statistical dependence

Suppose that you are studying the following variables:

• The height and the size of the foot of a person

• The weekly pay and the height

• The number of members of a family and the number of rooms of their house.

• The height from where we throw something and the time until it gets to the floor.

• The weight and the number of brothers/sisters

For each of the situations, we would like to know if there is any relationship between the variablesthat we study, if the value of one of them has influence over the other. Case 4 is, for instance, veryclear. We have learnt in physics that there is a functional relationship between those variables, anequation relating both. In other cases, we can think that there is no relation, as in cases 2 and 5,but in cases 1 and 3 there is a possibility of relation that we cannot assure.

The scatterplots can have very different shapes and can help us to realize how the variablesare. We will use them as a first approach though later we will use more rigorous methods to decidewhether two variables are related.

33

Page 35: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

As we have just seen there are several levels in the relationship of the variables. We say thatthere is a functional dependence if we are in a similar situation than case 4 that we have justpresented, this is, Y depends functionally on X when we can assign each value xi an unique valueyj in such a way that yj = f(xi). This means that a value of one variable determines exactly thevalue of the other one. The functional dependence is linear when all the pairs are in a line; it willbe curvilinear when they are in a curve defined by the function y = f(x).

Two variables X and Y are said to be independent if the value of one of them has no influenceover the other one. This means that the relative conditional distributions coincide.

In the rest of the situations we can talk about statistical dependence or relation. This dependencecan be stronger or weaker depending on the situation. We can have an idea of how strong (or weak)it is through the scatterplot, taking into account that it will be stronger when data approach to thegraph of a function.

Scatterplots in which we can see linear or curvilinear dependence are:

Figure 3.2: linear dependence

Figure 3.3: curvilinear dependence

Exercise 3.6.1 Can you see any conclusion about the possible dependence between the weight andthe answer to the question 3.1 from the scatterplot you drew in the previous section?

34

Page 36: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

3.7 Covariance

Recall the scatterplot of the two variables we are studying. It is not easy to conclude which kindof relationship there is between them. But, for instance, do you think that the pay grows when theage grows? Do you think it happens the other way round? We are trying to find now a numberthat can give us a measure such that we can decide whether the relationship is direct or inverse.We will use for that the covariance, that is defined as follows:

Sxy =

∑ki=1

∑mj=1(xi − x)(yj − y)nij

n=

∑ki=1

∑mj=1 xi yj nij

n− x y

This covariance is also known as the joint variance of the two variables. If the relationship isdirect, the covariance is positive, and if the covariance is negative, the relationship is inverse. Aswe know that the average age is 16, 86̂ and the average pay is 13, we obtain that Sxy = 4, 53̂, andso the relationship is direct and quite strong.

You can notice that in the expression of the covariance, its sign depends on the difference (xi−x)and (yj − y). Let us see what happens with the covariance in certain situations. We represent 3scatterplots, in which we mark the point (x, y) that is the gravity center of the distributions (seefigure 3.4).

Figure 3.4: covariance

We can see that in graph number 2 we have a big covariance because the differences (xi−x) and

35

Page 37: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

(yj − y) have always the same sign (xi and yj are always in the first and third quadrants definedby the axis centered on (x, y)). As these differences are positive, they contribute in a positive wayto the sum.

In the other 2 cases there is no linear relationship and so we will have positive and negativesumming because we have data points on the four quadrants so someone balance with others andthe result can be next to 0.

You can notice that covariance is a measure that depends on the measurement units, as ithappened with variance and standard deviation, so we shall look for another adimensional measurethat allows us to compare distributions.

3.8 Linear correlation

We are now looking for a measure that tells us the grade of relationship existing between twovariables (in a direct or inverse way). We want to use it also to measure the linear relationshipbetween them.

We start from the covariance that we have just presented, that depends on the product of themeasurement units of the two variables, because (xi − x) depends on the measurement units ofX and (yj − y) depends on the measurement units of Y ; while nij and n are adimensional. Weshould divide Sxy by a quantity in such a way that those two measurement units disappear. If youremember, the variance depended on the square of the measurement units of the variable, so wecannot use it, but the standard deviation depended on the measurement units of the variable. Thismeans that the product SxSy depends on the product of the measurement units of X and Y , andthis is what we were looking for. So, we define the linear correlation coefficient as follows:

r =Sxy

SxSy

Let us calculate it in our example. We know that Sxy = 4, 53̂ and Sx = 1, 008 and Sy = 6, 368so r = 0, 706, but what does this mean?

The value of r is always between −1 and 1. If the value of r is near −1 or 1, then the lineardependence between the variables is strong, being direct if it is near 1 and inverse if it is near −1.

If the value of r is near 0 we have weak dependence in case it exists. If the value of r coincideswith 1 or −1 the dependence is linear and all the points belong to a line.

Then in our example, we confirm that the relationship is direct and quite strong.

Exercise 3.8.1 Calculate the linear correlation coefficient of the variables weight and answer tothe question 3.1. What can we say about the relationship between them?

3.9 Regression lines

Let us suppose that you know that a boy from the high school has a pay of 18 euros, but youdon’t know his age. We could think about predicting the value that the variable age should havefor this boy. How could we do this? We have been discussing along this chapter about the possible

36

Page 38: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

relationship between the variables, so this is the moment in which we are going to use it. If we wereable to write the equation that relates the age and the pay, we would only have to substitute andwe would have the value that we want.

But, unfortunately, this is not so simple. As we know that the linear correlation between thetwo variables is quite big, we can try to find the line that best fits the points and then we cansubstitute the value of the pay in order to get the value of the age. This line is called the regressionline. Let us define it and later we will calculate the one for our example.

Let X, Y be two variables, we define the regression line as the line that makes minimum thesum of the squares of the distances between the data points and the estimated points.

For the regression line of Y over X, that shall be y = ax + b, we have to make minimum thesum of the squares of the distances between the values yj and the expected values for them, axi +b.The equation for this line is:

Y − y =Sxy

S2x

(X − x)

We will use this line when we want to estimate the value of Y from the value of X.In the case of the regression line of X over Y , that shall be x = c + dy we make minimum the

sum of the square of the distances between the values xi and the predictions for those values cyi +d.The equation of this line is:

X − x =Sxy

S2y

(Y − y)

We will use this line when we want to predict the value of X from the value of Y .Let us calculate the regression line for our example. Our variables are the pay (X) and the age

(Y ) so we have to calculate the line of X over Y . We have that:

x = 13 y = 16, 86̂ Sxy = 4, 53̂ Sx = 6, 368 S2x = 40, 551

so the line we are looking for is

Y − 16, 86̂ =4, 53̂

40, 551(X − 13)

or equivalently

Y − 16, 86̂ = 0, 111(X − 13) ⇒ Y = 0, 111X + 15, 413

so, if the pay of this boy is x = 18 euros, his age should be

Y = 0, 111 · 18 + 15, 413 = 17, 41

i. e., this boy should be 17 years old.We have to make some remarks about the regression line. The first thing is that the cutting

point of the two regression lines (X over Y and Y over X) is (x, y), unless in the case of linearcorrelation 1 or −1 in which the two lines coincide.

If we want to make predictions using the regression line, we have to consider that we are in oneof the next situations:

37

Page 39: An approach to Descriptive Statistics through real situationsoptimierung.mathematik.uni-kl.de/.../ver_texte/desc_english.pdf · An approach to Descriptive Statistics through real

• We can conclude from the scatterplot that there is a possible linear relationship between thevariables.

• The linear correlation coefficient is near 1 or −1.

• Common sense says to us that there is a possible relationship between the variables.

An alternative way of expressing the regression lines is the following:

• For the case of the regression line of Y over X, this is such as y = ax + b where

a =Sxy

S2x

b = y − Sxy

S2x

x

• For the case of the regression line of X over Y , this is such as x = cy + d where

c =Sxy

S2y

d = x− Sxy

S2y

y

Exercise 3.9.1 Calculate the regression lines for the variables weight and answer to question 3.1.If a student weighs 67 kg, can you predict which one can be the answer to question 3.1?

38