statistics i chapter 2: analysis of univariate · pdf filei bar and pie charts, pictograms,...

81
Statistics I Chapter 2: Analysis of univariate data

Upload: vankhue

Post on 16-Mar-2018

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Statistics IChapter 2: Analysis of univariate data

Page 2: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Chapter 2: Analysis of univariate data

Contents

1. Representations and graphs

I Frequency tables.

I Bar and pie charts, pictograms, histograms, frequency polygons.Other graphs. Lying with graphs.

2. Numerical summary:

I Central tendency (mean, median, mode)

I Location (quartiles and percentiles). Box plots.

I Spread (variance, standard deviation, range, IQR, coefficient ofvariation)

I Shape (coefficients of skewness and kurtosis)

Page 3: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Chapter 2: Analysis of univariate data

Recommended reading

I Pena, D., Romo, J. Introduccion a la Estadıstica para las CienciasSociales (1997).

I Chapters 2, 3, 4 y 5.

I Newbold, P. Statistics for Business and Economics (2008).I Chapters 1 y 2

Page 4: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of qualitative variables

I Sample: 46 professionals of a computer company in the UnitedStates.

I Variable: EDUC: education level (1=High School; 2=College;3=Advanced Degree)

I Variable: MGT: position of responsibility (1=yes; 0=no)

In order to obtain information:

How to summarize primary data in a more useful way that allows a quickvisual interpretation?

Page 5: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of qualitative variables: frequency tables andbar charts

Education level Number of employees Proportion of employeesHigh School 14 0.304

College 19 0.413Advanced Degree 13 0.283

Total 46 1

Page 6: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of qualitative variables: general outline of afrequency table

Freq. Freq.Class, ci Absolute, ni Relative, fi

c1 n1 f1 = n1

nc2 n2 f2 = n2

n...

......

ck nk fk = nkn

Total n 1

Note:

I ni = number of ci in the sample, fi = nin

I 0 ≤ fi ≤ 1

Page 7: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of qualitative variables: the bar chart

I Bars are of the same width and equally-spaced, with heightscorresponding to frequencies

I There are gaps between bars

I Bars are labeled with class names

Page 8: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: the pie chart

I Each slice is a fraction of the total size of the pie

I Many software programs rank slices alphabetically

I Although ’pretty’ harder to interpret than barcharts

I Avoid 3D piecharts, for those the area in the background seems tobe smaller than the area in the foreground

Page 9: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Th

ep

ie c

ha

rt:

exa

mp

le

Tabla dinámica

(Pivot table)

Sample: The Simpsons’ 568 first episodes

Variable: character performing the leading role (the one

who speaks more) in an episode

Note: You can obtain the same chart form the raw data

(without using a Pivot table). Check the Supplementary

materials about Excel usage.

Page 10: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: the Pareto chart

I Bar chart in which the categories of the variable are ranked indecreasing order of frequency.

I Applies only to nominal qualitative variables.

I Useful in the detection of the more significant “reasons” (a fewoptions account for almost all the purchasing frequency)

Pareto Principle (80-20 rule)Based on empirical knowledge Pareto stated in 1896 that society was dividedinto two proportional groups 80-20, the “few of many” and the “many of few”:

I A minority group made up of 20 % of the population who owned 80 % ofsomething.

I A majority group made up of 80 % of the population who owned theremaining 20 %.

Page 11: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The Pareto chart: example

I Sample: Among the 1100 visitors of the art exhibition “’Turner andthe Masters” (Prado Museum, June 22 to September 19 2010),those who bought their tickets online (a 20.3 %) Source: Institutefor Tourism Studies

I Variable: Main reason for buying the ticket online

Instituto de Estudios Turísticos

Turner y los Maestros

 

21  

Tabla 7. Visitantes según si ha tenido que esperar para entrar a la exposición

Filtro: Adquiere la entrada en taquilla

%

Ha tenido que esperar 12,1

No ha tenido que esperar 87,9

Total 100,0

Tabla 8. Visitantes por actividades realizadas en el tiempo de espera para entrar a la exposición

Filtro: Ha adquirido la entrada en taquilla y ha tenido que esperar desde que ha sacado

la entrada hasta que ha accedido a la exposición

%

Visitar la colección del Museo 16,6

Visitar o estar en la cafetería del Museo 7,7

Visitar la tienda del Museo 28,1

Estar o visitar otros espacios del Museo que no tienen colección

33,0

Esperar en el exterior del Museo 27,5

Tabla 9. Visitantes por la razón principal para adquirir la entrada por vía telemática

Filtro: Adquiere la entrada por vía telemática

%

Por comodidad 60,5

Rapidez 10,1

Puedo elegir el día y la hora de la visita 14,0

No tengo que esperar en taquilla 9,5

Porque la entrada es más barata 4,3

Por el horario 24 horas 1,2

Había oído hablar bien del servicio 0,4

Total 100,0

Page 12: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The Pareto chart: example

Page 13: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: pictograms

I Sample: 70 university students from Madrid

I Variable: Preferred political party

Preferred political party Students numb. Students prop.PSOE 23 0.33

PP 15 0.21Unidos Podemos 20 0.29

Ciudadanos 7 0.10Otros 5 0.07Total 70 1

The area of the graph is proportional to the frequency.

Page 14: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

ExerciseResults from a survey conducted among 15-20 year-olds about theirfavorite leisure activity

I What is the variable and who are the individuals?

I For what percentage of young people is reading the preferred leisureactivity?

Page 15: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Exercise

From a test taken by a group of students, graded between 1 and 8, thefollowing table was obtained:

Grade, ci ni fi1 4 0.082 43 0.164 7 0.145 56 107 7 0.148

I How many students took the test?

I What percentage of students obtained a grade greater than or equalto 6?

Page 16: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Exercise

In a survey about health habits, 30 randomly chosen students were askedabout the sport they usually practice. The results are shown in thefollowing table:

Sport, ci ni fiBasket 12 0.4Swimming 3 0.1Football 9 0.3None 6 0.2Total 30 1

Which of the following charts corresponds to the data above?

Page 17: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Exercise Estadística Aplicada

a) c)

b) d)

Deporte

0

2

4

6

8

10

12

14

Baloncesto Natación Fútbol Ningún deporte

Deporte

0

2

4

6

8

10

12

14

Baloncesto Natación Fútbol Ningún deporte

Deporte

0

2

4

6

8

10

12

14

Baloncesto Natación Fútbol Ningún deporte

Deporte

0

2

4

6

8

10

12

14

Baloncesto Natación Fútbol Ningún deporte

Page 18: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of discrete quantitative variables: table offrequencies

I Sample: 100 shopping malls in which a promotion of a certain service waslaunched last November.

I Variable: number of new customers gained due to the promotion.

Absolute RelativeAbsolute Relative Cumulative Cumulative

ci Frequency ni Frequency fi Frequency Ni Frequency Fi

0 1 0,01 1 0,011 4 0,04 5 0,052 7 0,07 12 0,123 8 0,08 20 0,24 8 0,08 28 0,285 16 0,16 44 0,446 18 0,18 62 0,627 14 0,14 76 0,768 10 0,1 86 0,869 11 0,11 97 0,97

10 3 0,03 100 1

Total 100 1

Page 19: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of discrete quantitative variables: table offrequencies

I What percentage of the sampled malls gained only 5 new customers?

I How many malls attracted at least 3 new customers?

I How many malls attracted less than 6 new customers?

I What percentage of the sampled malls gained between 4 and 8 newcustomers?

I What percentage of malls gained at most 7 new customers?

Page 20: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of discrete quantitative variables: the bar chart

Bar charts can also be created for discrete data if there are not too manydifferent values.

Page 21: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of discrete quantitative variables: generalformat of the table

Cumulative CumulativeAbsolute Relative Absolute Relative

Class, ci Freq., ni Freq., fi Freq., Ni Freq., Fi

c1 n1 f1 = n1

n N1 = n1 F1 = f1

c2 n2 f2 = n2

n N2 = N1 + n2 F2 = F1 + f2

......

......

...ck nk fk = nk

n Nk = n Fk = 1Total n 1

Note:

I c1 < c2 < . . . < ckI ni = number of individuals in the sample in class ci ,fi = ni

n

I Ni = Ni−1 + ni , Fi = Fi−1 + fiI 0 ≤ fi ,Fi ≤ 1

I Fi and Ni also make sense for qualitative ordinal variables

Page 22: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Qualitative ordinal variables: cumulative frequencies

We can also include cumulative frequencies in the table.

I Sample: 901 employees.

I Variable: levels of satisfaction (S=satisfied, V=very, U=unsatisfied)

Cumulative CumulativeAbsolute Relative Absolute Relative

Class Frequency Frequency Frequency FrequencyVU 62 0.07 62 0.07U 108 0.12 170 0.19S 319 0.35 489 0.54

VS 412 0.46 901 1Total 901 1

Page 23: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Qualitative ordinal variables: bar charts with cumulativefrequencies

Beware! Many software programs rank the classes in alphabetical orderwhen the variable is qualitative. If it is an ordinal variable, it must beranked in ascending order.

Page 24: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Bar charts for discrete data

I Sample: 46 professionals of a computer company in the United States.

I variable: EXPRNC: number of years working in the company

Experience, ci Absolute freq., ni Relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,087

10 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022

Total 46 1

Page 25: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of discrete quantitative variables: the bar chart

Too many different values.

Page 26: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of continuous quantitative variables

I Sample: 46 professionals of a computer company in the United States.

I Variable: EXPRNC: years of experience

I Variable: SALARY: anual gross income (in US dollars)

Page 27: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Grouping by class intervals: continuous (or discrete) data

Class Interval Midpoint ni fi Ni Fi

[l0, l1] c1 = l0+l12 n1 f1 N1 F1

(l1, l2] c2 = l1+l22 n2 f2 N2 F2

......

......

......

(lk−1, lk ] ck = lk−1+lk2 nk fk n 1

Total n 1

Note:

I Left end-point is excluded, but right end-point is included in Excel(it is a convention)

I Reverse end-point convention can be applied - check your softwarefor definition

I Useful for tabulating discrete data if X takes many values

Page 28: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Grouping by class intervals

I Very often class intervals have the same width

I Determine the width w of each interval by

w =largest number - smallest number

number of desired intervals

I How many intervals? Roughly between 5 and 20. Practice andexperience provide the best guidelines (From Newbold):

Sample size Number of classesFewer than 50 5–7

50 to 100 7–8101 to 500 8–10

501 to 1000 10–111001 to 5000 11–14

More than 5000 14–20

I Intervals never overlap

I Round up the interval width to get desirable interval endpoints

Page 29: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Grouping by class intervals: histogram and frequencypolygon

I Find range: 20− 1 = 19

I Select number of classes: say k =√

46 = 6.78 ≈ 7

I Compute interval width: 19/7 = 2.71⇒ 3.

I Determine the end-points (beginning before the first one and endingafter the last one): [0, 3], (3, 6], . . . , (19, 21]

Page 30: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of quantitative variables: histogram andfrequency polygon

I There are no gaps between the bars/bins

I Bin widths = widths of class intervals (identical), class boundariesare marked on the horizontal axis

I Bin heights = frequencies (here, absolute)

I Bin areas are proportional to the frequencies

Page 31: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Quantitative variables: the histogram

Page 32: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Description of quantitative variables: histogram andfrequency polygon

Page 33: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: cartograms (INE, Encuesta de Turismo de residentes)Average trips’ expenditure per person during the third term of 2016

Average excursions’ expenditure per person during the third term of 2016

Page 34: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: pictograms

Page 35: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Other graphics: time series

INE, Encuesta de Poblacion Activa

Page 36: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

How to lie with pictograms

Published in ”La Voz de Galicia”, on October 24, 2010.

I Letting height proportional to frequency gives a false impression.

I Is there anything else you don’t like?

Page 37: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Lying with graphsImproper use of scales: the coordinate origin is not 0

Page 38: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Lying with graphs

Page 39: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Lying with graphsThe vertical axes scale is upside down

Page 40: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Lying with Statistics

A classic book: How to Lie with Statistics, by Darrell Huff, 1954.

Available online: https://archive.org/details/HowToLieWithStatistics

Page 41: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Numerical summary

Central tendency Location Spread Shape

⇓ ⇓ ⇓mean quartiles range coeff. skewness

median percentiles interquartile range coeff. kurtosismode variance

standard deviationcoeff. of variation

Page 42: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Descriptive statistics

X Why are they useful?

X Can we calculate them for all types of variables?

X Which are the most useful in each case?

X How can we compute them with a calculator or Excel?

Page 43: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measures of central tendency

X The mean

X The median

X The mode

Page 44: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Central tendency: the (arithmetic) mean

The (arithmetic) meanThe mean is the average of all the data

x =

∑ni=1 xin

=x1 + . . .+ xn

n

I It is the most common measure of location

I It is the center of gravity of the data

I It can be calculated only for quantitative variables

Page 45: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The mean: exampleFor the experience of the 46 professionals of a computer company, Whatis the mean?

x =1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · ·+ 17 + 20

46= 7.5 anos

With Excel: function PROMEDIO(numero1; [numero 2]; ...)

Page 46: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The mean: exampleHow to calculate the mean from the absolute frequency table? And fromthe relative frequency table?

Page 47: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The mean with grouped data

It is the same formula but using the center of each interval.For the salary of the 46 professionals of a computer company, What isthe mean?

Note: the mean salary using the raw data equals 17250.413

Page 48: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The mean: properties

X Linearity: If Y = a + bX ⇒ y = a + bxIf Z = X + Y ⇒ z = x + y

If the 46 professionals’ salaries increase by 2 %, How does the meansalary change?

If the salary is reduced in 100 dollars, What is then the new meansalary?

If the salary is increased with a productivity bonus that is recorded invariable Y , with mean y , What is the new mean salary?

X Disadvantages: Affected by extreme values (outliers)

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200

x =3 + 1 + 5 + 4 + 2

5= 3 y =

3 + 1 + 5 + 4 + 200

5= 42.6!

When the data is skewed, an alternative robust measure of centraltendency is more appropriate

Page 49: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Central tendency: the median... it is the most central datum

1 1 1 3 3 5 5 7 8 8 9

1. Order the data from smallest to largest

2. Include repetitions

3. The median is the physical centre

1 1 1 3 3 5 5 7 8 8 ⇒ M =3 + 5

2= 4

MedianOrdered list from smallest to largest: x(1), x(2), . . . , x(n)

M =

x((n+1)/2) if n odd

x(n/2)+x(n/2+1)

2 if n even

With Excel: function MEDIANA(numero1; [numero2];...)

Page 50: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Finding the median from a frequency tableExperience, xi ni fi Ni Fi

1 5 0,109 5 0,1092 4 0,087 9 0,1963 4 0,087 13 0,2834 4 0,087 17 0,3705 3 0,065 20 0, 435 < 0.5

M=6 4 0,087 24 0, 522 > 0.57 1 0,022 25 0,5438 4 0,087 29 0,6309 0 0 29 0,630

10 4 0,087 33 0,71711 2 0,043 35 0,76112 2 0,043 37 0,80413 2 0,043 39 0,84814 1 0,022 40 0,87015 1 0,022 41 0,89116 3 0,065 44 0,95717 1 0,022 45 0,97818 0 0 45 0,97810 0 0 45 0,97820 1 0,022 46 1,000

Page 51: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The median: properties

X Linearity: If Y = a + bX with b > 0⇒ My = a + bMx

If the 46 professionals’ salaries are increased by 2 %, How does themedian salary change?

Afterwards the salary is reduced in 100 dollars. What is the finalmedian salary?

X Can we calculate the median with the education level data?

Can we calculate the median with the 0-1 position of responsibilityvariable?

X Advantage: Not affected by outliers

Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200

Mx = 3 My = 4

When the data is skewed it is a better measure of central tendencythan the mean.

Page 52: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The median and the mean for asymmetric (skewed) dataAnnual gross salary in 2014, Encuesta de Estructura Salarial 2014, INE

“La diferencia entre el salario medio y el mediano se explica porque en elcalculo del valor medio influyen notablemente los salarios muy altosaunque se refieran a pocos trabajadores.”(Nota de Prensa del INE de 28de octubre de 2016)

Page 53: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Central tendency: the mode

... it is the most frequent value

The mode of the variable experience in the 46 professionals example is 1year, with an absolute frequency of 5 employees.

The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.

Page 54: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Central tendency: the mode

Does this definition make sense with the education level data?

Does this definition make sense with the 0-1 position of responsabilityvariable?

Page 55: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Central tendency: the mode

Does this definition make sense with continuous data? ⇒ modal interval

Page 56: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

The mode: properties

X It can be calculated for both qualitative and quantitative variables.Indeed, it is the only descriptive measure (mean, median, mode) thatmakes sense for nominal qualitative variables.

X Not affected by outliers

X There can be no mode.

X There can be more than one mode: bimodal–trimodal–plurimodal

What does it indicate?

Page 57: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Bimodal distributionTime (in minutes) to complete a marathon. Data from an open

marathon (everybody can participate).

0

20

40

60

80

100

120

140

160

133 140 150 160 170 180 190 200 210 220 230 238

Tiempo en correr un maratón: histograma

� What do you think is happening?

� Can you guess which types of runners make up the blue and the

green groups?

� Would you expect to observe the same histogram shape if the data

came instead from a marathon at the Olympic Games?

Page 58: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Location measures

X Quartiles

X Percentiles

Page 59: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Location measures: quartiles and percentiles

X Quartiles split the ranked data into four segments with an equalnumber of values per segment.

X Percentiles split the ranked data into a hundred segments with anequal number of values per segment.

1. Order the data from smallest to largest

2. Include repetitions

3. Select each quartile (percentile) according to:I The first quartil Q1 has position 1

4(n + 1).

I The second quartil Q2 (= median) has position 12(n + 1).

I The third quartil Q3 has position 34(n + 1).

I The k-th percentile Pk , has position k(n + 1)/100, k = 1, . . . , 99.

Page 60: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Quartiles and percentiles with Excel

Note:

In most cases the fractions 14 (n + 1), 3

4 (n + 1) and k100 (n + 1) are not

integer ⇒ to get the (integer) position of the given quartile (orpercentile) a rounding criterion must be used.

With Excel, the functions are:

I CUARTIL.INC(matriz;cuartil), with:1=first quartil, 2=median, 3=third quartil

I PERCENTIL.INC(matriz;p), with:p = k

100 ∈ (0, 1), k-th percentile

Page 61: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Masures of spread

X The range and the interquartile range

X The variance and the standard deviation

X The coefficient of variation

Page 62: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Variation: range and interquartile range (IQR)

I The Range is the simplest measure of variation

R = xmax − xmın

I Ignores the way the data is distributed

I Sensitive to outliers

Example: Given observations 3, 1, 5, 4, 2, R = 5− 1 = 4Example: Given observations 3, 1, 5, 4, 100, R = 100− 1 = 99

I The Interquartile range (IQR) can eliminate some outlier problems.Eliminate high and low observations and calculate the range of themiddle 50 % of the data

RIC = 3rd cuartil− 1st cuartil = Q3 − Q1

Page 63: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Variation: Interquartile range and boxplot

I Outliers are observations that fall

I below the value of Q1 − 1.5 · IQRI above the value of Q3 + 1.5 · IQR

I For extreme outliers, replace 1.5 by 3 in the above definition

25% 25% 25% 25%

12 24 31 42 58

xmin Q1 ((Q2))MEDIANA

Q3 xmax

RI=18

Page 64: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Box-Plot

I It shows five central location measures.

I It shows a robust dispersion measure.

I It allows the study of the symmetry of the data.

I It gives a criterion to detect outliers.

I It is very useful to compare different datasets.

I Variation: when several box-plots are depicted in the same chart, theboxes widths can be proportional to the sample sizes.

Page 65: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Homer and his antagonistsHomer Simpson has two main antagonists: Flanders and Mr. Burns:

In those episodes in which at least one of them appears, How is Homer’s

protagonism distributed?

Page 66: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Homer and his antagonistsYou must use the filter variable defined in Exercise 5 (Problem set 1)

1) Create 4 variables with the values of columm (variable) “Homer” for each of the

following cases: Homer&Burns, Homer&Flanders, Homer&Both, Homer&None

2) Select all cases and insert a Diagrama de Cajas y Bigotes

Page 67: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measure of variation: variance

I Average of squared deviations of values from the mean

I Sample variance

σ2 =

∑ni=1 (xi − x)2

n=

faster to calculate︷ ︸︸ ︷∑ni=1 x

2i − n(x)2

n⇐ divided by n

I Sample quasi-variance (corrected sample variance)

s2 =

∑ni=1 (xi − x)2

n − 1=

∑ni=1 x

2i − n(x)2

n − 1⇐ divided by n − 1

I They are related via

σ2 =n − 1

ns2

I If a, b are real numbers and y = a + bx , then s2y = b2s2

x

Page 68: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measure of variation: standard deviation (SD)

I The most-commonly used measure of spread

I Population standard deviation, sample standard deviation andsample quasi-standard deviation are respectively

σ =√σ2 s =

√s2

I Measures variation about the mean

I Has the same units as the original data, whilst variance is in units2

I Variance and SD are both affected by outliers

Page 69: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Calculating variance and standard deviationExample: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,Z : 11, 11, 11, 12, 19, 20, 20, 20

x =124

8= 15.5 y =

124

8= 15.5 z =

124

8= 15.5

n∑i=1

x2i = 112 + 122 + . . .+ 212 = 2000

n∑i=1

y 2i = 142 + 152 + . . .+ 172 = 1928

n∑i=1

z2i = 112 + 112 + . . .+ 202 = 2068

s2x =

∑ni=1 x

2i − n(x)2

n − 1=

2000− 8(15.5)2

8− 1=

78

7= 11.1429 ⇒ sx = 3.3381

s2y =

1928− 8(15.5)2

8− 1=

6

7= 0.8571 ⇒ sy = 0.9258

s2z =

2068− 8(15.5)2

8− 1=

146

7= 20.8571 ⇒ sz = 4.5670

Page 70: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Calculating variance and standard deviation with Excel

Page 71: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Comparing standard deviationsExample cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20

● ● ●

● ●

● ● ● ●

● ● ●

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

z == 15.5 sz == 4.6

y == 15.5 sy == 0.9

x == 15.5 sx == 3.3

Page 72: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measure of variation: coefficient of variation (CV)

I Measures relative variation and is defined as

CV =s

|x |

I Is a unitless number (sometimes given in %’s)

I Shows variation relative to the mean

Example: Stock A: Average price last year = 50, Standard deviation = 5Stock B: Average price last year = 100, Standard deviation = 5

CVA =5

50= 0.10 CVB =

5

100= 0.05

Both stocks have the same SDs, but stock B is less variable relative to its mean

price

Page 73: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Medidas Resumen SDG 4 SDG5 SDG8 SDG12 SDG16

Media 72,3411911 59,8682135 61,9165165 68,9414671 63,3261936

Error típico 1,8162398 1,31663947 1,45068384 0,99827484 1,01989315

Mediana 80,2378311 63,8331375 61,8484726 73,0971451 63,0161781

Moda #N/A #N/A #N/A #N/A #N/A

Desviación estándar 22,7574195 16,4974452 18,1770164 12,5083478 12,7792246

Varianza de la muestra 517,900142 272,165699 330,403924 156,458766 163,308581

Curtosis 0,80070785 -0,47804046 -1,0086797 0,64222018 -0,3081343

Coeficiente de asimetría -1,22872549 -0,49130289 -0,10955689 -1,05249387 0,21842549

Rango 95,9346478 78,4421329 78,7104588 69,2612934 61,1629505

Mínimo 3,90777469 14,1622066 17,0483456 24,3055172 31,2056255

Máximo 99,8424225 92,6043396 95,7588043 93,5668106 92,368576

Suma 11357,567 9399,30952 9720,89308 10823,8103 9942,21239

Cuenta 157 157 157 157 157

Z-scores. In which ODS (SDG) is Spain performing better?

ODS 4: Quality education, Spain: 88,9

ODS 5: Gender equality and women’s empowerment, Spain: 80,6

ODS8: Decent work and economic growth, Spain: 80,9

ODS 12: Responsible consumption and production, Spain: 60,8

ODS16: Peace, justice and strong institutions, Spain: 69,5

Page 74: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Numerical summaries and frequency tables.Standardization.

I To standardize variable x means to calculate

x − x

s

I If you apply this formula to all observations x1, . . . , xn and call thetransformed ones z1, . . . , zn, then the mean of the z ’s is zero withstandard deviation of one

I Standardization = finding z-score

Page 75: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Medidas Resumen SDG 4 SDG5 SDG8 SDG12 SDG16

Media 72,3411911 59,8682135 61,9165165 68,9414671 63,3261936

Error típico 1,8162398 1,31663947 1,45068384 0,99827484 1,01989315

Mediana 80,2378311 63,8331375 61,8484726 73,0971451 63,0161781

Moda #N/A #N/A #N/A #N/A #N/A

Desviación estándar 22,7574195 16,4974452 18,1770164 12,5083478 12,7792246

Varianza de la muestra 517,900142 272,165699 330,403924 156,458766 163,308581

Curtosis 0,80070785 -0,47804046 -1,0086797 0,64222018 -0,3081343

Coeficiente de asimetría -1,22872549 -0,49130289 -0,10955689 -1,05249387 0,21842549

Rango 95,9346478 78,4421329 78,7104588 69,2612934 61,1629505

Mínimo 3,90777469 14,1622066 17,0483456 24,3055172 31,2056255

Máximo 99,8424225 92,6043396 95,7588043 93,5668106 92,368576

Suma 11357,567 9399,30952 9720,89308 10823,8103 9942,21239

Cuenta 157 157 157 157 157

Spain 88,9 80,6 80,9 60,8 69,5

Con respecto a la media 16,5588089 20,7317865 18,9834835 -8,14146713 6,17380644

Incorporando variabilidad 0,72762243 1,25666648 1,04436741 -0,65088269 0,48311276

Z-scores. In which ODS is Spain performing better?

Page 76: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measures of shape

X Fisher–Pearson coefficient of skewness

X Fisher coefficient of kurtosis

X Empirical rule

Page 77: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measures of shape: Skewness

Be AWARE of not making a decision about the shape just by means of acomparison between the Mean, the Median and the Mode.Fisher–Pearson coefficient of skewness

γ1 =1

n

n∑i=1

(xi − x

s

)3

With Excel: COEFICIENTE.ASIMETRIA(numero1; numero2; ...)

n

(n − 1)(n − 2)

n∑i=1

(xi − x

s

)3

Page 78: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measures of shape: kurtosis

Fisher’s coefficient of kurtosis

→ γ2 =1

n

n∑i=1

(xi − x

s

)4

− 3

With Excel: CURTOSIS(numero1; numero2;...)

n(n + 1)

(n − 1)(n − 2)(n − 3)

n∑i=1

(xi − x

s

)4

− 3(n − 1)2

(n − 2)(n − 3)

Page 79: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Measures of shape: skewness and kurtosis

Excel function

Page 80: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

“Análisis de Datos” in Excel: “Estadística descriptiva”

0

2

4

6

8

10

12

14

16

18

406,00 425,00 444,00 463,00 482,00 501,00 520,00 539,00

FR

EC

UE

NC

IA

MARCAS DE CLASE

[OECD-only] Average PISA score across Maths/Reading/Science(0-

600)

Data source: SDG Index & Dashboards Report 2017, http://www.sdgindex.org/

Spain: 491,4

[OECD-only] Average PISA score across

Maths/Reading/Science(0-600)

Media 491,9848408

Error típico 4,407032995

Mediana 496,9519786

Moda #N/A

Desviación estándar 26,0723588

Varianza de la muestra 679,7678935

Curtosis 1,905272727

Coeficiente de asimetría -1,319879232

Rango 113,2610878

Mínimo 415,6699466

Máximo 528,9310344

Suma 17219,46943

Cuenta 35

Page 81: Statistics I Chapter 2: Analysis of univariate · PDF fileI Bar and pie charts, pictograms, histograms, frequency polygons. Other graphs. Lying with graphs. 2.Numerical summary:

Empirical rule

If the data is bell-shaped (normal), that is, symmetric with light tails, thefollowing rule holds:

I 68 % of the data are in (x − 1s, x + 1s)

I 95 % of the data are in (x − 2s, x + 2s)

I 99.7 % of the data are in (x − 3s, x + 3s)

Note: This rule is also known as 68–95–99.7 ruleExample: We know that for a sample of 100 observations, the mean is40 and the quasi-standard deviation is 5. Assuming that the data isbell-shaped, give the limits of an interval that captures 95 % of theobservations.

95 % of xi ’s are in: (x ± 2s) = (40± 2(5)) = (30, 50)