beio volumen 33 - inicio - seio€¦ · el grupo de trabajo de teor a de juegos se fundo durante el...

BEIOBoletín de Estadística e Investigación Operativa

Revista Oficial de la Sociedad de Estadísticae Investigación Operativa

Volumen 33Número 3

Noviembre 2017

ISSN: 2387-1725

J. Vidal-Puga Editorial 183

M. AlcañizM. SantolinoLl. Ramón

A comparative analysis of tree-based modelsclassifying imbalanced breath alcohol data 189

E. Köbis Robust approaches to uncertain optimization 224

L. EstebanApplying the generic statistical business processmodel (GSBPM) to the business register; theSpanish experience

258

R. Ibar-AlonsoC. Cosculluela-Martínez

Positive effects on the least motivated studentsof the highly motivated ones 276

R. Cao Ingenuas reflexiones de un estadístico en la eradel Big Data 295

M. DiéguezR. ManínP. BlancoS. Vázquez

Premios incubadora de sondeos y experimentos 322

BEIO (Boletín de Estadística e Investigación Operativa) es una revista que publica cuatrimestralmente artículos de divulgación científica de Estadística y de Investigación Operativa. Los artículos pretenden abordar

tópicos relevantes para una gran mayoría de profesionales de la Estadística y de la Investigación Operativa,

primando la intención divulgativa sin olvidar el rigor científico en el tratamiento de la materia en cuestión. Las

secciones que incluye la revista son: Estadística, Investigación Operativa, Estadística Oficial, Historia y Enseñanza y Opiniones sobre la Profesión.

BEIO nació en 1985 como Boletín Informativo de la SEIO (Sociedad de Estadística e Investigación Operativa).

A lo largo de los años ha experimentado una continua evolución. En 1994, aparece publicado el primer

artículo científico y desde entonces el número de artículos científicos publicados ha ido creciendo hasta que en 2008 se segregan del Boletín los contenidos relacionados con la parte informativa y comienza a perfilarse

como revista de divulgación de la Estadística y de la Investigación Operativa.

Los artículos publicados en BEIO están indexados en Scopus, MathScinet, Biblioteca Digital Española de Matemáticas, Dialnet (Documat), Current Index to Statistics, The Electronic Library of Mathematics (ELibM),

COMPLUDOC y Catálogo Cisne Complutense.

La Revista está disponible online en www.seio.es/BEIO.

Editores

Salvador Naya Fernández, Universidade da Coruña

[email protected] Mª Teresa Santos Martín, Universidad de Salamanca

[email protected]

Editores Asociados

Estadística

Rosa M. Crujeiras Casais

Universidade de Santiago de Compostela [email protected]

Investigación Operativa

César Gutiérrez Vaquero

Universidad de Valladolid [email protected]

Estadística Oficial

Pedro Revilla Novella

Instituto Nacional de Estadística

[email protected]

Historia y Enseñanza

Mª Carmen Escribano Ródenas

Universidad CEU San Pablo de Madrid

[email protected]

Editores Técnicos

Antonio Elías Fernández, Universidad Carlos III de Madrid

[email protected]

María Jesús Gisbert Francés, Universidad Miguel Hernández de Elche [email protected]

Normas para el envío de artículos

Los artículos se enviarán por correo electrónico al editor asociado correspondiente o al editor de la Revista. Se

escribirán en estilo article de Latex. Cada artículo ha de contener el título, el resumen y las palabras clave en

inglés sin traducción al castellano. Desde la página web de la revista se pueden descargar las plantillas tanto

en español como en inglés, que los autores deben utilizar para la elaboración de sus artículos.

Copyright © 2017 SEIO

Ninguna parte de la revista puede ser reproducida, almacenada ó trasmitida en cualquier forma ó por medios,

electrónico, mecánico ó cualquier otro sin el permiso previo de la SEIO. Los artículos publicados representan las opiniones del autor y la revista BEIO no tiene por qué estar necesariamente de acuerdo con las opiniones

expresadas en los artículos publicados.

El hecho de enviar un artículo para la publicación en BEIO implica la transferencia del copyright de éste a la

SEIO. Por tanto, el autor(es) firmará(n) la aceptación de las condiciones del copyright una vez que el artículo sea aceptado para su publicación en la revista.

Edita SEIO

Facultad de CC. Matemáticas

Universidad Complutense de Madrid Plaza de Ciencias 3, 28040 Madrid

ISSN: 2387-1725

BEIO Revista Oficial de la Sociedad de Estadística e Investigación Operativa

mailto:[email protected]


http://www.seio.es/BEIO







Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017

Indice

Editorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Juan Vidal-Puga

Estadıstica 189

A comparative analysis of tree-based models classifying

imbalanced breath alcohol data . . . . . . . . . . . . . . . . . . . . . .Manuela Alcaniz, Miguel Santolino and Lluıs Ramon

Investigacion Operativa 224

Robust approaches to uncertain optimization . . . . . . . . . .Elisabeth Kobis

Estadıstica Oficial 258

Applying the generic statistical business process model

(GSBPM) to the business register; the Spanish

experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Luis Esteban Barbado Miguel

Historia y Ensenanza 276

Positive effects on the least motivated students of the

highly motivated ones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Raquel Ibar-Alonso and Carolina Cosculluela-Martınez

© 2017 SEIO

Indice

Opiniones sobre la profesion 295

Ingenuas reflexiones de un estadıstico en la era del Big

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ricardo Cao Abad

Special Section 322

Premios uncubadora de sondeos y experimentos . . . . . . .Milagros Dieguez Taboada, Roberto Manın Gutierrez, Paula

Blanco Mosquera and Sabela Vazquez

Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 183-188

Editorial

Juan Vidal-Puga

Economıa, Sociedad y Territorio (ECOSOT) y

Departamento de Estadıstica e Investigacion Operativa,

Universidade de Vigo

B [email protected]

El Grupo de Trabajo de Teorıa de Juegos se fundo durante

el XXIV Congreso de la Sociedad de Estadıstica e Investigacion

Operativa (SEIO), celebrado en Almerıa en octubre de 1998.

Estamos, por tanto, a menos de un ano de celebrar nuestro 20

aniversario. En su formacion, el grupo estuvo coordinado por

Ignacio Garcıa Jurado, que habıa sido socio fundador y presidente

(1992-1997) de la Sociedade Galega para a Promocion da Estatıstica

e a Investigacion de Operacions y que mas tarde llegarıa a ser

Presidente de la SEIO (2006-2012) y Editor en Jefe de TOP

(2001-2006), la revista de Investigacion Operativa de la Sociedad.

En cada Congreso de la SEIO, el grupo se reune para hacer balance

de las actividades realizadas, renovar la coordinacion del mismo y

planificar las actividades para el proximo periodo.

El proposito de este grupo es promover la comunicacion y la

investigacion entre los miembros de la SEIO que trabajan en teorıa

de juegos y, por extension, entre todos aquellos teoricos de juegos

espanoles. Con ello, el grupo esta abierto a toda persona interesada

en la teorıa de juegos, ya como miembro de la SEIO o como

colaborador externo.

© 2017 SEIO


184 J. Vidal-Puga

La teorıa de juegos es una herramienta dirigida a optimizar

procesos en los que el criterio de lo que es optimo no es el mismo para

todos los agentes involucrados. Con esta definicion, que generaliza el

concepto de problemas de decision al caso de mas de un decisor, su

encaje en las matematicas, en general, y la investigacion operativa,

en particular, es particularmente obvio. Sin embargo, la teorıa de

juegos no nacio exclusivamente dentro de la matematicas, sino como

una union entre la economıa, que le daba la motivacion, y las

matematicas, que le daba el rigor.

En concreto, se considera que la teorıa de juegos nace, como

disciplina, con la publicacion del libro “Theory of Games and

Economic Behavior” en 1944, de John von Neumann (matematico)

y Oskar Morgenstern (economista).

John von Neumann era un genio de las matematicas. Fue, sin

duda, uno de los matematicos mas brillantes del pasado siglo, y

tenıa la suficiente ambicion como para abordar un reto de obvia

complejidad: Como utilizar las matematicas para modelizar el

comportamiento humano.

Llegado a este punto, no puedo resistirme a hacer un parentesis

para comentar como fue mi primer contacto con la teorıa de juegos.

Curiosamente, no fue en la Universidade de Santiago de Compostela,

donde cursaba mi licenciatura y donde Ignacio Garcıa Jurado

impartıa esta materia, sino durante mi ano de Erasmus en el Reino

Unido, en la Southampton University. Medio por curiosidad, me

matricule en un curso sobre esa tematica. El curso era, basicamente,

sobre juegos de dos jugadores de suma cero. En estos juegos, dos

agentes compiten entre de forma que la ganancia de uno es la perdida

del otro (de ahı la“suma cero”). Recuerdo como me llamo la atencion

la resolucion de una generalizacion del juego “pares o nones”, donde

dos agentes (digamos, A y B) deben anunciar simultaneamente un

numero: el cero o el uno. Si los dos anuncian el mismo numero, B

Editorial 185

paga x a A. Si anuncian numeros distintos, A paga y a B. Se trata

de un juego sin una estrategia ganadora, ya que si la hubiera, el

contrario la anticipara y ganarıa siempre, contradiciendo que fuera

ganadora. Sin embargo, sı existe una estrategia optima para cada

uno de los agentes, que consiste en ser impredecible: Tanto A como

B deben elegir 0 con probabilidad 0,5 y 1 con probabilidad 0,5, lo

que lleva a un pago esperado (x − y)/2 para el agente A (al ser un

juego de suma cero, el agente B recibe un pago esperado −(x− y)/2

que ya no hace falta mencionar). Este solo numero resume todo el

juego.

Un resultado nada obvio es que la existencia y unicidad de

este numero esta garantizado en cualquier juego de suma cero

con dos agentes, independientemente del numero de estrategias que

tengan. Esta fue la primera gran contribucion de John von Neumann,

llamada el Teorema del Minimax.

Decir, siguiendo con mi historia personal, que entonces creı

entender por que la teorıa de juegos estaba dentro del area de

Estadıstica e Investigacion Operativa. Tenıa que ser debido al uso

de la teorıa de probabilidad dentro del calculo de las estrategias

optimas. Y puede que haya algo cierto en ello, teniendo en cuenta que

la primera revista cientıfica dedicada de forma especıfica a la teorıa

de juegos, “International Journal of Game Theory”, esta catalogada,

entre otras, dentro del area de Estadıstica y Probabilidad.

El Teorema del Minimax ya habıa sido demostrado por von

Neumann bastantes anos antes, en 1928, pero su importancia habıa

pasado casi desapercibida. Fue merito de Oskar Morgenstern el guiar

a su coautor a poner los focos de los economistas en este resultado,

que el propio von Neumann perfecciono y extendio a juegos con

informacion imperfecta y con mas de dos jugadores. El Teorema del

Minimax supuso para von Neumann, por ası decirlo, lo que cogito

ergo sum, “pienso luego existo”, fue para Rene Descartes. Una vez

186 J. Vidal-Puga

descubierta esa primera “verdad”, se deberıa poder construir sobre

ella todo el edificio del saber. Von Neumann tambien consideraba

que la existencia del Teorema del Minimax para el caso mas sencillo

sugerıa que un resultado general era posible en situaciones sin suma

cero, tan habituales en la vida real y en las que todos los agentes

pueden salir ganando si colaboran.

Vemos por tanto que Oskar Morgenstern sirvio de guıa para

orientar el genio de von Neumann en la aquellas direcciones que

responden a retos relevantes dentro de la economıa. Otro ejemplo,

que no me resisto a comentar, es la relacion entre preferencias y

utilidades. Las preferencias son relaciones de orden que describen las

prioridades de los agentes. El uso de preferencias es muy razonable

pero, debido a su caracter ordinal, son tambien muy limitadas

las herramientas matematicas que podemos emplear con ellas. Las

utilidades, en cambio, asignan un valor numerico a las preferencias

de los agentes y ello permite aplicar toda la potencia del analisis

matematico a su estudio. Hoy en dıa, las llamadas funciones de

utilidad de von Neumann y Morgenstern suponen la base para

introducir el concepto de riesgo en los modelos de microeconomıa.

Como puede suponerse, la contribucion de la teorıa de juegos a

la economıa continuo mas alla de von Neumann y Morgenstern, si

bien durante mucho tiempo estuvo muy focalizada en las aplicaciones

militares en Estados Unidos. Fue esta una relacion que, en palabras

de Guillermo Owen, supuso mas limitaciones que beneficios al

desarrollo de la disciplina. Una obra altamente recomendable que

describe esta epoca es “The Strategy of Conflict” de Thomas

Schelling (Harvard University Press), cuya tercera edicion se publico

en 1990.

En los ultimos anos, puede decirse que la teorıa de juegos ha

vivido una epoca dorada reconocida por la concesion de varios

premios Nobel: John Nash, Reinhard Selten y John Harsanyi en 1994

Editorial 187

por sus analisis del equilibrio en teorıa de juegos no cooperativos;

Robert Aumann y Thomas Schelling en 2005 por ampliar la

comprension del conflicto y la cooperacion a traves de analisis

basados en la teorıa de juegos; Leonid Hurwicz, Eric Maskin y Roger

Myerson en 2007 por establecer las bases de la teorıa del diseno de

mecanismos, que determina cuando los mercados estan trabajando

de forma eficiente; y Alvin Roth y Lloyd Shapley en 2012 por sus

trabajos con los problemas de asignacion y el diseno de mercado.

Pero la influencia de la teorıa de juegos va incluso mas alla. En la

economıa conductual, que le valio a Richard Thaler el premio Nobel

de este ano, uno no puede dejar de intuir la esencia de la teorıa de

juegos en sus planteamientos.

En este viaje tambien tuvo su papel la investigacion operativa.

Prestigiosos economistas como los premios Nobel Robert Aumann,

Roger Myerson, Alvin Roth y Lloyds Shapley, y el Premio Rey Juan

Carlos I, Andreu Mas Colell, entre otros, han publicado varios de sus

trabajos de teorıa de juegos en revistas del area de la Investigacion

Operativa.

Con todo, el potencial de la teorıa de juegos es, a mi entender,

mucho mas amplio y, aunque ya se esta haciendo, puede aplicarse

aun mas a muchas otras disciplinas como las ciencias polıticas, la

gestion publica o la biologıa.

Con la idea de mostrar la utilidad de la teorıa de juegos en

distintas areas, desde el Grupo de Teorıa de Juegos de la SEIO

hemos organizado este ano en Pontevedra el curso de Modelos de

Investigacion Operativa en Teorıa de Juegos, impartido por Ignacio

Garcıa Jurado y por Joaquın Sanchez Soriano. Este curso fue el

tercero de una serie cuyos objetivos son contribuir a que se afiancen

las relaciones entre estudiantes de doctorado que realicen sus tesis

en Espana en temas proximos a la teorıa de juegos, y favorecer el

intercambio de ideas entre estudiantes de doctorado e investigadores

188 J. Vidal-Puga

que pueda dar lugar a nuevas perspectivas de investigacion.

En estas ediciones, el publico objetivo son estudiantes e

investigadores para los que la teorıa de juegos pueda ser de utilidad

en sus lıneas de investigacion, tanto a nivel teorico como practico.

Por lo dicho anteriormente, considero que esto incluye variadas

areas, como economıa, administracion de empresas, investigacion

operativa, matematicas, ingenierıa, logıstica, ciencias polıticas y

direccion publica, entre otras. De hecho, el curso estuvo financiado,

ademas de por la SEIO, por un grupo de economıa (ECOSOT,

Economıa, Sociedad y Territorio), una grupo de estadıstica e

investigacion operativa (SiDOR), una agrupacion de economıa y

direccion de empresas (ECOBAS) y un programa de doctorado

centrado en los ambitos de la creatividad, la innovacion y la

sostenibilidad (CREA S2i).

Tengo la esperanza de que el espıritu de estos cursos se mantenga

en futuras ediciones y que incluso pueda ampliarse a un ambito mas

internacional. A un nivel mucho mas ambicioso, quiza podamos ser

el Morgenstern que ayude a un futuro von Neumann, del area que

sea, a alcanzar nuevas cotas en el saber de la Humanidad.


Estadıstica

A comparative analysis of tree-based models

classifying imbalanced breath alcohol data

Manuela Alcaniz and Miguel Santolino

Department of Econometrics

University of Barcelona

B [email protected], B [email protected]

Lluıs Ramon

Data Scientist

Digital Origin

B [email protected]

Abstract

When applied to binary data, most classification

algorithms behave well provided the dataset is balanced.

However, when one single class includes the majority of cases,

a good predictive performance for the minority class is not

easy to achieve. We examine the strengths and weaknesses

of three tree-based models when dealing with imbalanced

data. We also explore sampling and cost sensitive methods

as strategies for improving machine learning algorithms. An

application to a large dataset of breath alcohol content tests

performed in Catalonia (Spain) to detect drunk drivers is

shown. The Random Forest method proved to be the model of

choice if a high performance is required, while down-sampling

© 2017 SEIO




190 M. Alcaniz, L. Ramon, M. Santolino

strategies resulted in a significant reduction in computing

time. When predicting alcohol impairment, the area of

control (built-up or not), hour of day and driver’s age were

the most relevant variables for classification.

Keywords: Imbalanced data, positive, drunk driving, police,

checkpoint, machine learning.

1. Introduction

Tree-based models have attracted the increasing attention of

researchers in recent years; however, analyses of the use of such

models when there is a highly unequal distribution between classes

are scarce. This is particularly true of binary data where one class

includes the majority of cases and the other represents just a small

portion. Imbalanced datasets of this kind are very common in such

disciplines as medical diagnosis, on-line advertising, fraud detection,

network intrusion, road safety, etc.

Many classification algorithms behave well for balanced datasets;

yet, when applied to imbalanced data, model fitting may be biased

towards the majority class. As a result, the model may provide a

poor predictive performance for the minority class, which is usually

the most interesting one. Kumar and Sheshadri [20], He and

Garcia [16] and Chawla [9] review problems of class imbalance and

alternative solutions. Here, the performance of two strategies for

dealing with imbalanced data –that is, sampling and cost sensitive

methods– are compared, and the interpretability of their respective

results is discussed.

Specifically, we illustrate the performance and features of

tree-based models by applying them to the classification of

alcohol-impaired drivers in Catalonia (Spain). When testing

for breath alcohol content (BrAC) over the legal limits, highly

imbalanced results are obtained –clearly, most drivers are not

Tree-based models classifying imbalanced breath alcohol data 191

alcohol-impaired and so BrAC tests are largely negative.

The identification and deterrence of potential alcohol-impaired

drivers is a priority for traffic authorities the world over ([24])

and while a downward trend in drunk driving has been observed

in many countries, there is still room for improvement ([32], [24],

[34]). For example, in 2014, 24.8% of deaths among drivers

in Catalonia were related to alcohol. In order to tackle drunk

driving effectively, appropriate policies need to be adopted. In this

paper three tree-based models are studied and their application to

the classification of drivers with a BrAC over the legal limit on

Catalan roads is explored. Specifically, we examine the use of the

Classification and Regression Tree, Tree Bagging and the Random

Forest models to classify positive BrAC tests.

Several studies have been conducted in Catalonia with regard

to drinking habits and driving. Alcaniz et al. [1] estimated the

prevalence of alcohol-impaired driving in Catalonia in 2012. They

found that it was the 1.29% for the general population of drivers,

1.90% on Saturdays and 4.29% on Sundays. Chulia, Guillen, and

Llatje [10] studied seasonal and time-trend variation by gender of

alcohol-impaired drivers at preventive sobriety checkpoints. Alcaniz,

Santolino, and Ramon ([2], [3]) studied age-drinking patterns and

drinking behavior in Catalonia and analyzed different strategies in

sobriety checkpoints. They suggested that non-random breath tests

were primarily effective to detect binge drinking and random breath

tests in detection of other drinking and driving profiles of population.

To our knowledge, classification models to identify drunk drivers

have not been previously applied to Catalan road data.

The rest of this paper is structured as follows. Following on

from this introduction, in Section 2, three tree-based models are

introduced along with their properties and variants, and various

approaches to tackling the class imbalance problem are described.


Section 3 is devoted to explaining the dataset obtained from police

preventive checkpoints. The results obtained after fitting the

tree-based models to the data and several variants are reported in

Section 4. Concluding remarks and discussion are outlined in Section

5.

2. Methods

In this section three tree-based models are introduced and their

properties discussed. Specifically, we analyze the Classification and

Regression Tree, the Tree Bagging and the Random Forest models1.

A number of extensions employing other types of response data and

alternative implementations are also detailed. Finally, we investigate

how to deal with the class imbalance problem.

2.1. Classification and Regression Trees

Classification and Regression Trees (CART) were first introduced

by Breiman et al. [8]. The CART model partitions the predictor

space in a recursive way so as to create groups in the response

variable that are as homogeneous as possible. The CART algorithm

begins by splitting the dataset into two disjoint subsets (known as

nodes or leaves). For each predictor, splits are computed for all

possible cut-off values and the one that maximizes the homogeneity

(and minimizes the impurity) of the resulting disjoint subsets is

chosen. This process is recursively repeated for each node.

An impurity measure, quite commonly the Gini index, is used

to choose the best split, with the split impurity being calculated by

aggregating the impurity of the subnodes. For a two-class problem,

the Gini index for a given node is defined as p1(1− p1) + p2(1− p2),

where p1 and p2 are the class 1 and class 2 probabilities, respectively

1The CART and Random Forest trademarks are licensed exclusively toSalford Systems.


[19]. Alternative measures to the Gini index exist. For instance,

the information gain measure can be used, although differences

are frequently not significant [27]. To avoid the overfitting of the

CART model, the subtree is selected based on a cost complexity

tuning, where a complexity parameter cp penalizes the size of

the tree. In fact, the subtree that minimizes Impuritysubtree +

cp × (Number Terminal Nodes) is selected. The cp value, the

hyperparameter, is normally selected using cross-validation (CV).

CART models have the advantage of being easy to interpret

and rapid to compute, of allowing missing values to be dealt with

and of facilitating feature selection. An important characteristic of

these models is that variable importance can be assessed. This is

achieved by retaining the reduction in the Gini index at each split

and aggregating these values for every predictor. Predictors that

either appear at the beginning of the tree or which are used in several

splits are more important. Note that variable importance can be

biased when there are many missing values or there are categorical

variables with many levels ([30], [21]). The main disadvantages of

CART models concern the instability of their results.

In practice, a large number of alternative implementations of

tree models exist. Different approaches have been proposed for their

use with survival data [5], multivariate regression [11], clustering

[29] and unbiased models ([17], [21]). Hyafil and Rivest [18] show

that constructing optimal binary decision trees is an infeasible

task. Grubinger, Zeileis, and Pfeiffer [14] propose evolutionary

algorithms to improve accuracy, while Loh [22] compares a set of

alternative implementations in terms of their capabilities, strengths,

and weaknesses.

2.2. Tree Bagging

Bagging, or Bootstrap aggregating, also introduced by Breiman

[6], involves generating several predictions and combining them to


obtain an aggregated predictor. Here, predictions are generated

by applying a model to different bootstrap replicas of the dataset.

These replicas are made by replacement and are as large as the

dataset itself. The aggregate is the majority vote of all models.

Each tree used in the tree bagging is computed as described in 2.1

above. The only difference is that there is no pruning step. The

aggregating step neutralizes the overfitting error of the trees.

The number of trees to be used is defined by the user and, in

practice, a small number of replicas usually proves sufficient [19].

Although the error decreases with the number of trees, the trees are

highly correlated, so the margin of improvement associated with each

additional tree decreases with the number of replicas. Compared

with CART models, the advantage of tree-bagging models is their

stability, which reduces the risk of overfitting. On the other hand,

these models are computationally more intensive than CART models

and their interpretation more complex.

2.3. Random Forest

In common with the two models outlined above, the Random

Forest (RF) model was proposed by Breiman [7]. RF involves

generating bootstrap replicas of the original dataset and creating

trees for each replica as in Bagging. However, RF seeks to create

uncorrelated trees to improve predictions. To create trees that are

as different as possible, at each split the trees can only use a limited

number of random variables. Hence, the trees tend to be very

different and provide different information when aggregated.

As in Tree Bagging, the number of trees to compute has first to

be specified. The number of variables that might be split at each

node (referred to as mtry) must also be defined. A common selection

is the square root of the number of variables [19]. In common with

the previous models, the minimum number of nodes can also be

determined. The higher this number is, the smaller and faster the


trees will be. As with the Tree Bagging models, the advantages

of RF models is that performance is enhanced and the overfitting

risk reduced. Furthermore, RF models are robust to outliers.

Their disadvantages include the complexity of interpretation and

the lengthy computation time.

Indeed, the computation time of the original RF can be

prohibitive in the case of a large mtry and/or a high number of trees.

Therefore, less timing-consuming, more intensive alternatives are

useful. Here, we use an efficient RF implementation as ranger2. An

additional feature of ranger is that it uses a variant for probability

estimation. Each tree provides the proportion of positives as opposed

to its classification. The probability is obtained by averaging this

proportion for all the trees. In doing so, the model performance is

generally improved [23].

Sometimes categorical variables can be interpreted as ordered

categorical variables (for instance, colors ordered according to their

intensity or type of roads based on their traffic capacity). This

strategy can significantly reduce the computation time of RF. To

split a categorical variable of n categories, the algorithm checks all

2(n−1) − 1 possible combinations. However, since the categories are

sorted in the case of ordered categorical variables, the impurity is

calculated between each category, and the threshold that gives the

best split is chosen. This is much quicker to compute as only one

variable has to be checked.

RF models can assess variable importance in three ways. The

simplest way is to count the number of times that a variable is

selected in all the trees. The second way involves computing the

aggregate reduction in impurity obtained at each split in all the trees.

Finally, a third way is to measure the permutation importance. For

2This reduced computing time by a factor of 12 compared to that of theoriginal RF.


each tree, the prediction performance of out-of-bag (OOB) samples3

is recorded. This performance is again computed but here using the

values of one randomly permuted variable. The drop in performance

resulting from this permutation is averaged over all the trees. This

is carried out for each variable and provides a measure of variable

importance in the RF [[15]. When variables are highly correlated

or if categorical and continuous variables are combined, the variable

importance indicator needs to be considered with caution[31].

RF models have been extensively applied. For instance,

generalizations of RF models have been proposed to provide

conditional quantiles and confidence intervals ([25], [33]). Segal [28]

demonstrates that RF can overfit datasets with large numbers of

noisy inputs. To deal with this, alternative extended RFs have been

proposed ([35], [4]).

2.4. Class Imbalance

It is relatively common to find imbalanced datasets, where

the majority of cases present negative outcomes. For example,

only a small percentage of observations show positive outcomes

in datasets of BrAC tests. Many classification algorithms have

been designed specifically for balanced datasets and so a poor

predictive performance may be obtained when applied to imbalanced

data. Two strategies for dealing with unbalanced data are sampling

methods and cost sensitive methods.

Sampling methods involve modifying the original dataset to

obtain a balanced dataset and they can be divided into the following

categories: down-sampling, i.e., excluding some instances of the

majority class by random sampling; up-sampling, i.e., incorporating

more instances of the minority class by random sampling with

3Out-of-bag samples consist of observations not included in a bootstrapsample.


replacement; and, hybrid methods, i.e., combining both up- and

down-sampling methods. Note that sampling methods apply only to

training data and not to testing data. Cost-sensitive methods involve

applying different costs of misclassification to each class in the model

fitting process. By specifying a higher cost to the misclassification

of a minority instance than that to a majority instance, the machine

learning algorithm makes fewer errors with the minority class, as

it is more expensive. This would counteract the bias towards the

majority class.

An additional problem presented by class imbalance is how best

to assess classifiers. The usual classification metric is the level of

accuracy, for instance, by means of confusion matrix. However, in

the case of imbalanced data, this measure may be inadequate. Other

techniques to compare tree-based models such as leave-one-out

cross-validation can be in addition computationally very expensive

for large datasets. To overcome these limitations, receiver operating

characteristic (ROC) curves are used. The ROC curve presents a

binary classifier performance when its threshold varies. It is formed

by plotting the true positive rate (TPR) against the false positive

rate (FPR) at various threshold settings. Any point on the diagonal

of the ROC curve is a random guess classifier, while any points below

the diagonal are worse than a random guess. A complete description

of ROC analysis can be found in Fawcett [12].

To compare the performance of different classifiers directly, we

use the area under the ROC Curve (AUC). This indicator aggregates

all the information provided by the ROC curve in a single scalar

expression. A classifier with a high AUC indicates that it has a

better than average performance. Note, however, that the first

classifier may present a worse performance than the second classifier

in a specific region of the ROC curve. An interesting property is

that the AUC of a classifier is equivalent to the probability that the


classifier will rank a randomly chosen positive instance higher than

a randomly chosen negative instance [12].

3. Data

3.1. Drunk driving legislation

Statutory blood-alcohol limits for driving differ across the

countries of Europe. Spanish legislation differentiates between

administrative and criminal positives, according to the level of

alcohol concentration in the breath (or blood). Drivers with BrAC

levels between 0.25 and 0.60 mg/l (0.15 and 0.60 mg/l for novice

and professional drivers) face administrative penalties if detected.

When the BrAC level is over 0.60 mg/l, drivers are deemed to have

committed a criminal offence and, therefore, face more stringent

legal sanctions, including temporary suspension of the driving license

and imprisonment.

The police are allowed to perform a BrAC test on any driver, even

if the driver does not show any symptoms of alcohol impairment.

The standard procedure is to conduct a BrAC test using a portable

breathalyzer while the driver is seated in their car. If negative, the

driver is allowed to continue on their journey; if positive, given that

the breathalyzer has no legal validity, an evidential breath test is

performed in the officer’s vehicle.

3.2. Variables

The database comprises 439,699 preventive BrAC tests carried

out at checkpoints by traffic authorities in 2014 in Catalonia. These

tests represent almost 95% of the total number of BrAC tests,

while the remaining 5% includes tests conducted on drivers showing

visible signs of alcohol intoxication or after committing a traffic

violation or on drivers involved in a traffic accident. Preventive

BrAC tests performed on cyclists or pedestrians were removed from


the database. Observations with missing information were also

removed. The final database comprises 408,936 BrAC tests.

Information recorded by traffic officers, including the location of

the checkpoint, specific hour of day, driver characteristics and vehicle

type, is available. Information about location differentiates between

interurban and urban areas and records the region and subregion

in which the checkpoint was set up. The territory of Catalonia is

divided into four administrative units and is recorded here as the

variable region. However, there is a more detailed administrative

division composed of 41 subregions. The traffic police in Catalonia

include both the regional police (Mossos d’Esquadra) and the local

police. There is a traffic police administrative division, known as

ART, which comprises eight levels and corresponds to the scale

between that of the regions and subregions.

The variable roadType records the type of road on which the

BrAC test was performed4. Information about the hour, day, week

and month when the test was performed is also available. As

drinking habits are closely associated with leisure, factors identifying

bank holidays (holiday), the eve of such holidays (holidayEve) and

long weekends (longWeekend) were created. Finally, driver and

vehicle characteristics were also recorded.

The description of variables is as follows.

� positive (Dependent variable): BrAC level above legal limit

(yes/no).

� builtUp: Interurban area or Urban area.

� region: Barcelona, Girona, Lleida and Tarragona.

� subregion: Name of subregion, 41 categories.

4Highway1 corresponds to toll-highways and Highway2 corresponds totoll-free highways.


� policeType: Regional police or Local police.

� ART : Police territorial division, eight categories.

� roadType: Highway1, Highway2, Conventional road, Rural

road and Urban road.

� hour : specific hour of day (number 1-24) when the BrAC was

performed.

� day : day when the BrAC was performed.

� month: month when the BrAC was performed.

� week : week when the BrAC was performed, as a number

(1-52).

� weekday : day of the week when the BrAC was performed, as

a number (1-7, Sunday being 7).

� dayType: Mon-Thu, Fri, Sat and Sun.

� workingDay : 1 if it was a working day, 0 otherwise.

� timePeriod : morning (6:00 to 13:59), afternoon (14:00 to

21:59) or night (22:00 to 5:59h).

� holiday : bank holiday (yes/no).

� holidayEve: Eve of bank holiday (yes/no)

� longWeekend : Long weekend (yes/no)

� sex : driver’s sex.

� age: driver’s age.

� licenseYear : year that the driver obtained the license.


� spanish: driver Spanish or foreigner.

� vehType: type of vehicle (Car, Van, Motorcycle, Moped, Light

truck, Heavy truck, Bus, and Other).

Algorithms of tree-based models implement an implicit variable

selection, so the strategy involved including all the variables in the

models. Table 1 presents the number of tests, number of positives

and the percentage of positives for the main variables and their

levels. Additional tables for variables comprising many levels are

included in the appendix: ART (Table A.1), month (Table A.2) and

hour of day (Table A.3), are included.

3.3. BrAC outcomes above legal limit

The positive response variable is highly skewed. Of the 408, 936

BrAC tests carried out, only 16, 494 –approximately 4% –were

positive. Figure 1 shows the percentage of BrAC tests above the legal

limit by subregion. The map shows a non-homogeneous percentage

of positives throughout the territory, with values being particularly

high in the north-east and along the coast.

Figure 2 shows the percentage of BrAC tests above the legal limit

according to a specific set of variables. In winter there are fewer

positives, while from June to September there is a greater number.

Urban areas are associated with a higher prevalence of positives than

are interurban areas. During the week there is a 2% positive rate,

while on weekends it is between 5 and 7%. Positive rates on Fridays

(3.5%) are halfway between weekday and weekend prevalences. A

similar percentage of positives is observed for both men and women;

however, non-Spanish men record a slightly higher positive rate,

while non-Spanish women present the lowest rate. Driver age is also

informative. The prevalence of alcohol peaks at age 20 with more

than 7% of positives and falls after that age. The final plot analyzes

the relationship between the prevalence of alcohol with the hour of


Variable Levels # tests # positives (%)builtUp Interurban area 267,117 10,149 3.8

Urban area 141,819 6,345 4.5region Barcelona 225,019 9,944 4.4

Girona 50,145 2,610 5.2Lleida 61,868 1,020 1.6Tarragona 71,904 2,920 4.1

policeType Regional police 266,029 10,155 3.8Local police 142,907 6,339 4.4

roadType Highway1 30,149 1,213 4.0Highway2 45,735 2,247 4.9Conventional road 190,744 6,674 3.5Rural road 489 15 3.1Urban road 141,819 6,345 4.5

dayType Mon-Thu 180,635 4,007 2.2Fri 58,093 2,089 3.6Sat 85,250 4,637 5.4Sun 84,958 5,761 6.8

workingDay Working day 206,126 5,277 2.6Non-working day 202,810 11,217 5.5

timePeriod Morning 101,590 3,576 3.5Afternoon 86,982 985 1.1Night 220,364 11,933 5.4

sex Man 332,411 13,430 4.0Woman 76,525 3,064 4.0

age3l [15,30] 133,713 7,732 5.8(30,45] 171,145 6,023 3.5(45,100] 104,078 2,739 2.6

licenseYear [1932,1994) 138,129 3,964 2.9[1994,2004) 115,267 4,154 3.6[2004,2012) 131,088 7,043 5.4[2012,2015) 24,452 1,333 5.5

spanish Spanish 350,444 14,035 4.0Non-Spanish 58,492 2,459 4.2

vehType Car 316,530 14,332 4.5Van 25,229 436 1.7Motorcycle 29,717 1,264 4.3Moped 8,876 334 3.8Light Truck 6,117 25 0.4Heavy Truck 19,361 78 0.4Bus 2,490 12 0.5Other 616 13 2.1

Table 1: Number of tests, positives and percentage of positives formain variables.


Figure 1: Percentage of positives by subregion.

the day and the driver’s age. This highlights a black spot in the

early morning for drivers in the young age group when 15% of BrAC

positives are recorded. All age groups present a high positive rate

between 9pm and 3am. In the afternoon, this percentage increases

with age. Finally, a black spot occurs at 13h in the 55 to 65 age

group.

4. Results

To assess the performance of the tree-based models, the data

were randomly split into training and test sets. The division was

made preserving the distribution of positives-negatives and of the


Figure 2: Percentage of positives by hour of day and age group.


other variables. The training set contained 70% of the data and was

used to fit the models; the test set contained the remaining 30%

of the data and was used to validate the models. All categorical

variables were included in the models as binary variables; that

is, each category was converted into a dichotomous variable. The

performance of all the models was based on the AUC from the test

set. All models were performed with R version 3.2.3 [26]. Packages

used were caret, randomForest, ranger, pROC, e1071, rpart, ipred,

plyr and dplyr.

When a hyperparameter had to be adjusted, a ten-fold

cross-validation (10-CV) was used; that is, the training dataset was

randomly split into ten partitions. The model/hyperparameter was

trained with nine of the ten original partitions. The remaining

partition was used to obtain the validation performance of the model.

This step was repeated ten times and a different partition was used

each time for validation. The model/hyperparameter performance

was thus obtained as an average of all the validations. The metric

for hyperparameter tuning was the AUC value. The hyperparameter

with the highest AUC was selected5. Once the hyperparameter was

adjusted, the model was fitted to the whole dataset.

4.1. Classification and Regression Tree model

Tree models contain an hyperparameter which is the complexity

parameter (cp). A grid of 50 (cp) values was used. The best

cross-validated cp value was 6.9897 · 10−6, with an AUC of 0.7472.

First panel of Figure 3 shows that the AUC value increases when

the cp decreases.

Note that the adjusted cross-validated cp value was very small.

5Alternatives exist for selecting the tuning parameters, such as the onestandard error rule or tolerance. These alternatives choose the simplest modelwithin a standard error or a defined tolerance from the best model, respectively[16].


0.60

0.65

0.70

0.75

0.0e+00 2.5e−05 5.0e−05 7.5e−05 1.0e−04cp

CV

AU

C

(a) CV AUC as a function of cp.

0

10

20

30

0.0e+00 2.5e−05 5.0e−05 7.5e−05 1.0e−04cp

tree

dep

th

(b) Tree depth as a function of cp.

Figure 3: CART models. Model with the best AUC is shown in red. Leftpanel: CV AUC as a function of cp. Right panel: Tree depth as a function ofcp.


The fitted trees need to be very deep in order to appreciate

differences between the two classes. Right panel of Figure 3 shows

the tree depth as a function of cp. Note that the highest AUC was

obtained in the trees with 30 levels. The interpretation of deep trees

is more complex. Using the adjusted cp value, a final model was

adjusted with all the training data. A membership probability was

obtained from the test set. The test AUC value was 0.7498.

These previous models do not take into account the fact that the

data are imbalanced. Therefore, two approaches for dealing with

imbalanced data were applied. First, down-sampling was performed

and so the training data were reduced to a down-sampled training

dataset. This contained the same number of observations from each

class. Our results improved in comparison to our previous outcomes.

The best cross-validated cp value was 4.9310 · 10−4, with an AUC of

0.7499. Note that this cp value is 50 times higher than the previous

cp. The fitted tree has a depth of 17 levels and the AUC associated

with the test set was 0.7577. Thus, using a subset of the dataset

resulted in a better performance.

Second, up-sampling was performed. To achieve a balanced

dataset, items from the minority class were added until the dataset

contained the same number of positives as negatives. A large overfit

was made in cross validation. To obtain a balanced dataset, many

instances from the minority class had to be copied. For this reason,

the fitted tree contained the same observations in the leaves as in

the validation set. This resulted in nearly perfect performance, but

when tested with new data, a very poor performance was obtained.

Although the cross-validated AUC value was almost 1, when the

model was validated with the test data, its AUC was less than 0.5,

i.e., a random guess.

Finally, a cost sensitive method was applied. The selection of the

cost value had first to be defined. We used cost values that balanced


the difference between classes. The dataset contains one positive for

every 20 negatives; thus, the tree model performance was analyzed

by applying a cost of 10, 20 and 30 for misclassification. Table 2

shows the cp value, the cross-validated AUC, the test AUC and the

depth for each cost value.

Cost Best cp CV AUC test AUC tree depth10 0.000277 0.7483 0.7570 2120 0.000311 0.7560 0.7663 1730 0.000242 0.7545 0.7630 28

Table 2: Model results by the cost used.

The best model performance was obtained when a

misclassification cost of 20 was applied. Compared to the

base tree, the cp values were much higher and the trees were less

complex. Yet, they were still too deep to be visually interpretable.

If an interpretative tree is desired for our context, a bigger cp

value needs to be chosen as a trade-off between interpretability and

predictive performance.

4.2. Tree Bagging model

Bagging consists of generating several bootstrap replicas from

the original dataset and modeling the deepest possible tree for each

replica. Whereas bagging has no hyperparameters to tune, the

number of bootstrap replicas does have to be defined. In our case,

the number of bagging trees was 50 and the test AUC was 0.7267.

Figure 4 (a) shows that increasing the number of replicas did not

improve the test AUC. Note that after 40 replicas, the performance

of the model increases very slowly. When a sufficiently high number

of trees had been used, adding another tree did not provide any

additional information, since it was highly correlated with some

other previous tree.


Class imbalance strongly affected bagging performance. To

predict a new observation, class predictions were obtained for each

tree and the predicted probability was obtained from the frequency

of all individual tree predictions. This can be explained by the

fact that each tree in the bagging provides a classification, not

a probability. For example, a leaf with five negatives and four

positives would be classified as negative, just as would a leaf with all

negatives. As in the case of the tree model, a sampling approach was

adopted. Here, only the down-sampling method was used. Bagging

was applied with 50 trees and a test AUC of 0.7675 was obtained.

Finally, a cost sensitive approach was performed. A cost of 20 was

applied to the bagging building step and a test AUC value of 0.7737

was obtained. Note that using different costs affects how the splits

are chosen in the tree building step. As bagging builds trees that are

as deep as possible, the final leaves tend to be more homogeneous so

as to avoid misclassification costs. This limitation does not occur in

the base bagging model. Figure 4 (b) shows the ROC curve of the

base Tree Bagging model and the down-sampling and cost sensitive

Tree Bagging models.

To conclude, we should stress that the Bagging Models were

computationally much more intensive than the Classification and

Regression Tree models. Indeed, in some cases the model fitting

took more than twelve hours.

4.3. Random Forest

The efficient Random Forest implementation ranger was used

and categorical variables were considered as ordered categorical

variables. Compared to the RF model that does not modify

categorical variables, the AUC values were not statistically

significantly different6; however, the computation time was halved.

6The CV AUC of the RF with original categorical variables was 0.7886(s.d.=0.0065), and the CV AUC of the RF with converted categorical variables


0.65

0.70

0.75

0.80

0 10 20 30 40 50Number of bootstrap replicas

Test

AU

C

(a) Test AUC by the number of bootstrap replicas.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

base

cost

down

(b) ROC curves by the bagging models used.

Figure 4: ROC curves and number of bootstrap replicas. Left panel: TestAUC by the number of bootstrap replicas. Right panel: ROC curves by thebagging models used.


Intuitively it seems that performance is markedly affected when

considering ordered categorical variables. This might be because

some categorical variables are directly considered as ordered

(dayType, timePeriod) or, at least, are categorized with a certain

order. For instance, the variable roadType has a certain order,

beginning with road types that have higher speed limits and

terminating with those with a slower speed limit.

With a ten-fold CV, a large number of different mtry was

considered for selection. Figure 5 shows that CV AUC increased as

the number of mtry decreased. The highest CV AUC was obtained

with an mtry equal to two. It had a CV AUC of 0.7849 and a test

AUC of 0.7932. A low mtry means that trees are very different from

each other, so each provides information for the aggregation step.

A low mtry could be problematic in the case of a high number of

non-informative variables, which does seem to be the case here.

Once the mtry was selected, the number of trees to be used

was analyzed. Figure 6 shows model performance as a function of

the number of trees. When the forest was small, adding new trees

substantially improved the model performance. However, the test

AUC value did not increase after approximately 400 trees.

Finally, the down-sampling strategy was adopted to deal with

class imbalance problems. The down-sampled performance of the

model was slightly worse than when using all the dataset. The

optimal mtry was three with an associated CV AUC value of 0.7753

and a test AUC value of 0.7871. Compared with the previous models,

the standard deviation was much higher. As each fold used fewer

data, the AUC results were more dispersed. In terms of speed, the

down-sampled performance was fifteen times faster than when using

all the data. The cost sensitive approach was not performed.

was 0.7820 (s.d.=0.0064).


0.75

0.76

0.77

0.78

0.79

0 10 20 30 40 50mtry

CV

AU

C

Figure 5: CV AUC as a function of mtry.

Variable importance

A major advantage of the RF model is that variable importance

can be assessed. Here, we evaluate variable importance by means

of the RF built-in permutation variable importance measure, which

compares the increase in the prediction error after permuting all

elements of a variable. Here, categorical variables were not converted

to ordered categorical variable but to dummy variables in order to

facilitate interpretation.

Table 3 shows the 20 variables with the highest values on the

permutation variable importance measure. The variable with the

highest value was Local police. The correlated categories of Urban

area (builtUp) and Urban road (roadType) were in third and fourth


0.74

0.76

0.78

0.80

0 50 100 150Number of trees used

Test

AU

C

(a) Using fewer than 150 trees.

0.74

0.76

0.78

0.80

0 500 1000 1500Number of trees used

Test

AU

C

(b) Using fewer than 1500 trees.

Figure 6: Test AUC as a function of the number of trees. Left panel: Usingfewer than 150 trees. Right panel: Using fewer than 1500 trees.


positions. This means that the behavior of the Local police and the

Regional police was considered to be different by the RF algorithm.

As expected, the hour and the time period-night were relevant for the

classification of observations. The most important characteristics of

the driver profile were age and experience (number of years holding

a driver’s license) which are both ranked in the top ten variables by

importance. The remaining variables in the top 20 were road type,

some regions/subregions and police divisions, and variables related

to the weekday and week of the year. Notice that sex and vehicle

type do not figure in the top 20.

Variable Category ImportancepoliceType Local police 100.00hour 62.84builtUp Urban area 61.69roadType Urban road 57.54timePeriod Night 44.63age 38.04licenseYear 38.02roadType Conventional road 26.90weekday 19.25subregion Barcelones 19.12week 17.30ART ART Metropolitana N 16.85timePeriod Afternoon 16.12month 15.69workingDay Non-working days 12.56region Lleida 12.20day 8.49dayType Sun 7.99ART ART Tarragona 7.65roadType Highway2 7.40

Table 3: Top 20 variables by importance.


4.4. Comparison of tree-based models

To conclude, summarizing results are shown in Table 4. All the

tree-based models discussed in the article are compared in terms of

classification performance and computation intensity.

Tree-based Test AUC Time computation

model intensity

CART 0.7498 Low

Down-sampling CART 0.7577 Low

Up-sampling CART <0.5 Low/middle

Cost sensitive CART 0.7663 Low

Bagging 0.7267 Very High

Down-sampling Bagging 0.7675 High

Cost sensitive Bagging 0.7737 Very high

Efficient Random Forest 0.7932 Middle/high

Down-sampling efficient 0.7871 Middle

Random Forest

Table 4: Performance and time consuming comparison of tree-basedmodels.

5. Discussion

This paper compares the use of three tree-based models used

in classification problems –in this specific case, as applied to BrAC

test results in excess of the legal limit in Catalonia (Spain). Drunk

driving data are deeply imbalanced since most drivers are not

alcohol impaired. Additionally, the performances of two alternative

strategies for dealing with imbalanced data –sampling methods

and cost sensitive methods– are compared. Unlike up-sampling,

down-sampling methods were preferred to the original methods.

The results following the application of down-sampling methods

were often slightly worse, but the reduction in computing time was


significant. As such, down-sampling techniques may be used to

obtain a rapid overview of model performance. In our case more

data did not improve model performance substantially. In the case of

imbalanced datasets, quality may be more important than quantity.

A comparison of the tree-based methods, showed that the Random

Forest model performed best, which means it can be considered the

model of choice if a high performance model is wanted. If rapid

computation is required, however, the (CART) tree model with

misclassification costs should be used. Finally, when compared to

these two methods, Tree Bagging offered no modeling advantages in

the context described here.

In terms of the number of nodes, trees were in general very deep,

hindering the direct interpretation of variables. According to the

Random Forest variable importance indicators, the most important

variables were those of the area of control, the hour of day and the

driver’s age, findings that are in line with previous studies ([1], [13],

[3],[2], [10]). Built-up/non-built-up areas was the most important

variable in the classification. As for the implications of our findings

for road safety, it is clear that different enforcement strategies are

required to address drunk driving in each of the two areas. An

interesting application of tree-based methods is their utility for

helping in-situ police officers select the drivers that should be tested

when the checkpoint is set up. This application could be extended to

drug testing since the unitary cost of drug tests is high in comparison

to that of alcohol tests.

Future areas of research include to distinguish between

administrative and criminal offenses. In this highly imbalanced

scenario it would be interesting to analyze whether similar results

were obtained regarding the performance of tree-based models.

Additionally, other supervised classification techniques could be

applied such as linear discriminant analysis, naive Bayes or support


vector machine. Finally, a promising approach to explore in

the future in order to cut down the computation time is to

apply dimension reduction techniques, such as principal component

analysis or partial least squares.

Acknowledgements

We wish to express our gratitude to Servei Catala de Transit for

providing the data and the Mossos d’Esquadra and Local Police for

carrying out the fieldwork. The authors acknowledge the support

of the Spanish Ministry for grants ECO2013-48326-C2-1-P and

ECO2015-66314-R.

Declaration of interests

The authors report no conflicts of interest. The authors alone

are responsible for the content and writing of the paper.

References

[1] Alcaniz, M., Guillen, M., Santolino, M., Sanchez-Moscona, D.,

Llatje, O. and Ramon, L. (2014). Prevalence of alcohol-impaired

drivers based on random breath tests in a roadside survey in

Catalonia (Spain), Accident Analysis & Prevention, 65:131-141.

[2] Alcaniz, M., Santolino, M. and Ramon, L. (2016). Circular

con tasa de alcohol superior a la legal: caracterizacion del

conductor segun la vıa de circulacion, Revista Espanola de

Drogodependencias, 41(3):59-71.

[3] Alcaniz, M., Santolino, M. and Ramon, L. (2016). Drinking

patterns and drunk-driving behaviour in Catalonia, Spain: a

comparative study, Transportation Research Part F: Traffic

Psychology and Behaviour, 42, 522-531.


[4] Amaratunga, D., Cabrera, J. and Lee, Y.-S. (2008). Enriched

random forests, Bioinformatics, 24(18):2010-2014.

[5] Bou-Hamad, I., Larocque, D. Ben-Ameur, H. et al. (2011). A

review of survival trees, Statistics Surveys, 5:44-71.

[6] Breiman, L. (1996). Bagging predictors, Machine learning,

24(2):123-140.

[7] Breiman, L. (2001). Random Forest, Machine learning,

45(1):5-32.

[8] Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A.

(1984). Classification and regression trees, CRC press.

[9] Chawla, N. V. (2005). Data mining for imbalanced datasets: An

overview. In: Data mining and knowledge discovery handbook,

853-867, Springer.

[10] Chulia, H., Guillen, M. and Llatje, O. (2016). Seasonal and

Time-Trend Variation by Gender of Alcohol-Impaired Drivers

at Preventive Sobriety Checkpoints, Journal of Studies on

Alcohol and Drugs, 77(3):413-420.

[11] De’Ath, G. (2002). Multivariate regression trees: a new

technique for modeling species-environment relationships,

Ecology, 834):1105-1117.

[12] Fawcett, T. (2006). An introduction to ROC analysis, Pattern

recognition letters, 27(8):861-874.

[13] Font-Ribera, L., Garcia-Continente, X., Perez, A., Torres, R.,

Sala, N., Espelt, A. and Nebot, M. (2013). Driving under the

influence of alcohol or drugs among adolescents: the role of

urban and rural environments, Accident Analysis & Prevention,

60:1-4.


[14] Grubinger, T., Achim Zeileis, A. and Pfeiffer, K.-P. (2014).

Evolutionary Learning of Globally Optimal Classification and

Regression Trees in R, Journal of Statistical Software, 61(1).

[15] Hastie, T. Tibshirani, R. and Friedman, J. (2009). The

Elements of Statistical Learning: Data Mining, Inference, and

Prediction, Second Edition, Springer Series in Statistics.

[16] He, H. and Garcia, E. A. (2009). Learning from imbalanced

data, Knowledge and Data Engineering, IEEE Transactions on,

21(9):1263-1284.

[17] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased

recursive partitioning: A conditional inference framework,

Journal of Computational and Graphical statistics,

15(3):651-674.

[18] Hyafil, L. and Rivest, R. L. (1976). Constructing optimal binary

decision trees is NP-complete, Information Processing Letters,

5(1):15-17.

[19] Kuhn, M. and Johnson, K. (2013). Applied predictive modeling,

Springer.

[20] Kumar, M. and Sheshadri, H. (2012). On the classification

of imbalanced datasets, International Journal of Computer

Applications, 44.

[21] Loh, W.-Y. (2002). Regression tress with unbiased variable

selection and interaction detection, Statistica Sinica, pp.

361-386.

[22] Loh, W.-Y. (2011). Classification and regression trees, WIREs

Data Mining Knowl. Discov., 1(1):14-23.


[23] Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G. and

Ziegler, A. (2012). Probability machines: consistent probability

estimation using nonparametric learning machines, Methods of

Information in Medicine, 51(1):74.

[24] Mathijssen, M. (2005). Drink driving policy and road safety in

the Netherlands: a retrospective analysis, Transportation

research part E: logistics and transportation review,

41(5):395-408.

[25] Meinshausen, N. (2006). Quantile regression forests, The

Journal of Machine Learning Research, 7:983-999.

[26] R Core Team (2016). R: A Language and Environment for

Statistical Computing, R Foundation for Statistical Computing,

Vienna, Austria.

[27] Raileanu, L. E. and Stoffel, K. (2004). Theoretical comparison

between the gini index and information gain criteria, Annals of

Mathematics and Artificial Intelligence, 41(1):77-93.

[28] Segal, M. R. (2004). Machine learning benchmarks and random

forest regression, Center for Bioinformatics & Molecular

Biostatistic.

[29] Sela, R. J. and Simonoff, J. S. (2011). RE-EM trees: a data

mining approach for longitudinal and clustered data, Mach.

Learn., 86(2):169-207.

[30] Strobl, C., Boulesteix, A.-L. and Augustin, T. (2007). Unbiased

split selection for classification trees based on the Gini index,

Computational Statistics & Data Analysis, 52(1):483-501.

[31] Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T.

(2007). Bias in random forest variable importance measures:


Illustrations, sources and a solution, BMC bioinformatics,

8(1):1.

[32] Vanlaar, W., Robertson, R., Marcoux, K., Mayhew, D., Brown,

S. and Boase, P. (2012). Trends in alcohol-impaired driving in

Canada, Accident Analysis & Prevention, 48:297-302.

[33] Wager, S., Hastie, T. Efron, B. (2014). Confidence intervals for

random forests: The jackknife and the infinitesimal jackknife,

The Journal of Machine Learning Research, 15(1):1625-1651.

[34] Williams, A. F. (2006). Alcohol-impaired driving and its

consequences in the United States: the past 25 years, Journal

of safety research, 37(2):123-138.

[35] Xu, B., Huang, J. Z., Williams, G., Wang, Q. and Ye, Y.

(2012). Classifying very high-dimensional data with random

forests built from small subspaces, International Journal of

Data Warehousing and Mining (IJDWM), 8(2):44-63.


Appendix

ART # tests # positives (%)ART Girona 50,143 2,610 5.2ART Manresa Central 44,917 1,656 3.7ART Metropolitana N 142,719 6,730 4.7ART Metropolitana S 37,983 1,582 4.2ART Pirineu Lleida 20,344 495 2.4ART Ponent Lleida 41,524 525 1.3ART Tarragona 45,711 2,141 4.7ART Terres Ebre 25,595 755 2.9

Table A.1: Number of tests, positives and percentage of positives byPolice Territorial Division (ART).

Month # tests # positives (%)1 32,286 1,046 3.22 38,231 1,446 3.83 41,161 1,749 4.24 29,485 1,162 3.95 34,485 1,487 4.36 41,897 1,916 4.67 27,521 1,373 5.08 28,788 1,386 4.89 29,319 1,402 4.8

10 38,298 1,126 2.911 31,182 1,271 4.112 36,283 1,130 3.1

Table A.2: Number of tests, positives and percentage of positives bymonth of the year.


Hour # tests # positives (%)1 22,656 1,069 4.72 9,777 761 7.83 35,935 2,677 7.44 25,562 2,161 8.55 7,043 954 13.56 21,499 1,958 9.17 22,746 1,094 4.88 14,282 293 2.19 8,801 73 0.8

10 8,752 42 0.511 11,966 47 0.412 10,524 46 0.413 3,020 23 0.814 1,160 21 1.815 19,151 136 0.716 23,296 234 1.017 11,790 148 1.318 6,243 64 1.019 10,755 139 1.320 11,999 157 1.321 2,588 86 3.322 2,057 123 6.023 28,448 750 2.624 88,886 3,438 3.9

Table A.3: Number of tests, positives and percentage of positives byhour of the day.


Investigacion Operativa

Robust Approaches to Uncertain Optimization

Elisabeth Kobis

Institute of Mathematics

Martin–Luther–University Halle–Wittenberg

B [email protected]

Abstract

This paper gives an overview of recent results on

robust optimization via unifying approaches using a

nonlinear scalarization concept and methods from vector

and set optimization. First, we consider scalar uncertain

optimization concepts. We distinguish between a finite

and infinite uncertainty set and show that a prominent

scalarizing functional as well as methods from vector and

set optimization play a crucial role for the representation of

robust optimization models. Then we present a notion of

robust solutions of uncertain vector optimization problems

along with linear and nonlinear scalarization results.

Keywords: Uncertain vector optimization, Robust

optimization, Set optimization, Vector optimization,

Scalarization.

AMS Subject classifications: 90C29, 68T37.

© 2017 SEIO


Robust Approaches to Uncertain Optimization 225

1. Introduction

Uncertain data contaminate most optimization problems in

various applications ranging from science and engineering to finance

and thus represent an essential component in optimization. From a

mathematical point of view, many problems can be modeled as an

optimization problem and be solved, but in real life, having exact

data is very rare and seems almost impossible. Due to a lack of

complete information, uncertain data can highly affect solutions and

thus influence the decision making process. Hence, it is crucial to

address this important issue in optimization theory.

Potential applications of uncertain optimization include supply

and inventory management, since demand and tools needed for

the production process can easily be exposed to uncertain changes.

Further examples for uncertain data in optimization problems can

be found in the field of market analysis, share prices, transportation

science, timetabling and location theory (see, for example, [4] and

the references therein, and [16, 51]).

As was recently observed in [26, 27], robust multiobjective

optimization is an important application of set optimization. In

case uncertainties are present during an optimization process,

the decision maker generally has two modeling options: Using

stochastic optimization approaches, solutions are desired that are

likely to satisfy the given requirements (optimality and constraints).

Alternatively, robust optimization searches for solutions which are

of good quality in the worst-case scenarios, regardless of how likely

this event may be. Robust multiobjective optimization with a fixed

ordering structure was examined in [26, 27]. Results on robust

multiobjective optimization using a variable order relation can be

found in [32].

In this paper, we give an overview of recent results in robust

optimization using concepts from vector and set optimization, in

226 E. Kobis

particular by means of a nonlinear scalarizing functional. Section 2

is devoted to recalling some notation and preliminary results. In

Section 3, we describe approaches to uncertain scalar optimization,

where we distinguish between a finite and an infinite uncertainty

set. Section 4 presents a concept for robustness for uncertain

vector optimization problems and collects scalarization results. The

concluding Section 5 proposes some avenues for future research.

2. Preliminaries

In this section, we recall some notation of uncertain

multiobjective optimization introduced in Ehrgott et al. [11] (see

also [27, 29]). Throughout this work, let Yf be a real linear

topological space, X be a linear space, and let an uncertainty set

∅ 6= U ⊆ RN be given, where N ∈ N\{0}. Consider f : X×U → Yf ,

ξ ∈ U and let f(·, ξ) : X → Yf be the function that is to be minimized

on a feasible set ∅ 6= X (ξ) ⊆ X. The feasible set is defined as

X (ξ) := {x ∈ X | Fi(x, ξ) ≤ 0, i = 1, . . . ,m}

with ξ ∈ U and Fi : X × U → R, i = 1, . . . ,m.

For a fixed ξ ∈ U , the deterministic vector optimization problem

is denoted by

f(x, ξ)→ infx∈X (ξ)

. (P (ξ))

The family of all problems⋃ξ∈U (P (ξ)) is denoted by (P (U)). We

call ξ ∈ U a scenario and (P (ξ)) an instance of (P (U)).

ξ models the parameters which are uncertain, and the uncertainty

set U contains all the possible parameter values that the uncertain

parameter may attain. Such uncertainties occur in many real-world

optimization problems and can e.g. be caused by measuring errors,

modeling assumptions or simply because a future parameter is not


known prior to solving an optimization problem. Consequently, it

is necessary to treat some of the input data as uncertain and it is

important to find a way to handle uncertain data in optimization

problems. Throughout this paper we assume that the actual

outcome of the parameters ξ is unknown, but that ξ stem from an

uncertainty set U that is nonempty, compact and known a priori.

This is a common assumption in the context of robust optimization.

Examples include interval based uncertainties (e.g. [7]), polyhedral

uncertainties (e.g. [44]), or ellipsoidal uncertainty sets (e.g. [4]).

Let the set of robust solutions be denoted as

A := {x ∈ X|∀ξ ∈ U : Fi(x, ξ) ≤ 0, i = 1, . . . ,m} (2.1)

=⋂ξ∈U

X (ξ),

which we assume to be nonempty.

We define for x ∈ A

fU (x) := {f(x, ξ)| ξ ∈ U} (2.2)

the image of f under U . Note that fU (x) 6= ∅ for all x ∈ A, since

U 6= ∅.Our goal is to obtain solutions that are robust, i.e., that perform

well even in the worst-case scenario. For the scalar case Yf = R,

this would mean to minimize the functional supξ∈U f(x, ξ) on A. Of

course, if f is vector-valued, this scalar approach cannot be easily

transferred to vector optimization. Due to the absence of a total

order on Yf , we need to define the meaning of a robust solution that

satisfies some kind of optimality.

In order to determine robust solutions (where the term robustness

needs to be defined), sets fU (x) need to be compared. For the

comparison of sets, usually, a cone is added to one set and both

228 E. Kobis

sets are then compared w.r.t. that given cone, which represents

the ordering structure. Let Y be a real linear topological space.

Recall that C ⊆ Y is called a cone if c ∈ C implies that λc ∈ C

for every λ ≥ 0. The dual cone of a cone C is defined as the set

C∗ := {y∗ ∈ Y ∗ | ∀c ∈ C : y∗(c) ≥ 0}, where Y ∗ denotes the

topological dual space of Y . A cone C ⊆ Y is called pointed if

C ∩ (−C) = {0}. For two nonempty subsets A, B of Y , we denote

the Minkowski sum of sets by

A+B := {a+ b | a ∈ A, b ∈ B}.

The cone C ⊆ Y is convex if C + C ⊆ C. We say that a nonempty

set B ⊂ Y is proper if B 6= {0} and B 6= Y . A cone C ⊆ Y induces

a binary relation ≤C by

y1 ≤C y2 :⇐⇒ y1 ∈ y2 − C (⇐⇒ y2 ∈ y1 + C).

see, for example, [28]. If the cone C ⊆ Y is proper (i.e., {0} 6= C 6=Y ), pointed and convex, then the binary relation ≤C induced by C

is a (partial) order relation (i.e., a binary relation which is reflexive,

transitive and antisymmetric), see, for example, [28]. In the below

definition, we recall a widely used binary relation among nonempty

subsets of Y , namely, the so-called upper set less order relation.

Definition 2.1 (Upper Set Less Order Relation, see [35, 36]). Let

C ⊆ Y be a cone. Then the upper set less order relation is given

for two nonempty sets A,B ⊂ Y as

A �uC B :⇐⇒ A ⊆ B − C.

In the following, it will be important to identify minimal elements

of a nonempty subset F of Y .

Definition 2.2 (Minimality). Let F ⊆ Y be a nonempty set and let


α be a binary relation on Y . F ∈ F is a minimal element of Fw.r.t. α if

for all G ∈ F : G α F =⇒ F α G.

In the above definition, if instead of Y we consider 2Y , then

this definition encompasses the usual minimality notion in set

optimization (see Jahn [28, Definition 14.5]). If F ⊆ Y and the

relation α is induced by a convex cone C ⊂ Y , then the definition

describes the standard minimality notion in vector optimization

(compare, for example, [28, Definition 4.1]). Indeed, F ∈ F is a

minimal element of F w.r.t. α if and only if (F − C) ∩ F ⊆ F + C.

Definition 2.3 (Weak Minimality in Vector Optimization). Let

F ⊆ Y and consider the binary relation α =≤C on Y , where C is

a proper closed and convex cone with nonempty topological interior.

Then F ∈ F is called a weakly minimal element of F w.r.t. α if

(F − int(C)) ∩ F = ∅,

where int(C) denotes the topological interior of C. Note that

minimality implies weak minimality.

Now let k be a non-zero element in the real linear topological

space Y . In addition, let B be a nonempty closed proper subset of

Y satisfying the inclusion

B + [0,+∞) · k ⊂ B. (2.3)

Then we recall the functional zB,k : Y → R ∪ {±∞} =: R defined

by

zB,k(y) := inf{t ∈ R|y ∈ tk −B} for all y ∈ Y. (2.4)

By convention, let inf ∅ = +∞. The functional zB,k was originally

introduced as separation functional in vector optimization by

Gerstewitz [18], see also Gerth and Weidner [19], Pascoletti and

230 E. Kobis

Serafini [41] and Gopfert et al. [20]. It is interesting to notice

that the construction in (2.4) was mentioned by Krasnosel’skiı

[34] (see Rubinov [42]) in the context of operator theory. Using

this scalarizing functional we can define the following minimization

problem which will be used later on to represent the concept of

robust optimization.

In the following definition, we denote the set of feasible elements

of Y by F .

Definition 2.4. Let ∅ 6=F ⊆ Y and let zB,k be defined as in (2.4).

An element F ∈ F is a minimal element of F in Y w.r.t. zB,k if

zB,k(F ) ≤ zB,k(G) ∀G ∈ F ,

i.e., F solves the scalar optimization problem

zB,k(F )→ infF∈F

. (Pk,B,F )

We remark that many scalarization schemes that are

suggested in the literature are special cases of the above

nonlinear scalarization concept. For example, in the case of

(finite-dimensional) multiobjective optimization, this scalarization

method comprises weighted-sum (see Gass and Saaty [17],

or Zadeh [52]), Tschebyscheff- (Steuer and Choo [46]) and

ε-constraint-scalarizations (Haimes et al. [23]), and many others

(for an overview, see [48]). The functional zB,k possesses various

interesting properties, some of which we collect below in the case

that B is a proper closed convex cone in Y with nonempty interior

and k ∈ intB.

Lemma 2.1 ([20]). Let B be a proper closed convex cone with

nonempty interior in the real linear topological space Y and k ∈intB. Then zB,k, defined by (2.4), is a finite-valued, continuous,


sublinear, B-monotone (i.e., y1 ∈ y2−B =⇒ zB,k(y1) ≤ zB,k(y2))

and strictly (intB)-monotone (i.e., y1 ∈ y2− intB =⇒ zB,k(y1) <

zB,k(y2)) functional such that

∀y ∈ Y, ∀r ∈ R : zB,k(y) ≤ r ⇐⇒ y ∈ rk −B,

∀y ∈ Y, ∀r ∈ R : zB,k(y) < r ⇐⇒ y ∈ rk − intB.

It is interesting to mention that the functional zB,k has been recently

defined for linear spaces that are not endowed with a topology.

Several properties of zB,k under non-topological assumptions are

studied in [22] and the references therein.

3. Robust Approaches to Uncertain Scalar

Optimization

In this section, we study the problem (P (ξ)) for the case Yf = R,

i.e., we consider scalar optimization problems (P (ξ)) which depend

on uncertain parameters ξ ∈ U ⊆ RN . Thus, for fixed parameters

ξ ∈ U , the problem to be solved is given as

f(x, ξ)→ inf

s.t. Fi(x, ξ) ≤ 0, i = 1, . . . ,m,

x ∈ X,(P (ξ))

where f : X × U → R, Fi : X × U → R, i = 1, . . . ,m.

Now the question arises how one should handle the family of all

problems⋃ξ∈U (P (ξ)), denoted by (P (U)). Typically, the problem

(P (U)) is replaced by a deterministic counterpart problem, called

robust counterpart. Now we will formally recall the most prominent

robustness concept from the literature. It has been first mentioned

by Soyster [45] and then formalized and analyzed by Ben-Tal, El

Ghaoui, and Nemirovski in numerous publications, see e.g. [6, 14] for

232 E. Kobis

early contributions and [4] for an extensive collection of results. The

idea is that the worst possible objective function value is minimized

in order to get a solution that is “good enough” even in the worst

case scenario. Furthermore, constraints have to be satisfied for

every scenario ξ∈ U . Then the robust counterpart of the uncertain

optimization problem (P (U)) is defined by

supξ∈U

f(x, ξ)→ inf

s.t. x ∈ A,(RC)

where A is defined in (2.1). We call a feasible solution of (RC)

robust. The intuition behind this approach is the following: A

risk-averse decision-maker is interested in obtaining robust solutions,

i.e., solutions that hedge against the possibility of the worst case

scenario. Moreover, the given constraints have to be satisfied for

any scenario. Of course, this is an extremely conservative approach,

which necessitates handling with great care, since it needs to be

ensured that the set A is indeed nonempty. In the literature, there

exist numerous extensions and modifications of this concept (see,

for instance, [29] and the references therein). For example, the

reliably robust counterpart (compare [5]) relaxes the constraints, and

the lightly robust counterpart (see [44]) minimizes upper bounds in

the constraints, and where deviations from the optimal value at a

nominal scenario are allowed. The following two subsections are

devoted to investigating the problem (RC) in case of a finite and

infinite uncertainty set U , respectively.

3.1. Finite Uncertainty Set

In this section, we assume that U := {ξ1, . . . , ξq}, i.e., ξ ∈ U can

take on q different values. This assumption is of particular interest

in practical applications concerning computations, as most data can


only be handled in a discrete manner. We will now show how

the robust counterpart (RC) can be expressed using the nonlinear

scalarizing functional zB,k given by (2.4) under the assumption that

U is finite.

Theorem 3.1 ([29, Theorem 3]). Let Y = Rq. For B := Rq+, where

Rq+ denotes the nonnegative orthant in Rq, k := 1q := (1, . . . , 1)T

and F := {(f(x, ξ1), . . . , f(x, ξq))T |x ∈ A}, problem (Pk,B,F ) is

equivalent to problem (RC) in the following sense:

infF∈F

zB,k(F ) = infx∈A

supξ∈U

f(x, ξ).

Proof. Since B = Rq+ and k ∈ intRq+, (2.3) is fulfilled and then the

functional zB,k is well-defined. The following reformulations hold:

infF∈F

zB,k(F ) = infF∈F

inf{t ∈ R|F ∈ tk −B}

= infF∈F

inf{t ∈ R|F − tk ∈ −B}

= infx∈A

inf{t ∈ R|(f(x, ξ1), . . . , f(x, ξq))T

− t · (1, . . . , 1)T ≤Rq+ 0q}

= infx∈A

inf{t ∈ R|(f(x, ξ1), . . . , f(x, ξq))T

≤Rq+t · (1, . . . , 1)T }

= inf{supξ∈U

f(x, ξ)|x ∈ A},

which completes the proof.

Note that the selection of k = 1q reflects the choice of each

objective function: It means that every scenario ξ ∈ U (or each

objective function f(x, ξ), ξ ∈ U) is treated equally, i.e., no objective

function is preferred to another one.

234 E. Kobis

Remark 3.1. Since B = Rq+ is a proper closed convex cone

and k ∈ intRq+, Lemma 2.1 can be applied, and the functional

zB,k is continuous, finite-valued, convex, Rq+-monotone, strictly

(intRq+)-monotone and subadditive.

Remark 3.2. The concept of robustness is described by the

Tschebyscheff scalarization with the origin as reference point as a

special case of functional (2.4). Therefore, Theorem 3.1 verifies

that (RC) can be interpreted as a max-ordering problem as defined

in multiobjective optimization, see [10]. Note that this relation has

already been observed by Kouvelis and Sayin [33, 43], where it was

used to calculate the set of efficient solutions of discrete bicriteria

optimization problems. Additionally, this concept is equivalent to

a reference point approach of Wierzbicki [49] using the origin as

reference point, and in the case that f(x, ξ) ≥ 0 for all ξ ∈ U and

x ∈ A, to a weighted Tschebyscheff scalarization, see Steuer and

Choo [46].

The preceding result has shown that the problem (RC) can be

regarded as a scalarized problem of a multiobjective optimization

problem, where every scenario ξl ∈ U , l = 1, . . . , q, yields

its own objective function hl(x) := f(x, ξl), with h : A →Rq and h := (h1, . . . , hq)

T . Therefore, it would be quite

natural to consider the multiobjective optimization problem (as a

deterministic multiobjective counterpart problem) in more detail.

The multiobjective robust counterpart to (P (U)) is defined by

h(x)→ infx∈A

, (RC′)

where F = h[A] = ∪x∈Ah(x) (see (2.1) for the definition of A).

Using Theorem 3.1 together with Lemma 2.1, we can conclude that

problem (Pk,B,F ) is a scalarization of the multiobjective counterpart

(RC′), and the following corollary holds due to the monotonicity


properties of the functional zB,k.

Corollary 3.1 ([29, Corollary 4]). Let Y , B, k and F be given as in

Theorem 3.1, C := Rq+ and let α denote the order relation induced

by C. Then for a given F ∈ F , we have the following implications:

[∀G ∈ F \ {F} : zB,k(F ) < zB,k(G)]⇒ F is a minimal element

of F w.r.t. α,

[∀G ∈ F : zB,k(F ) ≤ zB,k(G)]⇒ F is a weakly minimal element

of F w.r.t. α.

It is shown in [29] that several different kinds of robust

counterpart problems known from the literature can be obtained

by considering the problem (Pk,B,F ) (as scalarization of (RC′))

with different input parameters k,B and F . Additionally, it is

interesting to mention that it is possible to include the constraints

Fi, i = 1, . . . ,m as objective functions in the objective vector h.

In this way, more concepts of robustness can be represented and

further evaluated (for example, reliable and light robustness, see

[29]). Moreover, depending on a decision-maker’s preferences, it is

now possible to find completely new concepts of robustness (i.e.,

different robust counterpart problems) by modifying the involved

parameters k and B.

3.2. Infinite Uncertainty Set

If U = {ξ1, . . . , ξq} is finite, each scenario can be interpreted as

an objective function, as we have seen in Section 3.1. For a robust

solution x ∈ A, we then obtain a vector Fx ∈ Rq which contains

f(x, ξi) in its ith coordinate. In order to compare two solutions

x and y, order relations for the vectors Fx and Fy are used. In

this way, many concepts of robust optimization and of stochastic

236 E. Kobis

programming can be characterized using multiobjective counterpart

problems, see [29]. If U is not a finite set, we obtain not vectors

but functions, i.e., Fx : U → R, where Fx(ξ) = f(x, ξ) contains

the objective function value of x in scenario ξ, ξ ∈ U . In order to

compare two solutions x and y, we hence need order relations in the

real linear space RU of all mappings F : U → R. Throughout this

subsection, we assume that U is not necessarily a finite set. In this

case, we propose three different approaches to the problem (RC):

� the vector approach,

� the set approach,

� and the nonlinear scalarization approach.

The idea of using these three approaches to dealing with problem

(RC) stems from [30], and most of the results presented within this

subsection are taken from [30]. We start by describing the vector

approach. Let Y = RU be the space of all functions F : U → R. For

a fixed solution x ∈ A, we define

Fx ∈ Y : Fx(ξ) := f(x, ξ).

In order to compare elements of Y , we consider different order

relations on the space Y which are denoted by α. In the context of

vector optimization, (partial) order relations are the binary relations

≤C induced by pointed convex cones.

Such a cone C induces an order relation α :=≤C by

y1 ≤C y2 :⇐⇒ y1 ∈ y2 − C (⇐⇒ y2 ∈ y1 + C).

Whenever we are working with the interior of an ordering cone, we

assume that Y = C(U ,R), i.e., that the functions Fx = f(x, ξ)

are continuous in ξ for all feasible values of x. A particular order

relation, which will be of interest later, is given in the next definition.


Definition 3.1. The natural order relation α is given by the cone

Y + := {F ∈ Y |∀ξ ∈ U : F (ξ) ≥ 0}

inducing for all F,G ∈ Y that

F α G ⇐⇒ G ∈ F + Y +

⇐⇒ F (ξ) ≤ G(ξ) for all ξ ∈ U .

Given an order relation α and a set F ⊆ Y , the vector

optimization problem asks for minimal elements of F w.r.t.

α. It is shown in [30] that various concepts for uncertain

optimization can be interpreted as solving such a vector optimization

problem, and conversely, every order α induces a concept for

handling uncertainty. While not all such concepts necessarily

have a meaningful interpretation in the context of uncertain

optimization, this relationship provides a coherent means of devising

and understanding deterministic counterparts of an uncertain

optimization problem. For a systematic approach to different

concepts for handling uncertainty in the context of vector and set

optimization, we refer to [30].

Remark 3.3. In the case of the natural order relation α of Y

introduced in Definition 3.1, an element F ∈ F is a minimal element

of F w.r.t. α if and only if 6 ∃ G ∈ F \ {F} such that

∀ξ ∈ U : (G− F )(ξ) ≤ 0,

or, in equivalent terms, if and only if 6 ∃ G ∈ F such that

∀ξ ∈ U : (G− F )(ξ) ≤ 0 and ∃ ξ ∈ U : (G− F )(ξ) < 0.

If Y = C(U ,R), then int(Y +) = {F ∈ Y |∀ξ ∈ U : F (ξ) > 0} (see

238 E. Kobis

Jahn [28] and Winkler [50]), and an element F ∈ F is a weakly

minimal element of F w.r.t. α if and only if

6 ∃G ∈ F : (G− F )(ξ) < 0 ∀ξ ∈ U . (3.1)

The robust counterpart (RC) can be formulated as a vector

optimization problem in the space Y = RU as follows. We denote

the set of robust outcome functions in Y by

F := {Fx ∈ Y | x ∈ A},

where A is defined in (2.1). Let two functions Fx, Fy ∈ Y be given.

We consider the following order relation on Y :

Fx αsup Fy :⇐⇒ supξ∈U

Fx(ξ) ≤ supξ∈U

Fy(ξ).

As in the finite dimensional case, the sup-order relation αsup is

not compatible with addition, i.e., for three elements Fx, Fy, Fz ∈Y , FxαsupFy does not necessarily imply (Fx + Fz)αsup(Fy + Fz).

Consequently, αsup cannot be represented by an ordering cone.

Nevertheless, it has the following properties.

Remark 3.4. αsup is reflexive and transitive. Furthermore, αsup is

a total preorder.

The following theorem shows that the order relation αsup allows

to represent the robust optimization problem (RC) as a vector

optimization problem.

Theorem 3.2 ([30, Theorem 1]). A solution x ∈ A is an optimal

solution to (RC) if and only if Fx is a minimal element of F w.r.t.

the sup-order relation αsup.


Proof. Let x ∈ A. Then

x is an optimal solution to (RC) ⇔ supξ∈U

f(x, ξ) ≤ supξ∈U

f(x, ξ)

for all x ∈ A

⇔ supξ∈U

Fx(ξ) ≤ supξ∈U

Fx(ξ)

for all x ∈ A

⇔ FxαsupFx for all x ∈ A,

⇔ FxαsupG for all G ∈ F ,

and the result follows since αsup is a total preorder.

This means that optimal solutions of the robust

counterpart (RC) correspond to outcome functions whose suprema

are minimal.

We now analyze the relation between the sup-order relation αsup

and the natural order relation α introduced in Definition 3.1.

Remark 3.5. F α G =⇒ F αsup G for all F,G ∈ Y .

The implication stated in Remark 3.5 does not generally imply

that every minimal element w.r.t. αsup is also a minimal element

w.r.t. α, or vice versa. When there are two scenarios ξ1, ξ2 and

under some additional assumptions, Iancu and Trichakis [25] have

shown that there exist optimal solutions to (RC) which are minimal

w.r.t. C = R2+, and call them PRO robust solutions. However, in

this general setting, we are able to formulate the following relation

between minimal elements.

Lemma 3.1 ([30, Lemma 2]). Let Y = C(U ,R). If F ∈ F is

a minimal element of F w.r.t. αsup, then F is a weakly minimal

element of F w.r.t. the natural order relation α.

Proof. Let F ∈ F be a minimal element of F w.r.t. αsup. Since

240 E. Kobis

αsup is a total preorder, this means that

supξ∈U

F (ξ) ≤ supξ∈U

G(ξ) for all G ∈ F . (3.2)

Now suppose that F is not a weakly minimal element of F w.r.t.

the natural order relation α of Y . Thus, there exists G ∈ F s.t.

∀ ξ ∈ U : G(ξ) < F (ξ),

see, (3.1). Since U was assumed to be compact, G attains its

supremum on U . This means that

supξ∈U

G(ξ) = G(ξ) < F (ξ) ≤ supξ∈U

F (ξ),

for some ξ ∈ U , a contradiction to (3.2).

Using this relation together with Theorem 3.2, we obtain that

Fx is a weakly minimal element w.r.t. the natural order relation α,

for all optimal solutions x to (RC).

Corollary 3.2 ([30, Corollary 1]). Let Y = C(U ,R) and let the

worst case be attained for every solution x ∈ A. Then for every

optimal solution x to the robust counterpart (RC), Fx is a weakly

minimal element of the set of robust outcome functions F w.r.t. the

natural order relation α in Y .

Now we will consider the problem (RC) by using the set

approach. In particular, we will show that it is possible to interpret

the robust counterpart (RC) as a set-valued optimization problem.

Let the power set of Yf =R be denoted by Z := 2R. Furthermore,

we define for each x ∈ A

Bx := fU (x) := {f(x, ξ) | ξ ∈ U} ⊆ R.


We denote the set of robust outcome sets in Z by

B := {Bx ∈ Z| x ∈ A}.

Let R+ denote the set of nonnegative real numbers. For Bx, By ∈ Z,

the upper-type set-relation βsup is defined as

Bx βsup By :⇐⇒ Bx ⊆ By − R+

⇐⇒ supBx ≤ supBy,

see Kuroiwa [35, 36] and Kuroiwa et al. [39].

Remark 3.6. βsup is reflexive and transitive. Furthermore, it is a

total preorder.

We obtain the following relation between βsup and αsup.

Lemma 3.2 ([30, Lemma 3]). Let x, y ∈ A and let Fx, Fytheir corresponding outcome functions in Y and Bx, By their

corresponding outcome sets in Z. Then

BxβsupBy ⇐⇒ FxαsupFy.

Proof.

Bx βsup By ⇐⇒ sup Bx ≤ sup By

⇐⇒ sup{Fx(ξ)|ξ ∈ U} ≤ sup{Fy(ξ)|ξ ∈ U}⇐⇒FxαsupFy,

and the proof finishes.

The order relation βsup allows to represent the robust

optimization problem (RC) as a set-valued optimization problem,

as the next theorem verifies.

242 E. Kobis

Theorem 3.3 ([30, Theorem 2]). A solution x ∈ A is an optimal

solution to (RC) if and only if Bx is a minimal element of B w.r.t.

the order relation βsup.

Proof. We know from Theorem 3.2 that x ∈ A is an optimal

solution to (RC) if and only if FxαsupFx for all x ∈ A. According

to Lemma 3.2 this is equivalent to BxβsupBx for all x ∈ A and the

result follows.

We finally represent the robust counterpart (RC) using the

nonlinear scalarizing functional (2.4).

Theorem 3.4 ([30, Theorem 3]). Let Y = RU , B := Y +, and

k :≡ 1 ∈ Y . Then x ∈ A is an optimal solution to (RC) if and only

if Fx solves problem (Pk,B,F ).

Proof. B + [0,+∞) · k ⊂ B holds, thus inclusion (2.3) is satisfied

and the functional zB,k can be defined. Furthermore, we have

zB,k(Fx) = inf{t ∈ R|Fx ∈ tk −B}= inf{t ∈ R|Fx − tk ∈ −Y +}= inf{t ∈ R|∀ξ ∈ U : Fx(ξ) ≤ t}= sup

ξ∈Uf(x, ξ).

Thus, Fx is a solution for (Pk,B,F ) if and only if x ∈ A minimizes

supξ∈U f(x, ξ), i.e., if and only if x is an optimal solution to (RC).

Remark 3.7. If Y = C(U ,R) and k ∈ int(Y +), we have the

following properties. Since B = Y + is a proper closed convex cone

and k ∈ int(Y +), Lemma 2.1 implies that the functional zB,k is

continuous, finite-valued, Y +-monotone, strictly (intY +)-monotone

and sublinear, and

∀ F ∈ Y, ∀ t ∈ R : zB,k(F ) ≤ t ⇐⇒ F ∈ tk − Y +,

∀ F ∈ Y, ∀ t ∈ R : zB,k(F ) < t ⇐⇒ F ∈ tk − int(Y +).


Note that in the special case of a discrete uncertainty set U =

{ξ1, . . . , ξq}, Theorem 3.4 simplifies to Theorem 3.1.

4. Robust Approaches to Uncertain Vector

Optimization

This section is devoted to developing solution concepts for

uncertain vector optimization problems, specifically, our goal is

to obtain robust solutions. Only a few approaches to uncertain

vector optimization have been mentioned in the literature, of which

we briefly summarize the following. Hughes [24] presented a first

concept of dealing with uncertain multiobjective optimization by

computing the expected value of the errors that occur in the

objective functions. The vector of expected errors is then used in

the classical concept of Pareto optimality. Teich [47] generalized the

concept of Pareto optimality in a probabilistic nature for uncertain

vector-valued problems where the objective values are constrained

by intervals. Another idea was presented by Li et al. [40]

who develop solution procedures that compare the performance of

solutions regarding optimality and its robustness. They propose

a biobjective optimization problem, one of the objective functions

being a fitness value and the other one containing a robustness

index. The considered method in [40] may be beneficial for obtaining

solutions that satisfy certain optimality and robustness criteria, and

a decision maker may choose depending on his preferences toward

uncertainty. Another approach was presented by Deb and Gupta

[9] who used an idea by Branke [8], and defined robustness as

a kind of sensitivity against perturbations in the decision-space.

Branke [8] proposes to replace the objective function f by its mean

function f which maps any point x to the average function value

in a pre-defined neighborhood of x. A minimizer of f is then more

244 E. Kobis

robust in the sense that the function values in its neighborhood

do not change too much. Based on this idea for single objective

optimization problems, Deb and Gupta [9] introduced two concepts

of robustness for vector-valued optimization problems. The first one

replaces all objective functions by their mean functions. Efficient

solutions to the resulting optimization problem are called robust

solutions of the original problem. Deb and Gupta’s second concept

minimizes the original objective functions but adds constraints to

the problem that restrict the variation between the original objective

functions and a perturbed function value (that can be chosen as their

mean functions) to a pre-defined limit. This approach proves to be

more pragmatic and enables the user to control the desired level of

robustness.

Barrico and Antunes [2, 3] consider a multiobjective optimization

problem with perturbations in the decision space. In [2, 3], a solution

is called robust if small perturbations in the decision-space only yield

small perturbations in the objective-space. The authors in [2, 3]

define a degree of robustness that allows the decision maker to specify

the level of robustness of the solution. Specifically, the user is able to

determine the size of the neighborhood that the solution belongs to.

Furthermore, Barrico and Antunes [1] extend the concept of degree of

robustness to the space of the objective function coefficients, where

perturbations are treated in a similar manner as in [2, 3]. For more

results on this line of research, compare [15, 21].

The first scenario-based approach to uncertain vector-valued

problems was introduced by Kuroiwa and Lee [37] who directly

transferred the main idea of scalar robust optimization, meaning

minimizing the worst-case objective function, to a multicriteria

setting. For (P (U)), i.e., for the family of deterministic vector

optimization problems (P (ξ)), and Yf = Rk, Kuroiwa and Lee [37]


introduce a multiobjective problem

h(x)→ infx∈X

(4.1)

with

h(x) :=

supξ∈U1 f1(x, ξ)...

supξ∈Uk fk(x, ξ)

,

fi : Rn × Ui → R, i = 1, . . . , k, and X ⊆ Rn. The authors in [37]

call (weakly) minimal elements of the set ∪x∈Xh(x) (weakly) robust

efficient. The special case for convex functions fi, i = 1, . . . , k,

is studied in [38]. This approach is a rather direct transferral

from scalar robust optimization. For some cases, this concept

may, however, not be sufficient to describe robust solutions of

multiobjective optimization problems, as the point h(x) may never

be attained (if one considers the sets fU (x) given in (2.2)), but

solutions are compared w.r.t. the point h(x). Problem (4.1) still

is beneficial and was recently used by Ehrgott et al. [11] to obtain

solutions that they call robust in a slightly different setting. The

authors in [11] generalize the above approach from Kuroiwa and Lee

[37] by considering the whole set that is obtained when analyzing

a possible solution x. They call a solution x0 robust efficient if its

set fU (x0) is not dominated by any other set fU (x). The authors

in [11] observe that (weakly) minimal elements of the set ∪x∈Xh(x)

(related to the above problem (4.1)) are also (weakly) robust efficient

solutions within their definition of robust efficiency, and the reverse

implication holds under the requirement that the uncertainty set

takes the form U := U1 × . . . × Uk, i.e., if the uncertainties are

independent of each other. The robustness concept introduced in

[11] implicitly uses a set order relation to compare solution sets.

We will show in this section that this approach is closely connected

246 E. Kobis

to set optimization, because the objective map considered here is

set-valued.

In the literature, two main ways of treating a set-valued

optimization problem are reported: Using a vector concept, one

wishes to obtain single elements that satisfy a certain minimality

condition (possibly similar to Pareto minimality) for the union of all

sets in the objective space. Since having one element that is optimal

in some sense does not reveal any information about the performance

of the remaining elements in that particular solution set, it can be

argued that this approach may not be useful enough in practical

applications. The second concept deals with obtaining solution sets

out of all possible sets in the objective space. The authors in [11]

use the latter approach to define robust solutions to an uncertain

multiobjective optimization problem.

In this section, we consider the family of deterministic vector

optimization problems (P (ξ)), denoted as (P (U)). Let A be defined

as in (2.1) and let C be a convex cone in the objective space Yf ,

which is assumed to be a real linear topological space.

Definition 4.1. A solution x0∈ A of (P (U)) is called robust if

there is no x ∈ A \ {x0} such that fU (x) �uC fU (x0), which is

equivalent to

@ x ∈ A \ {x0} : fU (x) ⊆ fU (x0)− C.

For the special case Yf = Rk, X = Rn, C = Rk+ and |U| = 1,

i.e., in the deterministic multiobjective case, Definition 4.1 coincides

with the definition of strict minimality (compare [10, Definition

2.24]). Accordingly, one can define weaker notions of robustness,

as it is done in [27, Definition 6], which we skip here for the sake

of brevity. Moreover, if Yf = R, X = Rn, C = R+, then the above


notion of robustness reduces to the classical one given in (RC) for

unique solutions, meaning that x0∈ A is a robust solution of (P (U))

if and only if for all x ∈ A \ {x0}, it holds that supξ∈U f(x, ξ) >

supξ∈U f(x0, ξ). The following scalarization result gives a sufficient

condition of robustness for a feasible element x ∈ A.

Theorem 4.1 ([27, Theorem 1]). Let y∗ ∈ C∗ \ {0} be given. If for

some x0 ∈ A

supξ∈U

y∗(f(x0, ξ)) < supξ∈U

y∗(f(x, ξ)), ∀x ∈ A \ {x0} (4.2)

holds true, then x0 is robust for (P (U)).

Proof. Suppose to the contrary that x0 is not robust. Then there

exists an element x ∈ A \ {x0} such that

fU (x) ⊆ fU (x0)− C.

This implies

∀ ξ ∈ U ∃ η ∈ U : f(x, ξ) ∈ f(x0, η)− C.

Choose now y∗ ∈ C∗ \ {0}. This implies

=⇒ ∀ ξ ∈ U ∃ η ∈ U : y∗(f(x, ξ)) ≤ y∗(f(x0, η))

=⇒ ∀ ξ ∈ U : y∗(f(x, ξ)) ≤ supη∈U

y∗(f(x0, η))

=⇒ supξ∈U

y∗(f(x, ξ)) ≤ supη∈U

y∗(f(x0, η)).

But this is a contradiction to (4.2).

Under convexity and closedness assumptions of the set fU (x0)−C, it is possible to derive the converse statement of the implication

given in Theorem 4.1. The following result is a particular case of

248 E. Kobis

[32, Theorem 3.2], where the cone C is not fixed, but depends on the

decision variable. The following theorem requires the objective space

Yf to be locally convex, where local convexity of a real topological

linear space is given in [28, Definition 1.33].

Theorem 4.2. Assume that the objective space Yf is locally convex.

Suppose that x0 is robust and that the set fU (x0)− C is closed and

convex. Then there does not exist an element x ∈ A \ {x0} such that

for every y∗ ∈ C∗

supξ∈U

y∗(f(x, ξ)) ≤ supξ∈U

y∗(f(x0, ξ)).

Proof. Assume that x0 ∈ A is robust. This is equivalent to

@x ∈ A \ {x0} : fU (x) ⊆ fU (x0)− C⇐⇒ ∀x ∈ A \ {x0} : fU (x) * fU (x0)− C⇐⇒ ∀x ∈ A \ {x0} : ∃ξx ∈ U : f(x, ξx) /∈ fU (x0)− C.

Since fU (x0)−C is closed and convex, we use a classical separation

argument (see, for instance, [28, Theorem 3.18]) such that we get

∀x ∈ A \ {x0} ∃ξx ∈ U , ∃y∗ ∈ Y ∗f \ {0}, α ∈ R :

y∗(f(x, ξx)) > α ≥ y∗(y) ∀y ∈ fU (x0)− C, (4.3)

and this yields

∀x ∈ A \ {x0} ∃y∗ ∈ Y ∗f \ {0}, α ∈ R :

supξ∈U

y∗(f(x, ξ)) > α ≥ supy∈fU (x0)−C

y∗(y).

To show that y∗ ∈ C∗, suppose that y∗ /∈ C∗, which means that

there is c ∈ C such that y∗(c) < 0. With (4.3), we obtain for any


ξ ∈ U , c ∈ C and some λ ≥ 0

α ≥ y∗(f(x0, ξ)− λc) = y∗(f(x, ξ))− λy∗(c) λ→+∞→ +∞,

a contradiction. Furthermore,

supy∈fU (x0)−C y∗(y) = supξ∈U y

∗(f(x0, ξ)) + supc∈−C y∗(c)

= supξ∈U y∗(f(x0, ξ)).

Altogether, we conclude with

∀x ∈ A \ {x0} ∃y∗ ∈ C∗ : supξ∈U

y∗(f(x, ξ)) > supξ∈U

y∗(f(x0, ξ)),

which is equivalent to

@ x ∈ A \ {x0} ∀y∗ ∈ C∗ : supξ∈U

y∗(f(x, ξ)) ≤ supξ∈U

y∗(f(x0, ξ)),

which completes the proof.

Remark 4.1. It is interesting to mention that Ehrgott et al. [11]

propose a vectorization approach for computing robust solutions for

the special case Yf = Rk, X = Rn, C = Rk+, i.e., they reduce

the problem (P (U)) to the vector optimization problem (4.1) (called

objective-wise worst case).

In [31], it is shown that the nonlinear scalarizing functional zB,k

(see (2.4)) can be used to characterize several set order relations.

From [31, Theorem 3.3], we obtain the following result.

Corollary 4.1. Let x0 be given and suppose that there exists some

k ∈ C \ {0} such that for all x ∈ A \ {x0}, infy0∈fU (x0) zC,k(y − y0)

is attained for all y ∈ fU (x). Then x0 is robust for (P (U)) if and

250 E. Kobis

only if

@ x ∈ A \ {x0} : supy∈fU (x)

infy0∈fU (x0)

zC,k(y − y0) ≤ 0.

5. Conclusions

This paper gives an overview on robust approaches to uncertain

scalar and vector-valued optimization problems, respectively. In

robust optimization, one traditionally hedges against perturbations

in the worst-case scenarios. Robust solutions are then immunized

against perturbations, and thus this approach is applicable if a

decision maker acts risk averse. In uncertain vector optimization,

this situation can be modeled by using the upper set less order

relation. This paper explores this concept and gives some

scalarization results. An interesting topic that is presently given

a lot of attention in the literature (see [12, 13, 32]) is a deeper

analysis of the ordering structure. Moreover, based on the

proposed scalarization techniques, it is now possible to derive

efficient algorithms for finding robust solutions of uncertain vector

optimization problems.

Acknowledgements

The author expresses her gratitude to the two anonymous

reviewers for their helpful suggestions which helped to improve the

manuscript significantly.

References

[1] Barrico C., and Antunes C.H. (2006). Robustness analysis

in evolutionary multiobjective optimization - with a case

study in electrical distribution networks. Presented at the II


European-Latin-American Workshop on Engineering Systems

(SELASI II), Porto, Portugal.

[2] Barrico C., and Antunes C.H. (2006). Robustness analysis

in multiobjective optimization using a degree of robustness

concept. In IEEE Congress on Evolutionary Computation (CEC

2006), pages 1887–1892. IEEE Computer Society.

[3] Barrico C., and Antunes C.H. (2006). A new approach to

robustness analysis in multi-objective optimization. Proceedings

of the 7th International Conference on Multi-Objective

Programming and Goal Programming (MOPGP), Loire Valley

(Tours), France.

[4] Ben-Tal A., El Ghaoui L., and Nemirovski A. (2009). Robust

Optimization, Princeton University Press, Princeton.

[5] Ben-Tal A., and Nemirovski A. (2000). Robust solutions of

linear programming problems contaminated with uncertain

data, Math. Program., 88, 411–424.

[6] Ben-Tal A., and Nemirovski A. (1998). Robust convex

optimization, Math. Oper. Res., 23(4), 769–805.

[7] Bertsimas D., and Sim, M. (2004). The price of robustness,

Oper. Res., 52(1), 35–53.

[8] Branke J. (1998). Creating robust solutions by means of

evolutionary algorithms. In E.A. Eiben, T. Back, M. Schenauer,

and H.-P. Schwefel, editors, Parallel Problem Solving from

Nature – PPSNV, volume 1498 of Lecture Notes in Computer

Science, pages 119–128. Springer, Berlin, Heidelberg.

[9] Deb K., and Gupta H. (2006). Introducing robustness in

multiobjective optimization, Evol. Comput., 14, 463–494.

252 E. Kobis

[10] Ehrgott M. (2005). Multicriteria Optimization, Springer, New

York.

[11] Ehrgott M., Ide J., and Schobel A. (2014). Minmax robustness

for multi-objective optimization problems, European J. Oper.

Res., 239(1), 17–31.

[12] Eichfelder G., and Pilecka, M. (2016). Set approach for set

optimization with variable ordering structures Part I: Set

relations and relationship to vector approach, J. Optim. Theory

Appl., 171(3), 931–946.

[13] Eichfelder G., and Pilecka M. (2016). Set approach for

set optimization with variable ordering structures Part II:

Scalarization approaches, J. Optim. Theory Appl., 171(3),

947–963.

[14] El Ghaoui L., and Lebret H. (1997). Robust solutions to

least-squares problems with uncertain data, SIAM J. Matrix

Anal. Appl., 18, 1034–1064.

[15] Erfani T., and Utyuzhnikov S. (2012). Control of robust design

in multiobjective optimization under uncertainties, Struct.

Multidiscip. Optim., 45, 247–256.

[16] Fischetti M., Salvagnin D., and Zanette A. (2009). Fast

approaches to improve the robustness of a railway timetable,

Transportation Sci., 43(3), 321–335.

[17] Gass S., and Saaty T. (1995). The computational algorithm

for the parametric objective function, Naval Res. Logistics

Quarterly, 2, 39–45.

[18] Gerstewitz (Tammer) Chr. (1983). Nichtkonvexe Dualitat in der

Vektoroptimierung, Wiss. Zeitschr. TH Leuna-Merseburg, 25,

357–364.


[19] Gerth (Tammer) Chr., and Weidner P. (1990). Nonconvex

separation theorems and some applications in vector

optimization, J. Optim. Theory Appl., 67, 297–320.

[20] Gopfert A., Riahi H., Tammer Chr., and Zalinescu C. (2003).

Variational Methods in Partially Ordered Spaces, CMS Books

in Mathematics, Springer, New York.

[21] Gunawan S., and Azarm S. (2005). Multi-objective robust

optimization using a sensitivity region concept, Struct.

Multidiscip. Optim., 29, 50–60.

[22] Gutierrez C., Novo V., Rodenas-Pedregosa J.L., and Tanaka

T. (2016). Nonconvex separation functional in linear spaces

with applications to vector equilibria, SIAM J. Optim., 26,

2677–2695.

[23] Haimes Y., Lasdon L.S., and Wismer D.A. (1971). On a

bicriterion formulation of the problems of integrated system

identification and system optimization, IEEE Trans. Syst.,

Man, Cybern., Syst., 1, 296–297.

[24] Hughes E.J. (2001). Evolutionary multi-objective ranking with

uncertainty and noise. Proceedings of the First International

Conference on Evolutionary Multi-Criterion Optimization

(EMO-2001), 329–343.

[25] Iancu D.A., and Trichakis N. (2014). Pareto efficiency in robust

optimization, Manag. Sci., 60, 130–147.

[26] Ide J., and Kobis E. (2014). Concepts of efficiency for uncertain

multi-objective optimization problems based on set order

relations, Math. Method Oper. Res., 80, 99–127.

254 E. Kobis

[27] Ide J., Kobis E., Kuroiwa D., Schobel A., and Tammer

Chr. (2014). The relationship between multicriteria robustness

concepts and set-valued optimization, Fixed Point Theory Appl.

DOI: 10.1186/1687-1812-2014-83.

[28] Jahn J. (2011). Vector Optimization - Introduction, Theory, and

Extensions, Springer, Berlin, Heidelberg.

[29] Klamroth K., Kobis E., Schobel A., and Tammer Chr. (2013).

A unified approach for different concepts of robustness and

stochastic programming via nonlinear scalarizing functionals,

Optimization, 62(5), 649–671.

[30] Klamroth K., Kobis E., Schobel A., and Tammer Chr. (2017). A

unified approach to uncertain optimization, European J. Oper.

Res., 260, 403–420.

[31] Kobis E., and Kobis M.A. (2016). Treatment of set order

relations by means of a nonlinear scalarization functional: A

full characterization, Optimization, 65(10), 1805–1827.

[32] Kobis E., and Tammer Chr. (2017). Robust vector optimization

with a variable domination structure, Carpathian J. Math.,

33(3), 343-351.

[33] Kouvelis P., and Sayin S. (2006). Algorithm robust for the

bicriteria discrete optimization problem, Ann. Oper. Res., 147,

71–85.

[34] Krasnosel’skiı M.A. (1964). Positive solutions of operator

equations. Translated from the Russian by Richard E. Flaherty;

edited by Leo F. Boron. P. Noordhoff Ltd. Groningen.

[35] Kuroiwa D. (1999). Some duality theorems of set-valued

optimization with natural criteria. In Proceedings of the


International Conference on Nonlinear Analysis and Convex

Analysis. World Scientific, 221–228.

[36] Kuroiwa D. (1997). The natural criteria in set-valued

optimization, Surikaisekikenkyusho Kokyuroku, 1031:85–90,

Research on nonlinear analysis and convex analysis, Kyoto.

[37] Kuroiwa D., and Lee G. M. (2012). On robust multiobjective

optimization, Vietnam J. Math., 40(2&3), 305–317

[38] Kuroiwa D., and Lee G. M. (2014). On robust convex

multiobjective optimization, J. Nonlinear Convex Anal., 15,

1125–1136.

[39] Kuroiwa D., Tanaka T., and Duc Ha T.X. (1997). On

cone convexity of set-valued maps, Nonlinear Anal., 30(3),

1487–1496.

[40] Li M., Azarm S., and Aute V. (2005). A multi-objective

genetic algorithm for robust design optimization. In Proceedings

of the Genetic and Evolutionary Computation Conference

(GECCO’05), 771–778.

[41] Pascoletti A., and Serafini P. (1984). Scalarizing vector

optimization problems, J. Optim. Theory Appl., 42, 499–524.

[42] Rubinov A.M. (1977). Sublinear operators and their

applications, Uspehi Mat. Nauk, 32(4(196)), 113–174.

[43] Sayin S., and Kouvelis P. (2005). The multiobjective

discrete optimization problem: A weighted min-max two-stage

optimization approach and a bicriteria algorithm, Manag. Sci.,

51, 1572–1581.

256 E. Kobis

[44] Schobel A. (2014). Generalized light robustness and the

trade-off between robustness and nominal quality, Math.

Methods Oper. Res., 80(2), 161–191.

[45] Soyster A.L. (1973). Convex programming with set-inclusive

constraints and applications to inexact linear programming,

Oper. Res., 21, 1154–1157.

[46] Steuer R.E., and Choo E.U. (1983). An interactive weighted

Tchebycheff procedure for multiple objective programming,

Math. Program., 26, 326–344.

[47] Teich J. (2001). Pareto-front exploration with uncertain

objectives. In Proceedings of the First International Conference

on Evolutionary Multi-Criterion Optimization (EMO-2001),

314–328.

[48] Weidner P. (1990). Ein Trennungskonzept und seine

Anwendung auf Vektoroptimierungsverfahren. Habilitation

thesis, Martin-Luther-University Halle-Wittenberg.

[49] Wierzbicki A.P. (1986). On the completeness and

constructiveness of parametric characterizations to vector

optimization problems, OR Spectrum, 8, 73–87.

[50] Winkler K. (2003). Aspekte Mehrkriterieller Optimierung

C(T )-wertiger Abbildungen. Dissertation thesis,

Martin-Luther-University Halle-Wittenberg.

[51] Yan Y, Meng Q., Wang S., and Guo X. (2012). Robust

optimization model of schedule design for a fixed bus route,

Transp. Res. Part C: Emerg. Technol., 25, 113–121.

[52] Zadeh L. (1963). Optimality and non-scalar-valued performance

criteria, IEEE Trans. Automat. Control, 8, 59–60.


About the author

Elisabeth Kobis is a post-doc researcher at the Institute of

Mathematics in Halle and her research interests include robust

optimization, vector and set optimization.


Estadıstica Oficial

Applying the Generic Statistical Business Process

Model (GSBPM) to the Business Register; the

Spanish experience

Luis Esteban Barbado Miguel

Department of Methodology of the statistical production

National Statistical Institute

B [email protected]

Abstract

The Generic Statistical Business Process Model (GSBPM)

is a reference framework to describe the statistical processes

in a coherent way, making them comparable within and

between different Organizations. The application of the

GSBPM to the management of the Spanish Business Register

was carried out by the NSI during 2015. This paper provides

a first assessment of the work done, focusing on the selected

approach for the description of the GSBPM phases and the

criteria adopted for a proper assignation of the core parts of

our business process. The main restrictions found and the

potential value added of this exercise are also pointed out.

Keywords: Business Register, business process, GSBPM,

BMNP, interoperability.

© 2017 SEIO


Applying the GSBPM to the Spanish Business Register 259

1. About the GSBPM

The Generic Statistical Business Process Model (GSBPM) is a

reference framework developed by the United Nations Economic

Commission for Europe (UNECE) and the conference of European

Statisticians Steering Group on Statistical Metadata. Its basic

aim is to define and describe the statistical processes in a

coherent way, making them comparable within and between

different Organizations. This tool provides a standard framework

and harmonised terminology to help statistical organizations to

modernise their production processes as well as to share methods

and components.

The GSBPM is closely connected to data quality management,

providing a framework for its assessment. It comprises four levels:

Level 0 (the statistical business process), Level 1 (the nine phases of

the statistical business process), Level 2 (the sub-processes within

each phase) and Level 3 (a description of those sub-processes).

Levels 1 and 2 are illustrated in the Figure 1. Although this standard

was conceived for the description of any statistical operation, the

production of a National Business Register has own specificities.

Taking as a basis the clear benefits offered by this frame, some

adaptations to the GSBPM structure have been done for this specific

exercise.

The National Statistical Institute (NSI) of Spain has adopted

this standard as core element for the implementation of the Quality

Assurance Framework of the European Statistical System. In fact,

the standard stars from the GSBPM structure and a more detailed

level of information has been added. For the above mentioned

reasons, this level has not been used in the description of the NBR.

260 L. E. Barbado

Figure 1: Levels 1 and 2 of the GSBPM


2. GSBPM and Business Registers; general

context

The management of National Business Registers (NBRs) for

statistical purposes is a strategic action, usually incorporated within

the official plans of the Statistical Offices. The key role of

these infrastructural elements in data production, the increasing

complexity of the related data architecture and the need for a

continuous adaptation to international standards and methodologies

are challenging issues undertaken by the daily work of the NBR

teams.

Since 1992, the DIRCE (denomination of the Spanish Business

Register) is the central reference as a sampling frame for official

business surveys, which are carried out by the NSI and other

Government Departments with statistical power. In the last year,

more than 400,000 units were provided and investigated through

more than 20 surveys.

The NSI of Spain is currently working under an explicit mandate

of its Board of Directors, which is encouraging a progressive use

of the GSBPM in all statistical domains identified in the national

statistics plan.

Communication 404/2009, generally referred to as the Vision

document, proposes several strategic principles for future statistics.

Among them, the need for a re-engineering of the current production

methods is particularly relevant, moving from a system based on

parallel processes to a more integrated production model. In this

line, Eurostat launched the 4-year initiative European System of

Business Registers (ESBR, 2013-2017), with the aim to improve the

relevance of these tools and reinforce their role as the backbone for

the European Statistical System.

The Euro Groups Register (EGR) is the Statistical Register

of the European Communities on multinational enterprise groups.

262 L. E. Barbado

The EGR is the authentic core of the ESBR system and includes

information of the most influential multinationals operating in the

EU and EFTA countries. It is built and maintained under a strict

collaborative model involving all relevant stakeholders, mainly NSIs

and Central Banks.

In the latest developments of the afore-mentioned initiative,

a specific Business Architecture and its materialization through

an Interoperability Frame will be available for NBRs and EGR.

In this context, the application of the GSBPM to Business

Registers becomes highly relevant because it will favour a mutual

understanding of national procedures, the circulation of good

practices and the identification of areas where efficiency can be

gained. A wide application of this standard will also make future

benchmarking tasks possible as well as the definition of a minimum

set of interoperability requirements.

This descriptive process will also need to cover interactions with

the EGR production, referring to the stage of the process where

national data extractions and flows between the NBR and the EGR

take part. The experience in application of this standard to the

Business Registers domain is quite new. In the scope of the ESBR

project, Eurostat launched a grant with this purpose and the NSI

of Spain participated in this action. A general assessment of this

innovative experience is provided in the following paragraphs.

3. Evaluation of the work done

Preliminary activities focused on building capacity for using

GSBPM. From 13 to 17 of October 2014, a training course was set

up by national experts on standards and methodology. The DIRCE

team participated in this cooperative action aiming to create basic

knowledge. The relevant parts of our business process were identified

and a proposal of allocation in the GSBPM structure was discussed.


Regarding the format adopted for this exercise, a combination of

human and modelling language has been used. Among the possible

options, the Business Process Modelling Notation version 2 was

selected. The most important parts of the DIRCE business process

have been graphically represented with this notation.

BPMN v2 offers possibilities to create several pools used for the

representation of the donor Organizations and the actions carried out

by the DIRCE Unit during the whole maintenance cycle. When an

input source is received, it is subject to different kinds of processing

and data quality programs. In order to provide a structured

representation of all actions, three different internal areas have been

considered:

� Source, where the input sources are received and evaluated.

� Intermediate, where the sources are edited and transformed

into statistical databases.

� Final, where the integration and maintenance of the DIRCE

is carried out.

The Source and intermediate areas are closely related with the

Collect phase. The Final area is highly relevant in the Process phase.

The main results acquired through this experience will be

examined below in detail. The selected approach for the different

GSBPM phases, the identification of the main register processes,

their allocation to the standardized structure and the level of

granularity adopted will be described. In addition, some lacks of

relevance or restrictions found in this action will also be pointed

out.

3.1. Specify needs, Design and Built phases

Management of NBRs has a long tradition in the majority of

NSIs. User needs, output objectives and methods of production

264 L. E. Barbado

are continuously changing and being adapted to new emerging

challenges. This context has a clear impact on the design and build

of the respective data models and derived uses, which need to be

continuously aligned with the new requirements.

For this reason, the approach followed for the description of these

phases has been based on a historical dimension. This criteria

will make a better understanding of the current state and the

fundamentals of our national model easier. The methodological basis

adopted for the DIRCE management is known as the PIDE Project

(Proyecto de Integracion de Directorios Economicos, Figure 2). This

initiative started at the end of the 80’s and was formulated under

a modular approach, involving several components developed in

successive steps and with different contributions to the maintenance

of the DIRCE.

The PIDE project is always considered open, because it is based

on a continuous evaluation of the current and potential statistical

needs, the development of specific actions to fulfill those needs and

their definitive incorporation into the production model.

The documentation of Design and Built phases has been

conceived as one unique part, focusing on the main milestones

consolidated under a time-based perspective. From the beginning,

both steps of the business process were jointly undertaken with a

large degree of overlap. In addition, the description of the related

sub-processes is not especially significant for our business case.

For the remaining phases, a static dimension has been adopted.

All actions refer to the most recent cycles of maintenance for DIRCE

and interactions with the EGR. More specifically, the description

of the PIDE components is allocated in the Collect phase and all

successive integration procedures are described in the Process phase.


Figure 2: High level perspective of the PIDE project

266 L. E. Barbado

3.2. Collect phase

This part is highly relevant for our business case and can be

easily generalized to the big majority of Statistical Offices. It mainly

involves a proper analysis of opportunity of data sources serving for

the management of the NBR. It also includes all actions related to

a successful and stable reception in the NSI.

The production of the DIRCE is cyclical with annual periodicity

and is based on an intense use of data sources. Due to the diversity

of the selected sources, the institutional and operative actions for

their acquisition and processing were arranged in different modules,

directly linked to the typology of sources (AT= Tax files, SS=Social

Security files, PR=Private files,.. ) and making up the dynamic of

the PIDE project.

When a source is received, a data quality program is applied

according to their features and specific role in the business process.

Some sources are core elements for the detection of new units or the

elimination of previously existing ones. However, other sources are

relevant for the maintenance of specific variables.

According to these parameters, the information provided in this

phase basically refers to:

� The Inter Departmental context created allowing a stable

access to input data (4.2 set up collection). Initially,

institutional actions were addressed to the Tax and

Social Security Authorities, formally established in both

collaboration agreements. In other cases, specific service

contracts are available for the acquisition of private databases.

� The channels adopted for the reception of the data sources

(4.2 set up collection). Diverse procedures are described, the

most relevant of which being several IT tools allowing security

requirements in data interchanges. In other cases, the direct


download from official websites is applied.

� The list of input sources used in each production cycle (4.3/4.4

run and finalize collection) classified by nature, in line with the

components of the PIDE project.

All input sources are numbered, this system being critical for the

data integration and the DIRCE maintenance. A set of structured

information is given for each database, including:

� Basic metadata: denomination, Managing Organization,

reception date, timetable for data process, elementary

observation unit and data structure.

� Validation rules.

� Editing and micro validation processes.

� Transformation processes and adoption of statistical

standards.

� Production of statistical databases.

Validation rules are designed with the purpose to make a decision

about the source: acceptance or rejection. If rejected, the source is

returned to the Managing Organization including a communication

of errors to be corrected. If the source is accepted, a set of specific

procedures is applied.

The lack of adaptation to statistical standards or the presence

of low quality in particular variables can be pointed out as a classic

restrictions of administrative data. These problems must be detected

in the preliminary design of the project.

Afterwards, the solution of these types of problems is

normally the result of a close cooperation among statisticians

and administrative managers, within the scope of the institutional

268 L. E. Barbado

context created. This is normally materialized by means of specific

editing, micro validation or transformations processes.

The GSBPM offers possibilities for assigning this information in

the next phase. However, in order to facilitate the understanding of

the complete production chain, it has been decided to link all these

register processes to the input sources in the collect phase.

3.3. Process phase

This phase is the core part of our business process. All statistical

databases produced in the previous phase are used as input for the

integration processes, the generation of the updated statistical units

forming the data model and all the related characteristics linked to

them. In summary, here is where the updating of the DIRCE takes

part.

The integration procedures (5.1 integrate data) are mainly

carried out by record linkage routines based on a universal presence

of unique national IDs. During this action, several DIRCE frozen

frames are generated with different levels of quality and uses. The

features of the frames of reference t are described as a timeline over

the year t+1 as the main result of the following iterative steps:

1. INT 1 1 produces a preliminary updated version of enterprises,

based on the new data sources.

2. INT 1 2 produces a second version of enterprises and local

units fully consistent with the year t-1. The data quality

is higher due to the incorporation of validated statistical

information. This frame is used for sample selection in the

STS domain and official dissemination of results.

3. INT 1 3 produces the definitive version of enterprises and

local units incorporating the last updating of basic variables.

In addition, specialized databases containing information on


monetary characteristics are received during the last quarter

of the year and this information is also incorporated.

4. INT 2 produces an updated version of enterprise groups based

on private, tax and statistical sources.

5. INT F produces the definitive updated system, integrating

the results obtained in INT 1 3 and INT 2. Three levels of

information formed by enterprises, local units and enterprise

groups are available and fully consistent.

In the last part of sub-process 5.1, the interactions with the EGR

cycle are described. Due to the complexity of the EGR model

and the need to give efficient answers from the NBRs, the uses

of the frozen frames are described according to the different EGR

interchange flows (extraction and delivery of resident/non- resident

units to the EGR Identification Service, extraction and delivery of

the core files on legal units and control relationships for the EGR

system, extraction and delivery of the enterprise data to the EGR

system, repair action and data exchange on Ultimate Controlling

Units). Figure 3 shows these interactions as a timeline.

During the integration procedures, a definitive classification of all

statistical units (5.2 classify and code) is also provided. For the core

classification variables, a predefined set of decision rules is described,

according to their presence in data sources, and their reliability.

New variables are also derived and systematically maintained

based on information available or specific data sources (5.5 derive

new variables and units). Ad hoc estimation procedures or

deterministic rules are described for the delimitation of the number

of persons employed, the institutional sector code or monetary

variables like turnover, import and export.

The main restrictions were found in the documentation of

review, validation, edit and imputation as separate sub-processes.

270 L. E. Barbado

Figure 3: DIRCE frames and interactions with the EGR cycle


As previously mentioned, all these practices are undertaken from

the beginning of the cycle and they are allocated in the related

sub-processes in order to facilitate the understanding of the whole

production chain.

3.4. Analyse phase

The increasing demand for better and more detailed business

statistics has put the focus on the NBRs and their key role in the

statistical production chain. Originally, these tools were conceived

as a vital component of statistical infrastructure, supporting data

collection, monitoring the response burden and giving grossing up

indicators for the production of aggregates. All these tasks, closely

related with the use of NBR as the survey frame, will be jointly

considered in the application of the GSBPM to business surveys.

In recent decades, user demands have diversified and the role

of the NBR as a source of data production has become more and

more relevant. This aspect has been the approach adopted for the

documentation of this phase and the following one. In the Spanish

NSI, the DIRCE is the key data source for the statistical analyses of

business activity from both a static and dynamic perspective. Two

main references linked to the DIRCE macro-data are documented:

� Statistical Analysis of the DIRCE. A standard publication of

results directly obtained from the updated frame.

� Harmonized Business Demography. A product specifically

elaborated to cover the national needs in this domain.

Its production is fully consistent with the OCDE-Eurostat

methodology.

Both statistical operations incorporate the same metadata: type

of operation, data source, periodicity, starting a¿“ ending date

272 L. E. Barbado

of processes, press release, presence in National Statistical Plan /

Statistical Operations Inventory and methodological basis.

3.5. Dissemination phase

Dissemination of NBRs can be established at micro or

macro-data levels. The first option is normally constrained by

the confidentiality provisions applying to the national legal frame.

This is the case of Spain, where access to the DIRCE micro data

is restricted to national authorities in charge of official statistics.

Dissemination of macro-data refers to the statistical operations

previously mentioned. The main operational steps carried out up

until their definitive publication are documented in this phase. Joint

meetings involving the DIRCE and Dissemination teams are held in

the last part of each cycle. Information about the dynamic of the

processes, the date foreseen for the generation of the aggregates and

the innovations incorporated, form the basis for a proper adaptation

of the output system (7.1 update output systems).

For the second stage, all components related to each operation

are documented (7.2 produce dissemination products). They mainly

refer to the list of data tables, metadata, standard methodological

report, complementary reports, graphic annex and press release.

The external impact of these statistics is very relevant. A recent

study of the number of web accesses to the INEBase, the generic

brand for statistical information at NSI website consisting of 185

statistical operations, shows that DIRCE statistics are placed in the

top 20 ranking.

Since the first year of publication, the DIRCE also provides a

tailor-made service through direct use of register data. Requests

from Public Administrations, Private Companies, Organizations,

Professionals or Researchers are continuously increasing. The

queries registered are very diverse in form and content and they

are managed according to a specific protocol defined by the NSI


dissemination policy (7.5 manage user support).

3.6. Evaluation phase

This phase is closely related to the quality policy implemented

and the incorporation of successive improvements to our NBR. Two

main orientations have been outlined:

� Internal evaluation, by developing a complete diagnosis of

processes carried out during each annual cycle.

� External evaluation, by using the feedback of business

producers as a basic element for the improvement of the NBR.

4. Final remarks

This has been a challenging and very positive experience for the

DIRCE team. Although the GSBPM seems to be better adapted to

a typical survey, this standard can also be applied to the Statistical

Business Registers. However, the presence of national specificities

in the management model makes the allocation to sub-processes

sometimes difficult.

Different approaches can be adopted for the description of

the phases, from a dynamic to a static perspective. Generally

speaking, this decision should be made regarding the current level

of implementation and the foreseen innovations to be included in

the business process. In the case of management of BRs, which

has a long-standing tradition in the Statistical Offices, the historical

dimension could be more appropriate for the first phases and the

static information linked to the most recent production cycle would

be the appropriate approach for the remaining phases.

The GSBPM proposes a multi-focal description of the business

process, allocating uniform parts in separate sub-processes. This

exercise can be opportune when actions are addressed to the same

274 L. E. Barbado

dynamic database throughout the production chain. However, this

philosophy can mean serious restrictions for projects involving a

great amount and variety of data sources for which, specific actions

must be designed. In the Spanish case, the longitudinal description

for each input data source has been predominant in order to properly

understand how our model actually works.

On an international scale, the results of these experiences will

have to be jointly evaluated. As a starting point, the development

of benchmarking activities will need to be undertaken. Expected

results should lead to some agreements towards more coordinated

and consistent production cycles. In addition, this context should

facilitate the identification of specific tools for a common use or a

preliminary identification of Data Quality Program for all BRs of

the European System.

This progressive integration within an interoperable system

will mean veritable added value for all statistical actors and an

opportunity to modernise the production process of official statistics.

References

[1] Applying the Generic Statistical Business Process Model

to business register maintenance. Economic Commission for

Europe, Conference of European Statisticians. Paris, September

2011.

[2] COM (2009) 404 final- Communication from the Commission

to the European Parliament and the Council on the production

method of EU statistics. http://eur-lex.europa.eu/

LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF

[3] Directive 2012/17/EU of the European Parliament and of the

Council on interconnecting Business Registers.

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF


http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?

uri=OJ:L:2012:156:0001:0009:en:PDF

[4] http://www1.unece.org/stat/platform/display/metis/

The+Generic+Statistical+Business+Process+Model

[5] http://www.ine.es/inebmenu/mnu_empresas.htm

About the author

Luis Esteban Barbado Miguel is a senior statistician at the

National Statistics Institute (INE) of Spain. Public official of the

Senior Corps of Statisticians, he has a broad experience in official

statistics. He is currently Deputy Director of the Department of

Methodology and development of statistical production. Spanish

representative in International Working Groups on Business

Registers for statistical purposes and Statistical Units. Participation

in training seminars for statisticians in the region of Latin America,

focusing on the fundamentals for the development and management

of Business Registers. He has a University degree in Mathematics,

specializing in Statistics.

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2012:156:0001:0009:en:PDF

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2012:156:0001:0009:en:PDF

http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+Process+Model

http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+Process+Model

http://www.ine.es/inebmenu/mnu_empresas.htm


Historia y Ensenanza

Positive effects on the least motivated students of

the highly motivated ones1

Raquel Ibar-Alonso

Departamento Interfacultativo de Matematicas y Estadıstica

Universidad CEU San Pablo

B [email protected]

Carolina Cosculluela-Martınez

Departamento de Economıa Aplicada I

Universidad Rey Juan Carlos

B [email protected]

Abstract

The teaching methodology affects the motivation of the

students. Non-motivated students can influence the most

motivated ones and vice versa. Adjusting continuously the

methodology to the less motivated students could be done

if there is information on what is really motivating, if they

can decide the way to be taught. A group of students of

two different universities in Madrid, a public and a private

one, answered a survey after a month attending the course of

Statistics. The three clusters found are motivated by different

methodology tools. The smallest motivation values of the

1Positive effects on the less motivated students.

© 2017 SEIO



Positive effects on the least motivated students 277

30% more motivated people of group 3 (percentile 70) has

been compared to the highest motivation value of the 30% less

motivated (percentile 30) and with (percentile 70) 30% more

motivated people in group 1. Thus, how the most motivated

group can influence the less motivating one in a non-linear

way?.

Keywords: Motivation, teaching methodology, profiles,

non-linear influences, just in time adaptability, sociological

study.

AMS Subject classifications: 62J02.

1. Introduction

The motivation of the students has been a constant matter

of study since the past century. Some attended the conclusion

that to help unmotivated students it is important to do exercises,

socialize, involve them in a process called ”attribution retraining”

(Lumsden, 1994). Others, that asses that the way to motivate is to

ask students to demonstrate what they have learned, to participate

in the class, not just simply display their ability to memorize and

answer questions (Chuska, 1995).

Different students could be motivated in many different ways.

The comfort zone for a group of students could motivate or could

be stressing for others. The goal of this paper is to analyze the

effects of a group of the most motivated students in a class on

the least motivated group of them. Quantifying this effect has

become a previous issue to influence the motivation by changing the

methodology accordingly to what the students believe is the most

motivating way of teaching in their point of view. The way they are

going to be motivated is beyond the scope of this paper.

The motivation of a student can be asked in several ways. Which

one is the best? You never know. Motivation is based in the results

278 R. Ibar-Alonso, C. Cosculluela-Martınez

that the student expected to obtain and what they finally achieved

in the past. In this paper the motivation is asked in a particular way,

graphically the student identifies the evolution of their motivation.

Thus, the latent variable motivation is built on the graph. The

feeling of the evolution of own motivation′s reflects the real nature

of the motivation of the student, and a more realistic way than asking

students to rank their motivation in a certain scale.

To achieve the main goal is mandatory to classify

students accordingly to their teaching preferences and their

characteristics. The proposed methodology, build on previous work

(Cosculluela-Martinez and Ibar-Alonso, 2016), is a cluster analysis

and the description of the groups obtained from an on-line survey

of 15 questions answered by more than 200 people studying in two

different types of universities: a private catholic one, and a public

one in Madrid. The motivation is going to be computed from one

graph question and is going to be based on the factorial punctuation

obtained from each group. This technique is the novelty in Ibar′s

thesis (2014) and it has been followed by (Martinez and Ibar-Alonso,

2015). Thus, the answer is going to be intuitive and quantifiable,

a sensation of the expectancy of the motivation is determining the

evolution of how the student thinks it is going to be motivated in

the future and their view of their motivation in the past.

The relationship between the punctuation of the motivation of

the most and least motivated students in each group would be

estimated and analyzed to determine the influence of one group on

the other.

Our main hypotheses consists in assessing that there is a strong

positive influence of the less motivated students on the motivation

of the highly motivated ones. Thus, the professor can influence

the motivation of the whole class by adjusting the methodology

to what this group feels it will be more motivating for them.


The rest of the paper is organized as follows. Section 2 shows

the methodology used to obtain the data. Section 3 presents the

previous analysis, the relationships between the variables and the

main assumptions. Section 4 discusses the results obtained from the

empirical estimations. Finally, Section 5 provides the concluding

remarks.

2. Methodology

The google drive on-line survey, https://goo.gl/forms/

Vk70LlCFLmsCQ6qo1, has 15 questions from which the cluster

analysis determines that three groups can be found with the same

characteristics in both universities, which was the same conclusion

attended by (Arias et al., 2000). The two universities chosen were

Rey Juan Carlos, a public university, and CEU San Pablo, a private

university. The reason to choose them is that in both of them

professors teach the same subject, with the same main bibliography

and syllabus. There was a great amount of responses from the

students, more than 200 people answered the questionnaire. Besides,

the questionnaire is a following up survey, this means each year is

asked to the students to answer the questionnaire. Thus, the people

responding differ from year to year.

One of the questions allows to calculate the percentiles for each

group and period of time (21 periods) of the motivation of the

student, the starting point is the factorial punctuation of each

group. Thus, the evolution can be a line, a parabolic function, an

exponential function or any other pattern that the student feels his

motivation evolution has had and it is going to follow.

From the profiles of the students, the percentiles of the

motivation have been calculated. The 30% less motivated in each

group have a different higher punctuation of their motivation for

every period, represented by percentile 30 of groups 3 and 1,

https://goo.gl/forms/Vk70LlCFLmsCQ6qo1

https://goo.gl/forms/Vk70LlCFLmsCQ6qo1


hereafter 30CL3 and 30CL1, while the 30% more motivated have

a different lower punctuation of their motivation for every period,

represented by percentile 70 of groups 1 and 3, hereafter 70CL1 and

70CL3.

The relationship between the evolution of those variables is

going to be estimated and analyzed accordingly to the pattern that

they follow, linear or non-linear relationship. The study has been

extended to the relationships between those percentiles in cluster

2 and the rest. Thus, equations (2.1) and (2.2) are going to be

estimated depending on the results obtained in Ramsey test and the

graphical analysis:

Y = β0 + β1Xi + β2Xi + ...+ βnXi , (2.1)

Y = β0 + β1Xi + β2X2i + ...+ βnX

ni , (2.2)

where:

Y: is the value of the percentile that is going to be estimated.

Xi are the independent percentiles that are related to Y.

βi are the parameters that are going to be estimated for i=1 to n.

Next, the empirical analysis.

3. Analysis

The variables for a time span of 21 periods that have been

selected for the study are:

� 30CL1: highest motivation value for each period of the 30%

with lower motivation in group 1.

� 30CL3: highest motivation value for each period of the 30%

with lower motivation in group 3.

� 70CL1: lowest motivation value for each period of the 30% with

higher motivation in group 1.


� 70CL3: lowest motivation value for each period of the 30% with

higher motivation in group 3.

The data is shown in Table 1. The extension of the analysis with

cluster 2 has been taken to the appendix.

PERIOD 30 CL1 70 CL1 30 CL2 70 CL2 30 CL3 70 CL3

1 -0.447919 0.635933 -0.682659 0.623431 -0.5066 0.62642

2 -0.400638 0.641701 -0.770779 0.624117 -0.5516 0.64763

3 -0.292498 0.616466 -0.836687 0.575844 -0.5866 0.64763

4 -0.249899 0.664164 -0.747312 0.59406 -0.664675 0.64763

5 -0.193041 0.693985 -0.828881 0.591281 -0.74211 0.64763

6 -0.201437 0.673634 -0.819367 0.520423 -0.74711 0.64763

7 -0.265068 0.619635 -0.833527 0.543977 -0.829805 0.64763

8 -0.302225 0.640624 -0.779835 0.569971 -0.914805 0.64763

9 -0.296863 0.599538 -0.831482 0.599809 -0.989805 0.64763

10 -0.312473 0.589619 -0.85975 0.609343 -1.053195 0.64763

11 -0.296863 0.602055 -0.904235 0.619343 -1.053195 0.64763

12 -0.293863 0.589442 -0.903235 0.662087 -1.053195 0.69003

13 -0.275659 0.524072 -0.902879 0.749999 -1.053195 0.69003

14 -0.293308 0.534936 -0.922674 0.659999 -1.05083 0.69003

15 -0.249227 0.576226 -0.922674 0.595511 -0.98083 0.69003

16 -0.278802 0.65001 -0.922674 0.615064 -0.89083 0.69003

17 -0.271151 0.607836 -0.923247 0.632576 -0.78083 0.655395

18 -0.326156 0.684989 -0.955928 0.622341 -0.65083 0.577485

19 -0.388955 0.739893 -0.993438 0.69653 -0.56242 0.577485

20 -0.461889 0.766856 -1.034884 0.776419 -0.52742 0.577485

21 -0.590638 0.925424 -1.118959 0.857497 -0.48242 0.577485

Table 1: Percentiles 70 and 30 of the motivation in groups 1 and 3,the most and least motivated students.

The previous analysis will determine the methodology

accordingly to the type of relationship between them.


3.1. Previous analysis

Percentiles 70 of group 1 70CL1 and group 3 70CL3 and 30

of group 1 30CL1 and group 3 30CL3 and their relationships are

represented in figures 1 to 3.

Figure 1: Relationship between the lowest values of the 30% mostmotivated people of group 1 and the highest values of 30% the lessmotivated ones of group 3. Source: Data from our own surveyobtained in 2015.

Figure 1 shows the relationship between the lowest values of the


30% most motivated people of group 1 and the highest values of the

less motivated ones of group 3. As it can be appreciated there is no

linearity to describe this influence. The Ramsey test rejects the null

hypotheses at a 90% and n=4.

Figure 2: Relationship between the lowest values of the 30% mostmotivated people of group 1 and group 3. Source: Data from ourown survey obtained in 2015.

Figure 2 shows the relationship between the lowest values of the

30% most motivated people of group 1 and group 3. Graphically


the relationship can be linear. The Ramsey test does not reject the

null hypotheses of linearity.

Figure 3: Relationship between the highest values of the 30% lessmotivated people of group 1 and group 3. Source: Data from ourown survey obtained in 2015.

Figure 3 shows the relationship between the highest values of

the 30% most motivated people of group 1 and group 3. Graphically

the relationship can be parabolic. The Ramsey test rejects the null


hypotheses of linearity with n=2 at a 99%. Accordingly to this

previous analysis, (2.1) and (2.2) are going to be estimated.

4. Estimations and Results

The final estimations obtained are:

First, the effects of group 3 on group 1 are represented in

equations (4.1), (4.2) and (4.3).

70CL1 = 367.87[224.97] ∗ 30CL3 + 954.54[616.74] ∗ 302CL3+1220.58[830.49] ∗ 303CL3 + 769.51[549.54] ∗ 304CL3

+191.55[143.05] ∗ 305CL3 + 56.55[32.25](4.1)

70CL1 = −1.82[0.32] ∗ 70CL3 + 1.82[0.21] (4.2)

30CL1 = −16.33[2.99] ∗ 30CL3 − 4.82[0.73]−19.07[3.94] ∗ 302CL3 − 7.26[1.68] ∗ 303CL3

(4.3)

The residuals are white noise. The ADF test applied to the

residuals and all the coefficients to determine the goodness of the

models.

From the (4.1) to (4.3) it can be said that there is a positive

relationship between the 30% less motivated students of group 3

and the 30% more motivated ones in group 1. There is a negative

relationship between the 30% more and less motivated on both

groups.

Graphically the models are quite accurate to explain the

evolution of the percentiles. The 30% less motivated students in

group 3 have high influence in the less motivated ones of group 1.

The 30% more motivated students in group 3 have high influence in


Figure 4: Relationship between the highest values of the 30% lessmotivated people of group 1 and group 3 and the one estimated bythe model.

the most motivated ones which are in group 1.

5. Conclusions

As the motivation is not always a positive value, and its evolution

depends on the previous experience, the relationship between the

students′ motivation in each one of the groups in which they can be

classified accordingly to their profiles is difficult to analyze. An

approximation to do so is to estimate the relationship between

the evolution of the punctuation that a fixed percentage of people

classified in that group think they have and the evolution of


Figure 5: Relationship between the lowest values of the 30% mostmotivated people of group 1 and group 3 and the one estimated bythe model.


Figure 6: Relationship between the lowest values of the 30% mostmotivated people of group 1 and the highest values of the lessmotivated ones of group 3. Source: Data from our own surveyobtained in 2015.


the punctuation that another, or the same percentage, of people

classified in another group think they have.

Thus, the technique used to quantify the influence of a one

percentage increase, hereafter, the shock, in the motivation of each

one of the groups on the others is to estimate the nine following

pairwise relationships: Thus, the technique used to quantify the

influence of a one percentage increase, hereafter, the shock, in the

motivation of each one of the groups on the others is to estimate

the nine following pairwise relationships: The relationship between

the lowest values of the 30% most motivated people of one group

and the highest values of the 30% least motivated ones of another

group; The relationship between the lowest values of the 30% most

motivated people of each pair of groups; The relationship between

the highest values of the 30% least motivated people of each pair of

groups.

First, positive effects have been found increasing the motivation

of group 2 in one percentage point; if the increase is in the 30% most

motivated the effect on the 30% most motivated of group 1 is 26,34%,

if the increase is in the 30% least motivated the effect on the 30%

least motivated in group 1 is 0.07% and the effect on the 30% least

motivated of group 3 is a 2.71% while the effect on the 30% most

motivated of group 3 is a 3.16%. On the other hand, an increase in

the 30% less motivated of group 2 decreases the motivation of the

30% least motivated of group 1 by a -0.82%.

Second, rising the motivation in one percentage point of the most

motivated people in the group 3 decreases the motivation of the most

motivated ones in group 1 in -1.05%. Rising the motivation of the

30% least motivated in group 3 decreases the motivation of the least

motivated in group 1 in a -1.47% and rises the motivation of the

30% most motivated ones in group 1 in 2.71%.


References

[1] Arias, A. V. (2000). Enfoques de aprendizaje en estudiantes

universitarios. Psicothema, 12(3), 368-375.

[2] Chuska, K. R. (1995). Improving classroom questions: A

teacher’s guide to increasing Student Motivation, Participation,

and Higher-Level Thinking, Phi Delta Kappa Educational

Foundation ERIC Publications, Bloomington (Indiana, US).

[3] Cosculluela-MartAnez, C. and Ibar-Alonso, R. (2016).

Retroalimentacion como fuente de mejora de la calidad docente:

caso real. Congreso XXIV Jornadas de ASEPUMA, Granada

(Spain).

[4] Ibar-Alonso, R. (2014). Nueva metodologıa de recogida

de informacion para su tratamiento a traves del Analisis

Multivariante y los Modelos de Ecuaciones Estructurales.

Aplicaion en el ambito universitario. Tesis Doctoral, Madrid

(Spain).

[5] Lumsden, L. S. (1994).Student motivation to learn. ERIC

Publications, number 92, Washington, DC. Avaliable in

https://eric.ed.gov/?id=ED370200.

[6] Martinez, M. S., and Ibar-Alonso, R. (2015). Convergence and

interaction in the new media: Typologies of prosumers among

university students. Comunicacion y Sociedad, 28(2), 87.

https://eric.ed.gov/?id=ED370200.


Appendix

Estimations for the relationship between cluster 2 with the other

ones.

A1. Clusters 1 and 2

70CL1 = −374.75 ∗ 30CL2 − 632.00 ∗ 302CL2 − 466.67 ∗ 303CL2−127.19 ∗ 304CL2 − 81.52

(5.1)

70CL1 = 0.27 ∗ 70CL2 + 1.06 ∗ 70CL1(−1) − 0.20 (5.2)

30CL1 = 0.98 ∗ 30CL2 − 0.64 ∗ 30CL2(−1) (5.3)

Figure 7: The relationship between the lowest values of the 30% mostmotivated people of group 1 and group 2, the relationship betweenthe lowest values of the 30% most motivated people of group 1 andthe highest values of the 30% less motivated people of group 2, andthe relationship between the highest values of the 30% less motivatedpeople of group 1 and group 2 Source: Data from our own surveyobtained in 2015.


Figure 8: Estimated and real values.

A.2. Clusters 2 and 3

30CL3 = 3.57 ∗ 30CL2 + 7.42 ∗ 302CL2+3.73 ∗ 303CL2 + 1.06 ∗ 30CL3(−1)

(5.4)

No adjustment of the linear or non linear model.

70CL3 = −0.26 ∗ 302CL2 + 0.78 ∗ 70CL3(−1) − 0.39 ∗ 30CL2 (5.5)


Figure 9: The relationship between the highest values of the 30% lessmotivated people of group 3 and group 2, the relationship betweenthe lowest values of the 30% most motivated people of group 3 andgroup 2, and the relationship between the lowest values of the 30%most motivated people of group 3 and the highest values of the 30%less motivated people of group 2 Source: Data from our own surveyobtained in 2015.

Figure 10: Estimated and real values.

About the authors

Ibar-Alonso R. has PhD with distinction in Economic and

Business Sciences by San Pablo University of Madrid (USP-CEU).

She obtained the Grade in Mathematical Sciences by Complutense


University of Madrid (UCM). Faculty staff of the Mathematical

and Statistic Department at the USP-CEU University. Member of

the research group of Media Convergence (INCIRTV) and of the

Research Project Smart Cities: Accessibility problems to digital

information of the older citizens. She has multidisciplinary research

lines in Multivariate Statistic Analysis, social behavior, and new

ways of collecting qualitative and quantitative information. Visiting

scholar (2016) at Regional Economic Applications Laboratory

Illinois University.

Cosculluela-Martınez, C. Faculty Staff in URJC. PhD with

distinction in Statistics for Economics UNED. Enrique Fuentes

Quintana (2010) and Ramon Areces (2011) Prizes. Coordinator

of the Business Administration and Tourism Branch. Public

Press Conferences and Coordinated General Directorates at the

Vice-Council Office of Economics and Employment of the Madrid

Regional Department. Senior Risk Analyst in Avalmadrid S.G.R,

external at A.E.M.S.A. Participation in several projects of the EU,

Education Ministry, Madrid Regional Education and Employment

Department and Town-halls. Referee of WOS indexed Journals.

Visiting scholar and Visiting Professor (2011, 2016) at Regional

Economic Applications Laboratory Illinois University.


Opiniones sobre la profesion

Ingenuas reflexiones de un estadıstico en la era

del Big Data

Ricardo Cao Abad

Grupo de investigacion MODES, Departamento de Matematicas,

Centro de Investigacion en Tecnologıas de la

Informacion y las Comunicaciones (CITIC)

Instituto Tecnologico de Matematica Industrial (ITMATI)

Universidade da Coruna,

Campus de Elvina, s/n 15071 A Coruna, Spain

[email protected]

Abstract

This article presents some reflections of the author (a

statistician) about the role of Statistics in the Big Data era.

The paper goes from the change of paradigms (asymptotic

properties versus huge sample sizes and statistical efficiency

versus computation time), to the more than probable presence

of bias in Big Data. It also makes a tour through subsampling

methods and ‘divide and conquer’ strategies. All these issues

are examined under a very personal (possibly naive) view of

the author.

Keywords: Biased data, big data, bootstrap, divide and conquer,

magnifying glass subsample, minimalist replication bootstrap,

subsampling.

AMS Subject classifications: 62G08, 62G09, 62G10, 62G20,

68T05.

© 2017 SEIO


296 R. Cao Abad

1. Motivacion

Algo que no siempre esta claro para todos los usuarios de la

estadıstica es que los datos son la informacion recogida en la muestra

observada y un aspecto muy importante es el modo en que se ha

decidido recogerlos. Ası, no solo importan sus valores concretos sino

tambien el procedimiento (normalmente aleatorio) a partir del cual

se obtuvieron esos datos concretos de la poblacion. En ese sentido,

el modelo aleatorio generador de los datos cobra mas importancia,

si cabe, que los datos en sı mismos. De ahı la importancia de

los metodos de muestreo y del concepto de muestra aleatoria. La

conexion de ese concepto de muestra, puramente matematico, con la

realidad es la muestra observada, ya formada por valores obtenidos

del mundo real: los datos.

Hasta hace aproximadamente un siglo la obtencion de datos era

un proceso muy laborioso. Por ese motivo, la mayorıa de los metodos

estadısticos propuestos a finales del siglo XIX y principios del XX

fueron pensados para situaciones en las que el tamano muestral

era pequeno. Un ejemplo de ello es el artıculo de Pearson (1900)

en el que se introduce el estadıstico χ2 para realizar contrastes de

bondad de ajuste. A raız de las ideas expresadas en el artıculo de

Pearson y, sobre todo, en el de William S. Gosset, en el que introdujo

la distribucion t de Student (Student (1908)), fue haciendose mas

evidente la necesidad de disenar procedimientos estadısticos que

tuviesen muy en cuenta el hecho de que, para tamanos muestrales

pequenos, la distribucion de probabilidad de muchos estadısticos

difiere bastante de la que tienen cuando el tamanano muestral tiende

a infinito: la llamada distribucion asintotica del estadıstico.

La introduccion del metodo bootstrap por Efron (1979) fue un

paso de gigante en ese sentido. El metodo proporciona una filosofıa

general para aproximar la distribucion de un estadıstico para un

tamano muestral finito concreto. Eso sı, la gran utilizacion del

Ingenuas reflexiones de un estadıstico en la era del Big Data 297

metodo bootstrap no hubiera sido posible de no haber dispuesto de

cada vez mas agiles ordenadores que hoy permiten simular millones

de replicas bootstrap de muchos de los estadısticos mas frecuentes en

unos pocos segundos. Este auge de las tecnologıas de la informacion

(y tambien de la sensorica y las comunicaciones) hace que los propios

ordenadores y dispositivos electronicos pasen de ser una valiosa

herramienta para analizar datos a ser fuentes inagotables de datos.

Esos datos son ahora de tamano muestral ingente y frecuentemente

muy complejos y de alta dimension. Esto ha dado lugar al campo

conocido actualmente como Big Data, sobre el que reflexionare desde

el punto de vista estadıstico, prestando atencion a los cambios de

paradigma que, a mi juicio, se avecinan. Un artıculo muy interesante

en el que el autor reflexiona sobre cual ha de ser el papel de la

estadıstica (y de las personas que nos dedicamos a esta ciencia) en

este campo emergente de los Big Data es el escrito por Pena (2014),

publicado tambien en BEIO.

2. Cambio de paradigmas

El Big Data ha traido consigo la generacion y necesidades de

procesamiento y analisis de bases de datos de gran volumen (en

ocasiones desestructuradas). El sentido en que estas bases de datos

son grandes frecuentemente varıa. Dicho coloquialmente, podemos

hablar de grande a lo ancho (gran numero de variables en la base

de datos) o a lo largo (tamano muestral muy elevado) o en ambos

sentidos. En Cao (2015) se hace un recorrido por diversas situaciones

reales en las que se presentan alguna de estas caracterısticas

de gran tamano (a lo ancho, a lo largo o en ambos sentidos),

incluyendo reflexiones sobre el tratamiento de gandes volumenes de

datos procedentes de imagenes y vıdeos, ası como la perspectiva

infinito-dimensional que proporciona el analisis de datos funcionales.

En el caso en que el conjunto de datos sea grande debido

298 R. Cao Abad

al elevado tamano muestral, n, a mi juicio se intuye un cambio

de paradigma en lo tocante a la disyuntiva entre el uso (y el

interes) de las propiedades asintoticas como contraposicion a las

propiedades obtenidas para n fijo. Tambien se adivinan cambios

en los criterios de optimalidad de los procedimientos de analisis de

datos que podrıan tener en cuenta ya no solo la eficiencia estadıstica

de los procedimientos, sino su coste computacional y escalabilidad.

Veamoslo a continuacion.

2.1. Propiedades asintoticas y para tamano muestral finito

Cuando uno dispone de una muestra con un tamano muy grande,

los resultados asintoticos deben estar muy cerca de lo que la

muestra nos ofrece. Como consecuencia, las propiedades asintoticas

de los metodos estadısticos deben jugar un papel fundamental en

estos dıas. Asimismo, los resultados que son de mucho interes

para tamanos muestrales pequenos posiblemente dejaran de tenerlo

en estos contextos de Big Data a lo largo. Ası, por ejemplo,

con un tamano muestral de n = 1 000 000, la distribucion χ2n−1

para el estadıstico estudentizado que permite hacer inferencia mas

precisa sobre la varianza de una poblacion normal, sera poco util

y la mera aproximacion de la distribucion de dicho estadıstico

gracias al Teorema Central del Lımite y la consistencia de los

momentos muestrales sera un resultado mucho mas interesante

en este caso. Igualmente, con un tamano muestral tan grande,

posiblemente no sea necesaria la utilizacion del metodo bootstrap

para hacer inferencia sobre la varianza de la poblacion. De nuevo,

la aproximacion normal para dicha distribucion sera un resultado

mucho mas util para ese caso. Es evidente que eso representa un gran

alivio computacional, pues realizar procedimientos bootstrap que

requieran simular decenas de miles de replicas con tamano muestral

del orden de millones es un proceso que consume mucho tiempo de

CPU y, en ocasiones, una gran cantidad de memoria.


Por otra parte, al disponer de un tamano muestral tan grande

cabe preguntarse lo cercana que ya esta la informacion muestral

de las caracterısticas poblacionales y hasta que punto podemos

simplemente olvidarnos de los errores de estimacion y considerar

que las estimaciones obtenidas a partir de la muestra son ya valores

muy cercanos a sus analogos poblacionales. Esto puede resultar

razonable en problemas sencillos, pero quiza no tanto en otros mas

complejos y especialmente en aquellos que traen consigo una elevada

dimension del objeto poblacional de interes. Ası, por ejemplo, si

disponemos de n = 1 000 000 observaciones en la muestra y para

cada una hemos registrado los valores de 1 000 variables, un elemento

poblacional que puede resultar interesante (por ejemplo, para llevar

a cabo un analisis de componentes principales) es la matriz de

varianzas-covarianzas. Esa matriz tiene dimension 1 000 ×1 000 y en

ella se hallan (1000 · 999) /2 + 1000 = 500 500 elementos distintos

a estimar. Aunque un millon de datos puede parecer mucho, al

tener que estimar alrededor de medio millon de parametros es muy

probable que en alguno de ellos el error de estimacion sea realmente

grande y que eso distorsione las conclusiones posteriores. Por ello

resulta interesante el poder controlar los errores de estimacion

conjuntos de tal ingente cantidad de parametros. Obviamente, si en

lugar de 1 000 se tratase de 2 000 variables, el numero de parametros

a estimar (2 001 000 elementos de la matriz de varianzas-covarianzas)

harıa que el problema fuese inabordable con “tan solo” un millon de

datos.

Una forma de abordar el problema de estimar las componentes

principales con solo un millon de datos en presencia de 2 000 variables

podrıa ser el considerar como componentes principales factibles

aquellas combinaciones lineales de las variables de partida que tenga,

a lo sumo, “tan solo” 100 coeficientes no nulos. De esta manera, si

consideramos todas las 1 000 potenciales componentes principales,

300 R. Cao Abad

solo necesitarıamos estimar 100 000 coeficientes, que aunque es un

numero elevado es considerablemente menor que el tamano muestral.

La idea anteriormente expuesta: considerar modelos dispersos, es

decir, con un numero relativamente pequeno de coeficientes no nulos,

mucho menor que el gran numero, d, de variables explicativas, ha

sido y continua siendo muy utilizada en el contexto de Big Data a lo

ancho. Tambien lo es en casos no necesariamente de Big Data pero

cuando simplemente d > n. Entre los trabajos pioneros en esta lınea

se encuentran el de Tibshirani (1996) y el de Efron et al. (2004).

Tambien es muy frecuente en Big Data a lo ancho que sea

necesario examinar la validez de un enorme numero de hipotesis.

Por ejemplo, contrastar si cada una de las d variables potencialmente

explicativas realmente aporta algo de explicacion en un modelo de

regresion o de clasificacion. Una situacion semejante se da cuando

se manejan modelos con un enorme numero de coeficientes y se

desean contrastar las hipotesis simplificadoras de que cada uno de

esos coeficientes es cero. Para dar respuesta a ese tipo de situaciones

surgieron a finales del pasado siglo y principios del presente (ver

Benjamini and Hochberg (1995) y Benjamini and Yekutieli (2001),

entre otros) diversos metodos encaminados a controlar la tasa de

falsos positivos (o FDR, del ingles false discovery rate) ası como la

tasa de error conjunta (o FWER, del ingles familywise error rate).

Obviamente estos metodos son esenciales en Big Data a lo ancho.

2.2. Eficiencia estadıstica y eficiencia computacional

En general, cuando tenemos que analizar estadısticamente

conjuntos de gran volumen de datos, un asunto muy relevante es

el del tiempo de calculo necesario para llevar a cabo tales analisis.

Esto plantea la necesidad (o, al menos, la conveniencia) de tener

en cuenta el tiempo de computacion necesario dentro del criterio de

optimalidad del metodo de estimacion. Lo habitual en estadıstica

es considerar medidas de eficiencia, como la inversa del error


cuadratico medio de un estimador, que son utiles para comparar

distintos procedimientos estadısticos atendiendo tan solo al error de

estimacion que cometen. Sin embargo, no es infrecuente que metodos

que proporcionan menos error de estimacion precisen de un mucho

mayor numero de calculos, con lo que el tiempo de computo de los

mismos sera tambien mucho mayor. Eso da pie a tener en cuenta

la llamada eficiencia computacional a la hora de comparar metodos

estadısticos.

Cuando disponemos de un numero de datos moderado, la

eficiencia estadıstica es el criterio primordial (sino el unico) para

determinar la optimalidad de un procedimiento de analisis de datos.

La eficiencia computacional suele considerarse como una propiedad

complementaria deseable del metodo. A veces llega a fijarse cierto

umbral para el tiempo de computacion que no debe rebasar el

metodo de analisis, normalmente debido a requisitos tecnicos, como

el tiempo maximo permisible para poner en practica medidas

correctivas, si tras el analisis de los datos se concluye que dichas

medidas son necesarias. En este sentido, se tratarıa de encontrar

el procedimiento mas eficiente estadısticamente (por ejemplo el

estimador con menor error cuadratico medio) dentro de los que

requieren un tiempo de computo menor o igual que un umbral fijado.

Por el contrario, en algunas aplicaciones crıticas en Big Data es

simplemente necesario producir estimaciones que conlleven un error

estadıstico no mayor que un umbral prefijado pero, dentro de ellas,

resulta crucial poder poner en practica el metodo mas rapido desde el

punto de vista computacional. Esto ocurre, por ejemplo, a la hora de

poner en el mercado productos y tecnologıas con alto valor anadido,

cuando el tiempo de respuesta es un factor decisivo para imponerse

a otros competidores.

Aunque actualmente no es muy frecuente, resulta muy razonable

esperar que en el futuro se utilicen criterios de optimalidad mixtos

302 R. Cao Abad

que combinen la eficiencia medida desde el punto de vista estadıstico

con la eficiciencia computacional, tanto en tiempo de procesado como

en memoria requerida para llevar a cabo el procedimiento de analisis.

Es ası concebible que los procedimientos de analisis opten en el futuro

por elegir un metodo u otro en funcion del peso que reciban el coste

del error de estimacion, el tiempo de computo necesario y la memoria

requerida para la implementacion del metodo, entre otros aspectos.

Ello, ademas, puede depender de la arquitectura computacional a

utilizar, ya que el grado de paralelizacion de los distintos metodos

de analisis de datos puede ser un factor decisivo al evaluar este tipo

de criterios de optimalidad conjuntos estadıstico-computacionales.

Ası, no sera sorprendente en el futuro que una rutina de analisis de

grandes volumenes de datos opte por llevar a cabo un procedimiento

u otro en funcion de la arquitectura computacional en la que se

ejecute.

Por ultimo, un aspecto crucial que ya esta muy presente en el

campo de los Big Data es el de la escalabilidad. En el contexto que

nos ocupa, la escalabilidad podrıamos definirla como la capacidad del

metodo de analisis de datos para adaptarse a situaciones de mayor

volumen (mayor tamano muestral, mayor dimension del modelo,

mayor numero de variables en el mismo) sin perder su calidad.

Frecuentemente la escalabilidad de un procedimiento estadıstico se

evalua examinando como crece el numero de operaciones necesarias

(o su tiempo de ejecucion) al ir aumentando el tamano muestral

o la dimension del problema. Tambien es importante analizar la

escalabilidad desde el punto de vista de la memoria RAM y de la

capacidad necesaria de disco duro para llevar a cabo el procedimiento

estadıstico. Obviamente las limitaciones en memoria RAM pueden

suplirse con la utilizacion mas intensiva de disco, sin embargo

esto produce un enlentecimiento considerable del tiempo necesario

para poder completar el procedimiento de analisis. Un ejemplo de


procedimiento estadıstico poco escalable desde el punto de vista del

tiempo de ejecucion es el de la obtencion del parametro de suavizado

mediante un criterio del tipo cross validation para la estimacion no

parametrica de la funcion de regresion. Este ejemplo lo trataremos

precisamente en la siguiente seccion.

En las secciones siguientes expondre en mayor detalle algunos

procedimientos estadısticos o problemas concretos que tienen

especial relevancia en el contexto de los datos de gran volumen.

Todo ello siempre con una vision muy personal.

3. Submuestreos lupa

Entre las herramientas de analisis exploratorio de datos

mas utilizadas estan los procedimientos graficos. Frecuentemente

es muy recomendable comenzar a explorar los datos mediante

representaciones graficas que nos permitan simplemente resumir

la informacion, detectar datos atıpicos o establecer patrones

iniciales que luego se validaran, contrastaran o ajustaran mediante

procedimientos estadısiticos, a veces sofisticados. De entre las

representaciones graficas uno de los tipos mas usados y utiles son

los diagramas de dispersion de pares de variables y las matrices

de dichas graficas de dispersion. Mediante este tipo de graficas se

prentende examinar, por ejemplo, la posible relacion de dependencia

entre variables relevantes del problema. Ello puede ayudar a decidir

que tipos de modelos de regresion formular inicialmente.

Recientemente, en el contexto de un trabajo fin de master (TFM)

que he dirigido, nos encontramos con el problema, tan trivial pero

limitante a la vez, de tener que remediar el inconveniente de no

poder distinguir los patrones subyacentes en un simple grafico de

dispersion entre dos variables. ¿Como es eso posible?, se preguntara

el lector. ¿No se arreglo el problema cambiando el tipo de puntos

utilizado, o el tamano, color o forma de los mismos? Pues no, ya

304 R. Cao Abad

que el grafico de dispersion constaba de algo mas de 800 000 datos,

correspondientes a otros tantos clientes de una entidad financiera.

El “apelotonamiento” de datos era tal que resultaba muy difıcil

distinguir zonas de muy alta concentracion de puntos de otras en

las que la contentracion era simplemente alta, o incluso moderada.

Obviamente, uno podrıa obviar la construccion de tales tipos de

representaciones graficas y pasar directamente a algun metodo

de analisis (como la construccion una estimacion no parametrica

de la regresion) que luego pueda ser representado graficamente

sin encontrarse con el problema antedicho. De todas formas, no

hay porque renunciar a esas exploraciones graficas, simplemente

podemos aplicar el principio “mas vale que sobre que no que falte”,

ya que de donde sobra se puede quitar.

La forma en que “resolvimos el problema” de no poder distinguir

los patrones en el grafico de dispersion consistio en obtener una

submuestra aleatoria de algo ası como 1 000 datos y representar

graficamente “solo” esos 1 000 datos en un diagrama de dispersion.

Aunque es proco probable, podrıa suceder que esa submuestra no

fuese muy representativa de la muestra original, ası que repetimos el

procedimiento unas pocas veces (obteniendo submuestras aleatorias

independientes), lo que nos permitio corroborar el patron observado

para la submuestra inicial. De esta forma, el submuestreo actuo a

modo de lupa (o quiza microscopio) permitiendonos ver donde antes

todo estaba enmaranado.

Este procedimiento de submuestreo puede resultar tambien muy

util cuando se trata de llevar a cabo otros procedimientos estadısticos

que pueden ser poco escalables (muy lentos de ejecucion para

tamanos muestrales relativamente grandes). Precisamente, en el

ejemplo antes citado, tras el analisis visual de algunas submuestras

llegamos a la conclusion de que resultaba conveniente realizar una

estimacion no parametrica (tipo nucleo) de la funcion de regresion.


Uno de los requisitos necesarios para ello es la eleccion del parametro

de suavizado (o ventana), que juega un papel fundamental a la hora

de aplicar esta tecnica. Entrando en cierto detalle, a la hora de hacer

una grafica del estimador de Nadaraya-Watson, mh, de la funcion

de regresion hemos de evaluar dicho estimador en una particion

suficientemente fina y para la construccion del mismo, debemos

elegir el parametro de suavizado, h. Uno de los primeros metodos

propuestos (y, aun ası, usado en la actualidad) para la seleccion del

parametro h consiste en encontrar aquel valor, hCV , que minimiza

la funcion por validacion cruzada (o cross validation) dada por

CV (h) =1

n

n∑i=1

(Yi − m−ih (Xi)

), (3.1)

siendo n el tamano muestral, (X1, Y1) , . . . , (Xn, Yn) la muestra,

m−ih (x) el estimador evaluado en x y calculado eliminando de la

muestra la i-esima observacion, mh (x) = 1nh

∑ni=1K

(x−Xih

)Yi

el estimador tipo nucleo de Nadaraya-Watson calculado con toda

la muestra, K la funcion nucleo utilizada y h el paramertro de

suavizado o ventana. Es inmediato razonar que cada evaluacion de

la funcion dada en (3.1) requiere del orden de n (n− 1) operaciones

elementales, es decir ese numero de evaluaciones de la funcion nucleo

y otras tantas sumas. En el caso que nos ocupa, con n = 800 000,

eso supone que el calculo de CV (h) para cada h requiere del orden

de 6,4 × 1011 operaciones, es decir algo mas de medio billon de

evaluaciones de la funcion nucleo y otras tantas sumas. Suponiendo

que pudiesemos llevar a cabo 100 millones de operaciones por

segundo y que cada evaluacion de la funcion K llevase consigo del

orden de 10 operaciones, entonces las aproximadamente 7,04× 1012

operaciones necesarias requerıan de unos 70 400 segundos, es decir

algo mas de 19 horas de tiempo de ejecucion. Si para encontrar

una buena aproximacion numerica del valor hCV hiciesen falta

306 R. Cao Abad

unas 10 evaluaciones de la funcion CV (h), entonces necesitarıamos

algo mas de 8 dıas para disponer del valor del parametro de

suavizado a utilizar. Obviamente, 8 dıas para poder conocer un

parametro auxiliar a utilizar para llevar a cabo un procedimiento de

estimacion es un tiempo prohibitivo. La ejecucion del procedimiento

se demorarıa mas dos anos si la muestra constase de n = 8 000 000

de datos.

Al reflexionar sobre el ejemplo anterior podrıamos pensar

que algunos procedimientos estadısticos (como la estimacion no

parametrica de la regresion) son practicamente inutilizables con

datos de gran volumen. Sin embargo ello no tiene porque ser ası,

simplemente hemos de utilizar nuestra imaginacion para solventar

esos problemas de escalabilidad.

En el contexto de la estimacion no parametrica tipo nucleo

de Nadaraya-Watson de la funcion de regresion, es conocido que

el parametro de suavizado optimo (por ejemplo, en el sentido

de minimizar el error cuadratico promediado medio, MASE) es

asintoticamente de la forma hopt ' c0n− 1

5 , para cierta constante

c0 que depende de caracterısticas poblacionales (como la propia

funcion de regresion desconocida, la de densidad de la variable

explicativa y algunas derivadas de ambas). De hecho, bajo

algunas condiciones sabemos que muchos metodos de seleccion

del parametro de suavizado, como el de validacion cruzada,

proporcionan procedimientos consistentes. En concreto, se tiene que

hCV,nhopt

−→ 1,

en probabilidad o de forma casi segura. Ası pues, para tamanos

muestrales relativamente grandes es previsible esperar que hCV,n 'c0n− 1

5 . Si no fuese porque c0 es desconocido, podrıamos utilizar esa

formula asintotica para aproximar el valor de hCV,n. Sin embargo,


ese problema puede resolverse aplicando el procedimiento estadıstico

con una submuestra de tamano mucho menor (aunque grande aun).

Por ejemplo, tomando una submuestra de tamano m el calculo de

la ventana de validacion cruzada para dicha submuestra requerirıa

del orden de m (m− 1) operaciones elementales y este numero

puede ser mucho menor que n (n− 1) eligiendo m adecuadamente.

Supongamos que en nuestro ejemplo tomamos m = 8 000, es decir

m = n100 . Ası, el calculo de la ventana de validacion cruzada

basada en esta submuestra, hCV,m, sera unas 1002 = 10 000 veces

mas rapido que con la muestra original completa. Eso significa

que requerirıa unos 70 segundos de tiempo de ejecucion. Ahora,

como hCV,m ' c0m− 1

5 se tiene que c0 ' hCV,mm15 y, por tanto,

hCV,n ' hCV,mm15n−

15 = hCV,m

(mn

) 15 . En resumen, la utilizacion

del mismo procedimiento de seleccion de la ventana pero para una

submuestra aleatoria de la muestra Big Data y una mera correccion

por el tamano muestral permitirıan obtener en poco mas de un

minuto el parametro de suavizado que, de seleccionarlo utilizando la

muestra completa, requerirıa mas de 8 dıas de calculos. Obviamente

esta forma de proceder provoca que el valor obtenido dependa de

la submuestra concreta elegida, pero podemos repetir el proceso

para unas cuantas submuestras y considerar el valor promedio del

c0 aproximado mediante hCV,mm15 .

En el ejemplo anterior, la clave para poder reducir el numero de

operaciones estuvo en conocer la expresion asintotica del parametro

auxiliar (de suavizado) que deseamos elegir. Eso permite“extrapolar”

el valor obtenido para el tamano muestral m al que corresponderıa

para otro tamano, n, mucho mayor. Para otros parametros o en

otros contextos es posible que no dispongamos de resultados teoricos

que permitan llevar a cabo razonamientos como ese. En tal caso,

siempre serıa factible llevar a cabo el procedimiento para unos pocos

valores del tamano de las submuestras: m1 < m2 < · · · < mk,

308 R. Cao Abad

mucho menores que n, y luego formular un modelo que permita

relacionar el parametro en cuestion con el tamano de la submuestra.

Dicho modelo podrıa utilizarse acto seguido para predecir el valor

que deberıa tomar dicho parametro para el tamano muestral original.

4. Divide y venceras

Frecuentemente el tamano del conjunto de datos provoca que

los procedimientos clasicos de analisis conlleven un numero de

operaciones demasiado elevado. Como consecuencia, en esos casos

parece razonable modificar el procedimiento para que el tiempo

de ejecucion del mismo sea factible. Un simple ejemplo es el

calculo de la funcion de distribucion empırica para una muestra

de gran numero de datos. Imaginemos, por ejemplo, que estamos

registrando la temperatura durante cada segundo en cada tienda de

una cadena comercial con mil establecimientos en todo el mundo.

Ası, si pretendemos construir la funcion de distribucion empırica

con los datos todos los establecimientos correspondientes a los tres

ultimos anos, nos encontramos con que deberemos utilizar n =

9,5 × 1010 datos de temperaturas. El calculo de la distribucion

empırica requiere esencialmente ordenar la muestra, lo cual puede

hacerse mediante algoritmos eficientes, como el quicksort, en n log2 n

operaciones. En este ejemplo, dado el valor de n, eso requerira de

unas 3,5 × 1012 operaciones. Si nuestro ordenador pudiese realizar

100 millones de operaciones por segundo, necesitarıamos unos 35 000

segundos (es decir, casi 10 horas) para llevar a cabo el calculo

de dicha funcion de distribucion empırica. Este puede resultar un

tiempo excesivo, si lo que se desea es tomar decisiones en corto

espacio de tiempo.

Supongamos ahora que simplemente deseamos calcular la

temperatura mediana de esos n = 9,5× 1010 datos. En lugar de las

n log2 n operaciones que requerirıa la ordenacion de todos los datos,


podrıamos “romper” la muestra en m submuestras de tamano nm ,

ordenar cada una de esas submuestras y luego utilizar las medianas

de esas submuestras para construir un estimador de la mediana

poblacional. Para hacer esto ultimo, una posibilidad serıa calcular

la mediana de todas esas medianas submuestrales. De proceder de

esta forma, el numero de operaciones necesarias para ordenar las

m submuestras serıa m nm log2

nm = n log2

nm , mientras que para

ordenar esas m medianas submuestrales necesitarıamos m log2m

operaciones. Ası pues, en total harıan falta un numero de operaciones

igual a

g (m) = n log2

n

m+m log2m = n log2 n− (n−m) log2m.

En nuestro caso utilizando que n = 9,5× 1010 y minimizando en m

la funcion g (m) es facil obtener el valor optimo para m que resulta

ser m = 4,1 × 109 submuestras. Eso significa que el numero de

operaciones a realizar serıa 5,6 × 1011, es decir unas 6 veces mas

rapido que con la muestra original. La forma de esta funcion puede

verse en la Figura 1.

310 R. Cao Abad

Figura 1: Numero de operaciones, g(m), para el calculo del estimador“divide y venceras” en funcion del numero de submuestras elegidas.

En esta situacion resulta interesante comparar estos dos

estimadores, θ1 (mediana muestral basada en los n datos) y θ2(mediana de las m medianas submuestrales), desde un punto

de vista de su eficiencia estadıstica. Si denotamos por θ0 la

mediana poblacional, es bien conocido que la distribucion asintotica

de θ1 es una N(θ0,

14nf(θ0)

), siendo f la funcion de densidad

de la poblacion. Por su parte, la distribucion de la mediana

submuestral con submuestras de tamano nm serıa aproximadamente

una N(θ0,

14 nm f(θ0)

), y, consecuentemente, la distribucion asintotica

de la mediana de las medianas submuestrales, θ2, viene dada por

una N(θ0,

14mgn,m(θ0)

), siendo gn,m (x) la funcion de densidad de

una distribucion N(θ0,

14 nm f(θ0)

). Utilizando la expresion de esta


densidad se obtiene

gn,m (θ0) =1

1

2√

nm

√f(θ0)

√2π

exp

(− (θ0 − θ0)

2

24 nm f(θ0)

)=

2√

nm

√f (θ0)

√2π

,

con lo cual la distribucion asintotica de θ2 resulta ser

una N

(θ0,

√2π

8√mn√f(θ0)

). En particular ambos estimadores son

asintoticamente insesgados y sus varianzas asintoticas resultan

V ar(θ1

)' 1

4nf (θ0)

V ar(θ2

)'

√2π

8√mn√f (θ0)

.

Ası, θ2 es asintoticamente mas eficiente que θ1 si y solo sı m >12πnf (θ0). Para el caso de una poblacion normal de desviacion

tıpica σ (su media sera θ0, igual a su mediana, por ser la normal

simetrica) esta condicion resulta m >√π

2√2σn, que viene a imponer

que el numero de submuestras a elegir para que el nuevo estimador

sea asintoticamente mas eficiente que el clasico no puede ser

excesivamente pequeno, en terminos del tamano muestral de la

muestra original y de la desviacion tıpica de la poblacion. Si σ es

un valor grande, estamos dando mucha libertad para la eleccion

de m, pero si σ es pequeno la situacion es la contraria. Ası,

por ejemplo, si σ <√π

2√2

= 0,626 66 el estimador θ2 no serıa

asintoticamente mas eficiente que θ1 para ninguna eleccion de m,

supuesta una distribucion poblacional normal. No obstante, θ2 sı

serıa mas eficiente computacionalmente que θ1 y posiblemente su

varianza no se verıa muy afectada. Para nuestro ejemplo, con n =

9,5× 1010, m = 4,1× 109 y considerando una poblacion normal, las

312 R. Cao Abad

deviaciones tıpicas asintoticas de θ1 y θ2 serıan√V ar

(θ1

)' 2. 568 3× 10−6

√σ√

V ar(θ2

)' 5. 013 6× 10−6 4

√σ

Puede verse entonces que, para σ = 1 la desviacion tıpica del

estimador θ2 es alrededor del doble que la de θ1, pero ambas son

realmente muy pequenas. Si consideramos otros casos mas extremos,

como σ = 0,0001 o σ = 10000, obtenemos que, en el primero de ellos,

la desviacion tıpica de θ2 es alrededor de 20 veces mayor que la de

θ1 (pero ambas muy pequenas, de ordenes 10−7 y 10−8), mientras

que en el segundo, la desviacion tıpica de θ2 es alrededor de 5 veces

menor que la de θ1, siendo ambas bastante pequenas, de ordenes

10−5 y 10−4 respectivamente.

Otra forma razonable de proceder para “integrar” la informacion

de las m medianas submuestrales, serıa calcular la media (en

lugar de la mediana) de las medianas submuestrales. Denotemos

dicho estimador por θ3. Dado que la distribucion de la mediana

submuestral es aproximadamente una N(θ0,

14 nm f(θ0)

), la varianza

asintotica de θ3 (media muestral de las medianas remuestrales)

resulta1

4 nmf(θ0)

m = 14nf(θ0)

y, por tanto, θ3 sigue aproximadamente

unaN(θ0,

14nf(θ0)

), es decir, la misma distribucion asintotica que θ1,

sin que para ello influya (asintoticamente, al menos) la eleccion de m.

Sin embargo el numero de operaciones necesarias para calcular θ3 es

incluso un poco menor que para calcular θ2, es decir, bastante menor

que para calcular θ1. Esto da pie a concluir que en una situacion

como esta serıa muy razonable dividir el conjunto de n = 9,5× 1010

datos en m = 4,1 × 109 submuestras, calcular con cada una la

mediana submuestral y finalmente calcular la media de todas esas


m medianas.

5. Bootstrap con replicas minimalistas

Recientemente, en una colaboracion de nuestro grupo de

investigacion con un grupo de oncologos nos vimos en la necesidad

de utilizar un metodo bootstrap para contrastar la significacion

de variables relacionadas con metilaciones en datos procedentes de

sarcomas. Como es bien sabido, aunque los metodos bootstrap son

computacionalmente costosos, la potencia actual de los ordenadores

permite ejecutarlos en muy pocos segundos. En nuestro caso, el

numero de datos, n = 300, era moderado, sin embargo, en el

problema que nos ocupaba el numero de potenciales variables

explicativas era cercano a las 400 000, lo cual provocaba un factor de

enlentecimiento tal, que los analisis necesarios podrıan demorarse

durante anos. ¿Como completar entonces todo el analisis en un

tiempo razonable? Veamoslo con un poco mas de detalle.

Dado que pretendıamos realizar contrastes de significacion para

cada una de las k = 400 000 variables, siendo el numero de contrastes

de hipotesis tan elevado, se impone utilizar una tecnica que controle

la tasa de falsos positivos (FDR). En concreto utilizamos los metodos

de Benjamini and Hochberg (1995) y Benjamini and Yekutieli (2001).

Estos metodos se basan en ordenar los p-valores de menor a mayor:

p(1) ≤ p(2) ≤ · · · ≤ p(k) y compararlos con diversos umbrales

calculados a partir del nivel de significacion global prefijado, α,

mediante la condicion p(i) ≤ αk−i+1 , encontrando cual es el maximo

valor de i que cumple dicha condicion. Como consecuencia, el p-valor

mas pequeno ha de compararse con αk , que en nuestro caso, usando

α = 0,05 y teniendo en cuenta que k = 400 000, resultaba ser αk =

1,25 × 10−7, un valor muy pequeno. Como ademas necesitabamos

utilizar el metodo bootstrap para aproximar estos p-valores, ello

significaba tener que simular un gran numero de replicas bootstrap

314 R. Cao Abad

y con ellas calcular la proporcion de veces en las que la version

bootstrap del estadıstico ofrecıa un valor menor que el estadıstico en

la muestra original, siendo dicha proporcion precisamente el p-valor

aproximado por bootstrap. Como se ha de hacer (entre otras cientos

de miles) la comparacion p(1) ≤ 1,25 × 10−7, es evidente que el

numero de replicas boostrap necesarias ha de ser algo mayor que1

1,25×10−7 = 8,0 × 106, es decir algo mayor que 8 millones. Por

ejemplo unos 100 millones de replicas boostrap serıan un numero

razonable. Sin embargo, dada la complejidad del estadıstico (de

orden cuadratico en el tamano muestral, n) y el numero de variables

sobre las que implementar el bootstrap (k = 400 000) el numero

de operaciones necesarias serıa del orden 3002 × 400 000 × 108 =

3,6 × 1018, que en un ordenador que realice unos 1000 millones de

operaciones por segundo llevarıa un tiempo de ejecucion de unos . . .

¡114 anos!

En realidad, el motivo de lanzar un numero tan grande (108)

de replicas bootstrap viene de la necesidad de hacer comparaciones

del tipo p(i) ≤ αk−i+1 pero, obviamente, cuando con unas pocas

replicas bootstrap (pongamos 10) el p-valor estimado por bootstrap

sea al menos 0,1, es obvio que para ese ındice, i, no va a ocurrir que

p(i) ≤ αk−i+1 , sino todo lo contrario. Esto significa que podrıamos

hacer una primera ronda de procedimientos bootstrap, para cada

una de las k variables, con solo B = 10, replicas bootstrap, y solo

aumentar el valor de B (por ejemplo al valor B = 100) para aquellas

variables para las que el p-valor obtenido por bootstrap con solo 10

replicas haya sido 0 (pues en las demas habra sido de al menos 0,1).

Es de esperar que la inmensa mayorıa de las variables (pongamos

para 380 000, el 95 % de ellas) esten en esa situacion de que p ≥ 0,1,

y “solo” para las 20 000 restantes habrıa que aumentar el numero

de replicas, pongamos al valor B = 100. Ahora procederıamos

a calcular los nuevos p-valores para esas (supongamos) 20 000


variables, teniendo presente que como ya hay unos 380 000 p-valores

mayores que α = 0,05, entonces estos otros 20 000 p-valores

calculados con B = 100 replicas bootstrap, han de conpararse con

umbrales de la forma αk−i+1 ≥

0,05400 000−20 000+1 = 1,32 × 10−7 para

i = 1, . . . 20 000. De esta manera, es de esperar que entorno al 95 %

de esos p-valores sean al menos 0,01 (con lo cual mucho mayores

que 1,32 × 10−7), teniendo que aumentar el valor de B (pongamos

a B = 1000) para los 1000 p-valores restantes. Continuando con

un procedimiento de esta forma y denotando por ` el numero de

variables que resultaran finalmente significativas (para las cuales

sı necesitarıamos del orden de 108 replicas bootstrap), el numero

de replicas bootstrap totales necesarias serıa aproximadamente del

orden

380 000 ·10+19 000 ·100+950 ·1 000+47 ·10 000+2 ·100 000+` ·108

= 7. 32× 106 + ` · 108 = (`+ 0,0732) · 108 ' 108`,

con lo que el numero de operaciones necesarias serıa 3002×108` = 9×1012`. Ası pues, el cociente entre el numero de operaciones necesarias

con el metodo standard y el numero de operaciones necesarias con

este bootstrap “minimalista en replicas” resulta 3,6×10189×1012` = 4×105

` .

Si, por ejemplo, finalmente resultase que hay ` = 100 de las 400 000

variables significativas, entonces este nuevo metodo requerirıa 4000

veces menos operaciones que el metodo clasico, lo cual supone que, en

un ordenador que realice 1000 millones de operaciones por segundo,

los 114 anos de tiempo de ejecucion del metodo standard pasen

a ser de solo algo mas de 10 dıas en este metodo optimizado. La

repercusion de este cambio es absolutamente crucial.

316 R. Cao Abad

6. Big Data . . . Big Bias

Segun nos adentramos en la era de los Big Data esta

imponiendose la (a menudo falsa) idea de que los conjuntos masivos

de datos reflejan la verdad absoluta. Brooks (2013) califica esta

concepcion como el “datismo”. Sin embargo frecuentemente los

datos contienen sesgos ocultos que a menudo provienen de su

procedimiento de recogida, especialmente para los metodos de

muestreo en los que los individuos de la muestra se autoseleccionan.

Un caso citado por Crawford (2013) es la base de datos de mas

de 20 millones de tweets, originados por el huracan Sandy en

octubre-noviembre de 2012. Un analisis combinado de los datos de

Twitter y los procedentes de Foursquare permitio obtener algunas

conclusiones previsibles, como un aumento de gastos en alimentacion

la noche previa a la tormenta, y otras mas sorprendentes, como un

incremento en la vida nocturna el dıa siguiente al huracan. Este

es un caso en el que los datos no son una muestra “insesgada”

de la poblacion que estamos estudiando. Ası, la gran mayorıa de

los tweets sobre Sandy provinieron de Manhattan, debido al alto

numero de propietarios de telefonos inteligentes en Nueva York. En

las zonas mas afectadas por el desastre se originaron pocos mensajes.

No solo por la menor penetracion del mercado de smartphones en

esas zonas, sino, sobre todo, porque los cortes electricos en esas areas

mas afectadas provocaron muchos problemas de acceso a internet y

provocaron tambien que muchos de esos telefonos se quedaran sin

baterıa en las horas posteriores a la tormenta.

Otro ejemplo muy interesante mencionado por Crawford (2013)

es el de los datos recopilados en la ciudad de Boston a partir de

StreetBump, la aplicacion para telefonos inteligentes que detecta,

de forma pasiva, la existencia de baches a partir de los registros

de los acelerometros de los smartphones y de los datos del GPS

durante la conduccion de un automovil. Los datos se envıan al


Departamento de Trafico de la Ciudad de Boston, que ası puede

planificar con eficiencia la reparacion de los baches, optimizando

recursos y ahorrando tiempo. En este caso, uno de los problemas

observados al poner en marcha el proyecto fue que algunos segmentos

de la poblacion de la ciudad de Boston (como las clases menos

favorecidas) tienen una baja tasa de uso de telefonos inteligentes.

Ademas esa tasa es aun menor para grupos de edad avanzada, con

lo cual esos datos proporcionan una muestra muy sesgada (aunque

grande, en numero) de la poblacion de baches existentes en la

ciudad. Eso provoca una infraestimacion del numero de baches en

determinados barrios de la ciudad, con la consiguiente deficiencia

sobre la planificacion optima.

Recientemente Cao and Borrajo (2018) han considerado el

problema de estimacion de la media, en un contexto no parametrico,

cuando disponemos de datos de gran volumen (Big Data) pero

sesgados, proponiendo metodos para corregir el problema causado

por el sesgo en dos situaciones diferentes: (i) cuando existe la

posibilidad de obtener una muestra aleatoria simple (de mucho

menor tamano) de la poblacion original y (ii) cuando el mecanismo

que provoca el sesgo puede replicarse sobre la poblacion ya sesgada

de forma que se dispone de una segunda muestra doblemente sesgada

y de pequeno tamano.

7. Conclusiones

Algunas conclusiones (muy personales) sobre lo que se abordado

en este artıculo son las siguientes:

1. Los paradigmas clasicos de la estadıstica merecen ser

revisados/actualizados en la era de los Big Data. Por ejemplo,

la teorıa asintotica puede ahora verse muy reflejada en los

datos. Por el contrario, los metodos de remuestreo podrıan

ser menos utilizados que hasta ahora, aunque posiblemente los

318 R. Cao Abad

metodos de submuestreo tendran un gran auge. Ademas, esta

nueva realidad probablemente hara que se centre mas atencion

en los procedimientos recursivos y las tecnicas de reduccion de

la dimension, como ya ocurrio antano.

2. Posiblemente sera necesario introducir nuevos paradigmas,

como el analisis de la complejidad de los metodos de

inferencia, los requisitos de memoria de los mismos (eficiencia

computacional de los metodos estadısticos), la facilidad de

paralelizacion de los procedimientos y el uso de estrategias

del tipo “divide y venceras”. En resumen, la “escalabilidad” de

los metodos de analisis estadıstico cobrara una mucho mayor

importancia.

3. Los metodos de submuestreo se anticipan como una potente

herramienta de visualizacion y analisis en situaciones en las

que el desbordante tamano muestral haga impracticables las

operaciones y graficas mas sencillas. Por otra parte, en casos

en los que lo metodos modernos de inferencia estadıstica

mas computacionalmente costosos (como el bootstrap) aun

resulten necesarios (por ejemplo en situaciones de Big Data en

horizontal pero no en vertical) sera necesario su optimizacion

computacional, a efectos de ahorrar muchas operaciones donde

los resultados ya son concluyentes con calculos muy poco

costosos.

4. El sesgo en los datos puede ser un problema muy considerable y

frecuente en el contexto de los Big Data. Muchos de esos datos

se autoseleccionan, con lo cual no existe un procedimiento

de muestreo, controlado por el experimentador, que permita

garantizar la representatividad del conjunto de datos de gran

volumen. Conviene detectar la presencia de esos posibles sesgos

y, si es que existen, corregirlos en la fase de analisis de datos.


En resumen, la estadıstica se dispone a afrontar nuevos retos (de

hecho lo esta haciendo ya) de la mano de la computacion: las bases de

datos, la inteligencia artificial y la computacion de altas prestaciones.

Para ello es muy importante que nosotros, los estadısticos, tomemos

la iniciativa y juguemos un papel muy relevante en esta nueva

disciplina que se esta dando en llamar la Ciencia de Datos. Creo

que la creacion de tıtulos universitarios de grado en Ciencia e

Ingenierıa de Datos, en diversos lugares de Espana, es una magnıfica

oportunidad para poner en practica esta actitud proactiva que creo

enormemente necesaria.

Referencias

[1] Benjamini Y. and Hochberg Y. (1995). Controlling the false

discovery rate: a practical and powerful approach to multiple

testing. J. R. Stat. Soc.B, 57, 289-300.

[2] Benjamini Y. and Yekutieli D. (2001). The control of the false

discovery rate in multiple testing under dependency. Ann. Stat.,

29, 1165-1188.

[3] Brooks D. (2013). The Philosophy of Data. The New York

Times, 5th of February, p. A23.

[4] Cao R. (2015). Inferencia estadıstica con datos de gran volumen.

La Gaceta de la RSME, 18, 1001-1025.

[5] Cao R. and Borrajo L. (2018) Nonparametric mean estimation

for big-but-biased data. In: E. Gil et al. (Eds.) The Mathematics

of the Uncertain. A Tribute to Pedro Gil. Studies in Systems,

Decision and Control, Springer (in press).

[6] Crawford K. (2013). The Hidden Biases in Big

Data. Harvard Business Review, 1st of april. In:

https://hbr.org/2013/04/the-hidden-biases-in-big-data

https://hbr.org/2013/04/the-hidden-biases-in-big-data

320 R. Cao Abad

[7] Efron B. (1979). Bootstrap methods: another look at the

Jackknife. Ann. Statist., 7, 1-26.

[8] Efron B. Hastie T. Johnstone I and Tibshirani R. (2004). Least

angle regression. Ann. Stat., 32, 407-451.

[9] Pearson K. (1900). On the criterion that a given system of

deviations from the probable in the case of a correlated system

of variables is such that it can be reasonably supposed to have

arisen from random sampling. Philosophical Magazine Series,

5, 50, 157–175.

[10] Pena D. (2014). Big Data and Statistics: Trend or Change?

Boletın de Estadıstica e Investigacion Operativa, 30, 313-324.

[11] Student (1908). The probable error of a mean. Biometrika, 6,

1-25.

[12] Tibshirani R. (1996). Regression shrinkage and selection via the

Lasso. J. R. Stat. Soc.B, 58, 267-288.

Acerca del autor

Ricardo Cao es Catedratico de Estadıstica e Investigacion

Operativa en el Departamento de Matematicas de la Universidade

da Coruna, donde coordina el grupo de investigacion MODES

(modelizacion, optimizacion e inferencia estadıstica). Sus lıneas de

investigacion abarcan la inferencia no parametrica, los metodos de

remuestreo, el analisis de supervivencia, la verosimilitud empırica,

los metodos estadısticos para Big Data, el analisis de datos

funcionales y los metodos estadısticos en genomica, neurociencia,

malherbologıa y riesgo de credito. Es miembro de la Bernoulli

Society, de la Sociedad Espanola de Biometrıa y de la Sociedad

Espanola de Estadıstica e Investigacion Operativa (SEIO), a cuyo

Consejo Academico pertenecio. Es Co-Editor Jefe de la revista


Computational Statistics (2016-actualidad) y ha sido Editor Jefe

de la revista TEST (2009-2012) y previamente Editor Asociado

de la misma. Actualmente es ademas Editor Asociado de las

revistas Computational Statistics & Data Analysis y Journal

of Nonparametric Statistics. Ricardo Cao ha sido Presidente de

European Courses in Advanced Statistics (ECAS (2009-20014) y

tambien su Vicepresidente (2007-2009 y 2014-2015). Es Miembro

Electo del International Statistical Institute. Fue Coordinador de

Matematicas en la Agencia Nacional de Evaluacion y Prospectiva

(ANEP) del Ministerio de Ciencia e Innovacion (2008-2011) y

Vicerrector de Investigacion y Transferencia de la Universidad de A

Coruna (enero 2012 - enero 2016). Ha dedicado parte de su trabajo

a labores de transferencia a sectores como el sanitario, el naval, el

comercial y el industrial. Es autor de siete libros docentes y mas de

150 publicaciones cientıficas de investigacion. De ellas mas de 90 son

artıculos en revistas internacionales recogidas en ISI Web of Science.

Ha dirigido diez tesis doctorales ya defedidas y en la actualidad esta

dirigiendo otras cuatro mas. Ha sido investigador principal de algo

mas de una docena de proyectos de investigacion en convocatorias

competitivas y de once contratos de investigacion con empresas.


Special Section

Premios Incubadora de Sondeos y Experimentos

Milagros Dieguez Taboada

C.P.I. As Revoltas

B [email protected]

Paula Blanco Mosquera

C.P.I. de San Vicente

B [email protected]

Roberto Manın Gutierrez

I.E.S Galileo Galilei

B [email protected]

Sabela Vazquez

I.E.S. Ibaialde de Burlada

B [email protected]

Abstract

In this new section of this journal BEIO the it gives the

winners for the work for the prize ”Incubadora de Sondeos

y Experimentos”, organized by the Society of Statistics and

Operational Research (SEIO). This contest aims to promote

the teaching and learning of statistics at non-university

educational levels. This publication’s aim is to divulge this

© 2017 SEIO





Premios Incubadora de Sondeos y Experimentos 323

work, for this reason the tutors have been invited to make this

work public, with the intention that it can serve as support

material for other teachers. In this first publication the first

prizes of each one of the phases of the course 2016-2017 are

collected. More information on the awarded work and their

authors, can be seen in the following page http://www.seio.

es/Incubadora/premiados-2016-17.html.

1. Probability games and Law of Large Numbers

1.1. El proyecto

En este proyecto intentamos averiguar si nuestros companeros

de instituto colaboran cuando se les pide ayuda o vienen por

contraprestaciones que puedan obtener. Para dar respuesta a nuestra

pregunta construımos una serie de juegos de probabilidad y les

pedimos que colaborasen con nosotros haciendo los experimentos

para despues presentar los resultados a un concurso. La conclusion

fue claramente que, amigos sı, pero si hay recompensa, es decir,

creemos que nuestros companeros responden mejor si hay premio.

El proyecto sigue dos ramas bien diferenciadas, por un lado

hicimos un sondeo para analizar la respuesta del alumnado del centro

ante nuestra solicitud de ayuda y por otro lado, un plan experimental

en el que tratamos de comprobar empıricamente como la frecuencia

relativa tiende a la probabilidad.

Sondeo

Partimos de una poblacion de 94 alumnos de secundaria y 140

de primaria. Los experimentos se realizaran por separado, mientras

que con los de secundaria se llevarıa a cabo la investigacion durante

varios dıas, con primaria les dedicamos dos, uno simplemente

invitandolos a colaborar y un segundo dıa con premio por la

participacion.

http://www.seio.es/Incubadora/premiados-2016-17.html

http://www.seio.es/Incubadora/premiados-2016-17.html

324

Sondeo secundaria

El estudio del comportamiento del alumnado de secundaria esta

dividido en tres fases:

� Primera fase: Pusimos un cartel en la puerta del aula donde

estaban los juegos y solicitamos colaboracion personalmente.

Se realizaron los experimentos durante 16 recreos y nos

ayudaron unicamente en 25 ocasiones.

� Segunda fase: Una de las alumnas salio al patio con un

altavoz pidiendo colaboracion y ofreciendo regalos a los que

nos ayudasen. A esta fase le dedicamos 6 recreos y pasamos de

tener una media diaria de 3.1, en la primera fase, a una media

de 18.3 en esta segunda fase.

� Tercera fase (premio a la constancia): Aquellos que

colaborasen durante diez recreos entrarıan en el sorteo de una

tarjeta Google Play. En esta ocasion fueron 16 recreos los

que dedicamos a los experimentos y obtuvimos que aumento

de nuevo el numero de veces que se realizaron estos.Pero en

este caso lo que nos interesaba estudiar era si aumentaba el

alumnado que colaboraba con nosotros o bien eran siempre los

mismos alumnos que venıan en mas ocasiones.

Los resultados que obtuvimos se reflejan en la Figura 1.

En la segunda fase aumento notablemante la afluencia de gente

pero la frecuencia con la que acudıan sigue siendo como maximo

3 mientras que en la tercera fase, aunque disminuye el numero de

alumnos que nos visitan se ve un aumento notable de las frecuencias,

destacando el alumno 215 que nos visita en 12 ocasiones. Por los

resultados obtenidos concluimos que el alumnado de secundaria no

estaba muy dispuesto a colaborar altruıstamente.


Figura 1: Observando las tres graficas podemos ver claramente comoevoluciono el comportamiento de los/as alumnos/as de secundaria.En la primera fase ademas de venir muy pocos, la frecuencia con laque acudıan era muy baja.

326

Sondeo primaria

Con el alumnado de primaria realizamos los experimentos

durante dos recreos y el comportamiento de estos no se manifesto

tan interesado como en el caso anterior.

� Primera fase: Simplemente pidiendoles que colaborasen nos

ayudaron 31 personas.

� Segunda fase: A aquellos que colaborasen con nosotros

les regalabamos gominolas y aunque aumento el numero de

colaboradores, que en esta ocasion ascendio a 46, desdendio la

participacion del alumnado de segundo y cuarto curso.

Por ello consideramos que aunque el alumnado de primaria tambien

estaba influenciado por los premios no era tan notable como en el

caso de secundaria.

Plan experimental

Elaboramos un total de trece juegos entre los que habıa urnas

con bolas para extracciones con y sin remplazamiento, barajas

de cartas para estudiar tanto las frecuencias relativas de distintos

sucesos elementales como de la union e interseccion de estos,

urna con calcetines para estudiar la frecuencia de sacar en dos

extracciones un par concreto o cualquier par,chinchetas y tabas

como ejemplo de sucesos elementales no equiprobables, la aguja de

Buffon para demostrar como el doble del inverso de la frecuencia de

corte tiende hacia π, dados y urnas como ejemplo de experimento

compuesto, las puertas de Monty Hall, cruzar el rıo suma de las

dos caras superiores dados...


Resultados

Exponemos a continuacion los resultados de algunos de los

experimentos realizados.

Urnas. Extracciones sin remplazamiento.

El experimento consistıa en extraer una bola de una urna

compuesta por 4 bolas amarillas, 12 azules,5 rojas y 8 verdes,

comprobar el color y devolverla a la urna. Se hizo el experimento

en 1714 ocasiones y las aproximaciones que obtuvimos se reflejan

en la grafica de la izquierda.

Monty Hall

Construimos 3 puertas detras de 2 de las cuales escondimos los

dibujos de sendas cabras y tras otra un coche, una vez que el jugador

escoge una de las puertas la monitora del juego le muestra otra en

328

la que se esconde una cabra y a continuacion le ofrece la posibilidad

de cambiar su eleccion.

Se realizo este experimento 1426 veces y obtuvimos:

f(ganar coche/cambiar puerta) = 0, 57

lejos de los 2/3 buscados, pero en todo caso lo que conseguimos

probar es que

f(ganar/cambiar) > f(ganar/no cambiar).

Cruzar el rıo

Para este juego elaboramos un tablero con 12 casillas para cada

uno de los dos jugadores, se lanzan dos dados y se mueve la ficha

colocada en la casilla que indica la suma de las caras superiores a la

correspondiente casilla de su oponente. Tenıamos dos versiones en la


primera cada jugador disponıa de 12 fichas y deberıa cruzar todas

ellas al otro lado del rıo, despues de algunas tiradas les haciamos

ver al jugador la imposibilidad del suceso 1 y continuabamos con la

segunda version en la que cada jugador apostaba por un numero y

el primero que obtenıa ese resultado con los dados cruzaba el rıo y

ganaba la partida.

Se hicieron un total de 1380 tiradas y obtuvimos los resultados

de los graficos de la pagina siguiente.

1.2. Conclusion

Espero que los tres ejemplos anteriormente expuestos puedan

transmitirles en que consistio nuestro proyecto. Para esta que

escribe, depues de anos trabajando la estadıstica con proyectos,

este fue sin duda el mas completo, gratificante y con el que mas

disfrutaron y aprendieron las alumnas.


2. Why doesn’t my mother like white chocolate?

A statistic research in 7th grade

2.1. Objetivos

El objetivo principal de este proyecto era confirmar o rechazar la

hipotesis de que nuestras preferencias de alimentos mas o menos

amargos, picantes, salados, dulces o acidos, evolucionan con la

edad. Para llevar a cabo este trabajo nos propusimos los siguientes

objetivos especıficos:

1. Analizar si existen diferencias significativas entre las

preferencias de los alimentos entre ninos y adultos.

2. Analizar posibles diferencias por sexos.

3. De ser cierto que haya diferencias significativas entre las

preferencias de los alimentos entre ninos y adultos, analizar los

grupos de edades en las que estas se producen.

2.2. Sondeo 1: sondeo entre los alumnos y adultos del

centro escolar

En el primer sondeo se eligieron tres alimentos representando a

cada gusto: amargo, picante, salado, dulce y acido. La gente deberıa

escoger un producto de los tres ofertados. La poblacion de este

primer estudio la costituyen los alumnos de infantil, primaria y

secundaria, profesores y personal no docente del CPI San Vicente

de A Bana. Las variables que se estudian son: Conguitos preferido,

Pringles preferidas, Pipas preferidas, Limonada dulce preferida y

Limonada acida preferida. Entre los estudiantes se conto con 93

chicos y 92 chicas, y de poblacion adulta la proprocion de mujeres

fue mucho mayor, contando con 27 mujeres y 8 hombres. La

muestra usada para el estudio fueron aquellos que voluntariomente

se ofrecieron a participar, que quedo en 80 alumnos y 82 alumnas, y

en el caso de adultos se redujo a 5 hombres y 18 mujeres.

332

Resultados del primer sondeo

Las diferencias entre el chocolate favorito elegido por ninos y

ninas puede oberservase en la Figura 1.

Del estudio del primer sondeo se obtienen las siguientes

conclusiones:

� Claramente, el chocolate blanco es el chocolate favorito de los

ninos (49 %) y, sin embargo, a los adultos apenas les gusta

(16.3 %).

� Mientras el chocolate favorito de los adultos es el negro

(63.3 %), en los ninos la predileccion por este es escasa (25.8 %).

� No se aprecian demasiadas diferencias en el chocolate con leche.

� Por sexos, vemos que a los ninos les gusta mas el amargo que a

las ninas (la segunda opcion favorita de los ninos es el chocolate

negro, la de las ninas, el chocolate con leche).


En el caso de las patatas se concluyo lo siguiente:

� Las patatas favoritas de los ninos son las muy picantes

(35.8 %), sin embargo es la opcion minoritaria entre los adultos.

� Los adultos prefieren las patatas con un grado intermedio de

picante (40.9 %), que es la opcion menos escogida entre los

ninos.

� Por sexos, vemos que a los ninos les gusta mas el picante que

a las ninas (la opcion “muy picante” es su opcion mayoritaria,

mientras que la de las ninas es “poco picante”).

Para las pipas de girasol:

� Las diferencias entre los adultos y los ninos son enormes:

Los ninos prefieren las pipas mas saladas (51 %), y las que

menos eligen slon las poco saladas (23,6 %), justo al reves que

los adultos, que eligen mayoritariamente las que tienen poca

cantidad de sal (42,9 %) y minoritariamente las muy saladas

(23,8 %).

� Por sexos, vemos que a los ninos les gusta mas la sal que a

las ninas (aunque ambos escogen mayoritariamente la opcion

“muy saladas”, la segunda opcion de las ninas es “poco

saladas”).

En el caso de la limonada dulce, los resultados fueron:

� Tanto ninos como adultos prefieren la limonada muy dulce. Los

porcentajes ademas son muy parecidas (Ninos: 53,5 % Adultos:

59,1 %). Hay pocos ninos a los que les guste la limonada con

poco dulce (12,7 %) pero los adultos la escogieron un 22,7 %

de las veces.

334

� A muy pocas ninas les gusta la limonada con poco azucar (Solo

un 7 % de ninas la elige, frente a un 18 % de ninos).

En el caso de la limonada acida, los resultados fueron:

� Los adultos prefieren la limonada mas acida (50 %), la opcion

menos valorada por los ninos (25 %).

� Los ninos prefieren una limonada con una acidez media

(40,6 %), en cambio entre los adultos es la opcion que menos

les gusto (22,7 %).

� No se aprecian grandes diferencias por sexos.

2.3. Sondeo 2: Sondeo entre la poblacion del Municipio de

A Bana

El primer sondeo evidencio la diferencia de gustos entre ninos

y adultos, pero no mostraba el momento en que se producıa esos

cambios por lo que se amplio la poblacion del estudio a todo el

municipio de A Bana. Los resultados pueden oberservarse en la

Figura 2.

Se eligieron los mismos productos que en el Sondeo 1, pero

suprimiendo las limonadas por la dificultad de transportarla,

conservarla y de fabricarla. Como poblacion se tomo la de mas de

4 anos que vivan en La Bana que, segun datos del IGE (Instituto

Galego de Estatıstica), es de 3583 personas. Se estimo, utilizando

la calculadora online de la facultad de medicina de la Universidad

Nacional del Nordeste de Argentina, que necesitarıamos una muestra

de 186 personas para obtener un nivel de confianza del 95 % y un

margen de error del 7 %.

Finalmente, el numero total de datos recabados fue de 395,

de los que 185 fueron hombres y 215 mujeres. Como herramienta

informatica se uso la hoja de calculo de Google Drive.


Resultados del segundo sondeo

Finalmente, para detectar donde hay un cambio de gusto, lo

que hicimos fue dividir las edades en tramos quinquenales y marcar

unicamente la opcion mayoritaria de ese tramo (si habıa empate,

se senalaba con una E). De este estudio se concluyo que, para el

Chocolate:

A partir de los 35 anos, el gusto por el chocolate blanco ya no es

mayoritario.

El chocolate con leche tiene mucha mas aceptacion entre las

mujeres.

A las mujeres les gusta menos el chocolate blanco que a los

hombres.

En el caso de las patatas que a partir de los 50 prefieren las

patatas sin picante. A los hombres les gustan mas el picante que a

las mujeres.

Para el caso de las pipas, solo los menores de 25 anos prefieren

336

mayoritariamente las pipas muy saladas, por otra parte, a los

hombres les gustan mas las pipas saladas mientras que a las mujeres

les gustan mas las que tienen poca sal.


3. Statistical analysis of video game results

Nuestro proyecto consiste en un estudio de calculo mental usando

dos videojuegos de fabricacion propia. Una de las partes fue la

creacion de un juego usando Scratch en el que unos animales salıan

y entraban de una casa y los alumnos tenıan que contar los que

quedaban dentro de la casa al final. Se fue realizando la prueba

por las distintas clases de los cursos en nuestro instituto para poder

analizar los datos segun sexo y edad.

La otra parte esta basada en la recreacion de una consola casera

estilo anos 80 con un juego de calculo mental programando con

GameMaker. Esta consta de varios niveles en los que la dificultad

va aumentando a medida que avanzas en el juego. En cada uno

de ellos hay una serie de operaciones matematicas basicas: sumas y

restas en los primeros niveles, multiplicaciones y divisiones en niveles

mas avanzados. El jugador debe decidir si la operacion es correcta o

no. La puntuacion final de cada jugador consiste en un sistema de

calificacion en base a los aciertos y los fallos totales y se guarda en

un fichero de datos junto con su edad y sexo.

En nuestro trabajo hemos analizado y estudiado estadısticamente

los resultados obtenidos.

Las hipotesis planteadas fueron varias:

� ¿Habrıa diferencia por edad o sexo en los resultados?

� ¿Segun avanzamos en nivel academico los resultados seran

mejores?

� ¿Los grupos bilingues o con mejores resultados academicos

tambien tendran mejores resultados en calculo mental?

� Las generaciones que tuvieron menor exposicion a las nuevas

tecnologıas, ¿tendran mejores resultados en calculo mental que

las nuevas generaciones?

338

Todas esas preguntas tratarıamos de resolverlas con los dos

videojuegos disenados, si bien el principal objetivo del proyecto era

medir las habilidades y el nivel de calculo mental del alumnado de

nuestro instituto y de los visitantes de la feria de la ciencia que se

viene desarrollando estos ultimos anos en nuestro IES.

Los resultados han sido comparados por totales, por sexos, por

cursos y por grupos de edad.

Los materiales empleados para la construccion de la consola

fueron:

� La carcasa de un VHS del almacen del instituto y unos

altavoces

� Un teclado antiguo

� Un ordenador obsoleto de la Escuela 2.0

� Cables del taller de tecnologıa

� Una fuente de alimentacion de un ordenador sin uso del taller

de IMA

� Tres botones ARCADE

� Una pantalla sin uso del departamento de matematicas

� Una placa micro controladora del taller de tecnologıa

� Espray y pinturas

Todos los materiales fueron reutilizados o se devolvieron a los

departamentos a excepcion de los botones ARCADE, que fueron el

unico gasto real del proyecto.


Las principales conclusiones de nuestro trabajo fueron las

siguientes:

Con respecto al test:

� La media de aciertos es en general ascendente segun avanzamos

por niveles dentro del instituto.

� Los promedios fueron superiores para los hombres tanto en

aciertos totales como en aciertos consecutivos.

Con respecto a la consola:

� Tanto para el porcentaje de aciertos como para la media de

puntos los mejores resultados se obtuvieron entre los 30 y los

40 anos.

340

� Los promedios de los hombres fueron superiores, aunque no los

consideramos significativos ya que realizamos mas de 16000

simulaciones en el ordenador y apenas el 4 % superaba la

diferencia de medias recogida en la consola.

En general:

� Los promedios en ambos juegos fueron superiores en los

hombres que en las mujeres.

� El calculo mental obtiene los mejores resultados con los de 30

a 40 anos.

342

Con estas conclusiones pretendemos dar respuesta a las hipotesis

planteadas inicialmente:

� Es posible que haya diferencia entre sexos y tambien por

edades en nuestro IES, pero no consideramos tales diferencias

significativas, especialmente por nivel.

� Los resultados son ligeramente superiores por nivel pero muy

levemente, a excepcion de primero de bachillerato, que lo

achacamos principalmente a la muestra seleccionada.

� Los grupos con mejores resultados academicos no mostraron

unos resultados superiores al resto, luego no podemos

establecer diferencias a la hora de realizar calculo mental.

Probablemente sus resultados academicos dependan de otros

factores como pueden ser la motivacion, el esfuerzo, el interes...

Nos han demostrado que son igualmente habiles a la hora de

calcular mentalmente.

� Sı que se aprecia un mejor calculo por aquellas generaciones

que estan en su madurez y dependieron en menor medida de las

nuevas tecnologıas, ya que los mejores resultados se obtuvieron

entre los 30 y los 40 anos.


4. Too much homework in Burlada?

4.1. Objetivos

El objetivo principal del trabajo es conocer si el alumnado de

Burlada tiene el problema de no tener suficiente tiempo libre debido

al tiempo dedicado a los deberes, estudiar o extraescolares y si una

huelga estarıa justificada.

Para dar respuesta a lo anterior se pretende:

� Saber cuanto tiempo libre tiene el alumnado de Burlada

diferenciando entre los que hacen extraescolares y los que no y

conocer la opinion del profesorado sobre este tema. Ver ademas si

existen diferencias entre etapas educativas.

� Saber el tiempo medio que dedica el alumnado de Burlada

a tareas y estudio, diferenciarlo por etapas y ver si coincide con la

opinion del profesorado. Ver ademas si existen diferencias entre los

que hacen extraescolares y los que no las hacen.

Para recoger los datos se elaboro una encuesta con un formulario

de Google Drive que se envio por correo electronico a los centros

educativos de la localidad. Se obtuvo una muestra de 153 personas.

4.2. Sondeo

Los datos se analizaron con la hoja de calculo de Google y se

sacaron las siguientes conclusiones:

1. Solo un 14.4 % del alumnado estudia mas de 2 horas al dıa

y solo un 4,9 % le dedica a los deberes mas de 2 horas.

2. El alumnado de Burlada dedica de media 1,2 horas al dıa

a hacer los deberes y 1,51 horas al dıa a estudiar. En total dedican

2,71 horas al dıa a hacer tareas y estudiar.

3. Si diferenciamos por etapas educativas, se aprecia que

cuando se avanza de etapa se emplea menos tiempo en la realizacion

de deberes y se incrementan las horas de estudio. El tiempo medio

344

para hacer tareas al dıa es de 1,4 horas en primaria, 1,07 horas

en secundaria y 1,06 horas en bachillerato. Mientras que el tiempo

medio de horas al dıa dedicadas a estudiar es de 1,47 horas en

primaria, 1,5 horas en secundaria y 1,79 horas en bachillerato.

4. El alumnado dice que dedican mas tiempo al estudio y

deberes del que el profesorado cree que hace, sobre todo en el estudio,

en el que hay mas de una hora de diferencia. Se observa ademas que

esto ocurre en todas las etapas.

5. El 73.17 % de alumnos/as hacen extraescolares. Si

diferenciamos por etapas, la mayor concentracion de alumnos/as que

hacen extraescolares pertenece a educacion primaria. En educacion

primaria casi el 100 % hacen extraescolares, de educacion secundaria

obligatoria mas del 50 % y de bachillerato casi el 75 %.

6. La media de horas dedicadas a la semana por el alumnado

de Burlada a la realizacion de extraescolares es de 3,42 horas.

7. El alumnado que hace extraescolares dedica a la semana

9,34 horas a estudiar y a hacer las tareas y 3,42 a hacer las

extraescolares, 12,76 horas a la semana en total. Mientras que el

que no hace extraescolares dedica a la semana 11,87 horas solo a

hacer tareas y estudiar. Tienen por tanto casi el mismo tiempo libre

a la semana.

8. Solo en la etapa de educacion secundaria el alumnado que

no hace extraescolares le dedica mas tiempo a estudiar y a hacer las

tareas que los alumnos que no hacen extraescolares. En el resto de

las etapas es al reves.

9. La media de horas libres al dıa que tienen los encuestados

es de 2,9 h. Si diferenciamos por etapas, los que mas tiempo libre

tienen son los de educacion secundaria obligatoria con 3,38 horas de

tiempo libre al dıa.

10. El 64 % de los profesores creen que sus alumnos/as no

necesitan mas tiempo libre y el 67,74 % de los alumnos/as dice lo


Figura 2: Comparacion entre los tiempos de estudio y deberes pornivel educativo.

mismo.

Las conclusiones anteriores se podrıan sintetizar en las siguientes:

� El alumnado de Burlada dedica de media al dıa 1h15min a

realizacion de tareas y 1h 30 min al estudio aproximadamente.

� A mayor etapa educativa se dedica mas tiempo al estudio

y menos a las tareas.

� El alumnado que realiza extraescolares dedica una media

de 3h30min por semana a las mismas.

346

� El alumnado que realiza extraescolares y el que no tiene el

mismo tiempo libre a la semana ya que los primeros dedican menos

a sus estudios y tareas.

� Tanto profesores como alumnos opinan que hay suficiente

tiempo libre.

CONCLUSION FINAL:

El alumnado de Burlada tiene suficiente tiempo libre incluso si

hace extraescolares.

beio volumen 33 - inicio - seio€¦ · el grupo de trabajo de teor a de juegos se fundo durante el...

Documents