beio volumen 33 - inicio - seio€¦ · el grupo de trabajo de teor a de juegos se fundo durante el...
TRANSCRIPT
BEIOBoletín de Estadística e Investigación Operativa
Revista Oficial de la Sociedad de Estadísticae Investigación Operativa
Volumen 33Número 3
Noviembre 2017
ISSN: 2387-1725
J. Vidal-Puga Editorial 183
M. AlcañizM. SantolinoLl. Ramón
A comparative analysis of tree-based modelsclassifying imbalanced breath alcohol data 189
E. Köbis Robust approaches to uncertain optimization 224
L. EstebanApplying the generic statistical business processmodel (GSBPM) to the business register; theSpanish experience
258
R. Ibar-AlonsoC. Cosculluela-Martínez
Positive effects on the least motivated studentsof the highly motivated ones 276
R. Cao Ingenuas reflexiones de un estadístico en la eradel Big Data 295
M. DiéguezR. ManínP. BlancoS. Vázquez
Premios incubadora de sondeos y experimentos 322
BEIO (Boletín de Estadística e Investigación Operativa) es una revista que publica cuatrimestralmente artículos de divulgación científica de Estadística y de Investigación Operativa. Los artículos pretenden abordar
tópicos relevantes para una gran mayoría de profesionales de la Estadística y de la Investigación Operativa,
primando la intención divulgativa sin olvidar el rigor científico en el tratamiento de la materia en cuestión. Las
secciones que incluye la revista son: Estadística, Investigación Operativa, Estadística Oficial, Historia y Enseñanza y Opiniones sobre la Profesión.
BEIO nació en 1985 como Boletín Informativo de la SEIO (Sociedad de Estadística e Investigación Operativa).
A lo largo de los años ha experimentado una continua evolución. En 1994, aparece publicado el primer
artículo científico y desde entonces el número de artículos científicos publicados ha ido creciendo hasta que en 2008 se segregan del Boletín los contenidos relacionados con la parte informativa y comienza a perfilarse
como revista de divulgación de la Estadística y de la Investigación Operativa.
Los artículos publicados en BEIO están indexados en Scopus, MathScinet, Biblioteca Digital Española de Matemáticas, Dialnet (Documat), Current Index to Statistics, The Electronic Library of Mathematics (ELibM),
COMPLUDOC y Catálogo Cisne Complutense.
La Revista está disponible online en www.seio.es/BEIO.
Editores
Salvador Naya Fernández, Universidade da Coruña
[email protected] Mª Teresa Santos Martín, Universidad de Salamanca
Editores Asociados
Estadística
Rosa M. Crujeiras Casais
Universidade de Santiago de Compostela [email protected]
Investigación Operativa
César Gutiérrez Vaquero
Universidad de Valladolid [email protected]
Estadística Oficial
Pedro Revilla Novella
Instituto Nacional de Estadística
Historia y Enseñanza
Mª Carmen Escribano Ródenas
Universidad CEU San Pablo de Madrid
Editores Técnicos
Antonio Elías Fernández, Universidad Carlos III de Madrid
María Jesús Gisbert Francés, Universidad Miguel Hernández de Elche [email protected]
Normas para el envío de artículos
Los artículos se enviarán por correo electrónico al editor asociado correspondiente o al editor de la Revista. Se
escribirán en estilo article de Latex. Cada artículo ha de contener el título, el resumen y las palabras clave en
inglés sin traducción al castellano. Desde la página web de la revista se pueden descargar las plantillas tanto
en español como en inglés, que los autores deben utilizar para la elaboración de sus artículos.
Copyright © 2017 SEIO
Ninguna parte de la revista puede ser reproducida, almacenada ó trasmitida en cualquier forma ó por medios,
electrónico, mecánico ó cualquier otro sin el permiso previo de la SEIO. Los artículos publicados representan las opiniones del autor y la revista BEIO no tiene por qué estar necesariamente de acuerdo con las opiniones
expresadas en los artículos publicados.
El hecho de enviar un artículo para la publicación en BEIO implica la transferencia del copyright de éste a la
SEIO. Por tanto, el autor(es) firmará(n) la aceptación de las condiciones del copyright una vez que el artículo sea aceptado para su publicación en la revista.
Edita SEIO
Facultad de CC. Matemáticas
Universidad Complutense de Madrid Plaza de Ciencias 3, 28040 Madrid
ISSN: 2387-1725
BEIO Revista Oficial de la Sociedad de Estadística e Investigación Operativa
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017
Indice
Editorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Juan Vidal-Puga
Estadıstica 189
A comparative analysis of tree-based models classifying
imbalanced breath alcohol data . . . . . . . . . . . . . . . . . . . . . .Manuela Alcaniz, Miguel Santolino and Lluıs Ramon
Investigacion Operativa 224
Robust approaches to uncertain optimization . . . . . . . . . .Elisabeth Kobis
Estadıstica Oficial 258
Applying the generic statistical business process model
(GSBPM) to the business register; the Spanish
experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Luis Esteban Barbado Miguel
Historia y Ensenanza 276
Positive effects on the least motivated students of the
highly motivated ones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Raquel Ibar-Alonso and Carolina Cosculluela-Martınez
© 2017 SEIO
Indice
Opiniones sobre la profesion 295
Ingenuas reflexiones de un estadıstico en la era del Big
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ricardo Cao Abad
Special Section 322
Premios uncubadora de sondeos y experimentos . . . . . . .Milagros Dieguez Taboada, Roberto Manın Gutierrez, Paula
Blanco Mosquera and Sabela Vazquez
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 183-188
Editorial
Juan Vidal-Puga
Economıa, Sociedad y Territorio (ECOSOT) y
Departamento de Estadıstica e Investigacion Operativa,
Universidade de Vigo
El Grupo de Trabajo de Teorıa de Juegos se fundo durante
el XXIV Congreso de la Sociedad de Estadıstica e Investigacion
Operativa (SEIO), celebrado en Almerıa en octubre de 1998.
Estamos, por tanto, a menos de un ano de celebrar nuestro 20
aniversario. En su formacion, el grupo estuvo coordinado por
Ignacio Garcıa Jurado, que habıa sido socio fundador y presidente
(1992-1997) de la Sociedade Galega para a Promocion da Estatıstica
e a Investigacion de Operacions y que mas tarde llegarıa a ser
Presidente de la SEIO (2006-2012) y Editor en Jefe de TOP
(2001-2006), la revista de Investigacion Operativa de la Sociedad.
En cada Congreso de la SEIO, el grupo se reune para hacer balance
de las actividades realizadas, renovar la coordinacion del mismo y
planificar las actividades para el proximo periodo.
El proposito de este grupo es promover la comunicacion y la
investigacion entre los miembros de la SEIO que trabajan en teorıa
de juegos y, por extension, entre todos aquellos teoricos de juegos
espanoles. Con ello, el grupo esta abierto a toda persona interesada
en la teorıa de juegos, ya como miembro de la SEIO o como
colaborador externo.
© 2017 SEIO
184 J. Vidal-Puga
La teorıa de juegos es una herramienta dirigida a optimizar
procesos en los que el criterio de lo que es optimo no es el mismo para
todos los agentes involucrados. Con esta definicion, que generaliza el
concepto de problemas de decision al caso de mas de un decisor, su
encaje en las matematicas, en general, y la investigacion operativa,
en particular, es particularmente obvio. Sin embargo, la teorıa de
juegos no nacio exclusivamente dentro de la matematicas, sino como
una union entre la economıa, que le daba la motivacion, y las
matematicas, que le daba el rigor.
En concreto, se considera que la teorıa de juegos nace, como
disciplina, con la publicacion del libro “Theory of Games and
Economic Behavior” en 1944, de John von Neumann (matematico)
y Oskar Morgenstern (economista).
John von Neumann era un genio de las matematicas. Fue, sin
duda, uno de los matematicos mas brillantes del pasado siglo, y
tenıa la suficiente ambicion como para abordar un reto de obvia
complejidad: Como utilizar las matematicas para modelizar el
comportamiento humano.
Llegado a este punto, no puedo resistirme a hacer un parentesis
para comentar como fue mi primer contacto con la teorıa de juegos.
Curiosamente, no fue en la Universidade de Santiago de Compostela,
donde cursaba mi licenciatura y donde Ignacio Garcıa Jurado
impartıa esta materia, sino durante mi ano de Erasmus en el Reino
Unido, en la Southampton University. Medio por curiosidad, me
matricule en un curso sobre esa tematica. El curso era, basicamente,
sobre juegos de dos jugadores de suma cero. En estos juegos, dos
agentes compiten entre de forma que la ganancia de uno es la perdida
del otro (de ahı la“suma cero”). Recuerdo como me llamo la atencion
la resolucion de una generalizacion del juego “pares o nones”, donde
dos agentes (digamos, A y B) deben anunciar simultaneamente un
numero: el cero o el uno. Si los dos anuncian el mismo numero, B
Editorial 185
paga x a A. Si anuncian numeros distintos, A paga y a B. Se trata
de un juego sin una estrategia ganadora, ya que si la hubiera, el
contrario la anticipara y ganarıa siempre, contradiciendo que fuera
ganadora. Sin embargo, sı existe una estrategia optima para cada
uno de los agentes, que consiste en ser impredecible: Tanto A como
B deben elegir 0 con probabilidad 0,5 y 1 con probabilidad 0,5, lo
que lleva a un pago esperado (x − y)/2 para el agente A (al ser un
juego de suma cero, el agente B recibe un pago esperado −(x− y)/2
que ya no hace falta mencionar). Este solo numero resume todo el
juego.
Un resultado nada obvio es que la existencia y unicidad de
este numero esta garantizado en cualquier juego de suma cero
con dos agentes, independientemente del numero de estrategias que
tengan. Esta fue la primera gran contribucion de John von Neumann,
llamada el Teorema del Minimax.
Decir, siguiendo con mi historia personal, que entonces creı
entender por que la teorıa de juegos estaba dentro del area de
Estadıstica e Investigacion Operativa. Tenıa que ser debido al uso
de la teorıa de probabilidad dentro del calculo de las estrategias
optimas. Y puede que haya algo cierto en ello, teniendo en cuenta que
la primera revista cientıfica dedicada de forma especıfica a la teorıa
de juegos, “International Journal of Game Theory”, esta catalogada,
entre otras, dentro del area de Estadıstica y Probabilidad.
El Teorema del Minimax ya habıa sido demostrado por von
Neumann bastantes anos antes, en 1928, pero su importancia habıa
pasado casi desapercibida. Fue merito de Oskar Morgenstern el guiar
a su coautor a poner los focos de los economistas en este resultado,
que el propio von Neumann perfecciono y extendio a juegos con
informacion imperfecta y con mas de dos jugadores. El Teorema del
Minimax supuso para von Neumann, por ası decirlo, lo que cogito
ergo sum, “pienso luego existo”, fue para Rene Descartes. Una vez
186 J. Vidal-Puga
descubierta esa primera “verdad”, se deberıa poder construir sobre
ella todo el edificio del saber. Von Neumann tambien consideraba
que la existencia del Teorema del Minimax para el caso mas sencillo
sugerıa que un resultado general era posible en situaciones sin suma
cero, tan habituales en la vida real y en las que todos los agentes
pueden salir ganando si colaboran.
Vemos por tanto que Oskar Morgenstern sirvio de guıa para
orientar el genio de von Neumann en la aquellas direcciones que
responden a retos relevantes dentro de la economıa. Otro ejemplo,
que no me resisto a comentar, es la relacion entre preferencias y
utilidades. Las preferencias son relaciones de orden que describen las
prioridades de los agentes. El uso de preferencias es muy razonable
pero, debido a su caracter ordinal, son tambien muy limitadas
las herramientas matematicas que podemos emplear con ellas. Las
utilidades, en cambio, asignan un valor numerico a las preferencias
de los agentes y ello permite aplicar toda la potencia del analisis
matematico a su estudio. Hoy en dıa, las llamadas funciones de
utilidad de von Neumann y Morgenstern suponen la base para
introducir el concepto de riesgo en los modelos de microeconomıa.
Como puede suponerse, la contribucion de la teorıa de juegos a
la economıa continuo mas alla de von Neumann y Morgenstern, si
bien durante mucho tiempo estuvo muy focalizada en las aplicaciones
militares en Estados Unidos. Fue esta una relacion que, en palabras
de Guillermo Owen, supuso mas limitaciones que beneficios al
desarrollo de la disciplina. Una obra altamente recomendable que
describe esta epoca es “The Strategy of Conflict” de Thomas
Schelling (Harvard University Press), cuya tercera edicion se publico
en 1990.
En los ultimos anos, puede decirse que la teorıa de juegos ha
vivido una epoca dorada reconocida por la concesion de varios
premios Nobel: John Nash, Reinhard Selten y John Harsanyi en 1994
Editorial 187
por sus analisis del equilibrio en teorıa de juegos no cooperativos;
Robert Aumann y Thomas Schelling en 2005 por ampliar la
comprension del conflicto y la cooperacion a traves de analisis
basados en la teorıa de juegos; Leonid Hurwicz, Eric Maskin y Roger
Myerson en 2007 por establecer las bases de la teorıa del diseno de
mecanismos, que determina cuando los mercados estan trabajando
de forma eficiente; y Alvin Roth y Lloyd Shapley en 2012 por sus
trabajos con los problemas de asignacion y el diseno de mercado.
Pero la influencia de la teorıa de juegos va incluso mas alla. En la
economıa conductual, que le valio a Richard Thaler el premio Nobel
de este ano, uno no puede dejar de intuir la esencia de la teorıa de
juegos en sus planteamientos.
En este viaje tambien tuvo su papel la investigacion operativa.
Prestigiosos economistas como los premios Nobel Robert Aumann,
Roger Myerson, Alvin Roth y Lloyds Shapley, y el Premio Rey Juan
Carlos I, Andreu Mas Colell, entre otros, han publicado varios de sus
trabajos de teorıa de juegos en revistas del area de la Investigacion
Operativa.
Con todo, el potencial de la teorıa de juegos es, a mi entender,
mucho mas amplio y, aunque ya se esta haciendo, puede aplicarse
aun mas a muchas otras disciplinas como las ciencias polıticas, la
gestion publica o la biologıa.
Con la idea de mostrar la utilidad de la teorıa de juegos en
distintas areas, desde el Grupo de Teorıa de Juegos de la SEIO
hemos organizado este ano en Pontevedra el curso de Modelos de
Investigacion Operativa en Teorıa de Juegos, impartido por Ignacio
Garcıa Jurado y por Joaquın Sanchez Soriano. Este curso fue el
tercero de una serie cuyos objetivos son contribuir a que se afiancen
las relaciones entre estudiantes de doctorado que realicen sus tesis
en Espana en temas proximos a la teorıa de juegos, y favorecer el
intercambio de ideas entre estudiantes de doctorado e investigadores
188 J. Vidal-Puga
que pueda dar lugar a nuevas perspectivas de investigacion.
En estas ediciones, el publico objetivo son estudiantes e
investigadores para los que la teorıa de juegos pueda ser de utilidad
en sus lıneas de investigacion, tanto a nivel teorico como practico.
Por lo dicho anteriormente, considero que esto incluye variadas
areas, como economıa, administracion de empresas, investigacion
operativa, matematicas, ingenierıa, logıstica, ciencias polıticas y
direccion publica, entre otras. De hecho, el curso estuvo financiado,
ademas de por la SEIO, por un grupo de economıa (ECOSOT,
Economıa, Sociedad y Territorio), una grupo de estadıstica e
investigacion operativa (SiDOR), una agrupacion de economıa y
direccion de empresas (ECOBAS) y un programa de doctorado
centrado en los ambitos de la creatividad, la innovacion y la
sostenibilidad (CREA S2i).
Tengo la esperanza de que el espıritu de estos cursos se mantenga
en futuras ediciones y que incluso pueda ampliarse a un ambito mas
internacional. A un nivel mucho mas ambicioso, quiza podamos ser
el Morgenstern que ayude a un futuro von Neumann, del area que
sea, a alcanzar nuevas cotas en el saber de la Humanidad.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 189-222
Estadıstica
A comparative analysis of tree-based models
classifying imbalanced breath alcohol data
Manuela Alcaniz and Miguel Santolino
Department of Econometrics
University of Barcelona
B [email protected], B [email protected]
Lluıs Ramon
Data Scientist
Digital Origin
Abstract
When applied to binary data, most classification
algorithms behave well provided the dataset is balanced.
However, when one single class includes the majority of cases,
a good predictive performance for the minority class is not
easy to achieve. We examine the strengths and weaknesses
of three tree-based models when dealing with imbalanced
data. We also explore sampling and cost sensitive methods
as strategies for improving machine learning algorithms. An
application to a large dataset of breath alcohol content tests
performed in Catalonia (Spain) to detect drunk drivers is
shown. The Random Forest method proved to be the model of
choice if a high performance is required, while down-sampling
© 2017 SEIO
190 M. Alcaniz, L. Ramon, M. Santolino
strategies resulted in a significant reduction in computing
time. When predicting alcohol impairment, the area of
control (built-up or not), hour of day and driver’s age were
the most relevant variables for classification.
Keywords: Imbalanced data, positive, drunk driving, police,
checkpoint, machine learning.
1. Introduction
Tree-based models have attracted the increasing attention of
researchers in recent years; however, analyses of the use of such
models when there is a highly unequal distribution between classes
are scarce. This is particularly true of binary data where one class
includes the majority of cases and the other represents just a small
portion. Imbalanced datasets of this kind are very common in such
disciplines as medical diagnosis, on-line advertising, fraud detection,
network intrusion, road safety, etc.
Many classification algorithms behave well for balanced datasets;
yet, when applied to imbalanced data, model fitting may be biased
towards the majority class. As a result, the model may provide a
poor predictive performance for the minority class, which is usually
the most interesting one. Kumar and Sheshadri [20], He and
Garcia [16] and Chawla [9] review problems of class imbalance and
alternative solutions. Here, the performance of two strategies for
dealing with imbalanced data –that is, sampling and cost sensitive
methods– are compared, and the interpretability of their respective
results is discussed.
Specifically, we illustrate the performance and features of
tree-based models by applying them to the classification of
alcohol-impaired drivers in Catalonia (Spain). When testing
for breath alcohol content (BrAC) over the legal limits, highly
imbalanced results are obtained –clearly, most drivers are not
Tree-based models classifying imbalanced breath alcohol data 191
alcohol-impaired and so BrAC tests are largely negative.
The identification and deterrence of potential alcohol-impaired
drivers is a priority for traffic authorities the world over ([24])
and while a downward trend in drunk driving has been observed
in many countries, there is still room for improvement ([32], [24],
[34]). For example, in 2014, 24.8% of deaths among drivers
in Catalonia were related to alcohol. In order to tackle drunk
driving effectively, appropriate policies need to be adopted. In this
paper three tree-based models are studied and their application to
the classification of drivers with a BrAC over the legal limit on
Catalan roads is explored. Specifically, we examine the use of the
Classification and Regression Tree, Tree Bagging and the Random
Forest models to classify positive BrAC tests.
Several studies have been conducted in Catalonia with regard
to drinking habits and driving. Alcaniz et al. [1] estimated the
prevalence of alcohol-impaired driving in Catalonia in 2012. They
found that it was the 1.29% for the general population of drivers,
1.90% on Saturdays and 4.29% on Sundays. Chulia, Guillen, and
Llatje [10] studied seasonal and time-trend variation by gender of
alcohol-impaired drivers at preventive sobriety checkpoints. Alcaniz,
Santolino, and Ramon ([2], [3]) studied age-drinking patterns and
drinking behavior in Catalonia and analyzed different strategies in
sobriety checkpoints. They suggested that non-random breath tests
were primarily effective to detect binge drinking and random breath
tests in detection of other drinking and driving profiles of population.
To our knowledge, classification models to identify drunk drivers
have not been previously applied to Catalan road data.
The rest of this paper is structured as follows. Following on
from this introduction, in Section 2, three tree-based models are
introduced along with their properties and variants, and various
approaches to tackling the class imbalance problem are described.
192 M. Alcaniz, L. Ramon, M. Santolino
Section 3 is devoted to explaining the dataset obtained from police
preventive checkpoints. The results obtained after fitting the
tree-based models to the data and several variants are reported in
Section 4. Concluding remarks and discussion are outlined in Section
5.
2. Methods
In this section three tree-based models are introduced and their
properties discussed. Specifically, we analyze the Classification and
Regression Tree, the Tree Bagging and the Random Forest models1.
A number of extensions employing other types of response data and
alternative implementations are also detailed. Finally, we investigate
how to deal with the class imbalance problem.
2.1. Classification and Regression Trees
Classification and Regression Trees (CART) were first introduced
by Breiman et al. [8]. The CART model partitions the predictor
space in a recursive way so as to create groups in the response
variable that are as homogeneous as possible. The CART algorithm
begins by splitting the dataset into two disjoint subsets (known as
nodes or leaves). For each predictor, splits are computed for all
possible cut-off values and the one that maximizes the homogeneity
(and minimizes the impurity) of the resulting disjoint subsets is
chosen. This process is recursively repeated for each node.
An impurity measure, quite commonly the Gini index, is used
to choose the best split, with the split impurity being calculated by
aggregating the impurity of the subnodes. For a two-class problem,
the Gini index for a given node is defined as p1(1− p1) + p2(1− p2),
where p1 and p2 are the class 1 and class 2 probabilities, respectively
1The CART and Random Forest trademarks are licensed exclusively toSalford Systems.
Tree-based models classifying imbalanced breath alcohol data 193
[19]. Alternative measures to the Gini index exist. For instance,
the information gain measure can be used, although differences
are frequently not significant [27]. To avoid the overfitting of the
CART model, the subtree is selected based on a cost complexity
tuning, where a complexity parameter cp penalizes the size of
the tree. In fact, the subtree that minimizes Impuritysubtree +
cp × (Number Terminal Nodes) is selected. The cp value, the
hyperparameter, is normally selected using cross-validation (CV).
CART models have the advantage of being easy to interpret
and rapid to compute, of allowing missing values to be dealt with
and of facilitating feature selection. An important characteristic of
these models is that variable importance can be assessed. This is
achieved by retaining the reduction in the Gini index at each split
and aggregating these values for every predictor. Predictors that
either appear at the beginning of the tree or which are used in several
splits are more important. Note that variable importance can be
biased when there are many missing values or there are categorical
variables with many levels ([30], [21]). The main disadvantages of
CART models concern the instability of their results.
In practice, a large number of alternative implementations of
tree models exist. Different approaches have been proposed for their
use with survival data [5], multivariate regression [11], clustering
[29] and unbiased models ([17], [21]). Hyafil and Rivest [18] show
that constructing optimal binary decision trees is an infeasible
task. Grubinger, Zeileis, and Pfeiffer [14] propose evolutionary
algorithms to improve accuracy, while Loh [22] compares a set of
alternative implementations in terms of their capabilities, strengths,
and weaknesses.
2.2. Tree Bagging
Bagging, or Bootstrap aggregating, also introduced by Breiman
[6], involves generating several predictions and combining them to
194 M. Alcaniz, L. Ramon, M. Santolino
obtain an aggregated predictor. Here, predictions are generated
by applying a model to different bootstrap replicas of the dataset.
These replicas are made by replacement and are as large as the
dataset itself. The aggregate is the majority vote of all models.
Each tree used in the tree bagging is computed as described in 2.1
above. The only difference is that there is no pruning step. The
aggregating step neutralizes the overfitting error of the trees.
The number of trees to be used is defined by the user and, in
practice, a small number of replicas usually proves sufficient [19].
Although the error decreases with the number of trees, the trees are
highly correlated, so the margin of improvement associated with each
additional tree decreases with the number of replicas. Compared
with CART models, the advantage of tree-bagging models is their
stability, which reduces the risk of overfitting. On the other hand,
these models are computationally more intensive than CART models
and their interpretation more complex.
2.3. Random Forest
In common with the two models outlined above, the Random
Forest (RF) model was proposed by Breiman [7]. RF involves
generating bootstrap replicas of the original dataset and creating
trees for each replica as in Bagging. However, RF seeks to create
uncorrelated trees to improve predictions. To create trees that are
as different as possible, at each split the trees can only use a limited
number of random variables. Hence, the trees tend to be very
different and provide different information when aggregated.
As in Tree Bagging, the number of trees to compute has first to
be specified. The number of variables that might be split at each
node (referred to as mtry) must also be defined. A common selection
is the square root of the number of variables [19]. In common with
the previous models, the minimum number of nodes can also be
determined. The higher this number is, the smaller and faster the
Tree-based models classifying imbalanced breath alcohol data 195
trees will be. As with the Tree Bagging models, the advantages
of RF models is that performance is enhanced and the overfitting
risk reduced. Furthermore, RF models are robust to outliers.
Their disadvantages include the complexity of interpretation and
the lengthy computation time.
Indeed, the computation time of the original RF can be
prohibitive in the case of a large mtry and/or a high number of trees.
Therefore, less timing-consuming, more intensive alternatives are
useful. Here, we use an efficient RF implementation as ranger2. An
additional feature of ranger is that it uses a variant for probability
estimation. Each tree provides the proportion of positives as opposed
to its classification. The probability is obtained by averaging this
proportion for all the trees. In doing so, the model performance is
generally improved [23].
Sometimes categorical variables can be interpreted as ordered
categorical variables (for instance, colors ordered according to their
intensity or type of roads based on their traffic capacity). This
strategy can significantly reduce the computation time of RF. To
split a categorical variable of n categories, the algorithm checks all
2(n−1) − 1 possible combinations. However, since the categories are
sorted in the case of ordered categorical variables, the impurity is
calculated between each category, and the threshold that gives the
best split is chosen. This is much quicker to compute as only one
variable has to be checked.
RF models can assess variable importance in three ways. The
simplest way is to count the number of times that a variable is
selected in all the trees. The second way involves computing the
aggregate reduction in impurity obtained at each split in all the trees.
Finally, a third way is to measure the permutation importance. For
2This reduced computing time by a factor of 12 compared to that of theoriginal RF.
196 M. Alcaniz, L. Ramon, M. Santolino
each tree, the prediction performance of out-of-bag (OOB) samples3
is recorded. This performance is again computed but here using the
values of one randomly permuted variable. The drop in performance
resulting from this permutation is averaged over all the trees. This
is carried out for each variable and provides a measure of variable
importance in the RF [[15]. When variables are highly correlated
or if categorical and continuous variables are combined, the variable
importance indicator needs to be considered with caution[31].
RF models have been extensively applied. For instance,
generalizations of RF models have been proposed to provide
conditional quantiles and confidence intervals ([25], [33]). Segal [28]
demonstrates that RF can overfit datasets with large numbers of
noisy inputs. To deal with this, alternative extended RFs have been
proposed ([35], [4]).
2.4. Class Imbalance
It is relatively common to find imbalanced datasets, where
the majority of cases present negative outcomes. For example,
only a small percentage of observations show positive outcomes
in datasets of BrAC tests. Many classification algorithms have
been designed specifically for balanced datasets and so a poor
predictive performance may be obtained when applied to imbalanced
data. Two strategies for dealing with unbalanced data are sampling
methods and cost sensitive methods.
Sampling methods involve modifying the original dataset to
obtain a balanced dataset and they can be divided into the following
categories: down-sampling, i.e., excluding some instances of the
majority class by random sampling; up-sampling, i.e., incorporating
more instances of the minority class by random sampling with
3Out-of-bag samples consist of observations not included in a bootstrapsample.
Tree-based models classifying imbalanced breath alcohol data 197
replacement; and, hybrid methods, i.e., combining both up- and
down-sampling methods. Note that sampling methods apply only to
training data and not to testing data. Cost-sensitive methods involve
applying different costs of misclassification to each class in the model
fitting process. By specifying a higher cost to the misclassification
of a minority instance than that to a majority instance, the machine
learning algorithm makes fewer errors with the minority class, as
it is more expensive. This would counteract the bias towards the
majority class.
An additional problem presented by class imbalance is how best
to assess classifiers. The usual classification metric is the level of
accuracy, for instance, by means of confusion matrix. However, in
the case of imbalanced data, this measure may be inadequate. Other
techniques to compare tree-based models such as leave-one-out
cross-validation can be in addition computationally very expensive
for large datasets. To overcome these limitations, receiver operating
characteristic (ROC) curves are used. The ROC curve presents a
binary classifier performance when its threshold varies. It is formed
by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings. Any point on the diagonal
of the ROC curve is a random guess classifier, while any points below
the diagonal are worse than a random guess. A complete description
of ROC analysis can be found in Fawcett [12].
To compare the performance of different classifiers directly, we
use the area under the ROC Curve (AUC). This indicator aggregates
all the information provided by the ROC curve in a single scalar
expression. A classifier with a high AUC indicates that it has a
better than average performance. Note, however, that the first
classifier may present a worse performance than the second classifier
in a specific region of the ROC curve. An interesting property is
that the AUC of a classifier is equivalent to the probability that the
198 M. Alcaniz, L. Ramon, M. Santolino
classifier will rank a randomly chosen positive instance higher than
a randomly chosen negative instance [12].
3. Data
3.1. Drunk driving legislation
Statutory blood-alcohol limits for driving differ across the
countries of Europe. Spanish legislation differentiates between
administrative and criminal positives, according to the level of
alcohol concentration in the breath (or blood). Drivers with BrAC
levels between 0.25 and 0.60 mg/l (0.15 and 0.60 mg/l for novice
and professional drivers) face administrative penalties if detected.
When the BrAC level is over 0.60 mg/l, drivers are deemed to have
committed a criminal offence and, therefore, face more stringent
legal sanctions, including temporary suspension of the driving license
and imprisonment.
The police are allowed to perform a BrAC test on any driver, even
if the driver does not show any symptoms of alcohol impairment.
The standard procedure is to conduct a BrAC test using a portable
breathalyzer while the driver is seated in their car. If negative, the
driver is allowed to continue on their journey; if positive, given that
the breathalyzer has no legal validity, an evidential breath test is
performed in the officer’s vehicle.
3.2. Variables
The database comprises 439,699 preventive BrAC tests carried
out at checkpoints by traffic authorities in 2014 in Catalonia. These
tests represent almost 95% of the total number of BrAC tests,
while the remaining 5% includes tests conducted on drivers showing
visible signs of alcohol intoxication or after committing a traffic
violation or on drivers involved in a traffic accident. Preventive
BrAC tests performed on cyclists or pedestrians were removed from
Tree-based models classifying imbalanced breath alcohol data 199
the database. Observations with missing information were also
removed. The final database comprises 408,936 BrAC tests.
Information recorded by traffic officers, including the location of
the checkpoint, specific hour of day, driver characteristics and vehicle
type, is available. Information about location differentiates between
interurban and urban areas and records the region and subregion
in which the checkpoint was set up. The territory of Catalonia is
divided into four administrative units and is recorded here as the
variable region. However, there is a more detailed administrative
division composed of 41 subregions. The traffic police in Catalonia
include both the regional police (Mossos d’Esquadra) and the local
police. There is a traffic police administrative division, known as
ART, which comprises eight levels and corresponds to the scale
between that of the regions and subregions.
The variable roadType records the type of road on which the
BrAC test was performed4. Information about the hour, day, week
and month when the test was performed is also available. As
drinking habits are closely associated with leisure, factors identifying
bank holidays (holiday), the eve of such holidays (holidayEve) and
long weekends (longWeekend) were created. Finally, driver and
vehicle characteristics were also recorded.
The description of variables is as follows.
� positive (Dependent variable): BrAC level above legal limit
(yes/no).
� builtUp: Interurban area or Urban area.
� region: Barcelona, Girona, Lleida and Tarragona.
� subregion: Name of subregion, 41 categories.
4Highway1 corresponds to toll-highways and Highway2 corresponds totoll-free highways.
200 M. Alcaniz, L. Ramon, M. Santolino
� policeType: Regional police or Local police.
� ART : Police territorial division, eight categories.
� roadType: Highway1, Highway2, Conventional road, Rural
road and Urban road.
� hour : specific hour of day (number 1-24) when the BrAC was
performed.
� day : day when the BrAC was performed.
� month: month when the BrAC was performed.
� week : week when the BrAC was performed, as a number
(1-52).
� weekday : day of the week when the BrAC was performed, as
a number (1-7, Sunday being 7).
� dayType: Mon-Thu, Fri, Sat and Sun.
� workingDay : 1 if it was a working day, 0 otherwise.
� timePeriod : morning (6:00 to 13:59), afternoon (14:00 to
21:59) or night (22:00 to 5:59h).
� holiday : bank holiday (yes/no).
� holidayEve: Eve of bank holiday (yes/no)
� longWeekend : Long weekend (yes/no)
� sex : driver’s sex.
� age: driver’s age.
� licenseYear : year that the driver obtained the license.
Tree-based models classifying imbalanced breath alcohol data 201
� spanish: driver Spanish or foreigner.
� vehType: type of vehicle (Car, Van, Motorcycle, Moped, Light
truck, Heavy truck, Bus, and Other).
Algorithms of tree-based models implement an implicit variable
selection, so the strategy involved including all the variables in the
models. Table 1 presents the number of tests, number of positives
and the percentage of positives for the main variables and their
levels. Additional tables for variables comprising many levels are
included in the appendix: ART (Table A.1), month (Table A.2) and
hour of day (Table A.3), are included.
3.3. BrAC outcomes above legal limit
The positive response variable is highly skewed. Of the 408, 936
BrAC tests carried out, only 16, 494 –approximately 4% –were
positive. Figure 1 shows the percentage of BrAC tests above the legal
limit by subregion. The map shows a non-homogeneous percentage
of positives throughout the territory, with values being particularly
high in the north-east and along the coast.
Figure 2 shows the percentage of BrAC tests above the legal limit
according to a specific set of variables. In winter there are fewer
positives, while from June to September there is a greater number.
Urban areas are associated with a higher prevalence of positives than
are interurban areas. During the week there is a 2% positive rate,
while on weekends it is between 5 and 7%. Positive rates on Fridays
(3.5%) are halfway between weekday and weekend prevalences. A
similar percentage of positives is observed for both men and women;
however, non-Spanish men record a slightly higher positive rate,
while non-Spanish women present the lowest rate. Driver age is also
informative. The prevalence of alcohol peaks at age 20 with more
than 7% of positives and falls after that age. The final plot analyzes
the relationship between the prevalence of alcohol with the hour of
202 M. Alcaniz, L. Ramon, M. Santolino
Variable Levels # tests # positives (%)builtUp Interurban area 267,117 10,149 3.8
Urban area 141,819 6,345 4.5region Barcelona 225,019 9,944 4.4
Girona 50,145 2,610 5.2Lleida 61,868 1,020 1.6Tarragona 71,904 2,920 4.1
policeType Regional police 266,029 10,155 3.8Local police 142,907 6,339 4.4
roadType Highway1 30,149 1,213 4.0Highway2 45,735 2,247 4.9Conventional road 190,744 6,674 3.5Rural road 489 15 3.1Urban road 141,819 6,345 4.5
dayType Mon-Thu 180,635 4,007 2.2Fri 58,093 2,089 3.6Sat 85,250 4,637 5.4Sun 84,958 5,761 6.8
workingDay Working day 206,126 5,277 2.6Non-working day 202,810 11,217 5.5
timePeriod Morning 101,590 3,576 3.5Afternoon 86,982 985 1.1Night 220,364 11,933 5.4
sex Man 332,411 13,430 4.0Woman 76,525 3,064 4.0
age3l [15,30] 133,713 7,732 5.8(30,45] 171,145 6,023 3.5(45,100] 104,078 2,739 2.6
licenseYear [1932,1994) 138,129 3,964 2.9[1994,2004) 115,267 4,154 3.6[2004,2012) 131,088 7,043 5.4[2012,2015) 24,452 1,333 5.5
spanish Spanish 350,444 14,035 4.0Non-Spanish 58,492 2,459 4.2
vehType Car 316,530 14,332 4.5Van 25,229 436 1.7Motorcycle 29,717 1,264 4.3Moped 8,876 334 3.8Light Truck 6,117 25 0.4Heavy Truck 19,361 78 0.4Bus 2,490 12 0.5Other 616 13 2.1
Table 1: Number of tests, positives and percentage of positives formain variables.
Tree-based models classifying imbalanced breath alcohol data 203
Figure 1: Percentage of positives by subregion.
the day and the driver’s age. This highlights a black spot in the
early morning for drivers in the young age group when 15% of BrAC
positives are recorded. All age groups present a high positive rate
between 9pm and 3am. In the afternoon, this percentage increases
with age. Finally, a black spot occurs at 13h in the 55 to 65 age
group.
4. Results
To assess the performance of the tree-based models, the data
were randomly split into training and test sets. The division was
made preserving the distribution of positives-negatives and of the
204 M. Alcaniz, L. Ramon, M. Santolino
Figure 2: Percentage of positives by hour of day and age group.
Tree-based models classifying imbalanced breath alcohol data 205
other variables. The training set contained 70% of the data and was
used to fit the models; the test set contained the remaining 30%
of the data and was used to validate the models. All categorical
variables were included in the models as binary variables; that
is, each category was converted into a dichotomous variable. The
performance of all the models was based on the AUC from the test
set. All models were performed with R version 3.2.3 [26]. Packages
used were caret, randomForest, ranger, pROC, e1071, rpart, ipred,
plyr and dplyr.
When a hyperparameter had to be adjusted, a ten-fold
cross-validation (10-CV) was used; that is, the training dataset was
randomly split into ten partitions. The model/hyperparameter was
trained with nine of the ten original partitions. The remaining
partition was used to obtain the validation performance of the model.
This step was repeated ten times and a different partition was used
each time for validation. The model/hyperparameter performance
was thus obtained as an average of all the validations. The metric
for hyperparameter tuning was the AUC value. The hyperparameter
with the highest AUC was selected5. Once the hyperparameter was
adjusted, the model was fitted to the whole dataset.
4.1. Classification and Regression Tree model
Tree models contain an hyperparameter which is the complexity
parameter (cp). A grid of 50 (cp) values was used. The best
cross-validated cp value was 6.9897 · 10−6, with an AUC of 0.7472.
First panel of Figure 3 shows that the AUC value increases when
the cp decreases.
Note that the adjusted cross-validated cp value was very small.
5Alternatives exist for selecting the tuning parameters, such as the onestandard error rule or tolerance. These alternatives choose the simplest modelwithin a standard error or a defined tolerance from the best model, respectively[16].
206 M. Alcaniz, L. Ramon, M. Santolino
0.60
0.65
0.70
0.75
0.0e+00 2.5e−05 5.0e−05 7.5e−05 1.0e−04cp
CV
AU
C
(a) CV AUC as a function of cp.
0
10
20
30
0.0e+00 2.5e−05 5.0e−05 7.5e−05 1.0e−04cp
tree
dep
th
(b) Tree depth as a function of cp.
Figure 3: CART models. Model with the best AUC is shown in red. Leftpanel: CV AUC as a function of cp. Right panel: Tree depth as a function ofcp.
Tree-based models classifying imbalanced breath alcohol data 207
The fitted trees need to be very deep in order to appreciate
differences between the two classes. Right panel of Figure 3 shows
the tree depth as a function of cp. Note that the highest AUC was
obtained in the trees with 30 levels. The interpretation of deep trees
is more complex. Using the adjusted cp value, a final model was
adjusted with all the training data. A membership probability was
obtained from the test set. The test AUC value was 0.7498.
These previous models do not take into account the fact that the
data are imbalanced. Therefore, two approaches for dealing with
imbalanced data were applied. First, down-sampling was performed
and so the training data were reduced to a down-sampled training
dataset. This contained the same number of observations from each
class. Our results improved in comparison to our previous outcomes.
The best cross-validated cp value was 4.9310 · 10−4, with an AUC of
0.7499. Note that this cp value is 50 times higher than the previous
cp. The fitted tree has a depth of 17 levels and the AUC associated
with the test set was 0.7577. Thus, using a subset of the dataset
resulted in a better performance.
Second, up-sampling was performed. To achieve a balanced
dataset, items from the minority class were added until the dataset
contained the same number of positives as negatives. A large overfit
was made in cross validation. To obtain a balanced dataset, many
instances from the minority class had to be copied. For this reason,
the fitted tree contained the same observations in the leaves as in
the validation set. This resulted in nearly perfect performance, but
when tested with new data, a very poor performance was obtained.
Although the cross-validated AUC value was almost 1, when the
model was validated with the test data, its AUC was less than 0.5,
i.e., a random guess.
Finally, a cost sensitive method was applied. The selection of the
cost value had first to be defined. We used cost values that balanced
208 M. Alcaniz, L. Ramon, M. Santolino
the difference between classes. The dataset contains one positive for
every 20 negatives; thus, the tree model performance was analyzed
by applying a cost of 10, 20 and 30 for misclassification. Table 2
shows the cp value, the cross-validated AUC, the test AUC and the
depth for each cost value.
Cost Best cp CV AUC test AUC tree depth10 0.000277 0.7483 0.7570 2120 0.000311 0.7560 0.7663 1730 0.000242 0.7545 0.7630 28
Table 2: Model results by the cost used.
The best model performance was obtained when a
misclassification cost of 20 was applied. Compared to the
base tree, the cp values were much higher and the trees were less
complex. Yet, they were still too deep to be visually interpretable.
If an interpretative tree is desired for our context, a bigger cp
value needs to be chosen as a trade-off between interpretability and
predictive performance.
4.2. Tree Bagging model
Bagging consists of generating several bootstrap replicas from
the original dataset and modeling the deepest possible tree for each
replica. Whereas bagging has no hyperparameters to tune, the
number of bootstrap replicas does have to be defined. In our case,
the number of bagging trees was 50 and the test AUC was 0.7267.
Figure 4 (a) shows that increasing the number of replicas did not
improve the test AUC. Note that after 40 replicas, the performance
of the model increases very slowly. When a sufficiently high number
of trees had been used, adding another tree did not provide any
additional information, since it was highly correlated with some
other previous tree.
Tree-based models classifying imbalanced breath alcohol data 209
Class imbalance strongly affected bagging performance. To
predict a new observation, class predictions were obtained for each
tree and the predicted probability was obtained from the frequency
of all individual tree predictions. This can be explained by the
fact that each tree in the bagging provides a classification, not
a probability. For example, a leaf with five negatives and four
positives would be classified as negative, just as would a leaf with all
negatives. As in the case of the tree model, a sampling approach was
adopted. Here, only the down-sampling method was used. Bagging
was applied with 50 trees and a test AUC of 0.7675 was obtained.
Finally, a cost sensitive approach was performed. A cost of 20 was
applied to the bagging building step and a test AUC value of 0.7737
was obtained. Note that using different costs affects how the splits
are chosen in the tree building step. As bagging builds trees that are
as deep as possible, the final leaves tend to be more homogeneous so
as to avoid misclassification costs. This limitation does not occur in
the base bagging model. Figure 4 (b) shows the ROC curve of the
base Tree Bagging model and the down-sampling and cost sensitive
Tree Bagging models.
To conclude, we should stress that the Bagging Models were
computationally much more intensive than the Classification and
Regression Tree models. Indeed, in some cases the model fitting
took more than twelve hours.
4.3. Random Forest
The efficient Random Forest implementation ranger was used
and categorical variables were considered as ordered categorical
variables. Compared to the RF model that does not modify
categorical variables, the AUC values were not statistically
significantly different6; however, the computation time was halved.
6The CV AUC of the RF with original categorical variables was 0.7886(s.d.=0.0065), and the CV AUC of the RF with converted categorical variables
210 M. Alcaniz, L. Ramon, M. Santolino
0.65
0.70
0.75
0.80
0 10 20 30 40 50Number of bootstrap replicas
Test
AU
C
(a) Test AUC by the number of bootstrap replicas.
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
base
cost
down
(b) ROC curves by the bagging models used.
Figure 4: ROC curves and number of bootstrap replicas. Left panel: TestAUC by the number of bootstrap replicas. Right panel: ROC curves by thebagging models used.
Tree-based models classifying imbalanced breath alcohol data 211
Intuitively it seems that performance is markedly affected when
considering ordered categorical variables. This might be because
some categorical variables are directly considered as ordered
(dayType, timePeriod) or, at least, are categorized with a certain
order. For instance, the variable roadType has a certain order,
beginning with road types that have higher speed limits and
terminating with those with a slower speed limit.
With a ten-fold CV, a large number of different mtry was
considered for selection. Figure 5 shows that CV AUC increased as
the number of mtry decreased. The highest CV AUC was obtained
with an mtry equal to two. It had a CV AUC of 0.7849 and a test
AUC of 0.7932. A low mtry means that trees are very different from
each other, so each provides information for the aggregation step.
A low mtry could be problematic in the case of a high number of
non-informative variables, which does seem to be the case here.
Once the mtry was selected, the number of trees to be used
was analyzed. Figure 6 shows model performance as a function of
the number of trees. When the forest was small, adding new trees
substantially improved the model performance. However, the test
AUC value did not increase after approximately 400 trees.
Finally, the down-sampling strategy was adopted to deal with
class imbalance problems. The down-sampled performance of the
model was slightly worse than when using all the dataset. The
optimal mtry was three with an associated CV AUC value of 0.7753
and a test AUC value of 0.7871. Compared with the previous models,
the standard deviation was much higher. As each fold used fewer
data, the AUC results were more dispersed. In terms of speed, the
down-sampled performance was fifteen times faster than when using
all the data. The cost sensitive approach was not performed.
was 0.7820 (s.d.=0.0064).
212 M. Alcaniz, L. Ramon, M. Santolino
0.75
0.76
0.77
0.78
0.79
0 10 20 30 40 50mtry
CV
AU
C
Figure 5: CV AUC as a function of mtry.
Variable importance
A major advantage of the RF model is that variable importance
can be assessed. Here, we evaluate variable importance by means
of the RF built-in permutation variable importance measure, which
compares the increase in the prediction error after permuting all
elements of a variable. Here, categorical variables were not converted
to ordered categorical variable but to dummy variables in order to
facilitate interpretation.
Table 3 shows the 20 variables with the highest values on the
permutation variable importance measure. The variable with the
highest value was Local police. The correlated categories of Urban
area (builtUp) and Urban road (roadType) were in third and fourth
Tree-based models classifying imbalanced breath alcohol data 213
0.74
0.76
0.78
0.80
0 50 100 150Number of trees used
Test
AU
C
(a) Using fewer than 150 trees.
0.74
0.76
0.78
0.80
0 500 1000 1500Number of trees used
Test
AU
C
(b) Using fewer than 1500 trees.
Figure 6: Test AUC as a function of the number of trees. Left panel: Usingfewer than 150 trees. Right panel: Using fewer than 1500 trees.
214 M. Alcaniz, L. Ramon, M. Santolino
positions. This means that the behavior of the Local police and the
Regional police was considered to be different by the RF algorithm.
As expected, the hour and the time period-night were relevant for the
classification of observations. The most important characteristics of
the driver profile were age and experience (number of years holding
a driver’s license) which are both ranked in the top ten variables by
importance. The remaining variables in the top 20 were road type,
some regions/subregions and police divisions, and variables related
to the weekday and week of the year. Notice that sex and vehicle
type do not figure in the top 20.
Variable Category ImportancepoliceType Local police 100.00hour 62.84builtUp Urban area 61.69roadType Urban road 57.54timePeriod Night 44.63age 38.04licenseYear 38.02roadType Conventional road 26.90weekday 19.25subregion Barcelones 19.12week 17.30ART ART Metropolitana N 16.85timePeriod Afternoon 16.12month 15.69workingDay Non-working days 12.56region Lleida 12.20day 8.49dayType Sun 7.99ART ART Tarragona 7.65roadType Highway2 7.40
Table 3: Top 20 variables by importance.
Tree-based models classifying imbalanced breath alcohol data 215
4.4. Comparison of tree-based models
To conclude, summarizing results are shown in Table 4. All the
tree-based models discussed in the article are compared in terms of
classification performance and computation intensity.
Tree-based Test AUC Time computation
model intensity
CART 0.7498 Low
Down-sampling CART 0.7577 Low
Up-sampling CART <0.5 Low/middle
Cost sensitive CART 0.7663 Low
Bagging 0.7267 Very High
Down-sampling Bagging 0.7675 High
Cost sensitive Bagging 0.7737 Very high
Efficient Random Forest 0.7932 Middle/high
Down-sampling efficient 0.7871 Middle
Random Forest
Table 4: Performance and time consuming comparison of tree-basedmodels.
5. Discussion
This paper compares the use of three tree-based models used
in classification problems –in this specific case, as applied to BrAC
test results in excess of the legal limit in Catalonia (Spain). Drunk
driving data are deeply imbalanced since most drivers are not
alcohol impaired. Additionally, the performances of two alternative
strategies for dealing with imbalanced data –sampling methods
and cost sensitive methods– are compared. Unlike up-sampling,
down-sampling methods were preferred to the original methods.
The results following the application of down-sampling methods
were often slightly worse, but the reduction in computing time was
216 M. Alcaniz, L. Ramon, M. Santolino
significant. As such, down-sampling techniques may be used to
obtain a rapid overview of model performance. In our case more
data did not improve model performance substantially. In the case of
imbalanced datasets, quality may be more important than quantity.
A comparison of the tree-based methods, showed that the Random
Forest model performed best, which means it can be considered the
model of choice if a high performance model is wanted. If rapid
computation is required, however, the (CART) tree model with
misclassification costs should be used. Finally, when compared to
these two methods, Tree Bagging offered no modeling advantages in
the context described here.
In terms of the number of nodes, trees were in general very deep,
hindering the direct interpretation of variables. According to the
Random Forest variable importance indicators, the most important
variables were those of the area of control, the hour of day and the
driver’s age, findings that are in line with previous studies ([1], [13],
[3],[2], [10]). Built-up/non-built-up areas was the most important
variable in the classification. As for the implications of our findings
for road safety, it is clear that different enforcement strategies are
required to address drunk driving in each of the two areas. An
interesting application of tree-based methods is their utility for
helping in-situ police officers select the drivers that should be tested
when the checkpoint is set up. This application could be extended to
drug testing since the unitary cost of drug tests is high in comparison
to that of alcohol tests.
Future areas of research include to distinguish between
administrative and criminal offenses. In this highly imbalanced
scenario it would be interesting to analyze whether similar results
were obtained regarding the performance of tree-based models.
Additionally, other supervised classification techniques could be
applied such as linear discriminant analysis, naive Bayes or support
Tree-based models classifying imbalanced breath alcohol data 217
vector machine. Finally, a promising approach to explore in
the future in order to cut down the computation time is to
apply dimension reduction techniques, such as principal component
analysis or partial least squares.
Acknowledgements
We wish to express our gratitude to Servei Catala de Transit for
providing the data and the Mossos d’Esquadra and Local Police for
carrying out the fieldwork. The authors acknowledge the support
of the Spanish Ministry for grants ECO2013-48326-C2-1-P and
ECO2015-66314-R.
Declaration of interests
The authors report no conflicts of interest. The authors alone
are responsible for the content and writing of the paper.
References
[1] Alcaniz, M., Guillen, M., Santolino, M., Sanchez-Moscona, D.,
Llatje, O. and Ramon, L. (2014). Prevalence of alcohol-impaired
drivers based on random breath tests in a roadside survey in
Catalonia (Spain), Accident Analysis & Prevention, 65:131-141.
[2] Alcaniz, M., Santolino, M. and Ramon, L. (2016). Circular
con tasa de alcohol superior a la legal: caracterizacion del
conductor segun la vıa de circulacion, Revista Espanola de
Drogodependencias, 41(3):59-71.
[3] Alcaniz, M., Santolino, M. and Ramon, L. (2016). Drinking
patterns and drunk-driving behaviour in Catalonia, Spain: a
comparative study, Transportation Research Part F: Traffic
Psychology and Behaviour, 42, 522-531.
218 M. Alcaniz, L. Ramon, M. Santolino
[4] Amaratunga, D., Cabrera, J. and Lee, Y.-S. (2008). Enriched
random forests, Bioinformatics, 24(18):2010-2014.
[5] Bou-Hamad, I., Larocque, D. Ben-Ameur, H. et al. (2011). A
review of survival trees, Statistics Surveys, 5:44-71.
[6] Breiman, L. (1996). Bagging predictors, Machine learning,
24(2):123-140.
[7] Breiman, L. (2001). Random Forest, Machine learning,
45(1):5-32.
[8] Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A.
(1984). Classification and regression trees, CRC press.
[9] Chawla, N. V. (2005). Data mining for imbalanced datasets: An
overview. In: Data mining and knowledge discovery handbook,
853-867, Springer.
[10] Chulia, H., Guillen, M. and Llatje, O. (2016). Seasonal and
Time-Trend Variation by Gender of Alcohol-Impaired Drivers
at Preventive Sobriety Checkpoints, Journal of Studies on
Alcohol and Drugs, 77(3):413-420.
[11] De’Ath, G. (2002). Multivariate regression trees: a new
technique for modeling species-environment relationships,
Ecology, 834):1105-1117.
[12] Fawcett, T. (2006). An introduction to ROC analysis, Pattern
recognition letters, 27(8):861-874.
[13] Font-Ribera, L., Garcia-Continente, X., Perez, A., Torres, R.,
Sala, N., Espelt, A. and Nebot, M. (2013). Driving under the
influence of alcohol or drugs among adolescents: the role of
urban and rural environments, Accident Analysis & Prevention,
60:1-4.
Tree-based models classifying imbalanced breath alcohol data 219
[14] Grubinger, T., Achim Zeileis, A. and Pfeiffer, K.-P. (2014).
Evolutionary Learning of Globally Optimal Classification and
Regression Trees in R, Journal of Statistical Software, 61(1).
[15] Hastie, T. Tibshirani, R. and Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition, Springer Series in Statistics.
[16] He, H. and Garcia, E. A. (2009). Learning from imbalanced
data, Knowledge and Data Engineering, IEEE Transactions on,
21(9):1263-1284.
[17] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased
recursive partitioning: A conditional inference framework,
Journal of Computational and Graphical statistics,
15(3):651-674.
[18] Hyafil, L. and Rivest, R. L. (1976). Constructing optimal binary
decision trees is NP-complete, Information Processing Letters,
5(1):15-17.
[19] Kuhn, M. and Johnson, K. (2013). Applied predictive modeling,
Springer.
[20] Kumar, M. and Sheshadri, H. (2012). On the classification
of imbalanced datasets, International Journal of Computer
Applications, 44.
[21] Loh, W.-Y. (2002). Regression tress with unbiased variable
selection and interaction detection, Statistica Sinica, pp.
361-386.
[22] Loh, W.-Y. (2011). Classification and regression trees, WIREs
Data Mining Knowl. Discov., 1(1):14-23.
220 M. Alcaniz, L. Ramon, M. Santolino
[23] Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G. and
Ziegler, A. (2012). Probability machines: consistent probability
estimation using nonparametric learning machines, Methods of
Information in Medicine, 51(1):74.
[24] Mathijssen, M. (2005). Drink driving policy and road safety in
the Netherlands: a retrospective analysis, Transportation
research part E: logistics and transportation review,
41(5):395-408.
[25] Meinshausen, N. (2006). Quantile regression forests, The
Journal of Machine Learning Research, 7:983-999.
[26] R Core Team (2016). R: A Language and Environment for
Statistical Computing, R Foundation for Statistical Computing,
Vienna, Austria.
[27] Raileanu, L. E. and Stoffel, K. (2004). Theoretical comparison
between the gini index and information gain criteria, Annals of
Mathematics and Artificial Intelligence, 41(1):77-93.
[28] Segal, M. R. (2004). Machine learning benchmarks and random
forest regression, Center for Bioinformatics & Molecular
Biostatistic.
[29] Sela, R. J. and Simonoff, J. S. (2011). RE-EM trees: a data
mining approach for longitudinal and clustered data, Mach.
Learn., 86(2):169-207.
[30] Strobl, C., Boulesteix, A.-L. and Augustin, T. (2007). Unbiased
split selection for classification trees based on the Gini index,
Computational Statistics & Data Analysis, 52(1):483-501.
[31] Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T.
(2007). Bias in random forest variable importance measures:
Tree-based models classifying imbalanced breath alcohol data 221
Illustrations, sources and a solution, BMC bioinformatics,
8(1):1.
[32] Vanlaar, W., Robertson, R., Marcoux, K., Mayhew, D., Brown,
S. and Boase, P. (2012). Trends in alcohol-impaired driving in
Canada, Accident Analysis & Prevention, 48:297-302.
[33] Wager, S., Hastie, T. Efron, B. (2014). Confidence intervals for
random forests: The jackknife and the infinitesimal jackknife,
The Journal of Machine Learning Research, 15(1):1625-1651.
[34] Williams, A. F. (2006). Alcohol-impaired driving and its
consequences in the United States: the past 25 years, Journal
of safety research, 37(2):123-138.
[35] Xu, B., Huang, J. Z., Williams, G., Wang, Q. and Ye, Y.
(2012). Classifying very high-dimensional data with random
forests built from small subspaces, International Journal of
Data Warehousing and Mining (IJDWM), 8(2):44-63.
222 M. Alcaniz, L. Ramon, M. Santolino
Appendix
ART # tests # positives (%)ART Girona 50,143 2,610 5.2ART Manresa Central 44,917 1,656 3.7ART Metropolitana N 142,719 6,730 4.7ART Metropolitana S 37,983 1,582 4.2ART Pirineu Lleida 20,344 495 2.4ART Ponent Lleida 41,524 525 1.3ART Tarragona 45,711 2,141 4.7ART Terres Ebre 25,595 755 2.9
Table A.1: Number of tests, positives and percentage of positives byPolice Territorial Division (ART).
Month # tests # positives (%)1 32,286 1,046 3.22 38,231 1,446 3.83 41,161 1,749 4.24 29,485 1,162 3.95 34,485 1,487 4.36 41,897 1,916 4.67 27,521 1,373 5.08 28,788 1,386 4.89 29,319 1,402 4.8
10 38,298 1,126 2.911 31,182 1,271 4.112 36,283 1,130 3.1
Table A.2: Number of tests, positives and percentage of positives bymonth of the year.
Tree-based models classifying imbalanced breath alcohol data 223
Hour # tests # positives (%)1 22,656 1,069 4.72 9,777 761 7.83 35,935 2,677 7.44 25,562 2,161 8.55 7,043 954 13.56 21,499 1,958 9.17 22,746 1,094 4.88 14,282 293 2.19 8,801 73 0.8
10 8,752 42 0.511 11,966 47 0.412 10,524 46 0.413 3,020 23 0.814 1,160 21 1.815 19,151 136 0.716 23,296 234 1.017 11,790 148 1.318 6,243 64 1.019 10,755 139 1.320 11,999 157 1.321 2,588 86 3.322 2,057 123 6.023 28,448 750 2.624 88,886 3,438 3.9
Table A.3: Number of tests, positives and percentage of positives byhour of the day.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 224-257
Investigacion Operativa
Robust Approaches to Uncertain Optimization
Elisabeth Kobis
Institute of Mathematics
Martin–Luther–University Halle–Wittenberg
Abstract
This paper gives an overview of recent results on
robust optimization via unifying approaches using a
nonlinear scalarization concept and methods from vector
and set optimization. First, we consider scalar uncertain
optimization concepts. We distinguish between a finite
and infinite uncertainty set and show that a prominent
scalarizing functional as well as methods from vector and
set optimization play a crucial role for the representation of
robust optimization models. Then we present a notion of
robust solutions of uncertain vector optimization problems
along with linear and nonlinear scalarization results.
Keywords: Uncertain vector optimization, Robust
optimization, Set optimization, Vector optimization,
Scalarization.
AMS Subject classifications: 90C29, 68T37.
© 2017 SEIO
Robust Approaches to Uncertain Optimization 225
1. Introduction
Uncertain data contaminate most optimization problems in
various applications ranging from science and engineering to finance
and thus represent an essential component in optimization. From a
mathematical point of view, many problems can be modeled as an
optimization problem and be solved, but in real life, having exact
data is very rare and seems almost impossible. Due to a lack of
complete information, uncertain data can highly affect solutions and
thus influence the decision making process. Hence, it is crucial to
address this important issue in optimization theory.
Potential applications of uncertain optimization include supply
and inventory management, since demand and tools needed for
the production process can easily be exposed to uncertain changes.
Further examples for uncertain data in optimization problems can
be found in the field of market analysis, share prices, transportation
science, timetabling and location theory (see, for example, [4] and
the references therein, and [16, 51]).
As was recently observed in [26, 27], robust multiobjective
optimization is an important application of set optimization. In
case uncertainties are present during an optimization process,
the decision maker generally has two modeling options: Using
stochastic optimization approaches, solutions are desired that are
likely to satisfy the given requirements (optimality and constraints).
Alternatively, robust optimization searches for solutions which are
of good quality in the worst-case scenarios, regardless of how likely
this event may be. Robust multiobjective optimization with a fixed
ordering structure was examined in [26, 27]. Results on robust
multiobjective optimization using a variable order relation can be
found in [32].
In this paper, we give an overview of recent results in robust
optimization using concepts from vector and set optimization, in
226 E. Kobis
particular by means of a nonlinear scalarizing functional. Section 2
is devoted to recalling some notation and preliminary results. In
Section 3, we describe approaches to uncertain scalar optimization,
where we distinguish between a finite and an infinite uncertainty
set. Section 4 presents a concept for robustness for uncertain
vector optimization problems and collects scalarization results. The
concluding Section 5 proposes some avenues for future research.
2. Preliminaries
In this section, we recall some notation of uncertain
multiobjective optimization introduced in Ehrgott et al. [11] (see
also [27, 29]). Throughout this work, let Yf be a real linear
topological space, X be a linear space, and let an uncertainty set
∅ 6= U ⊆ RN be given, where N ∈ N\{0}. Consider f : X×U → Yf ,
ξ ∈ U and let f(·, ξ) : X → Yf be the function that is to be minimized
on a feasible set ∅ 6= X (ξ) ⊆ X. The feasible set is defined as
X (ξ) := {x ∈ X | Fi(x, ξ) ≤ 0, i = 1, . . . ,m}
with ξ ∈ U and Fi : X × U → R, i = 1, . . . ,m.
For a fixed ξ ∈ U , the deterministic vector optimization problem
is denoted by
f(x, ξ)→ infx∈X (ξ)
. (P (ξ))
The family of all problems⋃ξ∈U (P (ξ)) is denoted by (P (U)). We
call ξ ∈ U a scenario and (P (ξ)) an instance of (P (U)).
ξ models the parameters which are uncertain, and the uncertainty
set U contains all the possible parameter values that the uncertain
parameter may attain. Such uncertainties occur in many real-world
optimization problems and can e.g. be caused by measuring errors,
modeling assumptions or simply because a future parameter is not
Robust Approaches to Uncertain Optimization 227
known prior to solving an optimization problem. Consequently, it
is necessary to treat some of the input data as uncertain and it is
important to find a way to handle uncertain data in optimization
problems. Throughout this paper we assume that the actual
outcome of the parameters ξ is unknown, but that ξ stem from an
uncertainty set U that is nonempty, compact and known a priori.
This is a common assumption in the context of robust optimization.
Examples include interval based uncertainties (e.g. [7]), polyhedral
uncertainties (e.g. [44]), or ellipsoidal uncertainty sets (e.g. [4]).
Let the set of robust solutions be denoted as
A := {x ∈ X|∀ξ ∈ U : Fi(x, ξ) ≤ 0, i = 1, . . . ,m} (2.1)
=⋂ξ∈U
X (ξ),
which we assume to be nonempty.
We define for x ∈ A
fU (x) := {f(x, ξ)| ξ ∈ U} (2.2)
the image of f under U . Note that fU (x) 6= ∅ for all x ∈ A, since
U 6= ∅.Our goal is to obtain solutions that are robust, i.e., that perform
well even in the worst-case scenario. For the scalar case Yf = R,
this would mean to minimize the functional supξ∈U f(x, ξ) on A. Of
course, if f is vector-valued, this scalar approach cannot be easily
transferred to vector optimization. Due to the absence of a total
order on Yf , we need to define the meaning of a robust solution that
satisfies some kind of optimality.
In order to determine robust solutions (where the term robustness
needs to be defined), sets fU (x) need to be compared. For the
comparison of sets, usually, a cone is added to one set and both
228 E. Kobis
sets are then compared w.r.t. that given cone, which represents
the ordering structure. Let Y be a real linear topological space.
Recall that C ⊆ Y is called a cone if c ∈ C implies that λc ∈ C
for every λ ≥ 0. The dual cone of a cone C is defined as the set
C∗ := {y∗ ∈ Y ∗ | ∀c ∈ C : y∗(c) ≥ 0}, where Y ∗ denotes the
topological dual space of Y . A cone C ⊆ Y is called pointed if
C ∩ (−C) = {0}. For two nonempty subsets A, B of Y , we denote
the Minkowski sum of sets by
A+B := {a+ b | a ∈ A, b ∈ B}.
The cone C ⊆ Y is convex if C + C ⊆ C. We say that a nonempty
set B ⊂ Y is proper if B 6= {0} and B 6= Y . A cone C ⊆ Y induces
a binary relation ≤C by
y1 ≤C y2 :⇐⇒ y1 ∈ y2 − C (⇐⇒ y2 ∈ y1 + C).
see, for example, [28]. If the cone C ⊆ Y is proper (i.e., {0} 6= C 6=Y ), pointed and convex, then the binary relation ≤C induced by C
is a (partial) order relation (i.e., a binary relation which is reflexive,
transitive and antisymmetric), see, for example, [28]. In the below
definition, we recall a widely used binary relation among nonempty
subsets of Y , namely, the so-called upper set less order relation.
Definition 2.1 (Upper Set Less Order Relation, see [35, 36]). Let
C ⊆ Y be a cone. Then the upper set less order relation is given
for two nonempty sets A,B ⊂ Y as
A �uC B :⇐⇒ A ⊆ B − C.
In the following, it will be important to identify minimal elements
of a nonempty subset F of Y .
Definition 2.2 (Minimality). Let F ⊆ Y be a nonempty set and let
Robust Approaches to Uncertain Optimization 229
α be a binary relation on Y . F ∈ F is a minimal element of Fw.r.t. α if
for all G ∈ F : G α F =⇒ F α G.
In the above definition, if instead of Y we consider 2Y , then
this definition encompasses the usual minimality notion in set
optimization (see Jahn [28, Definition 14.5]). If F ⊆ Y and the
relation α is induced by a convex cone C ⊂ Y , then the definition
describes the standard minimality notion in vector optimization
(compare, for example, [28, Definition 4.1]). Indeed, F ∈ F is a
minimal element of F w.r.t. α if and only if (F − C) ∩ F ⊆ F + C.
Definition 2.3 (Weak Minimality in Vector Optimization). Let
F ⊆ Y and consider the binary relation α =≤C on Y , where C is
a proper closed and convex cone with nonempty topological interior.
Then F ∈ F is called a weakly minimal element of F w.r.t. α if
(F − int(C)) ∩ F = ∅,
where int(C) denotes the topological interior of C. Note that
minimality implies weak minimality.
Now let k be a non-zero element in the real linear topological
space Y . In addition, let B be a nonempty closed proper subset of
Y satisfying the inclusion
B + [0,+∞) · k ⊂ B. (2.3)
Then we recall the functional zB,k : Y → R ∪ {±∞} =: R defined
by
zB,k(y) := inf{t ∈ R|y ∈ tk −B} for all y ∈ Y. (2.4)
By convention, let inf ∅ = +∞. The functional zB,k was originally
introduced as separation functional in vector optimization by
Gerstewitz [18], see also Gerth and Weidner [19], Pascoletti and
230 E. Kobis
Serafini [41] and Gopfert et al. [20]. It is interesting to notice
that the construction in (2.4) was mentioned by Krasnosel’skiı
[34] (see Rubinov [42]) in the context of operator theory. Using
this scalarizing functional we can define the following minimization
problem which will be used later on to represent the concept of
robust optimization.
In the following definition, we denote the set of feasible elements
of Y by F .
Definition 2.4. Let ∅ 6=F ⊆ Y and let zB,k be defined as in (2.4).
An element F ∈ F is a minimal element of F in Y w.r.t. zB,k if
zB,k(F ) ≤ zB,k(G) ∀G ∈ F ,
i.e., F solves the scalar optimization problem
zB,k(F )→ infF∈F
. (Pk,B,F )
We remark that many scalarization schemes that are
suggested in the literature are special cases of the above
nonlinear scalarization concept. For example, in the case of
(finite-dimensional) multiobjective optimization, this scalarization
method comprises weighted-sum (see Gass and Saaty [17],
or Zadeh [52]), Tschebyscheff- (Steuer and Choo [46]) and
ε-constraint-scalarizations (Haimes et al. [23]), and many others
(for an overview, see [48]). The functional zB,k possesses various
interesting properties, some of which we collect below in the case
that B is a proper closed convex cone in Y with nonempty interior
and k ∈ intB.
Lemma 2.1 ([20]). Let B be a proper closed convex cone with
nonempty interior in the real linear topological space Y and k ∈intB. Then zB,k, defined by (2.4), is a finite-valued, continuous,
Robust Approaches to Uncertain Optimization 231
sublinear, B-monotone (i.e., y1 ∈ y2−B =⇒ zB,k(y1) ≤ zB,k(y2))
and strictly (intB)-monotone (i.e., y1 ∈ y2− intB =⇒ zB,k(y1) <
zB,k(y2)) functional such that
∀y ∈ Y, ∀r ∈ R : zB,k(y) ≤ r ⇐⇒ y ∈ rk −B,
∀y ∈ Y, ∀r ∈ R : zB,k(y) < r ⇐⇒ y ∈ rk − intB.
It is interesting to mention that the functional zB,k has been recently
defined for linear spaces that are not endowed with a topology.
Several properties of zB,k under non-topological assumptions are
studied in [22] and the references therein.
3. Robust Approaches to Uncertain Scalar
Optimization
In this section, we study the problem (P (ξ)) for the case Yf = R,
i.e., we consider scalar optimization problems (P (ξ)) which depend
on uncertain parameters ξ ∈ U ⊆ RN . Thus, for fixed parameters
ξ ∈ U , the problem to be solved is given as
f(x, ξ)→ inf
s.t. Fi(x, ξ) ≤ 0, i = 1, . . . ,m,
x ∈ X,(P (ξ))
where f : X × U → R, Fi : X × U → R, i = 1, . . . ,m.
Now the question arises how one should handle the family of all
problems⋃ξ∈U (P (ξ)), denoted by (P (U)). Typically, the problem
(P (U)) is replaced by a deterministic counterpart problem, called
robust counterpart. Now we will formally recall the most prominent
robustness concept from the literature. It has been first mentioned
by Soyster [45] and then formalized and analyzed by Ben-Tal, El
Ghaoui, and Nemirovski in numerous publications, see e.g. [6, 14] for
232 E. Kobis
early contributions and [4] for an extensive collection of results. The
idea is that the worst possible objective function value is minimized
in order to get a solution that is “good enough” even in the worst
case scenario. Furthermore, constraints have to be satisfied for
every scenario ξ∈ U . Then the robust counterpart of the uncertain
optimization problem (P (U)) is defined by
supξ∈U
f(x, ξ)→ inf
s.t. x ∈ A,(RC)
where A is defined in (2.1). We call a feasible solution of (RC)
robust. The intuition behind this approach is the following: A
risk-averse decision-maker is interested in obtaining robust solutions,
i.e., solutions that hedge against the possibility of the worst case
scenario. Moreover, the given constraints have to be satisfied for
any scenario. Of course, this is an extremely conservative approach,
which necessitates handling with great care, since it needs to be
ensured that the set A is indeed nonempty. In the literature, there
exist numerous extensions and modifications of this concept (see,
for instance, [29] and the references therein). For example, the
reliably robust counterpart (compare [5]) relaxes the constraints, and
the lightly robust counterpart (see [44]) minimizes upper bounds in
the constraints, and where deviations from the optimal value at a
nominal scenario are allowed. The following two subsections are
devoted to investigating the problem (RC) in case of a finite and
infinite uncertainty set U , respectively.
3.1. Finite Uncertainty Set
In this section, we assume that U := {ξ1, . . . , ξq}, i.e., ξ ∈ U can
take on q different values. This assumption is of particular interest
in practical applications concerning computations, as most data can
Robust Approaches to Uncertain Optimization 233
only be handled in a discrete manner. We will now show how
the robust counterpart (RC) can be expressed using the nonlinear
scalarizing functional zB,k given by (2.4) under the assumption that
U is finite.
Theorem 3.1 ([29, Theorem 3]). Let Y = Rq. For B := Rq+, where
Rq+ denotes the nonnegative orthant in Rq, k := 1q := (1, . . . , 1)T
and F := {(f(x, ξ1), . . . , f(x, ξq))T |x ∈ A}, problem (Pk,B,F ) is
equivalent to problem (RC) in the following sense:
infF∈F
zB,k(F ) = infx∈A
supξ∈U
f(x, ξ).
Proof. Since B = Rq+ and k ∈ intRq+, (2.3) is fulfilled and then the
functional zB,k is well-defined. The following reformulations hold:
infF∈F
zB,k(F ) = infF∈F
inf{t ∈ R|F ∈ tk −B}
= infF∈F
inf{t ∈ R|F − tk ∈ −B}
= infx∈A
inf{t ∈ R|(f(x, ξ1), . . . , f(x, ξq))T
− t · (1, . . . , 1)T ≤Rq+ 0q}
= infx∈A
inf{t ∈ R|(f(x, ξ1), . . . , f(x, ξq))T
≤Rq+t · (1, . . . , 1)T }
= inf{supξ∈U
f(x, ξ)|x ∈ A},
which completes the proof.
Note that the selection of k = 1q reflects the choice of each
objective function: It means that every scenario ξ ∈ U (or each
objective function f(x, ξ), ξ ∈ U) is treated equally, i.e., no objective
function is preferred to another one.
234 E. Kobis
Remark 3.1. Since B = Rq+ is a proper closed convex cone
and k ∈ intRq+, Lemma 2.1 can be applied, and the functional
zB,k is continuous, finite-valued, convex, Rq+-monotone, strictly
(intRq+)-monotone and subadditive.
Remark 3.2. The concept of robustness is described by the
Tschebyscheff scalarization with the origin as reference point as a
special case of functional (2.4). Therefore, Theorem 3.1 verifies
that (RC) can be interpreted as a max-ordering problem as defined
in multiobjective optimization, see [10]. Note that this relation has
already been observed by Kouvelis and Sayin [33, 43], where it was
used to calculate the set of efficient solutions of discrete bicriteria
optimization problems. Additionally, this concept is equivalent to
a reference point approach of Wierzbicki [49] using the origin as
reference point, and in the case that f(x, ξ) ≥ 0 for all ξ ∈ U and
x ∈ A, to a weighted Tschebyscheff scalarization, see Steuer and
Choo [46].
The preceding result has shown that the problem (RC) can be
regarded as a scalarized problem of a multiobjective optimization
problem, where every scenario ξl ∈ U , l = 1, . . . , q, yields
its own objective function hl(x) := f(x, ξl), with h : A →Rq and h := (h1, . . . , hq)
T . Therefore, it would be quite
natural to consider the multiobjective optimization problem (as a
deterministic multiobjective counterpart problem) in more detail.
The multiobjective robust counterpart to (P (U)) is defined by
h(x)→ infx∈A
, (RC′)
where F = h[A] = ∪x∈Ah(x) (see (2.1) for the definition of A).
Using Theorem 3.1 together with Lemma 2.1, we can conclude that
problem (Pk,B,F ) is a scalarization of the multiobjective counterpart
(RC′), and the following corollary holds due to the monotonicity
Robust Approaches to Uncertain Optimization 235
properties of the functional zB,k.
Corollary 3.1 ([29, Corollary 4]). Let Y , B, k and F be given as in
Theorem 3.1, C := Rq+ and let α denote the order relation induced
by C. Then for a given F ∈ F , we have the following implications:
[∀G ∈ F \ {F} : zB,k(F ) < zB,k(G)]⇒ F is a minimal element
of F w.r.t. α,
[∀G ∈ F : zB,k(F ) ≤ zB,k(G)]⇒ F is a weakly minimal element
of F w.r.t. α.
It is shown in [29] that several different kinds of robust
counterpart problems known from the literature can be obtained
by considering the problem (Pk,B,F ) (as scalarization of (RC′))
with different input parameters k,B and F . Additionally, it is
interesting to mention that it is possible to include the constraints
Fi, i = 1, . . . ,m as objective functions in the objective vector h.
In this way, more concepts of robustness can be represented and
further evaluated (for example, reliable and light robustness, see
[29]). Moreover, depending on a decision-maker’s preferences, it is
now possible to find completely new concepts of robustness (i.e.,
different robust counterpart problems) by modifying the involved
parameters k and B.
3.2. Infinite Uncertainty Set
If U = {ξ1, . . . , ξq} is finite, each scenario can be interpreted as
an objective function, as we have seen in Section 3.1. For a robust
solution x ∈ A, we then obtain a vector Fx ∈ Rq which contains
f(x, ξi) in its ith coordinate. In order to compare two solutions
x and y, order relations for the vectors Fx and Fy are used. In
this way, many concepts of robust optimization and of stochastic
236 E. Kobis
programming can be characterized using multiobjective counterpart
problems, see [29]. If U is not a finite set, we obtain not vectors
but functions, i.e., Fx : U → R, where Fx(ξ) = f(x, ξ) contains
the objective function value of x in scenario ξ, ξ ∈ U . In order to
compare two solutions x and y, we hence need order relations in the
real linear space RU of all mappings F : U → R. Throughout this
subsection, we assume that U is not necessarily a finite set. In this
case, we propose three different approaches to the problem (RC):
� the vector approach,
� the set approach,
� and the nonlinear scalarization approach.
The idea of using these three approaches to dealing with problem
(RC) stems from [30], and most of the results presented within this
subsection are taken from [30]. We start by describing the vector
approach. Let Y = RU be the space of all functions F : U → R. For
a fixed solution x ∈ A, we define
Fx ∈ Y : Fx(ξ) := f(x, ξ).
In order to compare elements of Y , we consider different order
relations on the space Y which are denoted by α. In the context of
vector optimization, (partial) order relations are the binary relations
≤C induced by pointed convex cones.
Such a cone C induces an order relation α :=≤C by
y1 ≤C y2 :⇐⇒ y1 ∈ y2 − C (⇐⇒ y2 ∈ y1 + C).
Whenever we are working with the interior of an ordering cone, we
assume that Y = C(U ,R), i.e., that the functions Fx = f(x, ξ)
are continuous in ξ for all feasible values of x. A particular order
relation, which will be of interest later, is given in the next definition.
Robust Approaches to Uncertain Optimization 237
Definition 3.1. The natural order relation α is given by the cone
Y + := {F ∈ Y |∀ξ ∈ U : F (ξ) ≥ 0}
inducing for all F,G ∈ Y that
F α G ⇐⇒ G ∈ F + Y +
⇐⇒ F (ξ) ≤ G(ξ) for all ξ ∈ U .
Given an order relation α and a set F ⊆ Y , the vector
optimization problem asks for minimal elements of F w.r.t.
α. It is shown in [30] that various concepts for uncertain
optimization can be interpreted as solving such a vector optimization
problem, and conversely, every order α induces a concept for
handling uncertainty. While not all such concepts necessarily
have a meaningful interpretation in the context of uncertain
optimization, this relationship provides a coherent means of devising
and understanding deterministic counterparts of an uncertain
optimization problem. For a systematic approach to different
concepts for handling uncertainty in the context of vector and set
optimization, we refer to [30].
Remark 3.3. In the case of the natural order relation α of Y
introduced in Definition 3.1, an element F ∈ F is a minimal element
of F w.r.t. α if and only if 6 ∃ G ∈ F \ {F} such that
∀ξ ∈ U : (G− F )(ξ) ≤ 0,
or, in equivalent terms, if and only if 6 ∃ G ∈ F such that
∀ξ ∈ U : (G− F )(ξ) ≤ 0 and ∃ ξ ∈ U : (G− F )(ξ) < 0.
If Y = C(U ,R), then int(Y +) = {F ∈ Y |∀ξ ∈ U : F (ξ) > 0} (see
238 E. Kobis
Jahn [28] and Winkler [50]), and an element F ∈ F is a weakly
minimal element of F w.r.t. α if and only if
6 ∃G ∈ F : (G− F )(ξ) < 0 ∀ξ ∈ U . (3.1)
The robust counterpart (RC) can be formulated as a vector
optimization problem in the space Y = RU as follows. We denote
the set of robust outcome functions in Y by
F := {Fx ∈ Y | x ∈ A},
where A is defined in (2.1). Let two functions Fx, Fy ∈ Y be given.
We consider the following order relation on Y :
Fx αsup Fy :⇐⇒ supξ∈U
Fx(ξ) ≤ supξ∈U
Fy(ξ).
As in the finite dimensional case, the sup-order relation αsup is
not compatible with addition, i.e., for three elements Fx, Fy, Fz ∈Y , FxαsupFy does not necessarily imply (Fx + Fz)αsup(Fy + Fz).
Consequently, αsup cannot be represented by an ordering cone.
Nevertheless, it has the following properties.
Remark 3.4. αsup is reflexive and transitive. Furthermore, αsup is
a total preorder.
The following theorem shows that the order relation αsup allows
to represent the robust optimization problem (RC) as a vector
optimization problem.
Theorem 3.2 ([30, Theorem 1]). A solution x ∈ A is an optimal
solution to (RC) if and only if Fx is a minimal element of F w.r.t.
the sup-order relation αsup.
Robust Approaches to Uncertain Optimization 239
Proof. Let x ∈ A. Then
x is an optimal solution to (RC) ⇔ supξ∈U
f(x, ξ) ≤ supξ∈U
f(x, ξ)
for all x ∈ A
⇔ supξ∈U
Fx(ξ) ≤ supξ∈U
Fx(ξ)
for all x ∈ A
⇔ FxαsupFx for all x ∈ A,
⇔ FxαsupG for all G ∈ F ,
and the result follows since αsup is a total preorder.
This means that optimal solutions of the robust
counterpart (RC) correspond to outcome functions whose suprema
are minimal.
We now analyze the relation between the sup-order relation αsup
and the natural order relation α introduced in Definition 3.1.
Remark 3.5. F α G =⇒ F αsup G for all F,G ∈ Y .
The implication stated in Remark 3.5 does not generally imply
that every minimal element w.r.t. αsup is also a minimal element
w.r.t. α, or vice versa. When there are two scenarios ξ1, ξ2 and
under some additional assumptions, Iancu and Trichakis [25] have
shown that there exist optimal solutions to (RC) which are minimal
w.r.t. C = R2+, and call them PRO robust solutions. However, in
this general setting, we are able to formulate the following relation
between minimal elements.
Lemma 3.1 ([30, Lemma 2]). Let Y = C(U ,R). If F ∈ F is
a minimal element of F w.r.t. αsup, then F is a weakly minimal
element of F w.r.t. the natural order relation α.
Proof. Let F ∈ F be a minimal element of F w.r.t. αsup. Since
240 E. Kobis
αsup is a total preorder, this means that
supξ∈U
F (ξ) ≤ supξ∈U
G(ξ) for all G ∈ F . (3.2)
Now suppose that F is not a weakly minimal element of F w.r.t.
the natural order relation α of Y . Thus, there exists G ∈ F s.t.
∀ ξ ∈ U : G(ξ) < F (ξ),
see, (3.1). Since U was assumed to be compact, G attains its
supremum on U . This means that
supξ∈U
G(ξ) = G(ξ) < F (ξ) ≤ supξ∈U
F (ξ),
for some ξ ∈ U , a contradiction to (3.2).
Using this relation together with Theorem 3.2, we obtain that
Fx is a weakly minimal element w.r.t. the natural order relation α,
for all optimal solutions x to (RC).
Corollary 3.2 ([30, Corollary 1]). Let Y = C(U ,R) and let the
worst case be attained for every solution x ∈ A. Then for every
optimal solution x to the robust counterpart (RC), Fx is a weakly
minimal element of the set of robust outcome functions F w.r.t. the
natural order relation α in Y .
Now we will consider the problem (RC) by using the set
approach. In particular, we will show that it is possible to interpret
the robust counterpart (RC) as a set-valued optimization problem.
Let the power set of Yf =R be denoted by Z := 2R. Furthermore,
we define for each x ∈ A
Bx := fU (x) := {f(x, ξ) | ξ ∈ U} ⊆ R.
Robust Approaches to Uncertain Optimization 241
We denote the set of robust outcome sets in Z by
B := {Bx ∈ Z| x ∈ A}.
Let R+ denote the set of nonnegative real numbers. For Bx, By ∈ Z,
the upper-type set-relation βsup is defined as
Bx βsup By :⇐⇒ Bx ⊆ By − R+
⇐⇒ supBx ≤ supBy,
see Kuroiwa [35, 36] and Kuroiwa et al. [39].
Remark 3.6. βsup is reflexive and transitive. Furthermore, it is a
total preorder.
We obtain the following relation between βsup and αsup.
Lemma 3.2 ([30, Lemma 3]). Let x, y ∈ A and let Fx, Fytheir corresponding outcome functions in Y and Bx, By their
corresponding outcome sets in Z. Then
BxβsupBy ⇐⇒ FxαsupFy.
Proof.
Bx βsup By ⇐⇒ sup Bx ≤ sup By
⇐⇒ sup{Fx(ξ)|ξ ∈ U} ≤ sup{Fy(ξ)|ξ ∈ U}⇐⇒FxαsupFy,
and the proof finishes.
The order relation βsup allows to represent the robust
optimization problem (RC) as a set-valued optimization problem,
as the next theorem verifies.
242 E. Kobis
Theorem 3.3 ([30, Theorem 2]). A solution x ∈ A is an optimal
solution to (RC) if and only if Bx is a minimal element of B w.r.t.
the order relation βsup.
Proof. We know from Theorem 3.2 that x ∈ A is an optimal
solution to (RC) if and only if FxαsupFx for all x ∈ A. According
to Lemma 3.2 this is equivalent to BxβsupBx for all x ∈ A and the
result follows.
We finally represent the robust counterpart (RC) using the
nonlinear scalarizing functional (2.4).
Theorem 3.4 ([30, Theorem 3]). Let Y = RU , B := Y +, and
k :≡ 1 ∈ Y . Then x ∈ A is an optimal solution to (RC) if and only
if Fx solves problem (Pk,B,F ).
Proof. B + [0,+∞) · k ⊂ B holds, thus inclusion (2.3) is satisfied
and the functional zB,k can be defined. Furthermore, we have
zB,k(Fx) = inf{t ∈ R|Fx ∈ tk −B}= inf{t ∈ R|Fx − tk ∈ −Y +}= inf{t ∈ R|∀ξ ∈ U : Fx(ξ) ≤ t}= sup
ξ∈Uf(x, ξ).
Thus, Fx is a solution for (Pk,B,F ) if and only if x ∈ A minimizes
supξ∈U f(x, ξ), i.e., if and only if x is an optimal solution to (RC).
Remark 3.7. If Y = C(U ,R) and k ∈ int(Y +), we have the
following properties. Since B = Y + is a proper closed convex cone
and k ∈ int(Y +), Lemma 2.1 implies that the functional zB,k is
continuous, finite-valued, Y +-monotone, strictly (intY +)-monotone
and sublinear, and
∀ F ∈ Y, ∀ t ∈ R : zB,k(F ) ≤ t ⇐⇒ F ∈ tk − Y +,
∀ F ∈ Y, ∀ t ∈ R : zB,k(F ) < t ⇐⇒ F ∈ tk − int(Y +).
Robust Approaches to Uncertain Optimization 243
Note that in the special case of a discrete uncertainty set U =
{ξ1, . . . , ξq}, Theorem 3.4 simplifies to Theorem 3.1.
4. Robust Approaches to Uncertain Vector
Optimization
This section is devoted to developing solution concepts for
uncertain vector optimization problems, specifically, our goal is
to obtain robust solutions. Only a few approaches to uncertain
vector optimization have been mentioned in the literature, of which
we briefly summarize the following. Hughes [24] presented a first
concept of dealing with uncertain multiobjective optimization by
computing the expected value of the errors that occur in the
objective functions. The vector of expected errors is then used in
the classical concept of Pareto optimality. Teich [47] generalized the
concept of Pareto optimality in a probabilistic nature for uncertain
vector-valued problems where the objective values are constrained
by intervals. Another idea was presented by Li et al. [40]
who develop solution procedures that compare the performance of
solutions regarding optimality and its robustness. They propose
a biobjective optimization problem, one of the objective functions
being a fitness value and the other one containing a robustness
index. The considered method in [40] may be beneficial for obtaining
solutions that satisfy certain optimality and robustness criteria, and
a decision maker may choose depending on his preferences toward
uncertainty. Another approach was presented by Deb and Gupta
[9] who used an idea by Branke [8], and defined robustness as
a kind of sensitivity against perturbations in the decision-space.
Branke [8] proposes to replace the objective function f by its mean
function f which maps any point x to the average function value
in a pre-defined neighborhood of x. A minimizer of f is then more
244 E. Kobis
robust in the sense that the function values in its neighborhood
do not change too much. Based on this idea for single objective
optimization problems, Deb and Gupta [9] introduced two concepts
of robustness for vector-valued optimization problems. The first one
replaces all objective functions by their mean functions. Efficient
solutions to the resulting optimization problem are called robust
solutions of the original problem. Deb and Gupta’s second concept
minimizes the original objective functions but adds constraints to
the problem that restrict the variation between the original objective
functions and a perturbed function value (that can be chosen as their
mean functions) to a pre-defined limit. This approach proves to be
more pragmatic and enables the user to control the desired level of
robustness.
Barrico and Antunes [2, 3] consider a multiobjective optimization
problem with perturbations in the decision space. In [2, 3], a solution
is called robust if small perturbations in the decision-space only yield
small perturbations in the objective-space. The authors in [2, 3]
define a degree of robustness that allows the decision maker to specify
the level of robustness of the solution. Specifically, the user is able to
determine the size of the neighborhood that the solution belongs to.
Furthermore, Barrico and Antunes [1] extend the concept of degree of
robustness to the space of the objective function coefficients, where
perturbations are treated in a similar manner as in [2, 3]. For more
results on this line of research, compare [15, 21].
The first scenario-based approach to uncertain vector-valued
problems was introduced by Kuroiwa and Lee [37] who directly
transferred the main idea of scalar robust optimization, meaning
minimizing the worst-case objective function, to a multicriteria
setting. For (P (U)), i.e., for the family of deterministic vector
optimization problems (P (ξ)), and Yf = Rk, Kuroiwa and Lee [37]
Robust Approaches to Uncertain Optimization 245
introduce a multiobjective problem
h(x)→ infx∈X
(4.1)
with
h(x) :=
supξ∈U1 f1(x, ξ)...
supξ∈Uk fk(x, ξ)
,
fi : Rn × Ui → R, i = 1, . . . , k, and X ⊆ Rn. The authors in [37]
call (weakly) minimal elements of the set ∪x∈Xh(x) (weakly) robust
efficient. The special case for convex functions fi, i = 1, . . . , k,
is studied in [38]. This approach is a rather direct transferral
from scalar robust optimization. For some cases, this concept
may, however, not be sufficient to describe robust solutions of
multiobjective optimization problems, as the point h(x) may never
be attained (if one considers the sets fU (x) given in (2.2)), but
solutions are compared w.r.t. the point h(x). Problem (4.1) still
is beneficial and was recently used by Ehrgott et al. [11] to obtain
solutions that they call robust in a slightly different setting. The
authors in [11] generalize the above approach from Kuroiwa and Lee
[37] by considering the whole set that is obtained when analyzing
a possible solution x. They call a solution x0 robust efficient if its
set fU (x0) is not dominated by any other set fU (x). The authors
in [11] observe that (weakly) minimal elements of the set ∪x∈Xh(x)
(related to the above problem (4.1)) are also (weakly) robust efficient
solutions within their definition of robust efficiency, and the reverse
implication holds under the requirement that the uncertainty set
takes the form U := U1 × . . . × Uk, i.e., if the uncertainties are
independent of each other. The robustness concept introduced in
[11] implicitly uses a set order relation to compare solution sets.
We will show in this section that this approach is closely connected
246 E. Kobis
to set optimization, because the objective map considered here is
set-valued.
In the literature, two main ways of treating a set-valued
optimization problem are reported: Using a vector concept, one
wishes to obtain single elements that satisfy a certain minimality
condition (possibly similar to Pareto minimality) for the union of all
sets in the objective space. Since having one element that is optimal
in some sense does not reveal any information about the performance
of the remaining elements in that particular solution set, it can be
argued that this approach may not be useful enough in practical
applications. The second concept deals with obtaining solution sets
out of all possible sets in the objective space. The authors in [11]
use the latter approach to define robust solutions to an uncertain
multiobjective optimization problem.
In this section, we consider the family of deterministic vector
optimization problems (P (ξ)), denoted as (P (U)). Let A be defined
as in (2.1) and let C be a convex cone in the objective space Yf ,
which is assumed to be a real linear topological space.
Definition 4.1. A solution x0∈ A of (P (U)) is called robust if
there is no x ∈ A \ {x0} such that fU (x) �uC fU (x0), which is
equivalent to
@ x ∈ A \ {x0} : fU (x) ⊆ fU (x0)− C.
For the special case Yf = Rk, X = Rn, C = Rk+ and |U| = 1,
i.e., in the deterministic multiobjective case, Definition 4.1 coincides
with the definition of strict minimality (compare [10, Definition
2.24]). Accordingly, one can define weaker notions of robustness,
as it is done in [27, Definition 6], which we skip here for the sake
of brevity. Moreover, if Yf = R, X = Rn, C = R+, then the above
Robust Approaches to Uncertain Optimization 247
notion of robustness reduces to the classical one given in (RC) for
unique solutions, meaning that x0∈ A is a robust solution of (P (U))
if and only if for all x ∈ A \ {x0}, it holds that supξ∈U f(x, ξ) >
supξ∈U f(x0, ξ). The following scalarization result gives a sufficient
condition of robustness for a feasible element x ∈ A.
Theorem 4.1 ([27, Theorem 1]). Let y∗ ∈ C∗ \ {0} be given. If for
some x0 ∈ A
supξ∈U
y∗(f(x0, ξ)) < supξ∈U
y∗(f(x, ξ)), ∀x ∈ A \ {x0} (4.2)
holds true, then x0 is robust for (P (U)).
Proof. Suppose to the contrary that x0 is not robust. Then there
exists an element x ∈ A \ {x0} such that
fU (x) ⊆ fU (x0)− C.
This implies
∀ ξ ∈ U ∃ η ∈ U : f(x, ξ) ∈ f(x0, η)− C.
Choose now y∗ ∈ C∗ \ {0}. This implies
=⇒ ∀ ξ ∈ U ∃ η ∈ U : y∗(f(x, ξ)) ≤ y∗(f(x0, η))
=⇒ ∀ ξ ∈ U : y∗(f(x, ξ)) ≤ supη∈U
y∗(f(x0, η))
=⇒ supξ∈U
y∗(f(x, ξ)) ≤ supη∈U
y∗(f(x0, η)).
But this is a contradiction to (4.2).
Under convexity and closedness assumptions of the set fU (x0)−C, it is possible to derive the converse statement of the implication
given in Theorem 4.1. The following result is a particular case of
248 E. Kobis
[32, Theorem 3.2], where the cone C is not fixed, but depends on the
decision variable. The following theorem requires the objective space
Yf to be locally convex, where local convexity of a real topological
linear space is given in [28, Definition 1.33].
Theorem 4.2. Assume that the objective space Yf is locally convex.
Suppose that x0 is robust and that the set fU (x0)− C is closed and
convex. Then there does not exist an element x ∈ A \ {x0} such that
for every y∗ ∈ C∗
supξ∈U
y∗(f(x, ξ)) ≤ supξ∈U
y∗(f(x0, ξ)).
Proof. Assume that x0 ∈ A is robust. This is equivalent to
@x ∈ A \ {x0} : fU (x) ⊆ fU (x0)− C⇐⇒ ∀x ∈ A \ {x0} : fU (x) * fU (x0)− C⇐⇒ ∀x ∈ A \ {x0} : ∃ξx ∈ U : f(x, ξx) /∈ fU (x0)− C.
Since fU (x0)−C is closed and convex, we use a classical separation
argument (see, for instance, [28, Theorem 3.18]) such that we get
∀x ∈ A \ {x0} ∃ξx ∈ U , ∃y∗ ∈ Y ∗f \ {0}, α ∈ R :
y∗(f(x, ξx)) > α ≥ y∗(y) ∀y ∈ fU (x0)− C, (4.3)
and this yields
∀x ∈ A \ {x0} ∃y∗ ∈ Y ∗f \ {0}, α ∈ R :
supξ∈U
y∗(f(x, ξ)) > α ≥ supy∈fU (x0)−C
y∗(y).
To show that y∗ ∈ C∗, suppose that y∗ /∈ C∗, which means that
there is c ∈ C such that y∗(c) < 0. With (4.3), we obtain for any
Robust Approaches to Uncertain Optimization 249
ξ ∈ U , c ∈ C and some λ ≥ 0
α ≥ y∗(f(x0, ξ)− λc) = y∗(f(x, ξ))− λy∗(c) λ→+∞→ +∞,
a contradiction. Furthermore,
supy∈fU (x0)−C y∗(y) = supξ∈U y
∗(f(x0, ξ)) + supc∈−C y∗(c)
= supξ∈U y∗(f(x0, ξ)).
Altogether, we conclude with
∀x ∈ A \ {x0} ∃y∗ ∈ C∗ : supξ∈U
y∗(f(x, ξ)) > supξ∈U
y∗(f(x0, ξ)),
which is equivalent to
@ x ∈ A \ {x0} ∀y∗ ∈ C∗ : supξ∈U
y∗(f(x, ξ)) ≤ supξ∈U
y∗(f(x0, ξ)),
which completes the proof.
Remark 4.1. It is interesting to mention that Ehrgott et al. [11]
propose a vectorization approach for computing robust solutions for
the special case Yf = Rk, X = Rn, C = Rk+, i.e., they reduce
the problem (P (U)) to the vector optimization problem (4.1) (called
objective-wise worst case).
In [31], it is shown that the nonlinear scalarizing functional zB,k
(see (2.4)) can be used to characterize several set order relations.
From [31, Theorem 3.3], we obtain the following result.
Corollary 4.1. Let x0 be given and suppose that there exists some
k ∈ C \ {0} such that for all x ∈ A \ {x0}, infy0∈fU (x0) zC,k(y − y0)
is attained for all y ∈ fU (x). Then x0 is robust for (P (U)) if and
250 E. Kobis
only if
@ x ∈ A \ {x0} : supy∈fU (x)
infy0∈fU (x0)
zC,k(y − y0) ≤ 0.
5. Conclusions
This paper gives an overview on robust approaches to uncertain
scalar and vector-valued optimization problems, respectively. In
robust optimization, one traditionally hedges against perturbations
in the worst-case scenarios. Robust solutions are then immunized
against perturbations, and thus this approach is applicable if a
decision maker acts risk averse. In uncertain vector optimization,
this situation can be modeled by using the upper set less order
relation. This paper explores this concept and gives some
scalarization results. An interesting topic that is presently given
a lot of attention in the literature (see [12, 13, 32]) is a deeper
analysis of the ordering structure. Moreover, based on the
proposed scalarization techniques, it is now possible to derive
efficient algorithms for finding robust solutions of uncertain vector
optimization problems.
Acknowledgements
The author expresses her gratitude to the two anonymous
reviewers for their helpful suggestions which helped to improve the
manuscript significantly.
References
[1] Barrico C., and Antunes C.H. (2006). Robustness analysis
in evolutionary multiobjective optimization - with a case
study in electrical distribution networks. Presented at the II
Robust Approaches to Uncertain Optimization 251
European-Latin-American Workshop on Engineering Systems
(SELASI II), Porto, Portugal.
[2] Barrico C., and Antunes C.H. (2006). Robustness analysis
in multiobjective optimization using a degree of robustness
concept. In IEEE Congress on Evolutionary Computation (CEC
2006), pages 1887–1892. IEEE Computer Society.
[3] Barrico C., and Antunes C.H. (2006). A new approach to
robustness analysis in multi-objective optimization. Proceedings
of the 7th International Conference on Multi-Objective
Programming and Goal Programming (MOPGP), Loire Valley
(Tours), France.
[4] Ben-Tal A., El Ghaoui L., and Nemirovski A. (2009). Robust
Optimization, Princeton University Press, Princeton.
[5] Ben-Tal A., and Nemirovski A. (2000). Robust solutions of
linear programming problems contaminated with uncertain
data, Math. Program., 88, 411–424.
[6] Ben-Tal A., and Nemirovski A. (1998). Robust convex
optimization, Math. Oper. Res., 23(4), 769–805.
[7] Bertsimas D., and Sim, M. (2004). The price of robustness,
Oper. Res., 52(1), 35–53.
[8] Branke J. (1998). Creating robust solutions by means of
evolutionary algorithms. In E.A. Eiben, T. Back, M. Schenauer,
and H.-P. Schwefel, editors, Parallel Problem Solving from
Nature – PPSNV, volume 1498 of Lecture Notes in Computer
Science, pages 119–128. Springer, Berlin, Heidelberg.
[9] Deb K., and Gupta H. (2006). Introducing robustness in
multiobjective optimization, Evol. Comput., 14, 463–494.
252 E. Kobis
[10] Ehrgott M. (2005). Multicriteria Optimization, Springer, New
York.
[11] Ehrgott M., Ide J., and Schobel A. (2014). Minmax robustness
for multi-objective optimization problems, European J. Oper.
Res., 239(1), 17–31.
[12] Eichfelder G., and Pilecka, M. (2016). Set approach for set
optimization with variable ordering structures Part I: Set
relations and relationship to vector approach, J. Optim. Theory
Appl., 171(3), 931–946.
[13] Eichfelder G., and Pilecka M. (2016). Set approach for
set optimization with variable ordering structures Part II:
Scalarization approaches, J. Optim. Theory Appl., 171(3),
947–963.
[14] El Ghaoui L., and Lebret H. (1997). Robust solutions to
least-squares problems with uncertain data, SIAM J. Matrix
Anal. Appl., 18, 1034–1064.
[15] Erfani T., and Utyuzhnikov S. (2012). Control of robust design
in multiobjective optimization under uncertainties, Struct.
Multidiscip. Optim., 45, 247–256.
[16] Fischetti M., Salvagnin D., and Zanette A. (2009). Fast
approaches to improve the robustness of a railway timetable,
Transportation Sci., 43(3), 321–335.
[17] Gass S., and Saaty T. (1995). The computational algorithm
for the parametric objective function, Naval Res. Logistics
Quarterly, 2, 39–45.
[18] Gerstewitz (Tammer) Chr. (1983). Nichtkonvexe Dualitat in der
Vektoroptimierung, Wiss. Zeitschr. TH Leuna-Merseburg, 25,
357–364.
Robust Approaches to Uncertain Optimization 253
[19] Gerth (Tammer) Chr., and Weidner P. (1990). Nonconvex
separation theorems and some applications in vector
optimization, J. Optim. Theory Appl., 67, 297–320.
[20] Gopfert A., Riahi H., Tammer Chr., and Zalinescu C. (2003).
Variational Methods in Partially Ordered Spaces, CMS Books
in Mathematics, Springer, New York.
[21] Gunawan S., and Azarm S. (2005). Multi-objective robust
optimization using a sensitivity region concept, Struct.
Multidiscip. Optim., 29, 50–60.
[22] Gutierrez C., Novo V., Rodenas-Pedregosa J.L., and Tanaka
T. (2016). Nonconvex separation functional in linear spaces
with applications to vector equilibria, SIAM J. Optim., 26,
2677–2695.
[23] Haimes Y., Lasdon L.S., and Wismer D.A. (1971). On a
bicriterion formulation of the problems of integrated system
identification and system optimization, IEEE Trans. Syst.,
Man, Cybern., Syst., 1, 296–297.
[24] Hughes E.J. (2001). Evolutionary multi-objective ranking with
uncertainty and noise. Proceedings of the First International
Conference on Evolutionary Multi-Criterion Optimization
(EMO-2001), 329–343.
[25] Iancu D.A., and Trichakis N. (2014). Pareto efficiency in robust
optimization, Manag. Sci., 60, 130–147.
[26] Ide J., and Kobis E. (2014). Concepts of efficiency for uncertain
multi-objective optimization problems based on set order
relations, Math. Method Oper. Res., 80, 99–127.
254 E. Kobis
[27] Ide J., Kobis E., Kuroiwa D., Schobel A., and Tammer
Chr. (2014). The relationship between multicriteria robustness
concepts and set-valued optimization, Fixed Point Theory Appl.
DOI: 10.1186/1687-1812-2014-83.
[28] Jahn J. (2011). Vector Optimization - Introduction, Theory, and
Extensions, Springer, Berlin, Heidelberg.
[29] Klamroth K., Kobis E., Schobel A., and Tammer Chr. (2013).
A unified approach for different concepts of robustness and
stochastic programming via nonlinear scalarizing functionals,
Optimization, 62(5), 649–671.
[30] Klamroth K., Kobis E., Schobel A., and Tammer Chr. (2017). A
unified approach to uncertain optimization, European J. Oper.
Res., 260, 403–420.
[31] Kobis E., and Kobis M.A. (2016). Treatment of set order
relations by means of a nonlinear scalarization functional: A
full characterization, Optimization, 65(10), 1805–1827.
[32] Kobis E., and Tammer Chr. (2017). Robust vector optimization
with a variable domination structure, Carpathian J. Math.,
33(3), 343-351.
[33] Kouvelis P., and Sayin S. (2006). Algorithm robust for the
bicriteria discrete optimization problem, Ann. Oper. Res., 147,
71–85.
[34] Krasnosel’skiı M.A. (1964). Positive solutions of operator
equations. Translated from the Russian by Richard E. Flaherty;
edited by Leo F. Boron. P. Noordhoff Ltd. Groningen.
[35] Kuroiwa D. (1999). Some duality theorems of set-valued
optimization with natural criteria. In Proceedings of the
Robust Approaches to Uncertain Optimization 255
International Conference on Nonlinear Analysis and Convex
Analysis. World Scientific, 221–228.
[36] Kuroiwa D. (1997). The natural criteria in set-valued
optimization, Surikaisekikenkyusho Kokyuroku, 1031:85–90,
Research on nonlinear analysis and convex analysis, Kyoto.
[37] Kuroiwa D., and Lee G. M. (2012). On robust multiobjective
optimization, Vietnam J. Math., 40(2&3), 305–317
[38] Kuroiwa D., and Lee G. M. (2014). On robust convex
multiobjective optimization, J. Nonlinear Convex Anal., 15,
1125–1136.
[39] Kuroiwa D., Tanaka T., and Duc Ha T.X. (1997). On
cone convexity of set-valued maps, Nonlinear Anal., 30(3),
1487–1496.
[40] Li M., Azarm S., and Aute V. (2005). A multi-objective
genetic algorithm for robust design optimization. In Proceedings
of the Genetic and Evolutionary Computation Conference
(GECCO’05), 771–778.
[41] Pascoletti A., and Serafini P. (1984). Scalarizing vector
optimization problems, J. Optim. Theory Appl., 42, 499–524.
[42] Rubinov A.M. (1977). Sublinear operators and their
applications, Uspehi Mat. Nauk, 32(4(196)), 113–174.
[43] Sayin S., and Kouvelis P. (2005). The multiobjective
discrete optimization problem: A weighted min-max two-stage
optimization approach and a bicriteria algorithm, Manag. Sci.,
51, 1572–1581.
256 E. Kobis
[44] Schobel A. (2014). Generalized light robustness and the
trade-off between robustness and nominal quality, Math.
Methods Oper. Res., 80(2), 161–191.
[45] Soyster A.L. (1973). Convex programming with set-inclusive
constraints and applications to inexact linear programming,
Oper. Res., 21, 1154–1157.
[46] Steuer R.E., and Choo E.U. (1983). An interactive weighted
Tchebycheff procedure for multiple objective programming,
Math. Program., 26, 326–344.
[47] Teich J. (2001). Pareto-front exploration with uncertain
objectives. In Proceedings of the First International Conference
on Evolutionary Multi-Criterion Optimization (EMO-2001),
314–328.
[48] Weidner P. (1990). Ein Trennungskonzept und seine
Anwendung auf Vektoroptimierungsverfahren. Habilitation
thesis, Martin-Luther-University Halle-Wittenberg.
[49] Wierzbicki A.P. (1986). On the completeness and
constructiveness of parametric characterizations to vector
optimization problems, OR Spectrum, 8, 73–87.
[50] Winkler K. (2003). Aspekte Mehrkriterieller Optimierung
C(T )-wertiger Abbildungen. Dissertation thesis,
Martin-Luther-University Halle-Wittenberg.
[51] Yan Y, Meng Q., Wang S., and Guo X. (2012). Robust
optimization model of schedule design for a fixed bus route,
Transp. Res. Part C: Emerg. Technol., 25, 113–121.
[52] Zadeh L. (1963). Optimality and non-scalar-valued performance
criteria, IEEE Trans. Automat. Control, 8, 59–60.
Robust Approaches to Uncertain Optimization 257
About the author
Elisabeth Kobis is a post-doc researcher at the Institute of
Mathematics in Halle and her research interests include robust
optimization, vector and set optimization.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 258-275
Estadıstica Oficial
Applying the Generic Statistical Business Process
Model (GSBPM) to the Business Register; the
Spanish experience
Luis Esteban Barbado Miguel
Department of Methodology of the statistical production
National Statistical Institute
Abstract
The Generic Statistical Business Process Model (GSBPM)
is a reference framework to describe the statistical processes
in a coherent way, making them comparable within and
between different Organizations. The application of the
GSBPM to the management of the Spanish Business Register
was carried out by the NSI during 2015. This paper provides
a first assessment of the work done, focusing on the selected
approach for the description of the GSBPM phases and the
criteria adopted for a proper assignation of the core parts of
our business process. The main restrictions found and the
potential value added of this exercise are also pointed out.
Keywords: Business Register, business process, GSBPM,
BMNP, interoperability.
© 2017 SEIO
Applying the GSBPM to the Spanish Business Register 259
1. About the GSBPM
The Generic Statistical Business Process Model (GSBPM) is a
reference framework developed by the United Nations Economic
Commission for Europe (UNECE) and the conference of European
Statisticians Steering Group on Statistical Metadata. Its basic
aim is to define and describe the statistical processes in a
coherent way, making them comparable within and between
different Organizations. This tool provides a standard framework
and harmonised terminology to help statistical organizations to
modernise their production processes as well as to share methods
and components.
The GSBPM is closely connected to data quality management,
providing a framework for its assessment. It comprises four levels:
Level 0 (the statistical business process), Level 1 (the nine phases of
the statistical business process), Level 2 (the sub-processes within
each phase) and Level 3 (a description of those sub-processes).
Levels 1 and 2 are illustrated in the Figure 1. Although this standard
was conceived for the description of any statistical operation, the
production of a National Business Register has own specificities.
Taking as a basis the clear benefits offered by this frame, some
adaptations to the GSBPM structure have been done for this specific
exercise.
The National Statistical Institute (NSI) of Spain has adopted
this standard as core element for the implementation of the Quality
Assurance Framework of the European Statistical System. In fact,
the standard stars from the GSBPM structure and a more detailed
level of information has been added. For the above mentioned
reasons, this level has not been used in the description of the NBR.
260 L. E. Barbado
Figure 1: Levels 1 and 2 of the GSBPM
Applying the GSBPM to the Spanish Business Register 261
2. GSBPM and Business Registers; general
context
The management of National Business Registers (NBRs) for
statistical purposes is a strategic action, usually incorporated within
the official plans of the Statistical Offices. The key role of
these infrastructural elements in data production, the increasing
complexity of the related data architecture and the need for a
continuous adaptation to international standards and methodologies
are challenging issues undertaken by the daily work of the NBR
teams.
Since 1992, the DIRCE (denomination of the Spanish Business
Register) is the central reference as a sampling frame for official
business surveys, which are carried out by the NSI and other
Government Departments with statistical power. In the last year,
more than 400,000 units were provided and investigated through
more than 20 surveys.
The NSI of Spain is currently working under an explicit mandate
of its Board of Directors, which is encouraging a progressive use
of the GSBPM in all statistical domains identified in the national
statistics plan.
Communication 404/2009, generally referred to as the Vision
document, proposes several strategic principles for future statistics.
Among them, the need for a re-engineering of the current production
methods is particularly relevant, moving from a system based on
parallel processes to a more integrated production model. In this
line, Eurostat launched the 4-year initiative European System of
Business Registers (ESBR, 2013-2017), with the aim to improve the
relevance of these tools and reinforce their role as the backbone for
the European Statistical System.
The Euro Groups Register (EGR) is the Statistical Register
of the European Communities on multinational enterprise groups.
262 L. E. Barbado
The EGR is the authentic core of the ESBR system and includes
information of the most influential multinationals operating in the
EU and EFTA countries. It is built and maintained under a strict
collaborative model involving all relevant stakeholders, mainly NSIs
and Central Banks.
In the latest developments of the afore-mentioned initiative,
a specific Business Architecture and its materialization through
an Interoperability Frame will be available for NBRs and EGR.
In this context, the application of the GSBPM to Business
Registers becomes highly relevant because it will favour a mutual
understanding of national procedures, the circulation of good
practices and the identification of areas where efficiency can be
gained. A wide application of this standard will also make future
benchmarking tasks possible as well as the definition of a minimum
set of interoperability requirements.
This descriptive process will also need to cover interactions with
the EGR production, referring to the stage of the process where
national data extractions and flows between the NBR and the EGR
take part. The experience in application of this standard to the
Business Registers domain is quite new. In the scope of the ESBR
project, Eurostat launched a grant with this purpose and the NSI
of Spain participated in this action. A general assessment of this
innovative experience is provided in the following paragraphs.
3. Evaluation of the work done
Preliminary activities focused on building capacity for using
GSBPM. From 13 to 17 of October 2014, a training course was set
up by national experts on standards and methodology. The DIRCE
team participated in this cooperative action aiming to create basic
knowledge. The relevant parts of our business process were identified
and a proposal of allocation in the GSBPM structure was discussed.
Applying the GSBPM to the Spanish Business Register 263
Regarding the format adopted for this exercise, a combination of
human and modelling language has been used. Among the possible
options, the Business Process Modelling Notation version 2 was
selected. The most important parts of the DIRCE business process
have been graphically represented with this notation.
BPMN v2 offers possibilities to create several pools used for the
representation of the donor Organizations and the actions carried out
by the DIRCE Unit during the whole maintenance cycle. When an
input source is received, it is subject to different kinds of processing
and data quality programs. In order to provide a structured
representation of all actions, three different internal areas have been
considered:
� Source, where the input sources are received and evaluated.
� Intermediate, where the sources are edited and transformed
into statistical databases.
� Final, where the integration and maintenance of the DIRCE
is carried out.
The Source and intermediate areas are closely related with the
Collect phase. The Final area is highly relevant in the Process phase.
The main results acquired through this experience will be
examined below in detail. The selected approach for the different
GSBPM phases, the identification of the main register processes,
their allocation to the standardized structure and the level of
granularity adopted will be described. In addition, some lacks of
relevance or restrictions found in this action will also be pointed
out.
3.1. Specify needs, Design and Built phases
Management of NBRs has a long tradition in the majority of
NSIs. User needs, output objectives and methods of production
264 L. E. Barbado
are continuously changing and being adapted to new emerging
challenges. This context has a clear impact on the design and build
of the respective data models and derived uses, which need to be
continuously aligned with the new requirements.
For this reason, the approach followed for the description of these
phases has been based on a historical dimension. This criteria
will make a better understanding of the current state and the
fundamentals of our national model easier. The methodological basis
adopted for the DIRCE management is known as the PIDE Project
(Proyecto de Integracion de Directorios Economicos, Figure 2). This
initiative started at the end of the 80’s and was formulated under
a modular approach, involving several components developed in
successive steps and with different contributions to the maintenance
of the DIRCE.
The PIDE project is always considered open, because it is based
on a continuous evaluation of the current and potential statistical
needs, the development of specific actions to fulfill those needs and
their definitive incorporation into the production model.
The documentation of Design and Built phases has been
conceived as one unique part, focusing on the main milestones
consolidated under a time-based perspective. From the beginning,
both steps of the business process were jointly undertaken with a
large degree of overlap. In addition, the description of the related
sub-processes is not especially significant for our business case.
For the remaining phases, a static dimension has been adopted.
All actions refer to the most recent cycles of maintenance for DIRCE
and interactions with the EGR. More specifically, the description
of the PIDE components is allocated in the Collect phase and all
successive integration procedures are described in the Process phase.
Applying the GSBPM to the Spanish Business Register 265
Figure 2: High level perspective of the PIDE project
266 L. E. Barbado
3.2. Collect phase
This part is highly relevant for our business case and can be
easily generalized to the big majority of Statistical Offices. It mainly
involves a proper analysis of opportunity of data sources serving for
the management of the NBR. It also includes all actions related to
a successful and stable reception in the NSI.
The production of the DIRCE is cyclical with annual periodicity
and is based on an intense use of data sources. Due to the diversity
of the selected sources, the institutional and operative actions for
their acquisition and processing were arranged in different modules,
directly linked to the typology of sources (AT= Tax files, SS=Social
Security files, PR=Private files,.. ) and making up the dynamic of
the PIDE project.
When a source is received, a data quality program is applied
according to their features and specific role in the business process.
Some sources are core elements for the detection of new units or the
elimination of previously existing ones. However, other sources are
relevant for the maintenance of specific variables.
According to these parameters, the information provided in this
phase basically refers to:
� The Inter Departmental context created allowing a stable
access to input data (4.2 set up collection). Initially,
institutional actions were addressed to the Tax and
Social Security Authorities, formally established in both
collaboration agreements. In other cases, specific service
contracts are available for the acquisition of private databases.
� The channels adopted for the reception of the data sources
(4.2 set up collection). Diverse procedures are described, the
most relevant of which being several IT tools allowing security
requirements in data interchanges. In other cases, the direct
Applying the GSBPM to the Spanish Business Register 267
download from official websites is applied.
� The list of input sources used in each production cycle (4.3/4.4
run and finalize collection) classified by nature, in line with the
components of the PIDE project.
All input sources are numbered, this system being critical for the
data integration and the DIRCE maintenance. A set of structured
information is given for each database, including:
� Basic metadata: denomination, Managing Organization,
reception date, timetable for data process, elementary
observation unit and data structure.
� Validation rules.
� Editing and micro validation processes.
� Transformation processes and adoption of statistical
standards.
� Production of statistical databases.
Validation rules are designed with the purpose to make a decision
about the source: acceptance or rejection. If rejected, the source is
returned to the Managing Organization including a communication
of errors to be corrected. If the source is accepted, a set of specific
procedures is applied.
The lack of adaptation to statistical standards or the presence
of low quality in particular variables can be pointed out as a classic
restrictions of administrative data. These problems must be detected
in the preliminary design of the project.
Afterwards, the solution of these types of problems is
normally the result of a close cooperation among statisticians
and administrative managers, within the scope of the institutional
268 L. E. Barbado
context created. This is normally materialized by means of specific
editing, micro validation or transformations processes.
The GSBPM offers possibilities for assigning this information in
the next phase. However, in order to facilitate the understanding of
the complete production chain, it has been decided to link all these
register processes to the input sources in the collect phase.
3.3. Process phase
This phase is the core part of our business process. All statistical
databases produced in the previous phase are used as input for the
integration processes, the generation of the updated statistical units
forming the data model and all the related characteristics linked to
them. In summary, here is where the updating of the DIRCE takes
part.
The integration procedures (5.1 integrate data) are mainly
carried out by record linkage routines based on a universal presence
of unique national IDs. During this action, several DIRCE frozen
frames are generated with different levels of quality and uses. The
features of the frames of reference t are described as a timeline over
the year t+1 as the main result of the following iterative steps:
1. INT 1 1 produces a preliminary updated version of enterprises,
based on the new data sources.
2. INT 1 2 produces a second version of enterprises and local
units fully consistent with the year t-1. The data quality
is higher due to the incorporation of validated statistical
information. This frame is used for sample selection in the
STS domain and official dissemination of results.
3. INT 1 3 produces the definitive version of enterprises and
local units incorporating the last updating of basic variables.
In addition, specialized databases containing information on
Applying the GSBPM to the Spanish Business Register 269
monetary characteristics are received during the last quarter
of the year and this information is also incorporated.
4. INT 2 produces an updated version of enterprise groups based
on private, tax and statistical sources.
5. INT F produces the definitive updated system, integrating
the results obtained in INT 1 3 and INT 2. Three levels of
information formed by enterprises, local units and enterprise
groups are available and fully consistent.
In the last part of sub-process 5.1, the interactions with the EGR
cycle are described. Due to the complexity of the EGR model
and the need to give efficient answers from the NBRs, the uses
of the frozen frames are described according to the different EGR
interchange flows (extraction and delivery of resident/non- resident
units to the EGR Identification Service, extraction and delivery of
the core files on legal units and control relationships for the EGR
system, extraction and delivery of the enterprise data to the EGR
system, repair action and data exchange on Ultimate Controlling
Units). Figure 3 shows these interactions as a timeline.
During the integration procedures, a definitive classification of all
statistical units (5.2 classify and code) is also provided. For the core
classification variables, a predefined set of decision rules is described,
according to their presence in data sources, and their reliability.
New variables are also derived and systematically maintained
based on information available or specific data sources (5.5 derive
new variables and units). Ad hoc estimation procedures or
deterministic rules are described for the delimitation of the number
of persons employed, the institutional sector code or monetary
variables like turnover, import and export.
The main restrictions were found in the documentation of
review, validation, edit and imputation as separate sub-processes.
270 L. E. Barbado
Figure 3: DIRCE frames and interactions with the EGR cycle
Applying the GSBPM to the Spanish Business Register 271
As previously mentioned, all these practices are undertaken from
the beginning of the cycle and they are allocated in the related
sub-processes in order to facilitate the understanding of the whole
production chain.
3.4. Analyse phase
The increasing demand for better and more detailed business
statistics has put the focus on the NBRs and their key role in the
statistical production chain. Originally, these tools were conceived
as a vital component of statistical infrastructure, supporting data
collection, monitoring the response burden and giving grossing up
indicators for the production of aggregates. All these tasks, closely
related with the use of NBR as the survey frame, will be jointly
considered in the application of the GSBPM to business surveys.
In recent decades, user demands have diversified and the role
of the NBR as a source of data production has become more and
more relevant. This aspect has been the approach adopted for the
documentation of this phase and the following one. In the Spanish
NSI, the DIRCE is the key data source for the statistical analyses of
business activity from both a static and dynamic perspective. Two
main references linked to the DIRCE macro-data are documented:
� Statistical Analysis of the DIRCE. A standard publication of
results directly obtained from the updated frame.
� Harmonized Business Demography. A product specifically
elaborated to cover the national needs in this domain.
Its production is fully consistent with the OCDE-Eurostat
methodology.
Both statistical operations incorporate the same metadata: type
of operation, data source, periodicity, starting a¿“ ending date
272 L. E. Barbado
of processes, press release, presence in National Statistical Plan /
Statistical Operations Inventory and methodological basis.
3.5. Dissemination phase
Dissemination of NBRs can be established at micro or
macro-data levels. The first option is normally constrained by
the confidentiality provisions applying to the national legal frame.
This is the case of Spain, where access to the DIRCE micro data
is restricted to national authorities in charge of official statistics.
Dissemination of macro-data refers to the statistical operations
previously mentioned. The main operational steps carried out up
until their definitive publication are documented in this phase. Joint
meetings involving the DIRCE and Dissemination teams are held in
the last part of each cycle. Information about the dynamic of the
processes, the date foreseen for the generation of the aggregates and
the innovations incorporated, form the basis for a proper adaptation
of the output system (7.1 update output systems).
For the second stage, all components related to each operation
are documented (7.2 produce dissemination products). They mainly
refer to the list of data tables, metadata, standard methodological
report, complementary reports, graphic annex and press release.
The external impact of these statistics is very relevant. A recent
study of the number of web accesses to the INEBase, the generic
brand for statistical information at NSI website consisting of 185
statistical operations, shows that DIRCE statistics are placed in the
top 20 ranking.
Since the first year of publication, the DIRCE also provides a
tailor-made service through direct use of register data. Requests
from Public Administrations, Private Companies, Organizations,
Professionals or Researchers are continuously increasing. The
queries registered are very diverse in form and content and they
are managed according to a specific protocol defined by the NSI
Applying the GSBPM to the Spanish Business Register 273
dissemination policy (7.5 manage user support).
3.6. Evaluation phase
This phase is closely related to the quality policy implemented
and the incorporation of successive improvements to our NBR. Two
main orientations have been outlined:
� Internal evaluation, by developing a complete diagnosis of
processes carried out during each annual cycle.
� External evaluation, by using the feedback of business
producers as a basic element for the improvement of the NBR.
4. Final remarks
This has been a challenging and very positive experience for the
DIRCE team. Although the GSBPM seems to be better adapted to
a typical survey, this standard can also be applied to the Statistical
Business Registers. However, the presence of national specificities
in the management model makes the allocation to sub-processes
sometimes difficult.
Different approaches can be adopted for the description of
the phases, from a dynamic to a static perspective. Generally
speaking, this decision should be made regarding the current level
of implementation and the foreseen innovations to be included in
the business process. In the case of management of BRs, which
has a long-standing tradition in the Statistical Offices, the historical
dimension could be more appropriate for the first phases and the
static information linked to the most recent production cycle would
be the appropriate approach for the remaining phases.
The GSBPM proposes a multi-focal description of the business
process, allocating uniform parts in separate sub-processes. This
exercise can be opportune when actions are addressed to the same
274 L. E. Barbado
dynamic database throughout the production chain. However, this
philosophy can mean serious restrictions for projects involving a
great amount and variety of data sources for which, specific actions
must be designed. In the Spanish case, the longitudinal description
for each input data source has been predominant in order to properly
understand how our model actually works.
On an international scale, the results of these experiences will
have to be jointly evaluated. As a starting point, the development
of benchmarking activities will need to be undertaken. Expected
results should lead to some agreements towards more coordinated
and consistent production cycles. In addition, this context should
facilitate the identification of specific tools for a common use or a
preliminary identification of Data Quality Program for all BRs of
the European System.
This progressive integration within an interoperable system
will mean veritable added value for all statistical actors and an
opportunity to modernise the production process of official statistics.
References
[1] Applying the Generic Statistical Business Process Model
to business register maintenance. Economic Commission for
Europe, Conference of European Statisticians. Paris, September
2011.
[2] COM (2009) 404 final- Communication from the Commission
to the European Parliament and the Council on the production
method of EU statistics. http://eur-lex.europa.eu/
LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF
[3] Directive 2012/17/EU of the European Parliament and of the
Council on interconnecting Business Registers.
Applying the GSBPM to the Spanish Business Register 275
http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?
uri=OJ:L:2012:156:0001:0009:en:PDF
[4] http://www1.unece.org/stat/platform/display/metis/
The+Generic+Statistical+Business+Process+Model
[5] http://www.ine.es/inebmenu/mnu_empresas.htm
About the author
Luis Esteban Barbado Miguel is a senior statistician at the
National Statistics Institute (INE) of Spain. Public official of the
Senior Corps of Statisticians, he has a broad experience in official
statistics. He is currently Deputy Director of the Department of
Methodology and development of statistical production. Spanish
representative in International Working Groups on Business
Registers for statistical purposes and Statistical Units. Participation
in training seminars for statisticians in the region of Latin America,
focusing on the fundamentals for the development and management
of Business Registers. He has a University degree in Mathematics,
specializing in Statistics.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 276-294
Historia y Ensenanza
Positive effects on the least motivated students of
the highly motivated ones1
Raquel Ibar-Alonso
Departamento Interfacultativo de Matematicas y Estadıstica
Universidad CEU San Pablo
Carolina Cosculluela-Martınez
Departamento de Economıa Aplicada I
Universidad Rey Juan Carlos
Abstract
The teaching methodology affects the motivation of the
students. Non-motivated students can influence the most
motivated ones and vice versa. Adjusting continuously the
methodology to the less motivated students could be done
if there is information on what is really motivating, if they
can decide the way to be taught. A group of students of
two different universities in Madrid, a public and a private
one, answered a survey after a month attending the course of
Statistics. The three clusters found are motivated by different
methodology tools. The smallest motivation values of the
1Positive effects on the less motivated students.
© 2017 SEIO
Positive effects on the least motivated students 277
30% more motivated people of group 3 (percentile 70) has
been compared to the highest motivation value of the 30% less
motivated (percentile 30) and with (percentile 70) 30% more
motivated people in group 1. Thus, how the most motivated
group can influence the less motivating one in a non-linear
way?.
Keywords: Motivation, teaching methodology, profiles,
non-linear influences, just in time adaptability, sociological
study.
AMS Subject classifications: 62J02.
1. Introduction
The motivation of the students has been a constant matter
of study since the past century. Some attended the conclusion
that to help unmotivated students it is important to do exercises,
socialize, involve them in a process called ”attribution retraining”
(Lumsden, 1994). Others, that asses that the way to motivate is to
ask students to demonstrate what they have learned, to participate
in the class, not just simply display their ability to memorize and
answer questions (Chuska, 1995).
Different students could be motivated in many different ways.
The comfort zone for a group of students could motivate or could
be stressing for others. The goal of this paper is to analyze the
effects of a group of the most motivated students in a class on
the least motivated group of them. Quantifying this effect has
become a previous issue to influence the motivation by changing the
methodology accordingly to what the students believe is the most
motivating way of teaching in their point of view. The way they are
going to be motivated is beyond the scope of this paper.
The motivation of a student can be asked in several ways. Which
one is the best? You never know. Motivation is based in the results
278 R. Ibar-Alonso, C. Cosculluela-Martınez
that the student expected to obtain and what they finally achieved
in the past. In this paper the motivation is asked in a particular way,
graphically the student identifies the evolution of their motivation.
Thus, the latent variable motivation is built on the graph. The
feeling of the evolution of own motivation′s reflects the real nature
of the motivation of the student, and a more realistic way than asking
students to rank their motivation in a certain scale.
To achieve the main goal is mandatory to classify
students accordingly to their teaching preferences and their
characteristics. The proposed methodology, build on previous work
(Cosculluela-Martinez and Ibar-Alonso, 2016), is a cluster analysis
and the description of the groups obtained from an on-line survey
of 15 questions answered by more than 200 people studying in two
different types of universities: a private catholic one, and a public
one in Madrid. The motivation is going to be computed from one
graph question and is going to be based on the factorial punctuation
obtained from each group. This technique is the novelty in Ibar′s
thesis (2014) and it has been followed by (Martinez and Ibar-Alonso,
2015). Thus, the answer is going to be intuitive and quantifiable,
a sensation of the expectancy of the motivation is determining the
evolution of how the student thinks it is going to be motivated in
the future and their view of their motivation in the past.
The relationship between the punctuation of the motivation of
the most and least motivated students in each group would be
estimated and analyzed to determine the influence of one group on
the other.
Our main hypotheses consists in assessing that there is a strong
positive influence of the less motivated students on the motivation
of the highly motivated ones. Thus, the professor can influence
the motivation of the whole class by adjusting the methodology
to what this group feels it will be more motivating for them.
Positive effects on the least motivated students 279
The rest of the paper is organized as follows. Section 2 shows
the methodology used to obtain the data. Section 3 presents the
previous analysis, the relationships between the variables and the
main assumptions. Section 4 discusses the results obtained from the
empirical estimations. Finally, Section 5 provides the concluding
remarks.
2. Methodology
The google drive on-line survey, https://goo.gl/forms/
Vk70LlCFLmsCQ6qo1, has 15 questions from which the cluster
analysis determines that three groups can be found with the same
characteristics in both universities, which was the same conclusion
attended by (Arias et al., 2000). The two universities chosen were
Rey Juan Carlos, a public university, and CEU San Pablo, a private
university. The reason to choose them is that in both of them
professors teach the same subject, with the same main bibliography
and syllabus. There was a great amount of responses from the
students, more than 200 people answered the questionnaire. Besides,
the questionnaire is a following up survey, this means each year is
asked to the students to answer the questionnaire. Thus, the people
responding differ from year to year.
One of the questions allows to calculate the percentiles for each
group and period of time (21 periods) of the motivation of the
student, the starting point is the factorial punctuation of each
group. Thus, the evolution can be a line, a parabolic function, an
exponential function or any other pattern that the student feels his
motivation evolution has had and it is going to follow.
From the profiles of the students, the percentiles of the
motivation have been calculated. The 30% less motivated in each
group have a different higher punctuation of their motivation for
every period, represented by percentile 30 of groups 3 and 1,
280 R. Ibar-Alonso, C. Cosculluela-Martınez
hereafter 30CL3 and 30CL1, while the 30% more motivated have
a different lower punctuation of their motivation for every period,
represented by percentile 70 of groups 1 and 3, hereafter 70CL1 and
70CL3.
The relationship between the evolution of those variables is
going to be estimated and analyzed accordingly to the pattern that
they follow, linear or non-linear relationship. The study has been
extended to the relationships between those percentiles in cluster
2 and the rest. Thus, equations (2.1) and (2.2) are going to be
estimated depending on the results obtained in Ramsey test and the
graphical analysis:
Y = β0 + β1Xi + β2Xi + ...+ βnXi , (2.1)
Y = β0 + β1Xi + β2X2i + ...+ βnX
ni , (2.2)
where:
Y: is the value of the percentile that is going to be estimated.
Xi are the independent percentiles that are related to Y.
βi are the parameters that are going to be estimated for i=1 to n.
Next, the empirical analysis.
3. Analysis
The variables for a time span of 21 periods that have been
selected for the study are:
� 30CL1: highest motivation value for each period of the 30%
with lower motivation in group 1.
� 30CL3: highest motivation value for each period of the 30%
with lower motivation in group 3.
� 70CL1: lowest motivation value for each period of the 30% with
higher motivation in group 1.
Positive effects on the least motivated students 281
� 70CL3: lowest motivation value for each period of the 30% with
higher motivation in group 3.
The data is shown in Table 1. The extension of the analysis with
cluster 2 has been taken to the appendix.
PERIOD 30 CL1 70 CL1 30 CL2 70 CL2 30 CL3 70 CL3
1 -0.447919 0.635933 -0.682659 0.623431 -0.5066 0.62642
2 -0.400638 0.641701 -0.770779 0.624117 -0.5516 0.64763
3 -0.292498 0.616466 -0.836687 0.575844 -0.5866 0.64763
4 -0.249899 0.664164 -0.747312 0.59406 -0.664675 0.64763
5 -0.193041 0.693985 -0.828881 0.591281 -0.74211 0.64763
6 -0.201437 0.673634 -0.819367 0.520423 -0.74711 0.64763
7 -0.265068 0.619635 -0.833527 0.543977 -0.829805 0.64763
8 -0.302225 0.640624 -0.779835 0.569971 -0.914805 0.64763
9 -0.296863 0.599538 -0.831482 0.599809 -0.989805 0.64763
10 -0.312473 0.589619 -0.85975 0.609343 -1.053195 0.64763
11 -0.296863 0.602055 -0.904235 0.619343 -1.053195 0.64763
12 -0.293863 0.589442 -0.903235 0.662087 -1.053195 0.69003
13 -0.275659 0.524072 -0.902879 0.749999 -1.053195 0.69003
14 -0.293308 0.534936 -0.922674 0.659999 -1.05083 0.69003
15 -0.249227 0.576226 -0.922674 0.595511 -0.98083 0.69003
16 -0.278802 0.65001 -0.922674 0.615064 -0.89083 0.69003
17 -0.271151 0.607836 -0.923247 0.632576 -0.78083 0.655395
18 -0.326156 0.684989 -0.955928 0.622341 -0.65083 0.577485
19 -0.388955 0.739893 -0.993438 0.69653 -0.56242 0.577485
20 -0.461889 0.766856 -1.034884 0.776419 -0.52742 0.577485
21 -0.590638 0.925424 -1.118959 0.857497 -0.48242 0.577485
Table 1: Percentiles 70 and 30 of the motivation in groups 1 and 3,the most and least motivated students.
The previous analysis will determine the methodology
accordingly to the type of relationship between them.
282 R. Ibar-Alonso, C. Cosculluela-Martınez
3.1. Previous analysis
Percentiles 70 of group 1 70CL1 and group 3 70CL3 and 30
of group 1 30CL1 and group 3 30CL3 and their relationships are
represented in figures 1 to 3.
Figure 1: Relationship between the lowest values of the 30% mostmotivated people of group 1 and the highest values of 30% the lessmotivated ones of group 3. Source: Data from our own surveyobtained in 2015.
Figure 1 shows the relationship between the lowest values of the
Positive effects on the least motivated students 283
30% most motivated people of group 1 and the highest values of the
less motivated ones of group 3. As it can be appreciated there is no
linearity to describe this influence. The Ramsey test rejects the null
hypotheses at a 90% and n=4.
Figure 2: Relationship between the lowest values of the 30% mostmotivated people of group 1 and group 3. Source: Data from ourown survey obtained in 2015.
Figure 2 shows the relationship between the lowest values of the
30% most motivated people of group 1 and group 3. Graphically
284 R. Ibar-Alonso, C. Cosculluela-Martınez
the relationship can be linear. The Ramsey test does not reject the
null hypotheses of linearity.
Figure 3: Relationship between the highest values of the 30% lessmotivated people of group 1 and group 3. Source: Data from ourown survey obtained in 2015.
Figure 3 shows the relationship between the highest values of
the 30% most motivated people of group 1 and group 3. Graphically
the relationship can be parabolic. The Ramsey test rejects the null
Positive effects on the least motivated students 285
hypotheses of linearity with n=2 at a 99%. Accordingly to this
previous analysis, (2.1) and (2.2) are going to be estimated.
4. Estimations and Results
The final estimations obtained are:
First, the effects of group 3 on group 1 are represented in
equations (4.1), (4.2) and (4.3).
70CL1 = 367.87[224.97] ∗ 30CL3 + 954.54[616.74] ∗ 302CL3+1220.58[830.49] ∗ 303CL3 + 769.51[549.54] ∗ 304CL3
+191.55[143.05] ∗ 305CL3 + 56.55[32.25](4.1)
70CL1 = −1.82[0.32] ∗ 70CL3 + 1.82[0.21] (4.2)
30CL1 = −16.33[2.99] ∗ 30CL3 − 4.82[0.73]−19.07[3.94] ∗ 302CL3 − 7.26[1.68] ∗ 303CL3
(4.3)
The residuals are white noise. The ADF test applied to the
residuals and all the coefficients to determine the goodness of the
models.
From the (4.1) to (4.3) it can be said that there is a positive
relationship between the 30% less motivated students of group 3
and the 30% more motivated ones in group 1. There is a negative
relationship between the 30% more and less motivated on both
groups.
Graphically the models are quite accurate to explain the
evolution of the percentiles. The 30% less motivated students in
group 3 have high influence in the less motivated ones of group 1.
The 30% more motivated students in group 3 have high influence in
286 R. Ibar-Alonso, C. Cosculluela-Martınez
Figure 4: Relationship between the highest values of the 30% lessmotivated people of group 1 and group 3 and the one estimated bythe model.
the most motivated ones which are in group 1.
5. Conclusions
As the motivation is not always a positive value, and its evolution
depends on the previous experience, the relationship between the
students′ motivation in each one of the groups in which they can be
classified accordingly to their profiles is difficult to analyze. An
approximation to do so is to estimate the relationship between
the evolution of the punctuation that a fixed percentage of people
classified in that group think they have and the evolution of
Positive effects on the least motivated students 287
Figure 5: Relationship between the lowest values of the 30% mostmotivated people of group 1 and group 3 and the one estimated bythe model.
288 R. Ibar-Alonso, C. Cosculluela-Martınez
Figure 6: Relationship between the lowest values of the 30% mostmotivated people of group 1 and the highest values of the lessmotivated ones of group 3. Source: Data from our own surveyobtained in 2015.
Positive effects on the least motivated students 289
the punctuation that another, or the same percentage, of people
classified in another group think they have.
Thus, the technique used to quantify the influence of a one
percentage increase, hereafter, the shock, in the motivation of each
one of the groups on the others is to estimate the nine following
pairwise relationships: Thus, the technique used to quantify the
influence of a one percentage increase, hereafter, the shock, in the
motivation of each one of the groups on the others is to estimate
the nine following pairwise relationships: The relationship between
the lowest values of the 30% most motivated people of one group
and the highest values of the 30% least motivated ones of another
group; The relationship between the lowest values of the 30% most
motivated people of each pair of groups; The relationship between
the highest values of the 30% least motivated people of each pair of
groups.
First, positive effects have been found increasing the motivation
of group 2 in one percentage point; if the increase is in the 30% most
motivated the effect on the 30% most motivated of group 1 is 26,34%,
if the increase is in the 30% least motivated the effect on the 30%
least motivated in group 1 is 0.07% and the effect on the 30% least
motivated of group 3 is a 2.71% while the effect on the 30% most
motivated of group 3 is a 3.16%. On the other hand, an increase in
the 30% less motivated of group 2 decreases the motivation of the
30% least motivated of group 1 by a -0.82%.
Second, rising the motivation in one percentage point of the most
motivated people in the group 3 decreases the motivation of the most
motivated ones in group 1 in -1.05%. Rising the motivation of the
30% least motivated in group 3 decreases the motivation of the least
motivated in group 1 in a -1.47% and rises the motivation of the
30% most motivated ones in group 1 in 2.71%.
290 R. Ibar-Alonso, C. Cosculluela-Martınez
References
[1] Arias, A. V. (2000). Enfoques de aprendizaje en estudiantes
universitarios. Psicothema, 12(3), 368-375.
[2] Chuska, K. R. (1995). Improving classroom questions: A
teacher’s guide to increasing Student Motivation, Participation,
and Higher-Level Thinking, Phi Delta Kappa Educational
Foundation ERIC Publications, Bloomington (Indiana, US).
[3] Cosculluela-MartAnez, C. and Ibar-Alonso, R. (2016).
Retroalimentacion como fuente de mejora de la calidad docente:
caso real. Congreso XXIV Jornadas de ASEPUMA, Granada
(Spain).
[4] Ibar-Alonso, R. (2014). Nueva metodologıa de recogida
de informacion para su tratamiento a traves del Analisis
Multivariante y los Modelos de Ecuaciones Estructurales.
Aplicaion en el ambito universitario. Tesis Doctoral, Madrid
(Spain).
[5] Lumsden, L. S. (1994).Student motivation to learn. ERIC
Publications, number 92, Washington, DC. Avaliable in
https://eric.ed.gov/?id=ED370200.
[6] Martinez, M. S., and Ibar-Alonso, R. (2015). Convergence and
interaction in the new media: Typologies of prosumers among
university students. Comunicacion y Sociedad, 28(2), 87.
Positive effects on the least motivated students 291
Appendix
Estimations for the relationship between cluster 2 with the other
ones.
A1. Clusters 1 and 2
70CL1 = −374.75 ∗ 30CL2 − 632.00 ∗ 302CL2 − 466.67 ∗ 303CL2−127.19 ∗ 304CL2 − 81.52
(5.1)
70CL1 = 0.27 ∗ 70CL2 + 1.06 ∗ 70CL1(−1) − 0.20 (5.2)
30CL1 = 0.98 ∗ 30CL2 − 0.64 ∗ 30CL2(−1) (5.3)
Figure 7: The relationship between the lowest values of the 30% mostmotivated people of group 1 and group 2, the relationship betweenthe lowest values of the 30% most motivated people of group 1 andthe highest values of the 30% less motivated people of group 2, andthe relationship between the highest values of the 30% less motivatedpeople of group 1 and group 2 Source: Data from our own surveyobtained in 2015.
292 R. Ibar-Alonso, C. Cosculluela-Martınez
Figure 8: Estimated and real values.
A.2. Clusters 2 and 3
30CL3 = 3.57 ∗ 30CL2 + 7.42 ∗ 302CL2+3.73 ∗ 303CL2 + 1.06 ∗ 30CL3(−1)
(5.4)
No adjustment of the linear or non linear model.
70CL3 = −0.26 ∗ 302CL2 + 0.78 ∗ 70CL3(−1) − 0.39 ∗ 30CL2 (5.5)
Positive effects on the least motivated students 293
Figure 9: The relationship between the highest values of the 30% lessmotivated people of group 3 and group 2, the relationship betweenthe lowest values of the 30% most motivated people of group 3 andgroup 2, and the relationship between the lowest values of the 30%most motivated people of group 3 and the highest values of the 30%less motivated people of group 2 Source: Data from our own surveyobtained in 2015.
Figure 10: Estimated and real values.
About the authors
Ibar-Alonso R. has PhD with distinction in Economic and
Business Sciences by San Pablo University of Madrid (USP-CEU).
She obtained the Grade in Mathematical Sciences by Complutense
294 R. Ibar-Alonso, C. Cosculluela-Martınez
University of Madrid (UCM). Faculty staff of the Mathematical
and Statistic Department at the USP-CEU University. Member of
the research group of Media Convergence (INCIRTV) and of the
Research Project Smart Cities: Accessibility problems to digital
information of the older citizens. She has multidisciplinary research
lines in Multivariate Statistic Analysis, social behavior, and new
ways of collecting qualitative and quantitative information. Visiting
scholar (2016) at Regional Economic Applications Laboratory
Illinois University.
Cosculluela-Martınez, C. Faculty Staff in URJC. PhD with
distinction in Statistics for Economics UNED. Enrique Fuentes
Quintana (2010) and Ramon Areces (2011) Prizes. Coordinator
of the Business Administration and Tourism Branch. Public
Press Conferences and Coordinated General Directorates at the
Vice-Council Office of Economics and Employment of the Madrid
Regional Department. Senior Risk Analyst in Avalmadrid S.G.R,
external at A.E.M.S.A. Participation in several projects of the EU,
Education Ministry, Madrid Regional Education and Employment
Department and Town-halls. Referee of WOS indexed Journals.
Visiting scholar and Visiting Professor (2011, 2016) at Regional
Economic Applications Laboratory Illinois University.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 295-321
Opiniones sobre la profesion
Ingenuas reflexiones de un estadıstico en la era
del Big Data
Ricardo Cao Abad
Grupo de investigacion MODES, Departamento de Matematicas,
Centro de Investigacion en Tecnologıas de la
Informacion y las Comunicaciones (CITIC)
Instituto Tecnologico de Matematica Industrial (ITMATI)
Universidade da Coruna,
Campus de Elvina, s/n 15071 A Coruna, Spain
Abstract
This article presents some reflections of the author (a
statistician) about the role of Statistics in the Big Data era.
The paper goes from the change of paradigms (asymptotic
properties versus huge sample sizes and statistical efficiency
versus computation time), to the more than probable presence
of bias in Big Data. It also makes a tour through subsampling
methods and ‘divide and conquer’ strategies. All these issues
are examined under a very personal (possibly naive) view of
the author.
Keywords: Biased data, big data, bootstrap, divide and conquer,
magnifying glass subsample, minimalist replication bootstrap,
subsampling.
AMS Subject classifications: 62G08, 62G09, 62G10, 62G20,
68T05.
© 2017 SEIO
296 R. Cao Abad
1. Motivacion
Algo que no siempre esta claro para todos los usuarios de la
estadıstica es que los datos son la informacion recogida en la muestra
observada y un aspecto muy importante es el modo en que se ha
decidido recogerlos. Ası, no solo importan sus valores concretos sino
tambien el procedimiento (normalmente aleatorio) a partir del cual
se obtuvieron esos datos concretos de la poblacion. En ese sentido,
el modelo aleatorio generador de los datos cobra mas importancia,
si cabe, que los datos en sı mismos. De ahı la importancia de
los metodos de muestreo y del concepto de muestra aleatoria. La
conexion de ese concepto de muestra, puramente matematico, con la
realidad es la muestra observada, ya formada por valores obtenidos
del mundo real: los datos.
Hasta hace aproximadamente un siglo la obtencion de datos era
un proceso muy laborioso. Por ese motivo, la mayorıa de los metodos
estadısticos propuestos a finales del siglo XIX y principios del XX
fueron pensados para situaciones en las que el tamano muestral
era pequeno. Un ejemplo de ello es el artıculo de Pearson (1900)
en el que se introduce el estadıstico χ2 para realizar contrastes de
bondad de ajuste. A raız de las ideas expresadas en el artıculo de
Pearson y, sobre todo, en el de William S. Gosset, en el que introdujo
la distribucion t de Student (Student (1908)), fue haciendose mas
evidente la necesidad de disenar procedimientos estadısticos que
tuviesen muy en cuenta el hecho de que, para tamanos muestrales
pequenos, la distribucion de probabilidad de muchos estadısticos
difiere bastante de la que tienen cuando el tamanano muestral tiende
a infinito: la llamada distribucion asintotica del estadıstico.
La introduccion del metodo bootstrap por Efron (1979) fue un
paso de gigante en ese sentido. El metodo proporciona una filosofıa
general para aproximar la distribucion de un estadıstico para un
tamano muestral finito concreto. Eso sı, la gran utilizacion del
Ingenuas reflexiones de un estadıstico en la era del Big Data 297
metodo bootstrap no hubiera sido posible de no haber dispuesto de
cada vez mas agiles ordenadores que hoy permiten simular millones
de replicas bootstrap de muchos de los estadısticos mas frecuentes en
unos pocos segundos. Este auge de las tecnologıas de la informacion
(y tambien de la sensorica y las comunicaciones) hace que los propios
ordenadores y dispositivos electronicos pasen de ser una valiosa
herramienta para analizar datos a ser fuentes inagotables de datos.
Esos datos son ahora de tamano muestral ingente y frecuentemente
muy complejos y de alta dimension. Esto ha dado lugar al campo
conocido actualmente como Big Data, sobre el que reflexionare desde
el punto de vista estadıstico, prestando atencion a los cambios de
paradigma que, a mi juicio, se avecinan. Un artıculo muy interesante
en el que el autor reflexiona sobre cual ha de ser el papel de la
estadıstica (y de las personas que nos dedicamos a esta ciencia) en
este campo emergente de los Big Data es el escrito por Pena (2014),
publicado tambien en BEIO.
2. Cambio de paradigmas
El Big Data ha traido consigo la generacion y necesidades de
procesamiento y analisis de bases de datos de gran volumen (en
ocasiones desestructuradas). El sentido en que estas bases de datos
son grandes frecuentemente varıa. Dicho coloquialmente, podemos
hablar de grande a lo ancho (gran numero de variables en la base
de datos) o a lo largo (tamano muestral muy elevado) o en ambos
sentidos. En Cao (2015) se hace un recorrido por diversas situaciones
reales en las que se presentan alguna de estas caracterısticas
de gran tamano (a lo ancho, a lo largo o en ambos sentidos),
incluyendo reflexiones sobre el tratamiento de gandes volumenes de
datos procedentes de imagenes y vıdeos, ası como la perspectiva
infinito-dimensional que proporciona el analisis de datos funcionales.
En el caso en que el conjunto de datos sea grande debido
298 R. Cao Abad
al elevado tamano muestral, n, a mi juicio se intuye un cambio
de paradigma en lo tocante a la disyuntiva entre el uso (y el
interes) de las propiedades asintoticas como contraposicion a las
propiedades obtenidas para n fijo. Tambien se adivinan cambios
en los criterios de optimalidad de los procedimientos de analisis de
datos que podrıan tener en cuenta ya no solo la eficiencia estadıstica
de los procedimientos, sino su coste computacional y escalabilidad.
Veamoslo a continuacion.
2.1. Propiedades asintoticas y para tamano muestral finito
Cuando uno dispone de una muestra con un tamano muy grande,
los resultados asintoticos deben estar muy cerca de lo que la
muestra nos ofrece. Como consecuencia, las propiedades asintoticas
de los metodos estadısticos deben jugar un papel fundamental en
estos dıas. Asimismo, los resultados que son de mucho interes
para tamanos muestrales pequenos posiblemente dejaran de tenerlo
en estos contextos de Big Data a lo largo. Ası, por ejemplo,
con un tamano muestral de n = 1 000 000, la distribucion χ2n−1
para el estadıstico estudentizado que permite hacer inferencia mas
precisa sobre la varianza de una poblacion normal, sera poco util
y la mera aproximacion de la distribucion de dicho estadıstico
gracias al Teorema Central del Lımite y la consistencia de los
momentos muestrales sera un resultado mucho mas interesante
en este caso. Igualmente, con un tamano muestral tan grande,
posiblemente no sea necesaria la utilizacion del metodo bootstrap
para hacer inferencia sobre la varianza de la poblacion. De nuevo,
la aproximacion normal para dicha distribucion sera un resultado
mucho mas util para ese caso. Es evidente que eso representa un gran
alivio computacional, pues realizar procedimientos bootstrap que
requieran simular decenas de miles de replicas con tamano muestral
del orden de millones es un proceso que consume mucho tiempo de
CPU y, en ocasiones, una gran cantidad de memoria.
Ingenuas reflexiones de un estadıstico en la era del Big Data 299
Por otra parte, al disponer de un tamano muestral tan grande
cabe preguntarse lo cercana que ya esta la informacion muestral
de las caracterısticas poblacionales y hasta que punto podemos
simplemente olvidarnos de los errores de estimacion y considerar
que las estimaciones obtenidas a partir de la muestra son ya valores
muy cercanos a sus analogos poblacionales. Esto puede resultar
razonable en problemas sencillos, pero quiza no tanto en otros mas
complejos y especialmente en aquellos que traen consigo una elevada
dimension del objeto poblacional de interes. Ası, por ejemplo, si
disponemos de n = 1 000 000 observaciones en la muestra y para
cada una hemos registrado los valores de 1 000 variables, un elemento
poblacional que puede resultar interesante (por ejemplo, para llevar
a cabo un analisis de componentes principales) es la matriz de
varianzas-covarianzas. Esa matriz tiene dimension 1 000 ×1 000 y en
ella se hallan (1000 · 999) /2 + 1000 = 500 500 elementos distintos
a estimar. Aunque un millon de datos puede parecer mucho, al
tener que estimar alrededor de medio millon de parametros es muy
probable que en alguno de ellos el error de estimacion sea realmente
grande y que eso distorsione las conclusiones posteriores. Por ello
resulta interesante el poder controlar los errores de estimacion
conjuntos de tal ingente cantidad de parametros. Obviamente, si en
lugar de 1 000 se tratase de 2 000 variables, el numero de parametros
a estimar (2 001 000 elementos de la matriz de varianzas-covarianzas)
harıa que el problema fuese inabordable con “tan solo” un millon de
datos.
Una forma de abordar el problema de estimar las componentes
principales con solo un millon de datos en presencia de 2 000 variables
podrıa ser el considerar como componentes principales factibles
aquellas combinaciones lineales de las variables de partida que tenga,
a lo sumo, “tan solo” 100 coeficientes no nulos. De esta manera, si
consideramos todas las 1 000 potenciales componentes principales,
300 R. Cao Abad
solo necesitarıamos estimar 100 000 coeficientes, que aunque es un
numero elevado es considerablemente menor que el tamano muestral.
La idea anteriormente expuesta: considerar modelos dispersos, es
decir, con un numero relativamente pequeno de coeficientes no nulos,
mucho menor que el gran numero, d, de variables explicativas, ha
sido y continua siendo muy utilizada en el contexto de Big Data a lo
ancho. Tambien lo es en casos no necesariamente de Big Data pero
cuando simplemente d > n. Entre los trabajos pioneros en esta lınea
se encuentran el de Tibshirani (1996) y el de Efron et al. (2004).
Tambien es muy frecuente en Big Data a lo ancho que sea
necesario examinar la validez de un enorme numero de hipotesis.
Por ejemplo, contrastar si cada una de las d variables potencialmente
explicativas realmente aporta algo de explicacion en un modelo de
regresion o de clasificacion. Una situacion semejante se da cuando
se manejan modelos con un enorme numero de coeficientes y se
desean contrastar las hipotesis simplificadoras de que cada uno de
esos coeficientes es cero. Para dar respuesta a ese tipo de situaciones
surgieron a finales del pasado siglo y principios del presente (ver
Benjamini and Hochberg (1995) y Benjamini and Yekutieli (2001),
entre otros) diversos metodos encaminados a controlar la tasa de
falsos positivos (o FDR, del ingles false discovery rate) ası como la
tasa de error conjunta (o FWER, del ingles familywise error rate).
Obviamente estos metodos son esenciales en Big Data a lo ancho.
2.2. Eficiencia estadıstica y eficiencia computacional
En general, cuando tenemos que analizar estadısticamente
conjuntos de gran volumen de datos, un asunto muy relevante es
el del tiempo de calculo necesario para llevar a cabo tales analisis.
Esto plantea la necesidad (o, al menos, la conveniencia) de tener
en cuenta el tiempo de computacion necesario dentro del criterio de
optimalidad del metodo de estimacion. Lo habitual en estadıstica
es considerar medidas de eficiencia, como la inversa del error
Ingenuas reflexiones de un estadıstico en la era del Big Data 301
cuadratico medio de un estimador, que son utiles para comparar
distintos procedimientos estadısticos atendiendo tan solo al error de
estimacion que cometen. Sin embargo, no es infrecuente que metodos
que proporcionan menos error de estimacion precisen de un mucho
mayor numero de calculos, con lo que el tiempo de computo de los
mismos sera tambien mucho mayor. Eso da pie a tener en cuenta
la llamada eficiencia computacional a la hora de comparar metodos
estadısticos.
Cuando disponemos de un numero de datos moderado, la
eficiencia estadıstica es el criterio primordial (sino el unico) para
determinar la optimalidad de un procedimiento de analisis de datos.
La eficiencia computacional suele considerarse como una propiedad
complementaria deseable del metodo. A veces llega a fijarse cierto
umbral para el tiempo de computacion que no debe rebasar el
metodo de analisis, normalmente debido a requisitos tecnicos, como
el tiempo maximo permisible para poner en practica medidas
correctivas, si tras el analisis de los datos se concluye que dichas
medidas son necesarias. En este sentido, se tratarıa de encontrar
el procedimiento mas eficiente estadısticamente (por ejemplo el
estimador con menor error cuadratico medio) dentro de los que
requieren un tiempo de computo menor o igual que un umbral fijado.
Por el contrario, en algunas aplicaciones crıticas en Big Data es
simplemente necesario producir estimaciones que conlleven un error
estadıstico no mayor que un umbral prefijado pero, dentro de ellas,
resulta crucial poder poner en practica el metodo mas rapido desde el
punto de vista computacional. Esto ocurre, por ejemplo, a la hora de
poner en el mercado productos y tecnologıas con alto valor anadido,
cuando el tiempo de respuesta es un factor decisivo para imponerse
a otros competidores.
Aunque actualmente no es muy frecuente, resulta muy razonable
esperar que en el futuro se utilicen criterios de optimalidad mixtos
302 R. Cao Abad
que combinen la eficiencia medida desde el punto de vista estadıstico
con la eficiciencia computacional, tanto en tiempo de procesado como
en memoria requerida para llevar a cabo el procedimiento de analisis.
Es ası concebible que los procedimientos de analisis opten en el futuro
por elegir un metodo u otro en funcion del peso que reciban el coste
del error de estimacion, el tiempo de computo necesario y la memoria
requerida para la implementacion del metodo, entre otros aspectos.
Ello, ademas, puede depender de la arquitectura computacional a
utilizar, ya que el grado de paralelizacion de los distintos metodos
de analisis de datos puede ser un factor decisivo al evaluar este tipo
de criterios de optimalidad conjuntos estadıstico-computacionales.
Ası, no sera sorprendente en el futuro que una rutina de analisis de
grandes volumenes de datos opte por llevar a cabo un procedimiento
u otro en funcion de la arquitectura computacional en la que se
ejecute.
Por ultimo, un aspecto crucial que ya esta muy presente en el
campo de los Big Data es el de la escalabilidad. En el contexto que
nos ocupa, la escalabilidad podrıamos definirla como la capacidad del
metodo de analisis de datos para adaptarse a situaciones de mayor
volumen (mayor tamano muestral, mayor dimension del modelo,
mayor numero de variables en el mismo) sin perder su calidad.
Frecuentemente la escalabilidad de un procedimiento estadıstico se
evalua examinando como crece el numero de operaciones necesarias
(o su tiempo de ejecucion) al ir aumentando el tamano muestral
o la dimension del problema. Tambien es importante analizar la
escalabilidad desde el punto de vista de la memoria RAM y de la
capacidad necesaria de disco duro para llevar a cabo el procedimiento
estadıstico. Obviamente las limitaciones en memoria RAM pueden
suplirse con la utilizacion mas intensiva de disco, sin embargo
esto produce un enlentecimiento considerable del tiempo necesario
para poder completar el procedimiento de analisis. Un ejemplo de
Ingenuas reflexiones de un estadıstico en la era del Big Data 303
procedimiento estadıstico poco escalable desde el punto de vista del
tiempo de ejecucion es el de la obtencion del parametro de suavizado
mediante un criterio del tipo cross validation para la estimacion no
parametrica de la funcion de regresion. Este ejemplo lo trataremos
precisamente en la siguiente seccion.
En las secciones siguientes expondre en mayor detalle algunos
procedimientos estadısticos o problemas concretos que tienen
especial relevancia en el contexto de los datos de gran volumen.
Todo ello siempre con una vision muy personal.
3. Submuestreos lupa
Entre las herramientas de analisis exploratorio de datos
mas utilizadas estan los procedimientos graficos. Frecuentemente
es muy recomendable comenzar a explorar los datos mediante
representaciones graficas que nos permitan simplemente resumir
la informacion, detectar datos atıpicos o establecer patrones
iniciales que luego se validaran, contrastaran o ajustaran mediante
procedimientos estadısiticos, a veces sofisticados. De entre las
representaciones graficas uno de los tipos mas usados y utiles son
los diagramas de dispersion de pares de variables y las matrices
de dichas graficas de dispersion. Mediante este tipo de graficas se
prentende examinar, por ejemplo, la posible relacion de dependencia
entre variables relevantes del problema. Ello puede ayudar a decidir
que tipos de modelos de regresion formular inicialmente.
Recientemente, en el contexto de un trabajo fin de master (TFM)
que he dirigido, nos encontramos con el problema, tan trivial pero
limitante a la vez, de tener que remediar el inconveniente de no
poder distinguir los patrones subyacentes en un simple grafico de
dispersion entre dos variables. ¿Como es eso posible?, se preguntara
el lector. ¿No se arreglo el problema cambiando el tipo de puntos
utilizado, o el tamano, color o forma de los mismos? Pues no, ya
304 R. Cao Abad
que el grafico de dispersion constaba de algo mas de 800 000 datos,
correspondientes a otros tantos clientes de una entidad financiera.
El “apelotonamiento” de datos era tal que resultaba muy difıcil
distinguir zonas de muy alta concentracion de puntos de otras en
las que la contentracion era simplemente alta, o incluso moderada.
Obviamente, uno podrıa obviar la construccion de tales tipos de
representaciones graficas y pasar directamente a algun metodo
de analisis (como la construccion una estimacion no parametrica
de la regresion) que luego pueda ser representado graficamente
sin encontrarse con el problema antedicho. De todas formas, no
hay porque renunciar a esas exploraciones graficas, simplemente
podemos aplicar el principio “mas vale que sobre que no que falte”,
ya que de donde sobra se puede quitar.
La forma en que “resolvimos el problema” de no poder distinguir
los patrones en el grafico de dispersion consistio en obtener una
submuestra aleatoria de algo ası como 1 000 datos y representar
graficamente “solo” esos 1 000 datos en un diagrama de dispersion.
Aunque es proco probable, podrıa suceder que esa submuestra no
fuese muy representativa de la muestra original, ası que repetimos el
procedimiento unas pocas veces (obteniendo submuestras aleatorias
independientes), lo que nos permitio corroborar el patron observado
para la submuestra inicial. De esta forma, el submuestreo actuo a
modo de lupa (o quiza microscopio) permitiendonos ver donde antes
todo estaba enmaranado.
Este procedimiento de submuestreo puede resultar tambien muy
util cuando se trata de llevar a cabo otros procedimientos estadısticos
que pueden ser poco escalables (muy lentos de ejecucion para
tamanos muestrales relativamente grandes). Precisamente, en el
ejemplo antes citado, tras el analisis visual de algunas submuestras
llegamos a la conclusion de que resultaba conveniente realizar una
estimacion no parametrica (tipo nucleo) de la funcion de regresion.
Ingenuas reflexiones de un estadıstico en la era del Big Data 305
Uno de los requisitos necesarios para ello es la eleccion del parametro
de suavizado (o ventana), que juega un papel fundamental a la hora
de aplicar esta tecnica. Entrando en cierto detalle, a la hora de hacer
una grafica del estimador de Nadaraya-Watson, mh, de la funcion
de regresion hemos de evaluar dicho estimador en una particion
suficientemente fina y para la construccion del mismo, debemos
elegir el parametro de suavizado, h. Uno de los primeros metodos
propuestos (y, aun ası, usado en la actualidad) para la seleccion del
parametro h consiste en encontrar aquel valor, hCV , que minimiza
la funcion por validacion cruzada (o cross validation) dada por
CV (h) =1
n
n∑i=1
(Yi − m−ih (Xi)
), (3.1)
siendo n el tamano muestral, (X1, Y1) , . . . , (Xn, Yn) la muestra,
m−ih (x) el estimador evaluado en x y calculado eliminando de la
muestra la i-esima observacion, mh (x) = 1nh
∑ni=1K
(x−Xih
)Yi
el estimador tipo nucleo de Nadaraya-Watson calculado con toda
la muestra, K la funcion nucleo utilizada y h el paramertro de
suavizado o ventana. Es inmediato razonar que cada evaluacion de
la funcion dada en (3.1) requiere del orden de n (n− 1) operaciones
elementales, es decir ese numero de evaluaciones de la funcion nucleo
y otras tantas sumas. En el caso que nos ocupa, con n = 800 000,
eso supone que el calculo de CV (h) para cada h requiere del orden
de 6,4 × 1011 operaciones, es decir algo mas de medio billon de
evaluaciones de la funcion nucleo y otras tantas sumas. Suponiendo
que pudiesemos llevar a cabo 100 millones de operaciones por
segundo y que cada evaluacion de la funcion K llevase consigo del
orden de 10 operaciones, entonces las aproximadamente 7,04× 1012
operaciones necesarias requerıan de unos 70 400 segundos, es decir
algo mas de 19 horas de tiempo de ejecucion. Si para encontrar
una buena aproximacion numerica del valor hCV hiciesen falta
306 R. Cao Abad
unas 10 evaluaciones de la funcion CV (h), entonces necesitarıamos
algo mas de 8 dıas para disponer del valor del parametro de
suavizado a utilizar. Obviamente, 8 dıas para poder conocer un
parametro auxiliar a utilizar para llevar a cabo un procedimiento de
estimacion es un tiempo prohibitivo. La ejecucion del procedimiento
se demorarıa mas dos anos si la muestra constase de n = 8 000 000
de datos.
Al reflexionar sobre el ejemplo anterior podrıamos pensar
que algunos procedimientos estadısticos (como la estimacion no
parametrica de la regresion) son practicamente inutilizables con
datos de gran volumen. Sin embargo ello no tiene porque ser ası,
simplemente hemos de utilizar nuestra imaginacion para solventar
esos problemas de escalabilidad.
En el contexto de la estimacion no parametrica tipo nucleo
de Nadaraya-Watson de la funcion de regresion, es conocido que
el parametro de suavizado optimo (por ejemplo, en el sentido
de minimizar el error cuadratico promediado medio, MASE) es
asintoticamente de la forma hopt ' c0n− 1
5 , para cierta constante
c0 que depende de caracterısticas poblacionales (como la propia
funcion de regresion desconocida, la de densidad de la variable
explicativa y algunas derivadas de ambas). De hecho, bajo
algunas condiciones sabemos que muchos metodos de seleccion
del parametro de suavizado, como el de validacion cruzada,
proporcionan procedimientos consistentes. En concreto, se tiene que
hCV,nhopt
−→ 1,
en probabilidad o de forma casi segura. Ası pues, para tamanos
muestrales relativamente grandes es previsible esperar que hCV,n 'c0n− 1
5 . Si no fuese porque c0 es desconocido, podrıamos utilizar esa
formula asintotica para aproximar el valor de hCV,n. Sin embargo,
Ingenuas reflexiones de un estadıstico en la era del Big Data 307
ese problema puede resolverse aplicando el procedimiento estadıstico
con una submuestra de tamano mucho menor (aunque grande aun).
Por ejemplo, tomando una submuestra de tamano m el calculo de
la ventana de validacion cruzada para dicha submuestra requerirıa
del orden de m (m− 1) operaciones elementales y este numero
puede ser mucho menor que n (n− 1) eligiendo m adecuadamente.
Supongamos que en nuestro ejemplo tomamos m = 8 000, es decir
m = n100 . Ası, el calculo de la ventana de validacion cruzada
basada en esta submuestra, hCV,m, sera unas 1002 = 10 000 veces
mas rapido que con la muestra original completa. Eso significa
que requerirıa unos 70 segundos de tiempo de ejecucion. Ahora,
como hCV,m ' c0m− 1
5 se tiene que c0 ' hCV,mm15 y, por tanto,
hCV,n ' hCV,mm15n−
15 = hCV,m
(mn
) 15 . En resumen, la utilizacion
del mismo procedimiento de seleccion de la ventana pero para una
submuestra aleatoria de la muestra Big Data y una mera correccion
por el tamano muestral permitirıan obtener en poco mas de un
minuto el parametro de suavizado que, de seleccionarlo utilizando la
muestra completa, requerirıa mas de 8 dıas de calculos. Obviamente
esta forma de proceder provoca que el valor obtenido dependa de
la submuestra concreta elegida, pero podemos repetir el proceso
para unas cuantas submuestras y considerar el valor promedio del
c0 aproximado mediante hCV,mm15 .
En el ejemplo anterior, la clave para poder reducir el numero de
operaciones estuvo en conocer la expresion asintotica del parametro
auxiliar (de suavizado) que deseamos elegir. Eso permite“extrapolar”
el valor obtenido para el tamano muestral m al que corresponderıa
para otro tamano, n, mucho mayor. Para otros parametros o en
otros contextos es posible que no dispongamos de resultados teoricos
que permitan llevar a cabo razonamientos como ese. En tal caso,
siempre serıa factible llevar a cabo el procedimiento para unos pocos
valores del tamano de las submuestras: m1 < m2 < · · · < mk,
308 R. Cao Abad
mucho menores que n, y luego formular un modelo que permita
relacionar el parametro en cuestion con el tamano de la submuestra.
Dicho modelo podrıa utilizarse acto seguido para predecir el valor
que deberıa tomar dicho parametro para el tamano muestral original.
4. Divide y venceras
Frecuentemente el tamano del conjunto de datos provoca que
los procedimientos clasicos de analisis conlleven un numero de
operaciones demasiado elevado. Como consecuencia, en esos casos
parece razonable modificar el procedimiento para que el tiempo
de ejecucion del mismo sea factible. Un simple ejemplo es el
calculo de la funcion de distribucion empırica para una muestra
de gran numero de datos. Imaginemos, por ejemplo, que estamos
registrando la temperatura durante cada segundo en cada tienda de
una cadena comercial con mil establecimientos en todo el mundo.
Ası, si pretendemos construir la funcion de distribucion empırica
con los datos todos los establecimientos correspondientes a los tres
ultimos anos, nos encontramos con que deberemos utilizar n =
9,5 × 1010 datos de temperaturas. El calculo de la distribucion
empırica requiere esencialmente ordenar la muestra, lo cual puede
hacerse mediante algoritmos eficientes, como el quicksort, en n log2 n
operaciones. En este ejemplo, dado el valor de n, eso requerira de
unas 3,5 × 1012 operaciones. Si nuestro ordenador pudiese realizar
100 millones de operaciones por segundo, necesitarıamos unos 35 000
segundos (es decir, casi 10 horas) para llevar a cabo el calculo
de dicha funcion de distribucion empırica. Este puede resultar un
tiempo excesivo, si lo que se desea es tomar decisiones en corto
espacio de tiempo.
Supongamos ahora que simplemente deseamos calcular la
temperatura mediana de esos n = 9,5× 1010 datos. En lugar de las
n log2 n operaciones que requerirıa la ordenacion de todos los datos,
Ingenuas reflexiones de un estadıstico en la era del Big Data 309
podrıamos “romper” la muestra en m submuestras de tamano nm ,
ordenar cada una de esas submuestras y luego utilizar las medianas
de esas submuestras para construir un estimador de la mediana
poblacional. Para hacer esto ultimo, una posibilidad serıa calcular
la mediana de todas esas medianas submuestrales. De proceder de
esta forma, el numero de operaciones necesarias para ordenar las
m submuestras serıa m nm log2
nm = n log2
nm , mientras que para
ordenar esas m medianas submuestrales necesitarıamos m log2m
operaciones. Ası pues, en total harıan falta un numero de operaciones
igual a
g (m) = n log2
n
m+m log2m = n log2 n− (n−m) log2m.
En nuestro caso utilizando que n = 9,5× 1010 y minimizando en m
la funcion g (m) es facil obtener el valor optimo para m que resulta
ser m = 4,1 × 109 submuestras. Eso significa que el numero de
operaciones a realizar serıa 5,6 × 1011, es decir unas 6 veces mas
rapido que con la muestra original. La forma de esta funcion puede
verse en la Figura 1.
310 R. Cao Abad
Figura 1: Numero de operaciones, g(m), para el calculo del estimador“divide y venceras” en funcion del numero de submuestras elegidas.
En esta situacion resulta interesante comparar estos dos
estimadores, θ1 (mediana muestral basada en los n datos) y θ2(mediana de las m medianas submuestrales), desde un punto
de vista de su eficiencia estadıstica. Si denotamos por θ0 la
mediana poblacional, es bien conocido que la distribucion asintotica
de θ1 es una N(θ0,
14nf(θ0)
), siendo f la funcion de densidad
de la poblacion. Por su parte, la distribucion de la mediana
submuestral con submuestras de tamano nm serıa aproximadamente
una N(θ0,
14 nm f(θ0)
), y, consecuentemente, la distribucion asintotica
de la mediana de las medianas submuestrales, θ2, viene dada por
una N(θ0,
14mgn,m(θ0)
), siendo gn,m (x) la funcion de densidad de
una distribucion N(θ0,
14 nm f(θ0)
). Utilizando la expresion de esta
Ingenuas reflexiones de un estadıstico en la era del Big Data 311
densidad se obtiene
gn,m (θ0) =1
1
2√
nm
√f(θ0)
√2π
exp
(− (θ0 − θ0)
2
24 nm f(θ0)
)=
2√
nm
√f (θ0)
√2π
,
con lo cual la distribucion asintotica de θ2 resulta ser
una N
(θ0,
√2π
8√mn√f(θ0)
). En particular ambos estimadores son
asintoticamente insesgados y sus varianzas asintoticas resultan
V ar(θ1
)' 1
4nf (θ0)
V ar(θ2
)'
√2π
8√mn√f (θ0)
.
Ası, θ2 es asintoticamente mas eficiente que θ1 si y solo sı m >12πnf (θ0). Para el caso de una poblacion normal de desviacion
tıpica σ (su media sera θ0, igual a su mediana, por ser la normal
simetrica) esta condicion resulta m >√π
2√2σn, que viene a imponer
que el numero de submuestras a elegir para que el nuevo estimador
sea asintoticamente mas eficiente que el clasico no puede ser
excesivamente pequeno, en terminos del tamano muestral de la
muestra original y de la desviacion tıpica de la poblacion. Si σ es
un valor grande, estamos dando mucha libertad para la eleccion
de m, pero si σ es pequeno la situacion es la contraria. Ası,
por ejemplo, si σ <√π
2√2
= 0,626 66 el estimador θ2 no serıa
asintoticamente mas eficiente que θ1 para ninguna eleccion de m,
supuesta una distribucion poblacional normal. No obstante, θ2 sı
serıa mas eficiente computacionalmente que θ1 y posiblemente su
varianza no se verıa muy afectada. Para nuestro ejemplo, con n =
9,5× 1010, m = 4,1× 109 y considerando una poblacion normal, las
312 R. Cao Abad
deviaciones tıpicas asintoticas de θ1 y θ2 serıan√V ar
(θ1
)' 2. 568 3× 10−6
√σ√
V ar(θ2
)' 5. 013 6× 10−6 4
√σ
Puede verse entonces que, para σ = 1 la desviacion tıpica del
estimador θ2 es alrededor del doble que la de θ1, pero ambas son
realmente muy pequenas. Si consideramos otros casos mas extremos,
como σ = 0,0001 o σ = 10000, obtenemos que, en el primero de ellos,
la desviacion tıpica de θ2 es alrededor de 20 veces mayor que la de
θ1 (pero ambas muy pequenas, de ordenes 10−7 y 10−8), mientras
que en el segundo, la desviacion tıpica de θ2 es alrededor de 5 veces
menor que la de θ1, siendo ambas bastante pequenas, de ordenes
10−5 y 10−4 respectivamente.
Otra forma razonable de proceder para “integrar” la informacion
de las m medianas submuestrales, serıa calcular la media (en
lugar de la mediana) de las medianas submuestrales. Denotemos
dicho estimador por θ3. Dado que la distribucion de la mediana
submuestral es aproximadamente una N(θ0,
14 nm f(θ0)
), la varianza
asintotica de θ3 (media muestral de las medianas remuestrales)
resulta1
4 nmf(θ0)
m = 14nf(θ0)
y, por tanto, θ3 sigue aproximadamente
unaN(θ0,
14nf(θ0)
), es decir, la misma distribucion asintotica que θ1,
sin que para ello influya (asintoticamente, al menos) la eleccion de m.
Sin embargo el numero de operaciones necesarias para calcular θ3 es
incluso un poco menor que para calcular θ2, es decir, bastante menor
que para calcular θ1. Esto da pie a concluir que en una situacion
como esta serıa muy razonable dividir el conjunto de n = 9,5× 1010
datos en m = 4,1 × 109 submuestras, calcular con cada una la
mediana submuestral y finalmente calcular la media de todas esas
Ingenuas reflexiones de un estadıstico en la era del Big Data 313
m medianas.
5. Bootstrap con replicas minimalistas
Recientemente, en una colaboracion de nuestro grupo de
investigacion con un grupo de oncologos nos vimos en la necesidad
de utilizar un metodo bootstrap para contrastar la significacion
de variables relacionadas con metilaciones en datos procedentes de
sarcomas. Como es bien sabido, aunque los metodos bootstrap son
computacionalmente costosos, la potencia actual de los ordenadores
permite ejecutarlos en muy pocos segundos. En nuestro caso, el
numero de datos, n = 300, era moderado, sin embargo, en el
problema que nos ocupaba el numero de potenciales variables
explicativas era cercano a las 400 000, lo cual provocaba un factor de
enlentecimiento tal, que los analisis necesarios podrıan demorarse
durante anos. ¿Como completar entonces todo el analisis en un
tiempo razonable? Veamoslo con un poco mas de detalle.
Dado que pretendıamos realizar contrastes de significacion para
cada una de las k = 400 000 variables, siendo el numero de contrastes
de hipotesis tan elevado, se impone utilizar una tecnica que controle
la tasa de falsos positivos (FDR). En concreto utilizamos los metodos
de Benjamini and Hochberg (1995) y Benjamini and Yekutieli (2001).
Estos metodos se basan en ordenar los p-valores de menor a mayor:
p(1) ≤ p(2) ≤ · · · ≤ p(k) y compararlos con diversos umbrales
calculados a partir del nivel de significacion global prefijado, α,
mediante la condicion p(i) ≤ αk−i+1 , encontrando cual es el maximo
valor de i que cumple dicha condicion. Como consecuencia, el p-valor
mas pequeno ha de compararse con αk , que en nuestro caso, usando
α = 0,05 y teniendo en cuenta que k = 400 000, resultaba ser αk =
1,25 × 10−7, un valor muy pequeno. Como ademas necesitabamos
utilizar el metodo bootstrap para aproximar estos p-valores, ello
significaba tener que simular un gran numero de replicas bootstrap
314 R. Cao Abad
y con ellas calcular la proporcion de veces en las que la version
bootstrap del estadıstico ofrecıa un valor menor que el estadıstico en
la muestra original, siendo dicha proporcion precisamente el p-valor
aproximado por bootstrap. Como se ha de hacer (entre otras cientos
de miles) la comparacion p(1) ≤ 1,25 × 10−7, es evidente que el
numero de replicas boostrap necesarias ha de ser algo mayor que1
1,25×10−7 = 8,0 × 106, es decir algo mayor que 8 millones. Por
ejemplo unos 100 millones de replicas boostrap serıan un numero
razonable. Sin embargo, dada la complejidad del estadıstico (de
orden cuadratico en el tamano muestral, n) y el numero de variables
sobre las que implementar el bootstrap (k = 400 000) el numero
de operaciones necesarias serıa del orden 3002 × 400 000 × 108 =
3,6 × 1018, que en un ordenador que realice unos 1000 millones de
operaciones por segundo llevarıa un tiempo de ejecucion de unos . . .
¡114 anos!
En realidad, el motivo de lanzar un numero tan grande (108)
de replicas bootstrap viene de la necesidad de hacer comparaciones
del tipo p(i) ≤ αk−i+1 pero, obviamente, cuando con unas pocas
replicas bootstrap (pongamos 10) el p-valor estimado por bootstrap
sea al menos 0,1, es obvio que para ese ındice, i, no va a ocurrir que
p(i) ≤ αk−i+1 , sino todo lo contrario. Esto significa que podrıamos
hacer una primera ronda de procedimientos bootstrap, para cada
una de las k variables, con solo B = 10, replicas bootstrap, y solo
aumentar el valor de B (por ejemplo al valor B = 100) para aquellas
variables para las que el p-valor obtenido por bootstrap con solo 10
replicas haya sido 0 (pues en las demas habra sido de al menos 0,1).
Es de esperar que la inmensa mayorıa de las variables (pongamos
para 380 000, el 95 % de ellas) esten en esa situacion de que p ≥ 0,1,
y “solo” para las 20 000 restantes habrıa que aumentar el numero
de replicas, pongamos al valor B = 100. Ahora procederıamos
a calcular los nuevos p-valores para esas (supongamos) 20 000
Ingenuas reflexiones de un estadıstico en la era del Big Data 315
variables, teniendo presente que como ya hay unos 380 000 p-valores
mayores que α = 0,05, entonces estos otros 20 000 p-valores
calculados con B = 100 replicas bootstrap, han de conpararse con
umbrales de la forma αk−i+1 ≥
0,05400 000−20 000+1 = 1,32 × 10−7 para
i = 1, . . . 20 000. De esta manera, es de esperar que entorno al 95 %
de esos p-valores sean al menos 0,01 (con lo cual mucho mayores
que 1,32 × 10−7), teniendo que aumentar el valor de B (pongamos
a B = 1000) para los 1000 p-valores restantes. Continuando con
un procedimiento de esta forma y denotando por ` el numero de
variables que resultaran finalmente significativas (para las cuales
sı necesitarıamos del orden de 108 replicas bootstrap), el numero
de replicas bootstrap totales necesarias serıa aproximadamente del
orden
380 000 ·10+19 000 ·100+950 ·1 000+47 ·10 000+2 ·100 000+` ·108
= 7. 32× 106 + ` · 108 = (`+ 0,0732) · 108 ' 108`,
con lo que el numero de operaciones necesarias serıa 3002×108` = 9×1012`. Ası pues, el cociente entre el numero de operaciones necesarias
con el metodo standard y el numero de operaciones necesarias con
este bootstrap “minimalista en replicas” resulta 3,6×10189×1012` = 4×105
` .
Si, por ejemplo, finalmente resultase que hay ` = 100 de las 400 000
variables significativas, entonces este nuevo metodo requerirıa 4000
veces menos operaciones que el metodo clasico, lo cual supone que, en
un ordenador que realice 1000 millones de operaciones por segundo,
los 114 anos de tiempo de ejecucion del metodo standard pasen
a ser de solo algo mas de 10 dıas en este metodo optimizado. La
repercusion de este cambio es absolutamente crucial.
316 R. Cao Abad
6. Big Data . . . Big Bias
Segun nos adentramos en la era de los Big Data esta
imponiendose la (a menudo falsa) idea de que los conjuntos masivos
de datos reflejan la verdad absoluta. Brooks (2013) califica esta
concepcion como el “datismo”. Sin embargo frecuentemente los
datos contienen sesgos ocultos que a menudo provienen de su
procedimiento de recogida, especialmente para los metodos de
muestreo en los que los individuos de la muestra se autoseleccionan.
Un caso citado por Crawford (2013) es la base de datos de mas
de 20 millones de tweets, originados por el huracan Sandy en
octubre-noviembre de 2012. Un analisis combinado de los datos de
Twitter y los procedentes de Foursquare permitio obtener algunas
conclusiones previsibles, como un aumento de gastos en alimentacion
la noche previa a la tormenta, y otras mas sorprendentes, como un
incremento en la vida nocturna el dıa siguiente al huracan. Este
es un caso en el que los datos no son una muestra “insesgada”
de la poblacion que estamos estudiando. Ası, la gran mayorıa de
los tweets sobre Sandy provinieron de Manhattan, debido al alto
numero de propietarios de telefonos inteligentes en Nueva York. En
las zonas mas afectadas por el desastre se originaron pocos mensajes.
No solo por la menor penetracion del mercado de smartphones en
esas zonas, sino, sobre todo, porque los cortes electricos en esas areas
mas afectadas provocaron muchos problemas de acceso a internet y
provocaron tambien que muchos de esos telefonos se quedaran sin
baterıa en las horas posteriores a la tormenta.
Otro ejemplo muy interesante mencionado por Crawford (2013)
es el de los datos recopilados en la ciudad de Boston a partir de
StreetBump, la aplicacion para telefonos inteligentes que detecta,
de forma pasiva, la existencia de baches a partir de los registros
de los acelerometros de los smartphones y de los datos del GPS
durante la conduccion de un automovil. Los datos se envıan al
Ingenuas reflexiones de un estadıstico en la era del Big Data 317
Departamento de Trafico de la Ciudad de Boston, que ası puede
planificar con eficiencia la reparacion de los baches, optimizando
recursos y ahorrando tiempo. En este caso, uno de los problemas
observados al poner en marcha el proyecto fue que algunos segmentos
de la poblacion de la ciudad de Boston (como las clases menos
favorecidas) tienen una baja tasa de uso de telefonos inteligentes.
Ademas esa tasa es aun menor para grupos de edad avanzada, con
lo cual esos datos proporcionan una muestra muy sesgada (aunque
grande, en numero) de la poblacion de baches existentes en la
ciudad. Eso provoca una infraestimacion del numero de baches en
determinados barrios de la ciudad, con la consiguiente deficiencia
sobre la planificacion optima.
Recientemente Cao and Borrajo (2018) han considerado el
problema de estimacion de la media, en un contexto no parametrico,
cuando disponemos de datos de gran volumen (Big Data) pero
sesgados, proponiendo metodos para corregir el problema causado
por el sesgo en dos situaciones diferentes: (i) cuando existe la
posibilidad de obtener una muestra aleatoria simple (de mucho
menor tamano) de la poblacion original y (ii) cuando el mecanismo
que provoca el sesgo puede replicarse sobre la poblacion ya sesgada
de forma que se dispone de una segunda muestra doblemente sesgada
y de pequeno tamano.
7. Conclusiones
Algunas conclusiones (muy personales) sobre lo que se abordado
en este artıculo son las siguientes:
1. Los paradigmas clasicos de la estadıstica merecen ser
revisados/actualizados en la era de los Big Data. Por ejemplo,
la teorıa asintotica puede ahora verse muy reflejada en los
datos. Por el contrario, los metodos de remuestreo podrıan
ser menos utilizados que hasta ahora, aunque posiblemente los
318 R. Cao Abad
metodos de submuestreo tendran un gran auge. Ademas, esta
nueva realidad probablemente hara que se centre mas atencion
en los procedimientos recursivos y las tecnicas de reduccion de
la dimension, como ya ocurrio antano.
2. Posiblemente sera necesario introducir nuevos paradigmas,
como el analisis de la complejidad de los metodos de
inferencia, los requisitos de memoria de los mismos (eficiencia
computacional de los metodos estadısticos), la facilidad de
paralelizacion de los procedimientos y el uso de estrategias
del tipo “divide y venceras”. En resumen, la “escalabilidad” de
los metodos de analisis estadıstico cobrara una mucho mayor
importancia.
3. Los metodos de submuestreo se anticipan como una potente
herramienta de visualizacion y analisis en situaciones en las
que el desbordante tamano muestral haga impracticables las
operaciones y graficas mas sencillas. Por otra parte, en casos
en los que lo metodos modernos de inferencia estadıstica
mas computacionalmente costosos (como el bootstrap) aun
resulten necesarios (por ejemplo en situaciones de Big Data en
horizontal pero no en vertical) sera necesario su optimizacion
computacional, a efectos de ahorrar muchas operaciones donde
los resultados ya son concluyentes con calculos muy poco
costosos.
4. El sesgo en los datos puede ser un problema muy considerable y
frecuente en el contexto de los Big Data. Muchos de esos datos
se autoseleccionan, con lo cual no existe un procedimiento
de muestreo, controlado por el experimentador, que permita
garantizar la representatividad del conjunto de datos de gran
volumen. Conviene detectar la presencia de esos posibles sesgos
y, si es que existen, corregirlos en la fase de analisis de datos.
Ingenuas reflexiones de un estadıstico en la era del Big Data 319
En resumen, la estadıstica se dispone a afrontar nuevos retos (de
hecho lo esta haciendo ya) de la mano de la computacion: las bases de
datos, la inteligencia artificial y la computacion de altas prestaciones.
Para ello es muy importante que nosotros, los estadısticos, tomemos
la iniciativa y juguemos un papel muy relevante en esta nueva
disciplina que se esta dando en llamar la Ciencia de Datos. Creo
que la creacion de tıtulos universitarios de grado en Ciencia e
Ingenierıa de Datos, en diversos lugares de Espana, es una magnıfica
oportunidad para poner en practica esta actitud proactiva que creo
enormemente necesaria.
Referencias
[1] Benjamini Y. and Hochberg Y. (1995). Controlling the false
discovery rate: a practical and powerful approach to multiple
testing. J. R. Stat. Soc.B, 57, 289-300.
[2] Benjamini Y. and Yekutieli D. (2001). The control of the false
discovery rate in multiple testing under dependency. Ann. Stat.,
29, 1165-1188.
[3] Brooks D. (2013). The Philosophy of Data. The New York
Times, 5th of February, p. A23.
[4] Cao R. (2015). Inferencia estadıstica con datos de gran volumen.
La Gaceta de la RSME, 18, 1001-1025.
[5] Cao R. and Borrajo L. (2018) Nonparametric mean estimation
for big-but-biased data. In: E. Gil et al. (Eds.) The Mathematics
of the Uncertain. A Tribute to Pedro Gil. Studies in Systems,
Decision and Control, Springer (in press).
[6] Crawford K. (2013). The Hidden Biases in Big
Data. Harvard Business Review, 1st of april. In:
https://hbr.org/2013/04/the-hidden-biases-in-big-data
320 R. Cao Abad
[7] Efron B. (1979). Bootstrap methods: another look at the
Jackknife. Ann. Statist., 7, 1-26.
[8] Efron B. Hastie T. Johnstone I and Tibshirani R. (2004). Least
angle regression. Ann. Stat., 32, 407-451.
[9] Pearson K. (1900). On the criterion that a given system of
deviations from the probable in the case of a correlated system
of variables is such that it can be reasonably supposed to have
arisen from random sampling. Philosophical Magazine Series,
5, 50, 157–175.
[10] Pena D. (2014). Big Data and Statistics: Trend or Change?
Boletın de Estadıstica e Investigacion Operativa, 30, 313-324.
[11] Student (1908). The probable error of a mean. Biometrika, 6,
1-25.
[12] Tibshirani R. (1996). Regression shrinkage and selection via the
Lasso. J. R. Stat. Soc.B, 58, 267-288.
Acerca del autor
Ricardo Cao es Catedratico de Estadıstica e Investigacion
Operativa en el Departamento de Matematicas de la Universidade
da Coruna, donde coordina el grupo de investigacion MODES
(modelizacion, optimizacion e inferencia estadıstica). Sus lıneas de
investigacion abarcan la inferencia no parametrica, los metodos de
remuestreo, el analisis de supervivencia, la verosimilitud empırica,
los metodos estadısticos para Big Data, el analisis de datos
funcionales y los metodos estadısticos en genomica, neurociencia,
malherbologıa y riesgo de credito. Es miembro de la Bernoulli
Society, de la Sociedad Espanola de Biometrıa y de la Sociedad
Espanola de Estadıstica e Investigacion Operativa (SEIO), a cuyo
Consejo Academico pertenecio. Es Co-Editor Jefe de la revista
Ingenuas reflexiones de un estadıstico en la era del Big Data 321
Computational Statistics (2016-actualidad) y ha sido Editor Jefe
de la revista TEST (2009-2012) y previamente Editor Asociado
de la misma. Actualmente es ademas Editor Asociado de las
revistas Computational Statistics & Data Analysis y Journal
of Nonparametric Statistics. Ricardo Cao ha sido Presidente de
European Courses in Advanced Statistics (ECAS (2009-20014) y
tambien su Vicepresidente (2007-2009 y 2014-2015). Es Miembro
Electo del International Statistical Institute. Fue Coordinador de
Matematicas en la Agencia Nacional de Evaluacion y Prospectiva
(ANEP) del Ministerio de Ciencia e Innovacion (2008-2011) y
Vicerrector de Investigacion y Transferencia de la Universidad de A
Coruna (enero 2012 - enero 2016). Ha dedicado parte de su trabajo
a labores de transferencia a sectores como el sanitario, el naval, el
comercial y el industrial. Es autor de siete libros docentes y mas de
150 publicaciones cientıficas de investigacion. De ellas mas de 90 son
artıculos en revistas internacionales recogidas en ISI Web of Science.
Ha dirigido diez tesis doctorales ya defedidas y en la actualidad esta
dirigiendo otras cuatro mas. Ha sido investigador principal de algo
mas de una docena de proyectos de investigacion en convocatorias
competitivas y de once contratos de investigacion con empresas.
Boletın de Estadıstica e Investigacion OperativaVol. 33, No. 3, Noviembre 2017, pp. 322-346
Special Section
Premios Incubadora de Sondeos y Experimentos
Milagros Dieguez Taboada
C.P.I. As Revoltas
Paula Blanco Mosquera
C.P.I. de San Vicente
Roberto Manın Gutierrez
I.E.S Galileo Galilei
Sabela Vazquez
I.E.S. Ibaialde de Burlada
Abstract
In this new section of this journal BEIO the it gives the
winners for the work for the prize ”Incubadora de Sondeos
y Experimentos”, organized by the Society of Statistics and
Operational Research (SEIO). This contest aims to promote
the teaching and learning of statistics at non-university
educational levels. This publication’s aim is to divulge this
© 2017 SEIO
Premios Incubadora de Sondeos y Experimentos 323
work, for this reason the tutors have been invited to make this
work public, with the intention that it can serve as support
material for other teachers. In this first publication the first
prizes of each one of the phases of the course 2016-2017 are
collected. More information on the awarded work and their
authors, can be seen in the following page http://www.seio.
es/Incubadora/premiados-2016-17.html.
1. Probability games and Law of Large Numbers
1.1. El proyecto
En este proyecto intentamos averiguar si nuestros companeros
de instituto colaboran cuando se les pide ayuda o vienen por
contraprestaciones que puedan obtener. Para dar respuesta a nuestra
pregunta construımos una serie de juegos de probabilidad y les
pedimos que colaborasen con nosotros haciendo los experimentos
para despues presentar los resultados a un concurso. La conclusion
fue claramente que, amigos sı, pero si hay recompensa, es decir,
creemos que nuestros companeros responden mejor si hay premio.
El proyecto sigue dos ramas bien diferenciadas, por un lado
hicimos un sondeo para analizar la respuesta del alumnado del centro
ante nuestra solicitud de ayuda y por otro lado, un plan experimental
en el que tratamos de comprobar empıricamente como la frecuencia
relativa tiende a la probabilidad.
Sondeo
Partimos de una poblacion de 94 alumnos de secundaria y 140
de primaria. Los experimentos se realizaran por separado, mientras
que con los de secundaria se llevarıa a cabo la investigacion durante
varios dıas, con primaria les dedicamos dos, uno simplemente
invitandolos a colaborar y un segundo dıa con premio por la
participacion.
324
Sondeo secundaria
El estudio del comportamiento del alumnado de secundaria esta
dividido en tres fases:
� Primera fase: Pusimos un cartel en la puerta del aula donde
estaban los juegos y solicitamos colaboracion personalmente.
Se realizaron los experimentos durante 16 recreos y nos
ayudaron unicamente en 25 ocasiones.
� Segunda fase: Una de las alumnas salio al patio con un
altavoz pidiendo colaboracion y ofreciendo regalos a los que
nos ayudasen. A esta fase le dedicamos 6 recreos y pasamos de
tener una media diaria de 3.1, en la primera fase, a una media
de 18.3 en esta segunda fase.
� Tercera fase (premio a la constancia): Aquellos que
colaborasen durante diez recreos entrarıan en el sorteo de una
tarjeta Google Play. En esta ocasion fueron 16 recreos los
que dedicamos a los experimentos y obtuvimos que aumento
de nuevo el numero de veces que se realizaron estos.Pero en
este caso lo que nos interesaba estudiar era si aumentaba el
alumnado que colaboraba con nosotros o bien eran siempre los
mismos alumnos que venıan en mas ocasiones.
Los resultados que obtuvimos se reflejan en la Figura 1.
En la segunda fase aumento notablemante la afluencia de gente
pero la frecuencia con la que acudıan sigue siendo como maximo
3 mientras que en la tercera fase, aunque disminuye el numero de
alumnos que nos visitan se ve un aumento notable de las frecuencias,
destacando el alumno 215 que nos visita en 12 ocasiones. Por los
resultados obtenidos concluimos que el alumnado de secundaria no
estaba muy dispuesto a colaborar altruıstamente.
Premios Incubadora de Sondeos y Experimentos 325
Figura 1: Observando las tres graficas podemos ver claramente comoevoluciono el comportamiento de los/as alumnos/as de secundaria.En la primera fase ademas de venir muy pocos, la frecuencia con laque acudıan era muy baja.
326
Sondeo primaria
Con el alumnado de primaria realizamos los experimentos
durante dos recreos y el comportamiento de estos no se manifesto
tan interesado como en el caso anterior.
� Primera fase: Simplemente pidiendoles que colaborasen nos
ayudaron 31 personas.
� Segunda fase: A aquellos que colaborasen con nosotros
les regalabamos gominolas y aunque aumento el numero de
colaboradores, que en esta ocasion ascendio a 46, desdendio la
participacion del alumnado de segundo y cuarto curso.
Por ello consideramos que aunque el alumnado de primaria tambien
estaba influenciado por los premios no era tan notable como en el
caso de secundaria.
Plan experimental
Elaboramos un total de trece juegos entre los que habıa urnas
con bolas para extracciones con y sin remplazamiento, barajas
de cartas para estudiar tanto las frecuencias relativas de distintos
sucesos elementales como de la union e interseccion de estos,
urna con calcetines para estudiar la frecuencia de sacar en dos
extracciones un par concreto o cualquier par,chinchetas y tabas
como ejemplo de sucesos elementales no equiprobables, la aguja de
Buffon para demostrar como el doble del inverso de la frecuencia de
corte tiende hacia π, dados y urnas como ejemplo de experimento
compuesto, las puertas de Monty Hall, cruzar el rıo suma de las
dos caras superiores dados...
Premios Incubadora de Sondeos y Experimentos 327
Resultados
Exponemos a continuacion los resultados de algunos de los
experimentos realizados.
Urnas. Extracciones sin remplazamiento.
El experimento consistıa en extraer una bola de una urna
compuesta por 4 bolas amarillas, 12 azules,5 rojas y 8 verdes,
comprobar el color y devolverla a la urna. Se hizo el experimento
en 1714 ocasiones y las aproximaciones que obtuvimos se reflejan
en la grafica de la izquierda.
Monty Hall
Construimos 3 puertas detras de 2 de las cuales escondimos los
dibujos de sendas cabras y tras otra un coche, una vez que el jugador
escoge una de las puertas la monitora del juego le muestra otra en
328
la que se esconde una cabra y a continuacion le ofrece la posibilidad
de cambiar su eleccion.
Se realizo este experimento 1426 veces y obtuvimos:
f(ganar coche/cambiar puerta) = 0, 57
lejos de los 2/3 buscados, pero en todo caso lo que conseguimos
probar es que
f(ganar/cambiar) > f(ganar/no cambiar).
Cruzar el rıo
Para este juego elaboramos un tablero con 12 casillas para cada
uno de los dos jugadores, se lanzan dos dados y se mueve la ficha
colocada en la casilla que indica la suma de las caras superiores a la
correspondiente casilla de su oponente. Tenıamos dos versiones en la
Premios Incubadora de Sondeos y Experimentos 329
primera cada jugador disponıa de 12 fichas y deberıa cruzar todas
ellas al otro lado del rıo, despues de algunas tiradas les haciamos
ver al jugador la imposibilidad del suceso 1 y continuabamos con la
segunda version en la que cada jugador apostaba por un numero y
el primero que obtenıa ese resultado con los dados cruzaba el rıo y
ganaba la partida.
Se hicieron un total de 1380 tiradas y obtuvimos los resultados
de los graficos de la pagina siguiente.
1.2. Conclusion
Espero que los tres ejemplos anteriormente expuestos puedan
transmitirles en que consistio nuestro proyecto. Para esta que
escribe, depues de anos trabajando la estadıstica con proyectos,
este fue sin duda el mas completo, gratificante y con el que mas
disfrutaron y aprendieron las alumnas.
330
Premios Incubadora de Sondeos y Experimentos 331
2. Why doesn’t my mother like white chocolate?
A statistic research in 7th grade
2.1. Objetivos
El objetivo principal de este proyecto era confirmar o rechazar la
hipotesis de que nuestras preferencias de alimentos mas o menos
amargos, picantes, salados, dulces o acidos, evolucionan con la
edad. Para llevar a cabo este trabajo nos propusimos los siguientes
objetivos especıficos:
1. Analizar si existen diferencias significativas entre las
preferencias de los alimentos entre ninos y adultos.
2. Analizar posibles diferencias por sexos.
3. De ser cierto que haya diferencias significativas entre las
preferencias de los alimentos entre ninos y adultos, analizar los
grupos de edades en las que estas se producen.
2.2. Sondeo 1: sondeo entre los alumnos y adultos del
centro escolar
En el primer sondeo se eligieron tres alimentos representando a
cada gusto: amargo, picante, salado, dulce y acido. La gente deberıa
escoger un producto de los tres ofertados. La poblacion de este
primer estudio la costituyen los alumnos de infantil, primaria y
secundaria, profesores y personal no docente del CPI San Vicente
de A Bana. Las variables que se estudian son: Conguitos preferido,
Pringles preferidas, Pipas preferidas, Limonada dulce preferida y
Limonada acida preferida. Entre los estudiantes se conto con 93
chicos y 92 chicas, y de poblacion adulta la proprocion de mujeres
fue mucho mayor, contando con 27 mujeres y 8 hombres. La
muestra usada para el estudio fueron aquellos que voluntariomente
se ofrecieron a participar, que quedo en 80 alumnos y 82 alumnas, y
en el caso de adultos se redujo a 5 hombres y 18 mujeres.
332
Resultados del primer sondeo
Las diferencias entre el chocolate favorito elegido por ninos y
ninas puede oberservase en la Figura 1.
Del estudio del primer sondeo se obtienen las siguientes
conclusiones:
� Claramente, el chocolate blanco es el chocolate favorito de los
ninos (49 %) y, sin embargo, a los adultos apenas les gusta
(16.3 %).
� Mientras el chocolate favorito de los adultos es el negro
(63.3 %), en los ninos la predileccion por este es escasa (25.8 %).
� No se aprecian demasiadas diferencias en el chocolate con leche.
� Por sexos, vemos que a los ninos les gusta mas el amargo que a
las ninas (la segunda opcion favorita de los ninos es el chocolate
negro, la de las ninas, el chocolate con leche).
Premios Incubadora de Sondeos y Experimentos 333
En el caso de las patatas se concluyo lo siguiente:
� Las patatas favoritas de los ninos son las muy picantes
(35.8 %), sin embargo es la opcion minoritaria entre los adultos.
� Los adultos prefieren las patatas con un grado intermedio de
picante (40.9 %), que es la opcion menos escogida entre los
ninos.
� Por sexos, vemos que a los ninos les gusta mas el picante que
a las ninas (la opcion “muy picante” es su opcion mayoritaria,
mientras que la de las ninas es “poco picante”).
Para las pipas de girasol:
� Las diferencias entre los adultos y los ninos son enormes:
Los ninos prefieren las pipas mas saladas (51 %), y las que
menos eligen slon las poco saladas (23,6 %), justo al reves que
los adultos, que eligen mayoritariamente las que tienen poca
cantidad de sal (42,9 %) y minoritariamente las muy saladas
(23,8 %).
� Por sexos, vemos que a los ninos les gusta mas la sal que a
las ninas (aunque ambos escogen mayoritariamente la opcion
“muy saladas”, la segunda opcion de las ninas es “poco
saladas”).
En el caso de la limonada dulce, los resultados fueron:
� Tanto ninos como adultos prefieren la limonada muy dulce. Los
porcentajes ademas son muy parecidas (Ninos: 53,5 % Adultos:
59,1 %). Hay pocos ninos a los que les guste la limonada con
poco dulce (12,7 %) pero los adultos la escogieron un 22,7 %
de las veces.
334
� A muy pocas ninas les gusta la limonada con poco azucar (Solo
un 7 % de ninas la elige, frente a un 18 % de ninos).
En el caso de la limonada acida, los resultados fueron:
� Los adultos prefieren la limonada mas acida (50 %), la opcion
menos valorada por los ninos (25 %).
� Los ninos prefieren una limonada con una acidez media
(40,6 %), en cambio entre los adultos es la opcion que menos
les gusto (22,7 %).
� No se aprecian grandes diferencias por sexos.
2.3. Sondeo 2: Sondeo entre la poblacion del Municipio de
A Bana
El primer sondeo evidencio la diferencia de gustos entre ninos
y adultos, pero no mostraba el momento en que se producıa esos
cambios por lo que se amplio la poblacion del estudio a todo el
municipio de A Bana. Los resultados pueden oberservarse en la
Figura 2.
Se eligieron los mismos productos que en el Sondeo 1, pero
suprimiendo las limonadas por la dificultad de transportarla,
conservarla y de fabricarla. Como poblacion se tomo la de mas de
4 anos que vivan en La Bana que, segun datos del IGE (Instituto
Galego de Estatıstica), es de 3583 personas. Se estimo, utilizando
la calculadora online de la facultad de medicina de la Universidad
Nacional del Nordeste de Argentina, que necesitarıamos una muestra
de 186 personas para obtener un nivel de confianza del 95 % y un
margen de error del 7 %.
Finalmente, el numero total de datos recabados fue de 395,
de los que 185 fueron hombres y 215 mujeres. Como herramienta
informatica se uso la hoja de calculo de Google Drive.
Premios Incubadora de Sondeos y Experimentos 335
Resultados del segundo sondeo
Finalmente, para detectar donde hay un cambio de gusto, lo
que hicimos fue dividir las edades en tramos quinquenales y marcar
unicamente la opcion mayoritaria de ese tramo (si habıa empate,
se senalaba con una E). De este estudio se concluyo que, para el
Chocolate:
A partir de los 35 anos, el gusto por el chocolate blanco ya no es
mayoritario.
El chocolate con leche tiene mucha mas aceptacion entre las
mujeres.
A las mujeres les gusta menos el chocolate blanco que a los
hombres.
En el caso de las patatas que a partir de los 50 prefieren las
patatas sin picante. A los hombres les gustan mas el picante que a
las mujeres.
Para el caso de las pipas, solo los menores de 25 anos prefieren
336
mayoritariamente las pipas muy saladas, por otra parte, a los
hombres les gustan mas las pipas saladas mientras que a las mujeres
les gustan mas las que tienen poca sal.
Premios Incubadora de Sondeos y Experimentos 337
3. Statistical analysis of video game results
Nuestro proyecto consiste en un estudio de calculo mental usando
dos videojuegos de fabricacion propia. Una de las partes fue la
creacion de un juego usando Scratch en el que unos animales salıan
y entraban de una casa y los alumnos tenıan que contar los que
quedaban dentro de la casa al final. Se fue realizando la prueba
por las distintas clases de los cursos en nuestro instituto para poder
analizar los datos segun sexo y edad.
La otra parte esta basada en la recreacion de una consola casera
estilo anos 80 con un juego de calculo mental programando con
GameMaker. Esta consta de varios niveles en los que la dificultad
va aumentando a medida que avanzas en el juego. En cada uno
de ellos hay una serie de operaciones matematicas basicas: sumas y
restas en los primeros niveles, multiplicaciones y divisiones en niveles
mas avanzados. El jugador debe decidir si la operacion es correcta o
no. La puntuacion final de cada jugador consiste en un sistema de
calificacion en base a los aciertos y los fallos totales y se guarda en
un fichero de datos junto con su edad y sexo.
En nuestro trabajo hemos analizado y estudiado estadısticamente
los resultados obtenidos.
Las hipotesis planteadas fueron varias:
� ¿Habrıa diferencia por edad o sexo en los resultados?
� ¿Segun avanzamos en nivel academico los resultados seran
mejores?
� ¿Los grupos bilingues o con mejores resultados academicos
tambien tendran mejores resultados en calculo mental?
� Las generaciones que tuvieron menor exposicion a las nuevas
tecnologıas, ¿tendran mejores resultados en calculo mental que
las nuevas generaciones?
338
Todas esas preguntas tratarıamos de resolverlas con los dos
videojuegos disenados, si bien el principal objetivo del proyecto era
medir las habilidades y el nivel de calculo mental del alumnado de
nuestro instituto y de los visitantes de la feria de la ciencia que se
viene desarrollando estos ultimos anos en nuestro IES.
Los resultados han sido comparados por totales, por sexos, por
cursos y por grupos de edad.
Los materiales empleados para la construccion de la consola
fueron:
� La carcasa de un VHS del almacen del instituto y unos
altavoces
� Un teclado antiguo
� Un ordenador obsoleto de la Escuela 2.0
� Cables del taller de tecnologıa
� Una fuente de alimentacion de un ordenador sin uso del taller
de IMA
� Tres botones ARCADE
� Una pantalla sin uso del departamento de matematicas
� Una placa micro controladora del taller de tecnologıa
� Espray y pinturas
Todos los materiales fueron reutilizados o se devolvieron a los
departamentos a excepcion de los botones ARCADE, que fueron el
unico gasto real del proyecto.
Premios Incubadora de Sondeos y Experimentos 339
Las principales conclusiones de nuestro trabajo fueron las
siguientes:
Con respecto al test:
� La media de aciertos es en general ascendente segun avanzamos
por niveles dentro del instituto.
� Los promedios fueron superiores para los hombres tanto en
aciertos totales como en aciertos consecutivos.
Con respecto a la consola:
� Tanto para el porcentaje de aciertos como para la media de
puntos los mejores resultados se obtuvieron entre los 30 y los
40 anos.
340
� Los promedios de los hombres fueron superiores, aunque no los
consideramos significativos ya que realizamos mas de 16000
simulaciones en el ordenador y apenas el 4 % superaba la
diferencia de medias recogida en la consola.
En general:
� Los promedios en ambos juegos fueron superiores en los
hombres que en las mujeres.
� El calculo mental obtiene los mejores resultados con los de 30
a 40 anos.
Premios Incubadora de Sondeos y Experimentos 341
342
Con estas conclusiones pretendemos dar respuesta a las hipotesis
planteadas inicialmente:
� Es posible que haya diferencia entre sexos y tambien por
edades en nuestro IES, pero no consideramos tales diferencias
significativas, especialmente por nivel.
� Los resultados son ligeramente superiores por nivel pero muy
levemente, a excepcion de primero de bachillerato, que lo
achacamos principalmente a la muestra seleccionada.
� Los grupos con mejores resultados academicos no mostraron
unos resultados superiores al resto, luego no podemos
establecer diferencias a la hora de realizar calculo mental.
Probablemente sus resultados academicos dependan de otros
factores como pueden ser la motivacion, el esfuerzo, el interes...
Nos han demostrado que son igualmente habiles a la hora de
calcular mentalmente.
� Sı que se aprecia un mejor calculo por aquellas generaciones
que estan en su madurez y dependieron en menor medida de las
nuevas tecnologıas, ya que los mejores resultados se obtuvieron
entre los 30 y los 40 anos.
Premios Incubadora de Sondeos y Experimentos 343
4. Too much homework in Burlada?
4.1. Objetivos
El objetivo principal del trabajo es conocer si el alumnado de
Burlada tiene el problema de no tener suficiente tiempo libre debido
al tiempo dedicado a los deberes, estudiar o extraescolares y si una
huelga estarıa justificada.
Para dar respuesta a lo anterior se pretende:
� Saber cuanto tiempo libre tiene el alumnado de Burlada
diferenciando entre los que hacen extraescolares y los que no y
conocer la opinion del profesorado sobre este tema. Ver ademas si
existen diferencias entre etapas educativas.
� Saber el tiempo medio que dedica el alumnado de Burlada
a tareas y estudio, diferenciarlo por etapas y ver si coincide con la
opinion del profesorado. Ver ademas si existen diferencias entre los
que hacen extraescolares y los que no las hacen.
Para recoger los datos se elaboro una encuesta con un formulario
de Google Drive que se envio por correo electronico a los centros
educativos de la localidad. Se obtuvo una muestra de 153 personas.
4.2. Sondeo
Los datos se analizaron con la hoja de calculo de Google y se
sacaron las siguientes conclusiones:
1. Solo un 14.4 % del alumnado estudia mas de 2 horas al dıa
y solo un 4,9 % le dedica a los deberes mas de 2 horas.
2. El alumnado de Burlada dedica de media 1,2 horas al dıa
a hacer los deberes y 1,51 horas al dıa a estudiar. En total dedican
2,71 horas al dıa a hacer tareas y estudiar.
3. Si diferenciamos por etapas educativas, se aprecia que
cuando se avanza de etapa se emplea menos tiempo en la realizacion
de deberes y se incrementan las horas de estudio. El tiempo medio
344
para hacer tareas al dıa es de 1,4 horas en primaria, 1,07 horas
en secundaria y 1,06 horas en bachillerato. Mientras que el tiempo
medio de horas al dıa dedicadas a estudiar es de 1,47 horas en
primaria, 1,5 horas en secundaria y 1,79 horas en bachillerato.
4. El alumnado dice que dedican mas tiempo al estudio y
deberes del que el profesorado cree que hace, sobre todo en el estudio,
en el que hay mas de una hora de diferencia. Se observa ademas que
esto ocurre en todas las etapas.
5. El 73.17 % de alumnos/as hacen extraescolares. Si
diferenciamos por etapas, la mayor concentracion de alumnos/as que
hacen extraescolares pertenece a educacion primaria. En educacion
primaria casi el 100 % hacen extraescolares, de educacion secundaria
obligatoria mas del 50 % y de bachillerato casi el 75 %.
6. La media de horas dedicadas a la semana por el alumnado
de Burlada a la realizacion de extraescolares es de 3,42 horas.
7. El alumnado que hace extraescolares dedica a la semana
9,34 horas a estudiar y a hacer las tareas y 3,42 a hacer las
extraescolares, 12,76 horas a la semana en total. Mientras que el
que no hace extraescolares dedica a la semana 11,87 horas solo a
hacer tareas y estudiar. Tienen por tanto casi el mismo tiempo libre
a la semana.
8. Solo en la etapa de educacion secundaria el alumnado que
no hace extraescolares le dedica mas tiempo a estudiar y a hacer las
tareas que los alumnos que no hacen extraescolares. En el resto de
las etapas es al reves.
9. La media de horas libres al dıa que tienen los encuestados
es de 2,9 h. Si diferenciamos por etapas, los que mas tiempo libre
tienen son los de educacion secundaria obligatoria con 3,38 horas de
tiempo libre al dıa.
10. El 64 % de los profesores creen que sus alumnos/as no
necesitan mas tiempo libre y el 67,74 % de los alumnos/as dice lo
Premios Incubadora de Sondeos y Experimentos 345
Figura 2: Comparacion entre los tiempos de estudio y deberes pornivel educativo.
mismo.
Las conclusiones anteriores se podrıan sintetizar en las siguientes:
� El alumnado de Burlada dedica de media al dıa 1h15min a
realizacion de tareas y 1h 30 min al estudio aproximadamente.
� A mayor etapa educativa se dedica mas tiempo al estudio
y menos a las tareas.
� El alumnado que realiza extraescolares dedica una media
de 3h30min por semana a las mismas.
346
� El alumnado que realiza extraescolares y el que no tiene el
mismo tiempo libre a la semana ya que los primeros dedican menos
a sus estudios y tareas.
� Tanto profesores como alumnos opinan que hay suficiente
tiempo libre.
CONCLUSION FINAL:
El alumnado de Burlada tiene suficiente tiempo libre incluso si
hace extraescolares.