extremum estimators and stochastic optimization...
TRANSCRIPT
UNIVERSIDADE NOVA DE LISBOA
EXTREMUM ESTIMATORS AND
STOCHASTIC OPTIMIZATION
METHODS
by
MIGUEL DE CARVALHO
Submitted in partial fulfillment for the Requirements for the Degree of
PhD in Mathematics, in the Speciality of Statistics
in the
Faculdade de Ciencias e Tecnologia
Departamento de Matematica
May 2009
“Thought is only a flash between two long nights, but this flash is everything.”
Henri Poincare
UNIVERSIDADE NOVA DE LISBOA
Abstract
Faculdade de Ciencias e Tecnologia
Departamento de Matematica
Doctor of Philosophy
by MIGUEL DE CARVALHO
Extremum estimators are one of the broadest classes of statistical methods for the ob-
tention of consistent and asymptotically normal estimates. The Ordinary Least Squares
(OLS), the Generalized Method of Moments (GMM) as well as the Maximum Likeli-
hood (ML) methods are all given as solutions to an optimization problem of interest,
and thus are particular instances of extremum estimators. One major concern regarding
the computation of estimates of this type is related with the convergence features of
the method used to assess the optimal solution. In fact, if the method employed can
converge to a local solution, the consistency of the extremum estimator is no longer
ensured. This thesis is concerned with the application of global stochastic search and
optimization methods to the obtention of estimates based on extremum estimators. For
such purpose, a stochastic search algorithm, is proposed and shown to be convergent.
We provide applications to classical test functions, as well as to a problem of variance
component estimation in a mixed linear model.
UNIVERSIDADE NOVA DE LISBOA
Abstract
Faculdade de Ciencias e Tecnologia
Departamento de Matematica
Doctor of Philosophy
by MIGUEL DE CARVALHO
Os estimadores extremais (extremum estimators) sao uma das classes mais amplas de
metodos estatısticos utilizados para a obtencao de estimativas consistentes e assimp-
toticamente normais. O metodo dos mınimos quadrados, o metodo generalizado dos
momentos, bem como os metodos de maxima verosimilhanca resultam da solucao de
um problema de optimizacao, sendo consequentemente especificacoes particulares de
estimatores extremais. Um problema relevante no calculo de estimativas deste tipo
esta relacionado com as propriedades de convergencia do metodo utilizado para obter
a solucao optima. De facto, se o metodo utilizado convergir, eventualmente, para uma
solucao local, a consistencia do estimador extremal deixa de ser garantida. Esta tese
incide na aplicacao de metodos estocasticos de pesquisa e optimizacao global para o
calculo de estimativas baseadas em estimadores extremais. Para o efeito, e proposto
um algoritmo estocastico de pesquisa que provamos ser convergente. Neste sentido sao
providenciadas aplicacoes a funcoes de teste classicas, bem como a um problema de
estimacao em componentes de variancia num modelo linear misto.
Acknowledgements
I would like to express my hearty thanks to my advisors, Professor Tiago Mexia and
Professor Manuel Esquıvel. Their continuous generosity and unbounded encouragement
provided me the strength carry on converging to this thesis.
I would also like to record my indebtedness to my family and friends, who have con-
tributed to the realization of this thesis.
Unfortunately, some persons whose deeds contributed to this work will most probably
never be aware of that. To my grandparents Miguel and Helena, who no longer live
among us, and to Piotr Il’yich Tchaikosvky, whom I never had the opportunity to met.
I gratefully acknowledge the financial support from FCT (Fundacao para a Ciencia e a
Tecnologia).1
The unique errors which are not observable here are the error terms of the models.
1Advanced research scholarship with reference SFRH/BD/1569/2004.
iv
Contents
Abstract ii
Resumo iii
Acknowledgements iv
Glossary of Notation vii
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Prefatory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Formulation and Research Goals . . . . . . . . . . . . . . . . . . 3
1.3 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Stochastic Preliminaries 6
2.1 Overture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Modes of Stochastic Convergence and Stochastic Orders . . . . . . . . . . 9
2.3 Consistency of Point Estimators . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Extremum Estimators 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Consistency Results for Extremum Estimators . . . . . . . . . . . . . . . . 28
3.2.1 Consistency Under Compactness Assumptions . . . . . . . . . . . 29
3.2.2 Consistency for Maximum Likelihood Methods . . . . . . . . . . . 31
3.2.3 Consistency Without Compactness Assumption . . . . . . . . . . . 33
3.3 Convergence in Distribution of Extremum Estimators . . . . . . . . . . . 35
3.3.1 Interior Point Proviso . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Asymptotic Normality for Maximum Likelihood Methods . . . . . 36
3.3.3 Boundary Point Proviso . . . . . . . . . . . . . . . . . . . . . . . . 37
v
Contents vi
4 Global Optimization by Stochastic Search Methods 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 An Overview of Random Search Techniques . . . . . . . . . . . . . . . . . 44
4.3 Recasting the Solis and Wets Framework . . . . . . . . . . . . . . . . . . . 45
4.3.1 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 The Master Method . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Stochastic Zigzag Methods . . . . . . . . . . . . . . . . . . . . . . 51
4.3.4 A Matrix Formulation of the Master Method . . . . . . . . . . . . 52
4.4 Convergence of the Master Method . . . . . . . . . . . . . . . . . . . . . . 57
4.5 A Note on the Construction of Confidence Intervals . . . . . . . . . . . . . 64
5 Estimation in the Mixed Model via Stochastic Optimization 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 The Benchmark Case . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 The Dimension Reduction Technique . . . . . . . . . . . . . . . . . . . . . 70
5.4 A Stochastic Optimization Study of the Dimension Reduction Technique . 72
6 Summary and Conclusions 74
6.1 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A A Short Note Regarding the Essential Supremum 78
B van der Corput’s Sublevel Set Estimates 81
C Construction of Confidence Intervals for the Minimum 83
C.1 Tables for de Haan’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2 Functional Form of the Classical Test Functions . . . . . . . . . . . . . . . 88
Bibliography 89
Glossary of Notation
N N ∪∞
∂(A) boundary of set A
int(A) interior of set A
cl(A) closure of set A
Df(x) gradient of the vector function f(x)
D2f(x) Hessian of the vector function f(x)
Cn(A) Set of n-times continuously differentiable functions defined in a set A
In Identity matrix of size n
0n×m Null matrix of size n×m
E[X] Expectation of the random vector X
V[X] Covariance matrix of the random vector X
N (µ,V ) Normal random vector with mean vector µ and covariance matrix V
un = O(1) The sequence of real numbers, un, is bounded
un = o(1) The sequence of real numbers, un, converges to zero
Un −→pu The sequence of random variables, Un, converges in probability to u
UnD−→ u The sequence of random variables, Un, converges in distribution to u
Un −→a.s.
u The sequence of random variables, Un, converges almost surely to u
Un = op(1) The sequence of random variables, Un, is stochastically bounded
Un = op(1) The sequence of random variables, Un, converges in probability to 0
a ∨ b maxa; b
a ∧ b mina; b
‖ · ‖r Lr-norm
‖ · ‖∞ Infinity-norm
vii
List of Figures
2.1 Uniform Convergence in Probability . . . . . . . . . . . . . . . . . . . . . 10
2.2 The logical relation between the stochastic convergence concepts introduced. 15
3.1 Assumptions (1, 2, 3) imply consistency of the extremum estimator. . . . . 30
3.2 The picture behind Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . 33
4.1 A sketch which portrays the reasoning involved in the proof of Theorem4.5. This instance is used just to provide guidance, and it is not part ofthe proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 The initialization of the stochastic zigzag method. In the picture in theleft we start by finding points a and b which initialize the algorithm.The second picture illustrates that in Step 1 we collect a random sample(c = 10) from line which passes through a and b. The remaning picturedepicts steps 2 and 3 wherein after the maximum of the first line wegenerate another seed and start by extracting a sample from the new linewhich passes by such points. . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 The application of the stochastic zigzag method to the Stybilinski–Tangtest function. This function pertains to a class of test functions which aretypically used to assess the performance of an optimization algorithm (seee.g. Spall [50]). The functional form of this function is given in formula(4.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
viii
List of Tables
5.1 Estimates of the variance components in Model I. . . . . . . . . . . . . . . 72
5.2 Estimates of the variance components in Model II. . . . . . . . . . . . . . 73
5.3 Estimates of the variance components in Model III. . . . . . . . . . . . . . 73
C.1 Search domains of the test functions used and their corresponding globalminimum value denoted by m∗ . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2 (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
C.3 Sample variances for the upper bound and lower bounds of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000 observations. 84
C.4 (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
C.5 Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000 observations. 85
C.6 (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 100.000observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.7 Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 100.000 observations. 86
C.8 (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 500.000observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.9 Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 500.000 observations. 87
ix
Dedicated to Vanda. . .
x
Chapter 1
Introduction
1.1 Prefatory Remarks
A broad class of estimators, can be formulated as the solution to an optimization problem
of interest. This class of estimators is known in the literature as extremum estimators
(see Amemiya [1]; Newey and Mcfadden [35]). The classical Ordinary Least Squares
(OLS), the General Method of Moments (GMM) and the Maximum Likelihood (ML)
methods, are all instances of extremum estimators. These instances of extremum esti-
mators, as well as their corresponding affinity, will become more clear in Chapter 3. For
now we remain the discussion at the perfunctory level.
One of the advantages of considering this general class of estimators is that it is possible
to build an elegant asymptotic theory which reduces to a set of general results. In what
concerns consistency, we will examine what assumptions have to be made in order to
establish the weak consistency of an extremum estimator. As we shall see below, mild
requirements are necessary as a means to achieve such result. Strong consistency can
also be obtained, under slightly more demanding conditions. In addition, if further
assumptions are made, this class of estimators can also be ensured to be asymptotically
normal.
Despite their appealing features, in a plurality of cases of practical interest, these es-
timators are not analytically tractable. Stated differently, in some relevant instances
of extremum estimators we frequently lack a closed-form solution for achieving the es-
timates. A solution frequently adopted to solve this problem is to rely on some opti-
mization algorithm as a means to obtain the estimates. The questions then arise. First,
is there a method which outperforms all the others? Second, what type of algorithm
should we use to perform the optimization? Some brief comments regarding these ques-
tions. In what concerns the first question, an answer is provided by the No Free Lunch
1
Chapter 1. Introduction 2
Theorem—an impossibility theorem which precludes the existence of a general purpose
strategy, robust a priori to any type of optimization problem (Wolpert and Macready
[58]). In what concerns the second question, a major concern regarding the obtention of
estimates of this type is related with the convergence features of the method used to as-
sess the optimal solution. In fact, if the method employed converges to a local solution,
the consistency of the extremum estimator is no longer ensured (cf. [19, 35]). Hence,
one should avoid to rely on optimization methods which can converge to a local solu-
tion, given that it is the global solution that possesses noteworthy asymptotic features.
Loosely speaking, two types of optimization algorithms are typically adopted to tackle
such problem, namely deterministic and stochastic optimization methods. The former
includes the Newton–Raphson, the steepest descent method, among many others (see
[36] and references therein). In this thesis the focus will however be placed on stochastic
optimization algorithms. Specifically, this thesis is concerned with the application of
global stochastic search and optimization methods to the obtention of estimates based
on extremum estimators. Here, we follow Spall [50] referring to stochastic search and
optimization algorithms in the conditions that follow.
Stochastic Search and Optimization
I. There is some random noise in the measurements of the objective function;
(or / and)
II. There is a random choice in the search direction as the algorithm iterates
toward a solution.
It is important to underline that this thesis is exclusively concerned with item II. Namely,
the focus is placed on optimization algorithms wherein the search direction is randomly
dictated. Such algorithms include the pure random search (Solis and Wets [49]), the
simulated annealing technique (Bohachevsky et al. [5]), genetic algorithms (Goldberg
[21]), the conditional martingale algorithm (see Esquıvel [17]), etc. Stochastic search and
optimization algorithms find their application in a wide variety of scenarios. The scope
of this topic is broad enough to comprehend applications ranging from game theory
(Pakes and McGuire [40]) to the clustering of multivariate data (Booth et al. [8]). A
nice overview of stochastic search and optimization methods can be found for instance in
Duflo [16], or in Spall [50]. While the former is more turned into theoretical features of
the methods, and more apropos to the reader acquainted with the French language, the
latter is written in English and maintains an interesting balance between the technical
attributes and the applications.
The next section introduces the specific problem which is located at the core of our
interests, as well as the research goals undertaken in this thesis.
Chapter 1. Introduction 3
1.2 Problem Formulation and Research Goals
Any estimator which can be formulated as a solution to the optimization problem stated
below, is tantamount to an extremum estimator.
Problem Formulation
maxθ∈Θ
Tn(θ). (1.1)
Some remarks regarding notation: Θ ⊆ Rk denotes a parameter space; n is the sample
size; the parameter objective function which yields the extremum estimator is denoted
by Tn. In Chapter 3, we elaborate on some of the properties shared by the broad class
of estimators which solve such unconstrained optimization problem. In the sequel, the
research goals are summarized.
Research Goals
The general goal of this research is to develop a general stochastic optimization method
for solving the optimization problem (1.1). Notwithstanding, the procedures to be elab-
orated over this thesis can also be applied in other contexts of interest, where an opti-
mization problem identical to (1.1) arises. A stochastic optimization method in which
we are particularly interested is the stochastic zigzag method. This method is largely
inspired in the seminal work of Mexia et al. [32]. The idea is to give a stochastic op-
timization structure to a method which as proved to be successful in a broad variety
of contexts—namely the ‘classical’ zigzag algorithm (see e.g. Oliveira et al. [38] and
references therein). It is not our intention here to elaborate on the latter (this was
already done elsewhere; see e.g. Nunes et al. [37]), but rather on its stochastic variant.
After the convergence of the stochastic zigzag method has been addressed, it can be
useful to raise the level of abstraction.1 In fact, if we concentrate on the features of the
method which are really important to establish convergence, we may be able to obtain
the convergence under a more general setting. Of course there is a tradeoff between the
level of abstraction that one is willing to accept, and the possibility of being (un)able to
establish convergence, which one should try to control in a proper manner. A battery
of practical implementations should also be conducted as a means to assess the perfor-
mance of the developed method. Preferentially, the method should be applied in the
context of problem (1.1). However, as mentioned above, other optimization problems of
interest can also be used to evaluate the functioning of the algorithm.
1It should be stressed that for the sake of exposition, in this thesis the results are presented in theopposite way. Hence, we establish the convergence of the master method, from where the convergenceof its particular cases follows directly.
Chapter 1. Introduction 4
1.3 Contribution of this Thesis
In this thesis we contribute by proposing a master method which includes several other
stochastic optimization algorithms. The method proposed here is broad enough to en-
compass the conceptual algorithm of Solis and Wets [49], as a particular case. A note-
worthy embodiment of the master method in which we are particularly interested is
the stochastic zigzag method—an optimization algorithm which is based in the promi-
nent work of Mexia et al. [32]. The master method is presented in a twofold manner.
First, we present an algorithmic version of the method. Second, we provide a matrix
formulation. The latter form of conceptualization of the method brings into light some
features which are less noticeable in the algorithmic version. At the core of the matrix
formulation lies the iterative matrix Z which can also be used as a masterplan for prac-
tical implementations. Each row of the iterative matrix is composed by a seed and a
course. Roughly speaking, the seed is a point chosen at random from the domain of
the function, and in the course rely the remaining iterations of the method (arranged
by order of extraction). In the particular case of the stochastic zigzag method we ev-
idence how to take advantage of this matrix framework in order to reduce the burden
of implementation. In this spirit, a result which is remarkably easy to establish—the
Kronecker–zigzag decomposition—allows us to easily construct an entire course of the
iterative matrix. The convergence of the master method also plays a part in the con-
tribution of this thesis. It is important to note that the convergence results established
here depart from hypothesis identical to the ones considered in the literature, in what
concerns the convergence of the random search method (cf. Solis and Wets [49]; Esquıvel
[17]). Given that the stochastic zigzag method is a particular instance of the master
method, the convergence of this algorithm also follows as a consequence. In addition,
we evidence how we can construct confidence intervals for the maximum of the function,
departing from the first column of the iterative matrix Z. The construction is direct,
being based on a known result in extreme value theory given by de Haan [14].
An application is then provided in what concerns estimation in the mixed linear model,
through the Maximum Likelihood (hereinafter ML) methods. Before we move into the
optimization procedure, we establish that in a mixed linear model, the ML problem can
be rewritten as a much simpler optimization problem (the simplified problem) where the
search domain is a compact whose size depends only on the number of variance com-
ponents. As it can be readily noticed, from the estimation standpoint, this feature is
extremely advantageous, and this simplified problem will allow us to solve the ML prob-
lem with large computational savings. The results evidence an overall good performance
of some instances of the master method proposed here.
Chapter 1. Introduction 5
1.4 Synopsis
This thesis is written as a fugue, interweaving the themes of extremum estimators and
stochastic optimization. Below we offer a plan of the work developed in the pages that
follow.
In Chapter 2 we consider some preliminary concepts and results from asymptotic theory.
The exposition starts with the introduction of convergence modes of stochastic interest.
These concepts will provide us the basic framework for the introduction of the several
types of consistency of point estimators, as well as for the study of other large sample
properties.
In Chapter 3, we concern ourselves with the introduction of extremum estimators. In
this part of the thesis we also shed some light on the large sample properties of ex-
tremum estimators. The focus is placed on the circumstances under which one can
ensure consistency and asymptotic normality of this broad class of estimators.
A general stochastic optimization method is proposed in Chapter 4. Here, we start
by recasting the theoretical underpinning of Solis and Wets [49]. Following this, we
introduce the algorithmic version of the master method, and stress some of its instances.
The algorithmic version of the master method is then counterpointed with a matrix
version. After a thorough exposition of the method, we establish its convergence. This
chapter ends with a brief note on the construction of the confidence intervals for the
extremum departing from the first column of the iterative matrix.
Chapter 5, is devoted to the particular case of the conduction of maximum likelihood
estimates of variance components in mixed linear models. In this framework, we show
that the maximum likelihood problem can be rewritten as a much simpler optimization
problem where the search domain is a compact whose size depends only on the number of
variance components. This result will prove to be particularly useful for the application
of specific instances of the master method in the conduction of ML estimates of the
variance components.
The thesis closes in Chapter 6 with a short conclusion and with the proposal of some
future directions for our research agenda.
Chapter 2
Stochastic Preliminaries
This chapter is devoted to the introduction of some preliminary concepts and results,
suited for our purposes. The exposition made in the sequel, was largely inspired in the
prominent textbooks of Shao [48] and Sen and Singer [46]. For the sake of complete-
ness, in the first section we formally define the structures of interest for our exposition.
Furthermore, we also provide some relevant comments on the notation to be used from
hereafter.
We note that in the exposition that follows, we will frequently assign a name to a
Theorem. We are aware that some historical inaccuracies may arise as a consequence of
adopting such approach, much in the spirit of the Stigler’s law, whereby almost nothing
is named after the person who invented it, but we do it for the sake of exposition, and
we believe that the benefits of doing so outweigh the costs.
2.1 Overture
The exposition opens with the introduction of the primitive structures of interest. We
start with the definition of linear space. Roughly speaking, a linear space is a collection
of vectors or matrices, which is closed under the linear combination. Formally, we have:
Definition 2.1. Consider a field K. A collection of vectors or matrices X is a linear
space if for all x,y ∈ X, and κ ∈ K, the following conditions hold
1. x + y ∈ X,
2. κx ∈ X.
6
Chapter 2. Stochastic Preliminaries 7
If we drop the first condition of the linear space definition, and consider the particular
case of K = R+ we get another well-known structure. Namely, a collection of vectors
or matrices Λ is a cone if it is closed under the multiplication by a positive real scalar.
Formally, we have the following definition.
Definition 2.2. A collection of vectors or matrices Λ is a cone, if for all λ ∈ Λ, and
κ ∈ R+, the following condition holds
κλ ∈ Λ.
Over the exposition we will also make use of the concept of norm.
Definition 2.3. A function ‖ · ‖ from a linear space X to R+0 is a norm, if the following
conditions hold:
i) ‖x‖ ≥ 0, for all x ∈ X, where ‖x‖ = 0, if and only if x = 0;
ii)‖κx‖ = |κ|‖x‖, for all x ∈ X and all scalars κ;
iii)‖x + y‖ ≤ ‖x‖+ ‖y‖, for all x,y ∈ X.
In the sequel, we provide some instances of distances and norms which will prove to be
useful during the exposition that follows. For such purpose, let x = [ x1 · · · xr ]. First,
we introduce the Lr-norm
‖x‖r :=
[n∑i=1
|xi|r]1/r
.
In the particular case that r = 2 we get the well-known Euclidean norm. For the ease
of notation, we will omit the subscript if r = 2, i.e., we take ‖x‖ := ‖x‖2. Further, we
introduce the quadratic norm
‖x‖W := (xTWx),
where W is a given positive definite matrix. Note that ‖x‖I = ‖x‖, i.e., the particular
choice of W = I, leads to the Euclidean norm.
If X = [xi,j ] is a matrix1 of order m× n then we analogously define the Lr-norm as
‖A‖r :=
m∑i=1
n∑j=1
|xi,j |r1/r
. (2.1)
Again, to simplify notation, we will omit the subscript if r = 2, i.e., we take ‖A‖ :=
‖A‖2. In other contexts of interest, we will make use of the sup-norm, defined on a
1Given that the collection of all m × n matrices Mm,n is a vector space, Definition 2.3 can beapplied. However, if m = n it is often required that the matrix norm should also satisfy the property‖AB‖ ≤ ‖A‖‖B‖, ∀A,B ∈Mm,n. See Rao and Rao [42] for further details.
Chapter 2. Stochastic Preliminaries 8
linear space Θ ⊂ Rk, i.e.:
‖f‖∞ := supθ∈Θ
|f(θ)|,
where f is a real-valued bounded function defined over the parameter space. In our
exposition, we will make use of the the‘little-oh’ and ‘big-oh’ notations. Recall that in
classical Calculus, we have the notation xn = o (yn), which formally implies that
∀ε > 0, ∃ p(ε) ∈ N : n ≥ p(ε)⇒∣∣∣∣xnyn
∣∣∣∣ < ε.
We will also borrow from the classical Calculus, the notation xn = O (yn), to mean that
∃ M > 0 ∧ ∃ p(M) ∈ N : n ≥ p(M)⇒∣∣∣∣xnyn
∣∣∣∣ ≤M.
We highlight the particular case where yn = 1, ∀n ∈ N. Under those circumstances we
have that xn = o(1) means that xn −→n→∞
0. Further, we have that xn = O(1) means
that xn is bounded.
Given the way the ‘little-oh’ and ‘big-oh’ are defined, well-known properties can be
derived (see e.g. Khuri [28], p. 66; Figueira [18], p. 264).
1. λO(xn) = O(xn), λ ∈ R;
2. O(xn + yn) = O(xn) +O(yn);
3. O(xnyn) = O(xn)O(yn);
4. o(xnyn) = O(xn)o(yn).
To get some insight on these results, suppose that xn = yn = 1, and that an and bn
are sequences of real numbers such that an = O(1) and bn = O(1). The first three
properties yield that an + bn = O(1), and anbn = O(1). In words, the sum and product
of two bounded sequences originates another bounded sequence. As a means to illustrate
property 4, suppose that bn = o(1). Then we have that anbn = O(1)o(1) = o(1), i.e.,
the product of a sequence converging to zero by a bounded sequence, also converges to
zero.
The concept of cone stated above (Definition 2.2) will allow us to introduce the following
definition due to Chernoff [12], as well as a corresponding generalization. This defini-
tion has well-known applications in statistics (see for instance [3, 51]), and it will be
particularly useful in the next chapter.
Chapter 2. Stochastic Preliminaries 9
Definition 2.4. Consider a cone Λ ⊆ Rk, and a set Θ ⊆ Rk.
1. The set Θ ⊆ Rk is locally approximated by a cone Λ if the following conditions
hold
(a) infλ∈Λ‖λ− θ‖ = o(‖θ‖), ∀θ ∈ Θ;
(b) infθ∈Θ‖λ− θ‖ = o(‖λ‖), ∀λ ∈ Λ.
2. The sequence Θn of subsets of Rk is locally approximated by a cone Λ ∈ Rk if
it holds that
(a) infλ∈Λ‖λ− θn‖ = o(‖θn‖), ∀θn : ‖θn‖ = o(1);
(b) infθn∈Θn
‖λn − θn‖ = o(‖λn‖), ∀λn : ‖λn‖ = o(1).
Broader definitions of cone and local approximation with vertex θ0 ∈ Rk, can be found
elsewhere (Stram and Lee [51]). Notwithstanding, the aforementioned definitions will
prove to be sufficient for our purposes.
2.2 Modes of Stochastic Convergence and Stochastic Or-
ders
In the sequel we will focus the development of our exposition in a normed linear space
framework. This will prove to be sufficient for our purposes. For an exposition suited
for metric spaces see van der Vaart and Welner [54] and Billingsley [6].
Definition 2.5. Consider a sequence of random vectors Xn, with cumulative dis-
tribution functions Fn(x). Further, consider a random vector X, with cumulative
distribution function F (x).
1. We say that Xn converges almost surely (a.s.) to X, if
∀ε, η > 0, ∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P[‖XN −X‖ > ε, for some N ≥ n] < η.
2. We say that Xn converges in probability to X, if
∀ε, η > 0,∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P[‖Xn −X‖ > ε] < η.
3. We say that Xn converges to X in Lr, if
∀ε > 0, ∃ p(ε) ∈ N : n ≥ p(ε)⇒ E[‖Xn −X‖r] < ε.
Chapter 2. Stochastic Preliminaries 10
4. We say that Xn converges in distribution to X, if
∀ ε > 0, ∃ p(ε) : n ≥ p(ε)⇒ |Fn(x)− F (x)| < ε,
at any point of continuity x of F .
Remark 2.6. For each of the above defined, types of stochastic convergences, we will
respectively use the following notations Xn −→a.s.
X, Xn −→p
X, Xn −→Lr
X, and
XnD−→ X. Further, we emphasize that in what concerns convergence in distribution,
we will avoid the use of the expression ‘weak convergence’ to designate convergence in
distribution. As stressed by Williams [56], this terminology is unfortunate given that
the concept does not coincide with the corresponding concept in functional analysis (for
a definition of the function analysts concept see Luenberger [31]).
In the following we introduce the concept of uniform convergence in probability.
Definition 2.7. Let T be a function from Θ ⊂ Rk to R. Further, let Tn denote
a sequence of functions from Θ ⊂ Rk to R. We say that the sequence Tn converges
uniformly in probability to T if
∀ε, η > 0, ∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P [‖Tn − T‖∞ > ε] < η,
where
‖Tn − T‖∞ = supθ∈Θ|Tn(θ)− T (θ)|.
Figure 2.1: Uniform Convergence in Probability
This latter type of convergence will prove to be particularly useful in what regards the
statement of large sample results of the extremum estimators, to be provided in the next
Chapter 2. Stochastic Preliminaries 11
Chapter. Roughly speaking, this type of convergence demands from a certain order on,
the graph of Tn(θ) should lie in the ‘ε-sleeve’ with probability 1. In Figure 2.1, we
provide an illustration of this idea.
Note that each of the above-defined modes of stochastic convergence can also be rewritten
in the following manner:
Xn −→a.s.
X⇔ Xn −→n→∞
X, a.s.; (2.2)
Xn −→Lr
X⇔ ∃r > 0 : E[(‖Xn −X‖r)r] −→n→∞
0; (2.3)
Xn −→p
X⇔ P[‖Xn −X‖ > ε] −→n→∞
0,∀ε > 0. (2.4)
Similarly, in what concerns convergence in distribution, we have
XnD−→ X⇔ Fn(x) −→
n→∞F (x), (2.5)
for each point of continuity x of the cumulative distribution function F . Further, note
that
‖Tn − T‖∞ = supθ∈Θ
|Tn(θ)− T (θ)| −→p
0, (2.6)
is equivalent to say that Tn converges in probability to T .
We now introduce the stochastic counterparts of the ‘little-oh’ and ‘big-oh,’ often re-
ferred in the literature as the stochastic orders of random variables (vectors).
Definition 2.8. Consider a sequence of random vectors Xn and a sequence of random
variables Yn.
1. We say that Xn = O(Yn), a.s., if
P[‖Xn‖ = O(|Yn|)] = 1, a.s.
2. We say that Xn = o(Yn), a.s., if
Xn
Yn−→a.s.
0.
3. We say that Xn = Op(Yn), if
∀ε > 0, ∃Cε > 0 : supn
P[‖Xn‖ ≥ Cε|Yn|] < ε.
Chapter 2. Stochastic Preliminaries 12
4. We say that Xn = op(Yn), ifXn
Yn−→p
0.
These definitions warrant some remarks. As mentioned above we have that xn = O(1)
means that xn is bounded. In the same spirit, Xn is said to be stochastically bounded
(or bounded in probability), if Xn = Op(1). Note that if Xn is bounded in the sense
that
∃M > 0 : P[‖Xn‖ ≤M ] = 1, ∀n ∈ N,
then it is trivially stochastically bounded. Note however, that in general the converse is
not true.
In the following, we provide some well-known properties of the stochastic ‘little-oh’ and
‘big-oh’ (see e.g. van der Vaart [55], pp. 12–13).
op(1) + op(1) = op(1)
op(1) +Op(1) = Op(1)
Op(1)op(1) = op(1)
(1 + op(1))−1 = Op(1)
op(Xn) = Xnop(1)
Op(Xn) = XnOp(1)
op(Op(1)) = op(1)
In the next lemma we establish a connection between the classical calculus ‘little-oh’
and its stochastic counterpart.
Lemma 2.9.
Let Q be a function defined in Rk, such that Q(0) = 0. Consider a sequence of random
k-vectors such that Xn = op(1). If Q(x) = o(‖x‖r), as ‖x‖ −→n→∞
0, then Q(Xn) =
op(‖Xn‖r), for any given r > 0.
Proof. In this proof we will make use of the continuous mapping theorem (Theorem
2.20) which, for the sake of exposition, is only introduced later.2 The following auxiliar
function will also prove to be useful
q(x) =
(Q(x)
‖x‖r
)I(x 6= 0),
2We suggest the reader which is less acquainted with the continuous mapping theorem to return tothis proof after the introduction of Theorem 2.20.
Chapter 2. Stochastic Preliminaries 13
where I(·) denotes the indicator function. Consider function q(·) evaluated at Xn
q(Xn) =
(Q(Xn)
‖Xn‖r
)I(Xn 6= 0),
Observe that by construction q(·) is continuous at 0. Hence, by Theorem 2.20 it holds
that
q(Xn) = op(1).
Consequently, we have
Q(Xn) = ‖Xn‖rop(1) = op(‖Xn‖r).
Even though equations (2.2) to (2.6) provide just a direct set of translations of the defi-
nitions stated above, there are some equivalent of the concepts of stochastic convergence
introduced above, which can be more suitable in certain cases. The next set of results,
is introduced in this spirit, providing alternative characterizations of some of the modes
of convergence defined above. In what concerns almost sure convergence we have the
following result.
Theorem 2.10.
Consider the sequence of random vectors Xn, and a random vector X. Xn converges
almost surely to X, if and only if
P
[ ∞⋃m=n
‖Xm −X‖ > ε
]−→n→∞
0, ∀ε > 0.
Proof. See [48] p. 51.
Before we are ready to provide other equivalent characterizations, recall that the char-
acteristic function of X is defined as
ΦX(t) = E [exp(itTX)] ,
where i denotes the imaginary unit, i.e., i =√−1.
The following theorem sheds some light on the usefulness of the characteristic functions.
Theorem 2.11. (Levy’s Theorem)
Consider a random variable X, such that there exists k ∈ N and δ ∈]0, 1[ such that
E[|X|k+δ] <∞.
Chapter 2. Stochastic Preliminaries 14
Then we have that
φX(t) = 1 +
k∑j=1
(it)jE[Xj ] +Rk(t),
where i denotes the imaginary unit, and the remainder Rk(t) is bounded as follows
|Rk(t)| ≤ c|t|k+δE[|X|k+δ]
Proof. See [46] pp. 26–27.
As a consequence of the latter classical result, we have that each characteristic function
ΦX(t) is assigned to one, and only one cumulative distributions function FX(x). The
next result highlights that characteristic functions can also be used to provide equivalent
characterizations for convergence in distribution.
Theorem 2.12.
Consider the sequence of random vectors Xn, and let ΦXn denote the sequence of
corresponding characteristic functions. The following condition is necessary and suffi-
cient for XnD−→ X
ΦXn(t) −→n→∞
ΦX(t), ∀t ∈ Rk.
Proof. See [48] p. 57.
The next theorem establishes some further equivalent characterizations for convergence
in distribution. Claim 2 of the following theorem, is frequently referred in the literature
as the Cramer–Wold device.
Theorem 2.13.
Consider the sequence of random vectors Xn, and a random vector X. Each of the
following conditions is necessary and sufficient for XnD−→ X:
1. E[g(Xn)] −→n→∞
E[g(X)], for every bounded continuous function g(·);
2. xTXnD−→ xTX,∀xT ∈ Rk.
Proof. See [48] pp. 56–57.
One question that naturally arises is how the above-defined concepts of convergence
relate. We note that a partial answer to this question was already provided by the
suitable characterization of almost sure convergence, provided in Theorem 2.10. In fact,
Chapter 2. Stochastic Preliminaries 15
if a sequence of random vectors Xn converges a.s., then by Theorem 2.10, it holds that
for every ε > 0
P
[ ∞⋃m=n
‖Xm −X‖ > ε
]−→n→∞
0,
which implies that
⇒ P[‖Xn −X‖ > ε] −→n→∞
0,
and so a.s. convergence implies convergence in probability.
Theorem 2.14.
Consider the sequence of random vectors Xn and the random vector X. The following
conditions hold:
1. Almost surely convergence implies convergence in probability, i.e:
Xn −→a.s.
X⇒ Xn −→p
X.
2. Convergence in Lr (r > 0) implies convergence in probability, i.e.:
Xn −→Lr
X⇒ Xn −→p
X.
3. Convergence in probability implies convergence in distribution, i.e.:
Xn −→p
X⇒ XnD−→ X.
Proof. Claim 1 was established above. For a proof of the remainder claims, see [48]
p. 53.
Figure 2.2: The logical relation between the stochastic convergence concepts intro-duced.
Chapter 2. Stochastic Preliminaries 16
In Figure 2.2, we provide a schematic representation which summarizes what is claimed
in the previous theorem, where we have established the logical relation between the
stochastic convergence concepts introduced.
The following result establishes that stochastic boundedness is a necessary condition for
convergence in distribution.
Theorem 2.15.
Let Xn be a sequence of random vectors. Convergence in distribution implies stochastic
boundedness, i.e.
XnD−→ X⇒ Xn = Op(1).
Proof. See [46] pp. 106–107.
Since all the remaining modes of stochastic convergence defined above, imply conver-
gence in distribution, with the exception of uniform convergence in probability, we have
the simple corollary that stochastic boundedness is necessary for each of the above-
defined modes of stochastic convergence. Note that this is in accordance with what we
have in the study of deterministic sequences in classical Calculus. Hence, even though
the convergence and boundedness concepts are different in this context, we still have a
property of the type convergence implying boundedness.
It remains unanswered whether if under some special provisos, one can obtain a more
enriching portrait than the one provided in Figure 2.2. The next theorems shed some
light on this issue. We start by introducing the celebrated Skorohod’s theorem.
Theorem 2.16. (Skorohod’s theorem)
Consider the sequence of random vectors Xn and the random vector X. If XnD−→ X,
then there exists a sequence of random vectors Yn, and a random vector Y, such that
ΦX(t) = ΦY(t), ∀t ∈ Rk,
ΦXn(t) = ΦYn(t), ∀t ∈ Rk, ∀n ∈ N.
Further, it holds that
Yn −→a.s.
Y.
Proof. See Billingsley [6] pp. 399–402.
Chapter 2. Stochastic Preliminaries 17
The next theorem introduces some results in the same spirit. We underscore the impor-
tance of Claim 3 (see below) which is particularly useful for obtaining a prime example
of convergence in probability, namely the Kintchine’s weak law of large numbers.
Theorem 2.17.
Consider the sequence of random vectors Xn and the random vector X. The following
conditions hold:
1. Suppose that for every ε > 0, we have
∞∑i=1
P[‖Xi −X‖ ≥ ε] <∞. (2.7)
Then it holds that
Xn −→a.s.
X.
2. Suppose that Xn −→p
X. Then, there exists a subsequence Xnj : j ∈ N, such
that
Xnj −→a.s. X, j −→∞.
3. Suppose that XnD−→ X, and that P[X = x] = 1, where x is a constant vector.
Then, it holds that
Xn −→p
x.
4. Suppose that XnD−→ X. Then
E [(‖Xn‖r)r] −→n→∞
E [(‖X‖r)r] <∞,
if and only if (‖Xn‖r)r is uniformly integrable, i.e.:
supn
E[(‖Xn‖r)r I‖Xn‖r>t
]= o(1), t→∞. (2.8)
where I(·) denotes the indicator function.
Proof. See [48] pp. 53–54.
The last theorem warrants some discussion. Claim 1 establishes that as long as P[‖Xn−X‖ ≥ ε] converges sufficiently fast to 0, the concepts of convergence in probability and
almost sure convergence are identical. Further, we note that if a sequence of random
vectors Xn verifies (2.7), then it is said to be completely convergent. In this spirit,
Chapter 2. Stochastic Preliminaries 18
Claim 1 of the latter theorem, argues that complete convergence implies a.s. conver-
gence. In what regards Claims 2 and 3, these are partial converses to Theorem 2.14.
Finally in Claim 4, we conclude that converge in probability combined with uniform
integrability, implies convergence in Lr.
We now provide an example of convergence in probability. For the sake of simplicity, we
restrict our attention to random variables rather than to random vectors.
Example 2.1. (Kintchine’s weak law of large numbers)
Consider a sequence of independent and identically distributed random variables Xk,such that E[X1] = µ. We thus have that
φXn = E
[exp
(it
n∑k=1
Xk
n
)]
=
n∏k=1
E[exp
(itXk
n
)]=
[φX1
(t
n
)]n.
Now, notice that applying Levy’s theorem to the preceding equality yields,
φXn =
[1 +
it
nE[X1] + o
(1
n
)t
]n,
thus implying that
φXn −→n→∞ exp(itµ).
Hence, as a consequence of Claim 3 of Theorem 2.17, we have that
Xn −→pµ.
Given that in many circumstances of interest, one is constrained to have a sequence
of identically distributed random variables we are precluded to use the laws of larges
numbers stated above. To overcome such barrier the following weak and strong laws
are suitably modified in order to drop the requirement of ‘identical distribution’ of the
random variables.
Theorem 2.18. Let Xn be a sequence of random variables.
1. (Weak law of large numbers)
Suppose that the following condition holds
∃α ∈ [1, 2] :1
nα
n∑i=1
E[|Xi|α] = o(1).
Chapter 2. Stochastic Preliminaries 19
Then, it holds that
1
n
n∑i=1
(Xi − E[Xi]) = op(1)
2. (Strong law of large numbers)
Suppose that the following condition holds
∃α ∈ [1, 2] :
n∑i=1
E[Xi]α
iα= O(1).
Then, it holds that
1
n
n∑i=1
(Xi − E[Xi]) = o(1), a.s.
Proof. See [50] p. 65.
We now introduce a prime example of uniform convergence in probability. For such
purpose, let A(X,θ) be a matrix of functions of a realization of the p-vector X, and the
parameter θ. The law of large numbers stated below will be particularly useful in the
next chapter. The statement of such result will make use of the norm as defined in 2.1,
in order to provide a meaning to ‖A(·, ·)‖.
Theorem 2.19. (Uniform weak law of large numbers)
Let Θ be a compact subset of Rk which denotes a parameter set of interest, and let
Xn be a sequence of random p-vectors. Suppose that A(Xi,θ) is continuous for all
θ ∈ Θ, with probability one. Further, suppose that there exists a nonnegative function
d(·) defined on Rp, such that:
1. ‖A(X,θ)‖ ≤ d(X), ∀θ ∈ Θ;
2. E[d(X)] <∞.
Then, the following conditions hold:
1. E[A(X,θ)] is continuous;
2. supθ∈Θ‖ 1n
∑ni=1 A(Xi,θ)− E[A(Xi,θ)]‖ = op(1).
Proof. See [30].
It is also possible to establish a uniform strong law of large numbers, under a set of
assumptions which somehow resembles the proviso considered here (Theorem 6.13 in
Chapter 2. Stochastic Preliminaries 20
[7]; for a proof, see pp. 172–174 of the same reference). The large sample result stated
in Theorem 2.19, will however prove to be sufficient for our purposes.
One issue of general interest, is whether it is possible to assess the consistency of a
transformation of statistics whose large sample behavior is known. The next theorem
sheds some light on this issue.
Theorem 2.20. (Continuous mapping theorem)
Consider the sequence of random vectors Xn, and a constant vector x. Let g : Rk −→R be a function continuous at x. Then the following claims hold:
1. If Xn −→a.s.
x, then
g(Xn) −→a.s.
g(x).
2. If Xn −→p
x, then
g(Xn) −→pg(x).
Proof. See [46] , pp. 58–59.
Roughly speaking, the previous theorem ensures that a.s. convergence, and convergence
in probability, are preserved over continuous transformations. We will introduce lat-
ter a result (Sverdrup’s theorem) which mimics the latter theorem in what concerns
convergence in distribution.
2.3 Consistency of Point Estimators
We now introduce some of the most common concepts of consistency of point estimators.
Definition 2.21. Let θn be an estimator of the parameter θo.
1. The estimator θn is strongly consistent, if
θn −→a.s.
θo.
2. We say that the estimator θn is Lr-consistent, if
θn −→Lr
θo.
Chapter 2. Stochastic Preliminaries 21
3. We say that the estimator θn is weakly consistent, if
θn −→pθo.
4. Let xn be a sequence of real numbers such that
xn > 0,∀n ∈ N,
and
xn −→n→∞
∞.
We say that the estimator θn is xn-consistent, if
xn(θn − θo) = Op(1).
Remark 2.22. Note that beyond xn-consistency, all the above-defined definitions can be
restated through the use of the ‘oh’ notations introduced above. Hence θn is strongly
consistent if,
θn − θo = o(1), a.s.
Lr-consistent, if
E[‖Xn −X‖rr] = o(1),
and it is weakly consistent if
θn − θo = op(1).
We emphasize that consistency is a mild reliability demand for an estimator, and roughly
speaking it states that in a large sample, the estimator θn should be close to the true pa-
rameter θo. The concept of consistency of an estimator is also compatible with the basic
idea that from more data we should, on average, be able to extract more information
regarding an unknown population, from which we intend to infer about.
2.4 Asymptotic Normality
From the inference standpoint asymptotic normality, is another desirable property an
estimator should have. In fact, asymptotic normality plays a central role in the con-
struction of confidence zones and hypothesis testing. The definition is given below.
Chapter 2. Stochastic Preliminaries 22
Definition 2.23. An estimator θn is said to be asymptotically normal if there exists
an increasing function u(n), and a positive definite matrix Σ such that
u(n)(θn − θo)D−→ N (0,Σ).
Remark 2.24. Some remarks about the foregoing definition:
• the variance Σ of the limiting distribution is referred as the asymptotic variance
of θn.
• in many cases of practical interest u(n) =√n.
In a large number of cases of interest, asymptotic normality of the estimator is achieved
by the construction of suitable decompositions of the sampling error (θn − θo). This
motivates the definition of asymptotically linear estimator.
Definition 2.25. An estimator θn is asymptotically linear if there exists a function
ψ(x) such that
E[ψ(x)] = 0,
E[ψ(x)[ψ(x)]T] <∞,
√n(θn − θo) =
n∑i=1
ψ(Xi)√n
+ op(1).
Remark 2.26. The function ψ(x) is the so-called the influence function, and it can be
used to assess the impact that a single observation can exert on the estimation, up to a
0 remainder with probability 1.
The theorem stated below is perhaps the cornerstone of statistical asymptotic theory.
We take it as the prime example of the concept of convergence in distribution.
Theorem 2.27. (Classical central limit theorem)
Let Xn be a sequence of independent and identically distributed random vectors. Fur-
ther, let Σ = V[X1] <∞. Then, the following large sample result holds
1√n
n∑i=1
(Xi − E[X1])D−→ N (0,Σ).
Proof. See Billingsley [6].
Note that we have the following simple corollary to the central limit theorem
√n(Xn − E[X1]) = Op(1),
Chapter 2. Stochastic Preliminaries 23
as a consequence of Theorem 2.15.
In the sequel we provide an important result which is useful for deriving the asymptotic
distribution of a transformation of a statistic with a known asymptotic distribution. As
aforementioned, this theorem mimics the continuous mapping theorem stated above in
what concerns convergence in distribution.
Theorem 2.28. (Sverdrup’s theorem)
Let Xn be a sequence of random vectors, such that
XnD−→ X.
Further, let g : Rk −→ R be a continuous function. Then, it holds that
g(Xn)D−→ g(X).
Proof. See [46] p. 106.
This theorem implies that convergence in distribution is preserved over continuous trans-
formations. However, it remains unanswered how to obtain the convergence in distribu-
tion of the of the sum and the product of random vectors. In this context, we have the
following result which plays a central role in statistical asymptotic theory.
Theorem 2.29. (Slutsky’s theorem)
Let Xn and Yn be a sequence of random p-vectors such that
XnD−→ X,Yn = op(1).
Consider further a sequence of random (w × p) matrices Wn such that
tr[(Wn −W)T (Wn −W)
]= op(1),
where W is a nonstochastic matrix, and tr(.) denotes the trace operator. Then:
1. Xn + YnD−→ X;
2. WnXnD−→WX.
Proof. See [46] pp. 130–131.
This chapter closes with the introduction of a tool of broad use in the obtention of the
distribution of transformation, namely the δ-method.
Chapter 2. Stochastic Preliminaries 24
Theorem 2.30. (δ-method)
Let Xn be a sequence of random vectors, and let Y be a random vector with distribution
N (0,Σ), such that
xn(Xn − x)D−→ Y,
where x is a constant vector, and xn is a sequence of real numbers, such that
xn > 0,∀n ∈ N,
and
xn −→n→∞
∞.
Further, let g(·) be a function from Rk to R. Then, the following large sample result
holds
xn[g(Xn)− g(x)]D−→ N (0, [Dg(x)]TΣDg(x)).
Proof. See Shao [48] p. 61.
Chapter 3
Extremum Estimators
In this chapter, we concern ourselves with the introduction of extremum estimators. In
this part of the thesis we address large sample properties of these estimators, and discuss
the provisos under which one can ensure consistency and asymptotic normality of this
class of estimators.
3.1 Introduction
A broad class of estimators can be achieved by solving an optimization problem of
interest. Such estimators are known as extremum estimators (see [1, 33–35]). One of
the advantages of considering this general class of estimators is that it is possible to
build an elegant asymptotic theory with a set of general results. In the sequel, let
Θ ⊂ Rk denotes the parameter space. In the definition that follows, we formally define
an extremum estimator.
Definition 3.1. An estimator θn is an extremum estimator if there exists a function
Tn (θ), such that:
Tn(θn) = supθ∈ΘTn(θ) + op(1). (3.1)
Note that an alternative definition which is often considered in the literature is to define
an extremum estimator as
θn = arg maxθ∈Θ
Tn (θ) .
From the conceptual standpoint, the definition provided in (3.1) is preferable, given
that it only demands that Tn(θn) is within op(1) from the global maximum of Tn(θn).
This overcomes the question of existence and it is also more suitable for computational
25
Chapter 3. Extremum Estimators 26
purposes (cf. Andrews [3]). Notwithstanding, from the pragmatical stance, the latter
definition is preferred.
Note further, that an M-estimator is a an extremum estimator for which the estimator
objective function is given by a sample average, i.e., if Tn(θ) takes the form
Tn(θ) =1
n
n∑i=1
q(Xi,θ).
where q is some function of the data and the parameter. Here and below we will make use
of Xini=1 to denote a random sample of size n. The Xi are assumed to be identically
distributed with the random variable x. In the sequel we provide some simple examples
of extremum estimators. Other examples can be found elsewhere (e.g. Newey and
McFadden [35]; Andrews [3]; Mexia and Corte Real [34]).
Example 3.1. (Ordinary least squares—simple linear model)
Consider the simple linear model
y = Xβo + ε, (3.2)
where y is the n-vector of observations, X is the design matrix of size n × k, βo is a
k-vector of unknown regression parameters, and ε is (n×1)-vector of unobserved errors.
The estimator objective function assigned to the Ordinary Least Squares (OLS) estimator
is given by
Tn(β) = −‖y −Xβ‖, β ∈ Θ = Rk.
The OLS estimator is thus defined as
βOLS = arg minβ∈Rk
‖y −Xβ‖.
We underscore the example that follows, given that it will be reconsidered in Chapter 5.
Example 3.2. (Maximum likelihood methods—mixed linear models)
The mixed linear model is an extension to the simple linear model (3.2), in order to
account for more than one source of error. For a reference see Christensen [13].
The model takes the following form
y = Xβo +
w−1∑i=1
Xiζi + ε, (3.3)
where (y,X,βo, ε) are defined as in (3.2), Xi are design matrices of size n×ki, and where
ζi are ki-vectors of unobserved random effects. Following classical assumptions, we take
Chapter 3. Extremum Estimators 27
the random effects ζi to be independent and normally distributed with null mean vectors
and covariance matrix σ2oiIki, for i = 1, . . . , w − 1. Further, we also take the ε to be
normally distributed with null mean vector and covariance matrix σ2owIn, independently
of the ζi, for i = 1, . . . , w− 1. The model has the following mean vector and covariance
matrix
E [y|X] = Xβo, (3.4)
Σσ2o≡ V[y|X] =
w−1∑i=1
σ2oiXiX
Ti + σ2
owIn,
where σ2o ≡ [ σ2
o1 · · · σ2ow ]. Given the current framework, we have that
y|X ∼ N
(Xβo;
w−1∑i=1
σ2oiXiX
Ti + σ2
owIn
),
and thus the density for the model is given by
fy|X(y) =exp(−1
2(y −Xβo)TΣ−1
σ2o(y −Xβo)
)√(2π)n det(Σσ2
o)
.
Now, let θ ≡ [ βT σ2 ]. The estimator objective function assigned to the maximum
likelihood estimator is given by the loglikelihood of the aforementioned mixed linear model,
i.e.:
Tn(θ) =Tn([ βT σ2 ])
=− n
2ln(2π)− 1
2ln(
det(Σσ2
))− 1
2(y −Xβ)TΣ−1
σ2 (y −Xβ).
The maximum likelihood estimators of the true regression parameter and the model vari-
ance components, respectively denoted by βML and σ2ML, are thus given by
[ βT
ML σ2ML ] = arg max
[ βT σ2 ]
Tn([ βT σ2 ]) = arg maxθ∈Θ
Tn (θ) , (3.5)
where the parameter set Θ, is a bounded subset of Rk+w which restricts the elements θi
to be nonnegative, for i = k + 1, . . . , w, i.e.
Θ ≡ θ ∈ Rk+w : θi ≥ 0, i = k + 1, . . . , w.
Chapter 3. Extremum Estimators 28
Example 3.3. (Generalized method of moments—simple linear model)
Consider the following alternative representation of the simple linear model stated above
yi = xTi βo + εi, i = 1, . . . , n.
Here yi and εi respectively denote the i-th element vectors y and ε; further xTi represents
the i-th row of the design matrix X. This example puts forward another important
instance of extremum estimators—the so-called Generalized Method of Moments (GMM)
estimators. In GMM-based estimators the objective function is given by
Tn(θ) =
∣∣∣∣∣∣∣∣∣∣ 1n
n∑i=1
g(yi,xi,θ)
∣∣∣∣∣∣∣∣∣∣W
=
[1
n
n∑i=1
g(yi,xi,θ)
]T
W
[1
n
n∑i=1
g(yi,xi,θ)
],
where ‖ · ‖W denotes the quadratic norm introduced in (2.1), g is a k-vector function,1
and W is a symmetric positive definite matrix (possibly dependent on the sample) of
size k × k; further details of this class of estimators, including large sample properties,
can be found in Hayashi [22].
The examples provided above, evidence the broadness of the class of extremum estima-
tors.
In the next section we provide some consistency results for extremum estimators.
3.2 Consistency Results for Extremum Estimators
Under a set of mild assumptions it is possible to characterize the large sample behavior
of the broad class of estimators defined in (3.1). In order to establish the consistency
of this broad class of estimators, we first introduce the proviso which will allow us to
establish consistency. Here and in the sequel, a convention regarding the numeration of
assumptions will be made. Thus, an assumption numbered using roman numerals (e.g.
I,II,. . .) will be used to denote alternative set up to the one considered by the assumption
with the same number in arabic numeral (i.e. 1,2,. . .). This convention will prove to be
useful over the exposition.
1The k-vector function should obey some orthogonal conditions which are not relevant for our pur-poses, and which can be found elsewhere [22].
Chapter 3. Extremum Estimators 29
3.2.1 Consistency Under Compactness Assumptions
We start by establishing the consistency of extremum estimators under a set of assump-
tions which we state in the sequel. Note that here and below, the main goal will be to
provide a set of conditions under which we are able to assure that a sequence of functions
Tn converges either in probability or almost surely to the maximand of the estimator
objective function T .
Assumption 1: Weierstrass Framework
• Θ ⊂ Rk is compact.
• T : Θ ⊂ Rk −→ R is continuous.
Later, we will consider an alternative to Assumption 1 (Assumption I). The reason why
we refer to this assumption as the Weierstrass framework is justified by this being set
of assumptions under which the Weierstrass’s theorem is classically derived. Namely,
Assumption 1 implies the existence of θ∗∈Θ such that the following condition holds.
θ∗ ∈ arg maxθ∈Θ
T (θ). (3.6)
Note that this simple consequence of the well-known Weierstrass’s theorem, implies that
the possibly set-valued function arg maxθ∈Θ
T (θ) is non-empty.
Further we also have to assume that the extremum estimator objective function Tnconverges uniformly in probability to T , in the sense of Definition 2.7.
Assumption 2: Uniform Convergence in Probability
• ‖Tn − T ‖∞ = supθ∈Θ
|Tn(θ)− T (θ)| = op(1).
Note that the uniform weak law of large numbers presented in the previous chapter
(Theorem 2.19), provides a set of sufficient conditions implying Assumptions 1 and 2,
if the paramater set Θ is compact; other laws of large numbers suited for extremum
estimators are also known in the literature (see e.g. Mexia and Corte Real [33]).
Later we will consider the alternative assumptions of a.s. uniform convergence (Assump-
tion II), and pointwise convergence in probability (Assumption II*).
In order to be able to identify the true parameter θo, we will have to assume that
arg maxθ∈Θ
T (θ) is not a set-valued function, thus implying that T (θ) has a unique maxi-
mum.
Chapter 3. Extremum Estimators 30
Assumption 3: Identification
• θo = arg maxθ∈Θ
T (θ).
It turns out that Weierstrass framework combined with uniform convergence in proba-
bility and the assumption of identification, are sufficient for the weak consistency of the
class of estimators defined in (3.1). In fact, under these conditions the argument of the
maximum of the sequence of functions Tn will converge in probability, to the maximand
of estimator objective function T . This point is formalized in the theorem that follows,
and in Figure 3.1 we give a graphical representation that provides some insight on this
result.
Figure 3.1: Assumptions (1, 2, 3) imply consistency of the extremum estimator.
Theorem 3.2. (Weak consistency of extremum estimators)
Suppose that assumptions (1, 2, 3) hold. Then,
θn − θo = op(1).
Proof. cf. [35], pp. 2121–2122.
It is easy to verify that Theorem 3.2 carries over to the ‘min’ case simply by replacing
T by −T .
In order to move towards a strong consistency result, we will have to focus on a more
demanding proviso.
Assumption ii: a.s. Uniform Convergence
Chapter 3. Extremum Estimators 31
• ‖Tn − T ‖∞ = supθ∈Θ
|Tn(θ)− T (θ)| = o(1), a.s.
If we now consider this stronger form of uniform convergence, and drop Assumption 3,
we get the following strong consistency result.
Theorem 3.3. (Strong consistency of extremum estimators)
Suppose that assumptions (1,II,3) hold. Then,
θn − θo = o(1), a.s.
Proof. cf. [35], pp. 2121–2122.
The foregoing theorem ensures that every extremum estimator will be strongly consis-
tent, as long as the aforementioned requirements are fulfilled. Note that this result is
indeed quite general, given that it provides a set of sufficient conditions under which
the broad class of estimators defined in (3.1) converges a.s. to θo, the true value of the
relevant parameter.
In the next subsection we focus our attention on a well-known particular case of ex-
tremum estimator, namely the maximum likelihood methods.
3.2.2 Consistency for Maximum Likelihood Methods
Maximum Likelihood Estimation (MLE) methods are among the main standard tech-
niques for yielding parameter estimates of a statistical model of particular interest.
Large sample results for this M -estimation methodology, were long ago established in
the literature (cf. Wald [57]). We emphasize that it is not our intention to address
here general consistency results for maximum likelihood methods, but either to provide
a characterization of consistency in the style of the previous section. In this spirit, we
will illustrate in the sequel how we can rely on a proviso more suited for the particular
case of the maximum likelihood methods.
For such purpose, suppose that the following identification property holds
θ 6= θo ⇒ f(x|θ) 6= f(x|θo),
Chapter 3. Extremum Estimators 32
where θo is the true value of the parameter. As a consequence of the strict Jensen
inequality, we can establish that
T (θo)− T (θ) = E[ln f(x|θo)− ln f(x|θ)]
= E[− ln
f(x|θ)
f(x|θo)
]> − lnE
[f(x|θ)
f(x|θo)
]= − ln
∫f(x|θ)f(x|θo)
f(x|θo)dx
= 0.
Thus, we have established the following result.
Theorem 3.4.
Suppose that the following conditions are verified:
1. θ 6= θo ⇒ f(x|θ) 6= f(x|θo);
2. E[| ln f(x|θ)|] <∞.
Then it holds that
θo = arg maxθ∈Θ
T (θ),
where T (θ) = E[ln f(x|θ)]
Proof. See above.
The last theorem states sufficient conditions for Assumption 3, made above. In Figure
3.4 we provide a graphical representation which is useful for getting a clear portrait of
the reasoning leading to Theorem 3.4.
Theorem 3.5. (Consistency of maximum likelihood methods)
Suppose that the following conditions hold:
1. Θ is a compact set;
2. θo 6= θ ⇒ f(x|θ) 6= f(x|θo);
3. ln[f(x|θ)] is continuous for every θ ∈ Θ, with probability one;
4. E [‖ln f(x|θ)‖∞] <∞.
Chapter 3. Extremum Estimators 33
Figure 3.2: The picture behind Proposition 3.4
Then it holds that
θn − θo = op(1),
where
θo = arg maxθ∈Θ
E[ln f(x|θ)].
Proof. See [35], p. 2131.
3.2.3 Consistency Without Compactness Assumption
The hypothesis of compactness included in Assumption 1 is somehow restrictive. Since
we are considering the parameter set Θ to be a subset of Rk the compactness of the
parameter set is equivalent to Θ being closed and bounded. Thus, the compactness
Chapter 3. Extremum Estimators 34
condition imposes either the bounds on the true parameter θo are known, or that at
least its existence should somehow be assured.
As an alternative to Assumption 1, we now consider Assumption I.
Assumption I
• Θ ⊂ Rk is convex.
• T : Θ ⊂ Rk −→ R is concave.
From the inspection of the latter assumption, we can notice that the compactness as-
sumption was abandoned in favor of the convexity of the parameter space Θ. Further,
whereas in Assumption 1 we required the estimator objective function T to be continu-
ous, we are now demanding that it should be concave.
Under this framework, we can consider the following weaker form of convergence in
probability.
Assumption ii*: Pointwise Convergence in Probability
• Tn(θ)− T (θ) = op(1), ∀ θ ∈ Θ.
Further, we also assume that the true parameter θo is interior to the parameter set Θ.
Assumption 4: Interior Parameter
• θo ∈ int(Θ), where int(·) denotes the interior of a set.
We emphasize that Assumption 4 is a standard assumption made in the literature in
order to obtain the asymptotic distribution of an estimator. This assumption will be
dropped later.
Theorem 3.6. (Consistency without compactness requirements)
Suppose that assumptions (I,II*,3,4) hold. Then there exists θn with probability ap-
proaching one such that
θn − θo = op(1).
Proof. See [35], p. 2133.
In the next section we will address the issue of convergence in distribution of extremum
estimators.
Chapter 3. Extremum Estimators 35
3.3 Convergence in Distribution of Extremum Estimators
3.3.1 Interior Point Proviso
In this subsection we will be particularly interested in the establishment of convergence
in distribution results for extremum estimators, in the sense of Assumption 4.
Further, we will have to assume weak consistency of the extremum estimator. Note that
as a consequence of Theorem 3.2, assumptions (1,2,3) are sufficient for Assumption 5;
other sufficient conditions for Assumption 5 can be found elsewhere (Andrews [3]).
Assumption 5 - Weak Consistency
• θn = θo + op(1).
Here and in the sequel, we will make use of the following notation
DTn(θ) =∂Tn(θ)
∂θT=[
∂Tn∂θ1
∂Tn∂θ2
· · · ∂Tn∂θk
]Tand
D2Tn(θ) =∂Tn(θ)
∂θ∂θT=
∂2Tn∂θ21
∂2Tn∂θ1∂θ2
· · · ∂2Tn∂θ1∂θk
∂2Tn∂θ2∂θ1
∂2Tn∂θ22
· · · ∂2Tn∂θ2∂θk
......
. . . · · ·∂2Tn
∂θk−1∂θ1∂2Tn
∂θk−1∂θ2
. . . ∂2Tn∂θk−1∂θk
∂2Tn∂θk∂θ1
∂2Tn∂θk∂θ2
· · · ∂2Tn∂θ2k
.
Assumption 6: Regularity Conditions
• ∃δ > 0 : Tn is twice differentiable in
Bδ(θo) ≡ θ ∈ Θ : ‖θ − θo‖ < δ;
•√nDTn(θo)
D−→ N (0,Σ);
• There exists a continuous function H(·) such that
supθ∈Bδ(θo)
‖D2Tn(θ)−H(θ)‖ −→p
0.
The aforementioned assumptions will now allow us to establish the asymptotic normality
of extremum estimators.
Chapter 3. Extremum Estimators 36
Theorem 3.7. (Asymptotic normality of extremum estimators)
Suppose that assumptions (4,5,6) hold. Then
√n(θn − θo)
D−→ N (0,H−1ΣH−1).
Proof. See [35], p. 2143.
Note that having in mind the prior comment regarding the sufficiency of (1,2,3) for
Assumption 5, we have that the latter Theorem could also have been stated making use
of assumptions (1,2,3,4,6).
As aforementioned Assumption 4, is a standard assumption made in the literature for the
obtention of the asymptotic distribution of an estimator. There is however a substancial
number of cases of interest under which the true parameter θo lies in the boundary (∂)
of the parameter space Θ.2
3.3.2 Asymptotic Normality for Maximum Likelihood Methods
Similarly to what we have previously done, we now restrict our attention the particular
case of maximum likelihood methods. The following theorem is a version of the general
Theorem 3.7, using a set of assumption which are more suitable for maximum likelihood
methods.
Theorem 3.8. (Asymptotic normality for maximum likelihood methods)
Consider a sequence of independent and identically distributed random vectors Xnni=1.
Suppose that the conditions of Theorem 3.5 hold. Further, suppose that the following
conditions are verified:
1. θo ∈ int(Θ);
2. f(x|θ) is twice continuously differentiable and there exists δ > 0 such that f(x|θ) >
0 for all θ ∈ Bδ(θo), where
Bδ(θo) ≡ θ ∈ Θ : ‖θ − θo‖ < δ;2Such occurrence takes place for instance when one intends to test the nullity of the variance compo-
nents σ2o ≡ [σ2
o1 · · · σ2ow] in a mixed linear model (5.1), using the standard likelihood-ratio test. In
fact, under such circumstances, we have that the parameter space A = σ2o ∈ Rw : σ2
oi ≥ 0, i = 1, . . . , w,and so 0w ∈ ∂(A) and thus 0w /∈ int(A); Stram and Lee [51] provide an insightful discussion on thisissue.
Chapter 3. Extremum Estimators 37
3. The following bounding conditions are satisfied:∫sup
θ∈Bδ(θo)‖Df(x|θ)‖dx <∞;
∫sup
θ∈Bδ(θo)‖D2f(x|θ)‖dx <∞;
4. J ≡ E[D ln f(x|θo)(D ln f(x|θo))T] is nonsingular;
5. E
[sup
θ∈Bδ(θo)‖D2 ln f(x|θ)‖
]<∞ .
Then√n(θn − θo)
D−→ N (0,J−1).
Proof. See [35].
This theorem requires some discussion in what regards the assumptions made, in com-
parison with the hypothesis of the general result provided by Theorem 3.7. First, notice
that similarly to what was done in the general result 3.7, we are also considering the
true parameter θo as an interior point to the paramater space Θ. Further, note that
since we are assuming conditions of Theorem 3.5, we will have consistency of the ML
estimator, so Assumption 5 of the general result 3.7 holds. All the remainder conditions
are sufficient for Assumption 6.
In the next subsection, we abandon Assumption 4 and move again to a more general
framework. The exposition of the next subsection is largely inspired in the prominent
work by Andrews [3].
3.3.3 Boundary Point Proviso
In the prior subsections we have restricted our attention to the establishment of con-
vergence in distribution results for extremum estimators. We now depart from the
assumption that the true parameter θo was an interior point of the parameter space Θ.
In this subsection we will allow the true parameter θo, to be a closure point. Thus, the
true parameter can now be either in the interior point of the parameter set int(Θ), or
in the boundary of the parameter set ∂(Θ). Here and in the sequel, the idea will be
to approach the parameter space Θ by a cone Λ, in the sense of Definition 2.2. This
type of approximation, allows the boundary of the parameter space ∂(Θ) to be linear
or curved, and to eventually have kinks.
Chapter 3. Extremum Estimators 38
We thus have the following alternative to Assumption 4.
Assumption iv: Boundary Parameter
• θo ∈ cl(Θ) ≡ int(Θ) ∪ ∂(Θ), where cl(·) denotes the closure of a set, and ∂(·)denotes its boundary.
Obviously, even though we allow the true parameter θ to be an interior point, the main
interest here relies in the cases wherein θo is on the boundary of the parameter set.
We consider the case where the estimator objective function Tn(θ) has a quadratic
expansion around the true parameter θo
Assumption 7: Quadratic Expansion with Bounded Remainder
• Tn(θ) has a quadratic expansion around θo
Tn(θ) =Tn(θo) + [DTn(θo)]T(θ − θo)
+1
2(θ − θo)TD2Tn(θo)(θ − θo) +Rn(θ),
(3.7)
where
supθ∈Θ:‖Bn(θ−θo)‖≤γ
|Rn(θ)| = op(1).
and Bn is a sequence of matrices such that
λmin(Bn) −→n→∞
∞,
where λmin(·) denotes the smallest eigenvalue.
For the current purposes, the quadratic expansion of the estimator objective function
around the true parameter, can be restated in a suitable manner. This is established in
the following lemma.
Lemma 3.9. The quadratic expansion stated in equation (3.7) can be rewritten as
Tn(θ) = Tn(θo) +1
2qn(0)− 1
2qn(Bn(θ − θo)) +Rn(θ), (3.8)
where
qn(λ) ≡ (λ− Zn)TFn(λ− Zn), λ ∈ Rk, (3.9)
and
Fn ≡ −[B−1n ]TD2Tn(θo)B
−1n ,
Zn ≡ F−1n [B−1
n ]DTn(θo).
Chapter 3. Extremum Estimators 39
Proceeding as in the previous sections, we now state the remaining assumptions under
which it will be possible to obtain the large sample results of interest. We start with the
introduction of some further regularity conditions for the quadratic expansion provided
in (3.8).
Assumption 8
• Assume that the following large sample results hold[B−1n ]′DTn(θ)
D−→ G,
FnD−→ F,
where G is a random k-vector, and F is a matrix with size k×k, which is symmetric
and nonsingular with probability one.
Note that, as a consequence of Theorem 2.15, the latter condition can also be interpreted
as a bounding condition. Further, given Assumption 8, we are able to define a limiting
version of (3.9) given by
q(λ) ≡ (λ− Z)TF(λ− Z), λ ∈ Rk, (3.10)
where
Z ≡ F−1G,
and F and G follow from Assumption 8.
Further we have to impose another boundedness condition, for the argument of qn in
the quadratic expansion (3.8). Namely, in Assumption 9, we will assume Bn(θn − θo),to be bounded in probability.
Assumption 9: Stochastic Boundedness Condition
• Bn(θn − θo) is stochastically bounded, i.e.
Bn(θn − θo) = Op(1).
The next assumption is concerned with the local approximation of the parameter space
by a cone. Here and in the sequel, by the shifted and rescaled parameter space we will
mean the set
Bn(Θ− θo)/bn =
Bn
(θ − θobn
): θ ∈ Θ
n∈N
,
Chapter 3. Extremum Estimators 40
where bn is a sequence of real numbers such that bn −→n→∞
∞, and
bn ≤ cλmin(Bn),
for some positive real number c.
The following assumption makes use of the generalization of the definition of local ap-
proximation by a cone, which was introduced in Definition 2.4.
Assumption 10: Local Approximation by a Cone
• The sequence of sets Bn(Θ− θo)/bn is locally approximated by a cone Λ.
Before stating the last assumption of interest, consider the following possibly set-valued
mappings
λn ⇒ arg infλ∈cl(Λ)
qn(λ), (3.11)
where qn(λ) is defined as in (3.9). Similarly consider the limiting version of λn as
λ⇒ arg infλ∈cl(Λ)
q(λ), (3.12)
where q(λ) is defined as in (3.10).
In order to avoid that the aforementioned mappings can be set-valued, we will have to
assume the convexity of the cone used to approximate the shifted and rescaled parameter
space Λ.
Assumption 11: Convexity of the Cone
• Λ is convex.
As a consequence of the latter assumption, we are able to define
λn ≡ arg infλ∈cl(Λ)
qn(λ),
instead of (3.11). Similarly, we are able to define
λ ≡ arg infλ∈cl(Λ)
q(λ),
instead of (3.12).
The assumptions stated above allows to reach the following large sample result.
Chapter 3. Extremum Estimators 41
Theorem 3.10.
Suppose that assumptions (IV,5,7,8,9,10,11) hold. Then
1. Bn(θn − θo)− λn = op(1).
2. λnD−→ λ.
3. Bn(θn − θo)D−→ λ.
Proof. See [3] pp. 1378–1379.
Thus far we have focused our attention on the large sample properties of the broad class
of estimators defined in equation (3.1). No comments were done yet in what concerns
the computational aspects of extremum estimators. This issue will be addressed in the
next chapter.
Chapter 4
Global Optimization by
Stochastic Search Methods
4.1 Introduction
As it was discussed in the previous chapter, any estimator which can be formulated
through an optimization problem of interest, is tantamount to an extremum estimator.
Thus far we have concerned ourselves with the circumstances under which it is possible
to establish the consistency and asymptotic normality of such estimators. We have not
yet relied our focus on the computational aspects of extremum estimators. If fact, in
a plurality of cases of practical interest, these estimators are not analytically tractable
and so we frequently lack of a closed-form solution for obtaining the estimates. A nice
overview of some standard numerical procedures which can be employed to compute the
estimates can be found for instance in Hayashi [22] and Judge et al. [24]. Given that
the quest for an input value which fulfills some determined output criteria is a problem
of interest in a wide variety of scenarios, there is nowadays a broad class of methods
available. Notwithstanding, when choosing on which algorithm to rely, one should avoid
iterative maximization procedures which can convergence to a local solution. We em-
phasize that the fact of an extremum estimator being achieved as a global solution to
a determined optimization problem, has deep implications in what regards consistency.
As it becomes clear from the inspection of the large sample results stated in the previous
chapter, consistency is only ensured to hold for the global solutions. This highlight is
peculiarly important when the optimization problem at hand is analytically intractable.
Broadly speaking, two types of numerical procedures are typically adopted to tackle
such problem, namely deterministic and stochastic optimization algorithms. The former
includes the Newton-Raphson, the steepest descent method, among many others (see
42
Chapter 4. Global Optimization by Stochastic Search Methods 43
[36] and references therein). In this thesis the focus will however be placed on stochastic
optimization algorithms. These include the pure random search (Solis and Wets [49]),
the simulated annealing technique (see Bohachevsky et al. [5]), the conditional martin-
gale algorithm (Esquıvel [17]), etc. As it was mentioned in the Chapter 1, we follow Spall
[50] referring to stochastic search and optimization algorithms in the following terms.
Stochastic Search and Optimization
I. There is some random noise in the measurements of Tn(θ); (or / and)
II. There is a random choice in the search direction as the algorithm iterates
toward a solution.
In this thesis we are exclusively concerned with item II, and hence the focus is placed
on optimization algorithms wherein the search direction is randomly dictated.
In this chapter, we contribute by proposing a master method from which several stochas-
tic optimization algorithms are particular cases. The generality of the proposed method
is broad enough to include the conceptual algorithm of Solis and Wets [49] as a particu-
lar case. Furthermore, we also establish the convergence of the proposed method under
a set of fairly mild assumptions. An important instance of this method to which we also
devote some attention, is given by the stochastic zigzag method—an algorithm which
is largely inspired in the prominent work of Mexia et al. [32]. We point out that even
though the crux of our analysis relies over the optimization problem
maxθ∈Θ
Tn (θ) , (4.1)
for a fixed n, the procedures developed here carry over mutatis mutandis to other un-
constrained optimization problems of interest.
The remaining of this chapter is as follows. In the next section we provide an overview
of random search techniques as a starting point towards our method. In §4.3 we recast
the meta-approach of Solis and Wets [49] through the introduction of a master method
which includes several other stochastic optimization algorithms as a particular case. The
convergence of the master method is established in §4.4. Finally in §4.5 we offer a short
note regarding the construction of confidence intervals for the maximum, departing from
the general algorithm previously introduced.
Chapter 4. Global Optimization by Stochastic Search Methods 44
4.2 An Overview of Random Search Techniques
Suppose that one has available a random sample of size n, from a population of interest.
With such sample at hand, we intend to solve the optimization problem (4.1), as a means
to obtain estimates of θo. It is worth noting that from the conceptual standpoint, for
a fixed n, one can also think of the graph of Tn as a population of interest from which
one intends to consistently estimate the parameters1
(arg max
θ∈ΘTn (θ) ,max
θ∈ΘTn (θ)
).
In order to do so, suppose that we collect a random sample (θi, Tn(θi)pi=1 from such
population. Hence for each sampled value θi, we also inquire its corresponding image
value Tn(θi). Assume that such sample is collected sequentially and that during each
extraction period we compute
θi =
θ0 ⇐ i = 0,
θi−1 ITn(θi−1) ≥ Tn(θi)+ θi ITn(θi−1) < Tn(θi) ⇐ i ∈ N.(4.2)
As we shall see below, the procedure described above contains the essence of the classical
pure random search algorithm.
Classical Random Search Algorithm
1. Choose a initial value of θ, say θ0 ∈ Θ, either randomly or deterministically.
Set i = 0 and θ0 = θ0.
2. Generate a new independent value θi+1 from a probability distribution f , with
support over Θ. If Tn(θi+1) > Tn(θi), set θi+1 = θi+1. Else set θi+1 = θi.
The convergence of the algorithm stated above was established in the seminal work of
Solis and Wets [49]. The crux of their work relies in the introduction of a conceptual
algorithm which includes among others the algorithm stated above. In fact, as a means
to shed some light on some standard variants of the classical random search algorithm,
observe that
- other types of processes can be used in lieu of (4.2);
- independence in the choice of the values of θi is sometimes dropped;
- the probability distribution f can be allowed to have a support defined over Rk ⊇ Θ.2
1Recall that the graph of T is defined gr(T ) = (θ, T (θ)) : θ ∈ Θ.2Obviously due adaptations are entailed, otherwise some hindrances can arise in Step 2. For instance,
the lapse of such modifications may preclude the computation of the image for certain values of θ whichare not included in the domain of T .
Chapter 4. Global Optimization by Stochastic Search Methods 45
4.3 Recasting the Solis and Wets Framework
4.3.1 Preliminaries and Notation
A Brief Note on Notation
Here and below we make use of the following shorthand notation
∇(t) = θ ∈ Θ : Tn(θ) < t, ∇(t) = θ ∈ Θ : Tn(θ) ≤ t,
∆(t) = θ ∈ Θ : Tn(θ) > t, ∆(t) = θ ∈ Θ : Tn(θ) ≥ t.
Further, we make use of the almost universal quantifier, which we introduce below.
Definition 4.1. Let P(x) denote a proposition which depends on a variable x which
takes values on a specified domain D. We say that gx P(x), if P(x) is true for all the
values of D\N , where N is a null-measure set.
Remark 4.2. The universal quantifier ∀ and the existential quantifier ∃ are often used to
select the elements of a set wherein some property holds. The almost universal quantifier
g, is also introduced for selecting what are . Whereas the universal quantifier ∀ should
be read as for every element, the almost universal quantifier should be read as for almost
every element.3 For the sake of illustration, using classical measure theoretical notation
we say that functions f and g are equivalent if
f = g, a.e.
With our notation we get
f(θ) = g(θ), gθ ∈ Θ.
In the sequel we introduce some definitions which are necessary for the presentation
of the convergence result of a general algorithm stated below. First we introduce the
concept of essential supremum, which is deeply related with the maximum. It turns out
that the concept of essential supremum is more suited for computational purposes than
the maximum itself. Second, we present the concept of optimality region.
Definition 4.3. Let Tn : Θ→ R be a measurable function. The essential supremum is
defined as
ess supθ∈Θ
Tn(θ) ≡ inft : Tn(θ) ≤ t, gθ ∈ Θ.
3The logical grounds for this quantifier are far beyond the scope of this work. We leave this for futurework.
Chapter 4. Global Optimization by Stochastic Search Methods 46
Similarly, we define the essential infimum as
ess infθ∈Θ
Tn(θ) ≡ supt : Tn(θ) ≥ t,gθ ∈ Θ.
Remark 4.4. Observe that the essential supremum can be equivalently rewritten as
ess supθ∈Θ
Tn(θ) = inft : λ(∆t) = 0 = supt : λ(∆t) > 0, (4.3)
where λ(·) denotes the Lebesgue measure. Note that the last equality follows by a similar
argument as in Williams [56] (p. 34), and by noting that for every positive h it holds
that ∆t+h ⊆ ∆t. By a similar reasoning the essential infimum definition is tantamount
to
ess infθ∈Θ
Tn(θ) = inft : λ(∇t) > 0. (4.4)
This type of representation is actually preferred by Solis and Wets [49].
The concepts of essential supremum and essential infimum are also used in the context
of stochastic differential equations (see for instance Øksendal [39]). To gain some insight
on the mechanics of these definitions replace the almost universal quantifier g, in the
definition of the ess sup, by ∀. Hence, we would intend to compute the infimum of the
set
t : Tn(θ) ≤ t, ∀θ ∈ Θ,
and this would yield the least majorant of the range of Tn. A similar reasoning applies
when we consider the universal quantifier in lieu of the almost universal quantifier.
For the sake of completeness below, we state some basic results regarding the essential
supremum; the corresponding proofs are included in the appendix.
Fundamental Properties of the Essential Supremum
1. Tn(θ) ≤ ess supx∈Θ
Tn(x), gθ ∈ Θ.
2. ess supx∈Θ
(Tn(x) +Rn(x)) ≤ ess supx∈ΘTn(x) + ess sup
x∈ΘRn(x).
3. ess supx∈Θ
Tn(x) ≤ supx∈Θ
Tn(x).
It is important to underscore that the all the results stated above are valid, as long as
the measurability of Tn and Rn is verified. As a means to ease notation, in the sequel we
use τ and τ to denote the essential supremum and the essential infimum, respectively.
Let us observe that if the maximizer of Tn is unique, and Tn is continuous, the essential
supremum τ coincides with the maximum.
Chapter 4. Global Optimization by Stochastic Search Methods 47
Theorem 4.5.
Let Tn be continuous and suppose that
θn = arg maxθ∈Θ
Tn(θ).
Then, the essential supremum τ coincides with the maximum of the parameter objective
function, i.e.
τ = Tn(θn).
Proof. As a consequence of Property 3 (see above and Proposition A.4 in the appendix),
we have that τ ≤ Tn(θn). Thus, we only have to prove that τ ≥ Tn(θn). Let ε > 0 be
given. There exists θε ∈ Θ such that
Tn(θn)− ε < Tn(θε) < Tn(θn).
Since we are assuming Tn to be continuous, there exists δ > 0, such that for every
θ ∈ Bδ(θε) ≡ θ ∈ Θ : ‖θ − θε‖ < δ we still have
Tn(θn)− ε < Tn(θ) < Tn(θn).
Consequently, we get
λ(
∆Tn(θn)−ε
)≥ λ(Bδ(θε)) > 0 ,
and so, (4.3) implies that τ ≥ Tn(θn)− ε. Since ε is arbitrary, we have τ ≥ Tn(θn).
Figure 4.1: A sketch which portrays the reasoning involved in the proof of Theorem4.5. This instance is used just to provide guidance, and it is not part of the proof.
Chapter 4. Global Optimization by Stochastic Search Methods 48
We now formally define the concept of optimality zone.
Definition 4.6. Let τ denote the essential supremum of Tn. The optimality zone for
the maximand of Tn is given by the set-valued function O : R2+ ⇒ Θ defined as
Oε,M =
θ ∈ Θ : Tn(θ) > τ − ε ⇐ τ ∈ R,
θ ∈ Θ : Tn(θ) > M ⇐ τ = +∞.
Similarly, we define the optimality zone for the minimand as
Oε,M =
θ ∈ Θ : Tn(θ) < τ + ε ⇐ τ ∈ R,
θ ∈ Θ : Tn(θ) < M ⇐ τ = −∞.
Remark 4.7. Note that making use of the shorthand notation defined above we can
restate the optimality zones as follows
Oε,M =
∆(τ − ε) ⇐ τ ∈ R,
∆(M) ⇐ τ = +∞.; Oε,M =
∇(τ + ε) ⇐ τ ∈ R,
∇(M) ⇐ τ = −∞.
It remains unanswered, what is the magnitude of the optimality zone. We provide a
rough upper bound for the measure of the optimality region of the minimand, which
makes use of the van der Corput’s sublevel set bound (see Appendix B).4
Theorem 4.8.
Let Θ = [a, b] denote some parameter space of interest. Suppose that Tn : Θ → R+ is
k-times differentiable on int(Θ), with k ≥ 1, and that |T (k)n (θ)| ≥ ζ > 0. Then it holds
that:
λOε,M ≤ ck
[(τ + ε
ζ
) 1k
I(τ ∈ R) +
(M
ζ
) 1k
I(τ = −∞)
],
where ck =k√k!22k−1, and I(·) denotes the indicator function.
Proof. The proof follows from a direct application of the van der Corput’s sublevel set
bound (see Appendix B). If τ ∈ R, then by the van der Corput’s sublevel set bound
λOε,M ≤ ck(τ + ε
ζ
) 1k
.
Similarly, if τ = −∞, we have the following sublevel set bound λOε,M ≤ ck(Mζ
) 1k.
4Some modern versions of this result can be found for instance in Rogers [44] and Carber et al. [10].Note that this classical result is still playing an active role in contemporaneous mathematical research.
Chapter 4. Global Optimization by Stochastic Search Methods 49
Even though appealing from the theoretical standpoint, this result is useless from a
practical stance. Even with large requirements over function Tn, we are unable to obtain
an upper bound which is independent of τ .
The next definition closes our conceptual framework.
Definition 4.9. A function C : Θ× Rk → Θ is a compass function if(Tn C)(θa,θb) ≥ Tn(θa), ∀(θa,θb) ∈ Θ× Rk,
(Tn C)(θa,θb) ≥ Tn(θb), ∀(θa,θb) ∈ Θ×Θ.
Example 4.1. A simple example of compass function is given by the mapping
C(θa,θb) = θaIθa∈∆(Tn(θb))(θa,θb) + θbIθa∈∇(Tn(θb))(θa,θb).
If we define
θi+1 = C(θi,θi+1),
then it holds that
θi+1 = θi ITn(θi) ≥ Tn(θi+1)+ θi+1 ITn(θi) < Tn(θi+1),
and so we recover the above mentioned probabilistic recursive translation of the pure
random search algorithms.
Even though the definition of compass function given above is for maximization prob-
lems, it can be easily accommodated for minimization problems. It is worth mentioning
that we refer to this function as the compass, since this is the mapping that guides the
process of selection of the extremes.
Given that our main interest relies in the optimization problem
maxθ∈Θ
Tn(θ),
for a fixed n, hereinafter we focus on maximization.
In the next section we introduce the master method—a broad algorithm which includes
several other optimization algorithms.
Chapter 4. Global Optimization by Stochastic Search Methods 50
4.3.2 The Master Method
We open this subsection with the introduction of a general method from which several
other algorithms are particular cases. The modus operandi of such method is given
below.
Modus Operandi of the Master Method (c ∈ N)
0. Set i, j = 1. Find a,b ∈ Θ, and set θ0 and θ0 equal to arg maxx∈a,b
Tn(x).
Further, set z1 and Z1,1 equal to arg minx∈a,b
T (x).
1. If c > 1, generate Zi,j from the probability space (Rk,B(Rk),Pi,j), and set
Zi,j+1 = Zi,j . Else, go to Step 2.
2. If j < c− 1, increment j, and return to Step 1. Otherwise, set θi = C(θi−1,θi),
where θi = arg maxq∈1,...,c
Tn(Zi,q), and set j = 1.
3. Generate zi from the probability space (Rk,B(Rk),Pi), set Zi,1 = zi, increment
i and j, and return to Step 1.
Some comments concerning this general algorithm:
• The parameter c can be defined a priori by the user, and it can take any positive
integer value. As a rule of thumb, we suggest taking c as random (e.g. drawn from
discrete uniform distribution U1, . . . , k).
• Observe that Step 0 simply initiates the algorithm. If we repeat Step 1 for a fixed
i, we construct the iterates Zi,1,Zi,2, . . . ,Zi,c−1. In Step 2 we update the compass
and obtain the next ‘candidate’ to argument of the maximum, as proposed by the
algorithm. The repetition of Step 3 yields z2, z3, etc.
• Here and below, we refer to each zi as a seed. If the seeds are independent and
identically distributed, we refer to the master method as pure. If the probability
measure Pp depends on some probability measure(s) Pq, with q < p, then the
master method will be called adaptive. Further, we refer to each Zi,j as an iterate.
For each i we will refer to sequence Zi,1, . . . ,Zi,c as a course. At the light of this
terminology, we can say that the consecutive repetition of Step 1 builds a course.
Similarly, if we rerun serially Step 3 we obtain a sequence of seeds.
• The mechanics of the algorithm is perhaps better understood through the law of
movement of the iterates which can be written as
Zi,j = ziI(j = 1 ∨ c = 1) + Zi,j−1I(j ∈ 2, . . . , c ∧ c > 1) (4.5)
Chapter 4. Global Optimization by Stochastic Search Methods 51
To gain some insight on the mechanics of the method, consider the case wherein c = 1.
Throughout this chapter, this benchmark case will be invoked frequently. In such case
we have that
θi = arg maxq∈1
Tn(Zi,q) = Zi,1 = zi,
and hence θi = zi = Zi,1. Additionally, j becomes inactive in the the algorithm, given
that under these circumstances, Step 1 is never activated. Consequently, for c = 1 the
algorithm can be equivalently rewritten as follows.
Modus Operandi of the Master Method (c = 1)
0. Set i = 1. Find θ1 ∈ Θ, and set θ0 = θ1.
1. Set θi = C(θi−1,θi), and increment i.
2. Generate θi from the probability space (Rk,B(Rk),Pi), and return to Step 1.
Hence, the classical Solis and Wets conceptual algorithm [49], is a particular case of
our master method with c = 1. We emphasize that the master method is simply a
generalization of this method which follows a course between any two seeds. These and
other features of the master method will become more clear after the introduction of a
matrix formulation of the master method which we present in the upcoming subsection.
4.3.3 Stochastic Zigzag Methods
We will be particularly interested in the following instance of the master method
Modus Operandi of the Stochastic Zigzag Method (c ∈ N)
0. Set i, j = 1. Find a,b ∈ Θ, and set θ0 and θ0 equal to arg maxx∈a,b
Tn(x).
Further, set z1 and Z1,1 equal to arg minx∈a,b
Tn(x).
1. If c > 1, generate αi,j from the probability space (R,B(R),Pi,j), and set
Zi,j+1 = αi,jθi−1 + (1− αi,j)zi. Else, go to Step 2.
2. If j < c− 1, increment j, and return to Step 1. Otherwise, set θi = C(θi−1,θi),
where θi = arg maxq∈1,...,c
Tn(Zi,q), and set j = 1.
3. Generate zi from the probability space (Rk,B(Rk),Pi), set Zi,1 = zi, increment
i and j, and return to Step 1.
Essentially the layout of the algorithm is the following. In Step 1 we initialize the
algorithm, and sample points from the line which passes through the points θ1 and z1.
The consecutive application of Step 1, simply collects a random sample of c points from
such line. In Step 2, we refresh the compass function C, obtaining the next candidate to
Chapter 4. Global Optimization by Stochastic Search Methods 52
the argument of the maximum yield by the algorithm. We then move to Step 3, wherein
a new seed is generated. Again, we sample the line which passes through the argument
of the maximum of the previous line and the new generated seed, and proceed repeating
the described above (eventually ad infinitum).
For the sake of illustration below we provide an application of the the stochastic zigzag
method to the classical test function
L(x1, x2) =1
2
[x4
1 − 16x21 + 5x1 + x4
2 − 16x22 + 5x2
](4.6)
Figure 4.2: The initialization of the stochastic zigzag method. In the picture in theleft we start by finding points a and b which initialize the algorithm. The secondpicture illustrates that in Step 1 we collect a random sample (c = 10) from line whichpasses through a and b. The remaning picture depicts steps 2 and 3 wherein after themaximum of the first line we generate another seed and start by extracting a sample
from the new line which passes by such points.
It is important to underscore that other variants of the stochastic zigzag method are
also included in the general method, but we preferred to focus on the one which is
stated above for its simplicity and appealing ease of implementation (see Theorem 4.10
in §4.3.4). We could have considered an alternative shape for the line, and the robustness
of the master method is such that we are even able to take a different type of line per
each different course.
4.3.4 A Matrix Formulation of the Master Method
In this subsection, we shed some light on the matrix formulation of the master method.
This conceptual framework will help us to clarify some features of the method. Addi-
tionally, as we shall see latter, such representation can reduce substantially the burden
of implementation. In order to be able to present this formulation, we need to consider
a stopping time of the method, which we denote by r. From the theoretical stance, one
can consider for instance the time of entry in the optimal zone. In fact, this can be
defined for every ε,M > 0, as
rε,M = infi ∈ N : θi ∈ Oε,M.
Chapter 4. Global Optimization by Stochastic Search Methods 53
Figure 4.3: The application of the stochastic zigzag method to the Stybilinski–Tangtest function. This function pertains to a class of test functions which are typicallyused to assess the performance of an optimization algorithm (see e.g. Spall [50]). The
functional form of this function is given in formula (4.6)
It can be easily shown that this is a stopping time, with respect to the natural filtration
Fi = σ(θ1,θ2, . . . ,θi). Analogous stopping times can be found even in introductory
textbooks (e.g. [56]), so we skip the details. The crux of the proof is given by observing
that
rε,M ≤ i =
i⋃p=1
θp ∈ Oε,M ∈ Fi, ∀i ∈ N.
The law of movement of the iterates (4.5) allow us to describe the mechanics of the
method in a matrix form by defining the iterative matrix Z as the (r × kc)-matrix
Z ≡
Z1,1 Z1,2 · · · Z1,c
Z2,1 Z2,2 · · · Z2,c
......
. . ....
Zr,1 Zr,2 · · · Zr,c
=
z1 Z1,1 · · · Z1,c−1
z2 Z2,1 · · · Z2,c−1
......
. . ....
zr Zr,1 · · · Zr,c−1
=
z1
z2
...
zr
. (4.7)
Chapter 4. Global Optimization by Stochastic Search Methods 54
Further, we will refer to the map-iterative matrix TZ as the (r × c)-matrix
TZ ≡
T (Z1,1) T (Z1,2) · · · T (Z1,c)
T (Z2,1) T (Z2,2) · · · T (Z2,c)...
.... . .
...
T (Zr,1) T (Zr,2) · · · T (Zr,c)
=
Tz1
Tz2...
Tzr
.
For the sake of illustration, lets rethink the case wherein c = 1. Then the iterative
matrix Z and the map-iterative matrix TZ become
Z =
z1...
zr
, TZ =
T (z1)
...
T (zr)
.
Hence, it is now more clear the affinity between the Solis and Wets conceptual algorithm
[49] and the master method introduced above. In fact, in the particular case wherein
c = 1, the iterative matrix degenerates into a matrix composed uniquely by seeds, i.e.,
random draws generated from the probability space (Rk,B(Rk),Pi).
In the particular case of the stochastic zigzag method, the following matrix also finds
application
α ≡
α1,1 α1,2 · · · α1,c−1
α2,1 α2,2 · · · α2,c−1
......
. . ....
αr,1 αr,2 · · · αr,c−1
.
In the proposition that follows we show how the matrix representation of the stochastic
zigzag method can bring inherent implementation advantages.
Theorem 4.10. (Kronecker–zigzag decomposition)
The i-th zigzag course can be rewritten as
zi =
[zi
... αi ⊗ θi−1 + (1Tc−1 −αi)⊗ zi
], (4.8)
for i = 1, . . . , r, and where θi−1 is defined accordingly to the formulation of the stochastic
zigzag method given above.
Chapter 4. Global Optimization by Stochastic Search Methods 55
Proof. Just note that
zi =
[zi
... αi,1θi−1 + (1− αi,1)zi · · · αi,c−1θi−1 + (1− αi,c−1)zi
]=
[zi
... αi,1θi−1 · · · αi,c−1θi−1
]+
[0
... (1− αi,1)zi · · · (1− αi,c−1)zi
]=
[zi
... αi ⊗ θi−1 + (1Tc−1 −αi)⊗ zi
].
The latter result warrants some comments. Roughly speaking, it states that the law
of movement of each iterate, can be readily extended to describe the whole law of
movement a zigzag course simply by replacing the usual scalar product, by the Kronecker
product (and by performing the necessary scalar to vector adaptations). Note that the
latter result allows to build very simple computational implementations, namely it only
requires a loop which can be easily stated in pseudocode.
Pseudocode Implementation of the Stochastic Zigzag Method
• rand:
- seeds;
- alpha.
for i=1 to r,
- compute theta_i-1;
- compute z_i;
- increment i
Making use of the binary operation ⊗, we are able to build in a step the whole first
line of the iterative matrix Z. In the following we introduce an example. The presented
example should by no means be considered for optimization purposes, but merely as an
illustration which clarifies the conceptual framework introduced above.
Example 4.2. (Minimizing the Styblinski–Tang function)
We randomly generate the following matrices
α =
α1
α2
α3
=
23
13
−1 −13
−2 −1
; z =
−4 1
0 0
2 0
; θ0 = θ0 =[−1 4
].
Chapter 4. Global Optimization by Stochastic Search Methods 56
Using the binary operation ⊗, we built in a step the whole first line of the iterative matrix
Z,
z1 =
[z1
... α1 ⊗ θ0 + (1T2 −α1)⊗ z1
]=[z1
... 23θ0 + 1
3z113θ0 + 2
3z1
]=[−4 1 −2 3 −3 2
].
This yields
Tz1 =[−15 −53 −58
],
and so θ1 = [ −3 2 ]. Similarly, we build second and third lines of the iterative matrix
Z,
z2 =
[z2
... α2 ⊗ θ1 + (1T2 −α2)⊗ z2
]=[
0 0 3 −2 1 −23
].
This yields
Tz2 =[
0 −53 −10.12],
and so θ2 = [ 3 −2 ]. Finally, we have
z3 =
[z3
... α3 ⊗ θ2 + (1T2 −α3)⊗ z3
]=[
2 0 0 4 1 2].
Implying that
Tz3 =[−19 10 −24
],
and so θ3 = [ 1 2 ]. Thus, we have the following iterative matrix Z and corresponding
map-iterative matrix TZ
Z =
−4 1 −2 3 −3 2
0 0 3 −2 1 −23
2 0 0 4 1 2
; TZ =
−15 −53 −58
0 −53 −10, 12
−19 10 −24
.
Chapter 4. Global Optimization by Stochastic Search Methods 57
4.4 Convergence of the Master Method
This section establishes the convergence of the general algorithm introduced above.
We start this journey with the introduction of some preliminary considerations. First,
note that as a consequence of the compass update rule of the master algorithm θi =
C(θi−1,θi), it holds that the sequence Tn(θi)i∈N is increasing. In fact, we have that
Tn(θi) = (Tn C)(θi−1,θi) ≥ Tn(θi−1). (4.9)
This reasoning can be easily inducted, being valid that for every positive integer k
Tn(θi+κ) ≥ Tn(θi).
This simple fact will play an important role in the establishment of the following trinity
of elementary results.
Proposition 4.11.
For every positive integer κ, we have that:
1. If θi ∈ Oε,M , then θi+κ ∈ Oε,M ;
2. If θi ∈ Oε,M , then θi+κ ∈ Oε,M ;
3. θκ ∈ Ocε,M ⊆ θ1, . . . , θκ−1 ∈ O
cε,M ∩ θ1, . . . ,θκ−1 ∈ O
cε,M.
Proof.
1. We will just deal with the case where the essential supremum is finite, because the
case wherein τ =∞ is similar. Given that the sequence Tn(θi)i∈N is increasing,
it holds that for every positive integer κ
Tn(θi+κ) ≥ Tn(θi) = Tn(C(θi−1,θi)) ≥ Tn(θi). (4.10)
Further, since by assumption θi ∈ Oε,M , it holds that
Tn(θi) > τ − ε. (4.11)
The final result now follows by combining inequalities (4.10) and (4.11).
Chapter 4. Global Optimization by Stochastic Search Methods 58
2. We will only consider the case in which τ ∈ R, given that the case wherein τ =∞is similar. Since by assumption we have that θi ∈ Oε,M , then it holds that
Tn(θi) > τ − ε. (4.12)
The final result follows directly as a consequence of the sequence Tn(θi)i∈N being
increasing.
3. As a consequence of Claims 1 and 2 we have that for every positive integer κ(θκ−1 ∈ Oε,M ∨ θκ−1 ∈ Oε,M
)⇒ θκ ∈ Oε,M . (4.13)
Applying the contrapositive law to (4.13) yields
θκ ∈ Ocε,M ⇒
θκ−1 ∈ Ocε,M ,
θκ−1 ∈ Ocε,M .
⇒
θ1, . . . ,θκ−1 ∈ Ocε,M ,
θ1, . . . , θκ−1 ∈ Ocε,M .
The last implication follows directly from Claims 1 and 2.
Claims 1 and 2 of the foregoing theorem, translate the idea that if an iterate of the
algorithm falls in the optimal zone, then it remains there forever. Claim 3 will be
particularly useful in the proof of convergence of the general algorithm proposed above.
With next theorem we start the study towards the convergence of the master method.
Theorem 4.12. (Convergence of the pure master method—Part I)
1. Suppose that T is bounded from above. Further, suppose that the master method
is pure, and that the following condition holds
∀B ∈ B(Θ) λ(B) > 0⇒ P[z1 ∈ B] > 0. (4.14)
Then
P[θi ∈ Ocε,M ] = o(1).
2. Suppose that T is bounded from above. Then
Tn(θi)− T = o(1), a.s., (4.15)
where T is a random variable such that P[T = τ ] = 1.
Chapter 4. Global Optimization by Stochastic Search Methods 59
Proof.
1. As a consequence of Proposition 4.11 it holds that
P[θi ∈ Ocε,M ] ≤ P
⋂1≤p≤i−1
θp ∈ Ocε,M ∩ θp ∈ O
cε,M
≤ P
⋂1≤p≤i−1
θp ∈ Ocε,M
.(4.16)
Observe now that since by definition θp = arg maxq∈1,...,c
Tn(Zp,q), then it holds that
θp ∈ Ocε,M⊆ zp ∈ O
cε,M, ∀p ∈ N. This latter observation combined with (4.16)
yields
P[θi ∈ Ocε,M ] ≤ P
⋂1≤p≤i−1
zp ∈ Ocε,M
= P[z1 ∈ Ocε,M ]i−1.
The final result now holds since by assumption P[z1 ∈ Ocε,M ] < 1.
2. Start by noting that T (θi),Fii∈N is a submartingale, where Fi = σ(θ1, θ2, . . . , θi),
denotes the natural filtration; just observe that
E[Tn(θi)|Fi] = E[(Tn C)(θi−1,θi)|Fi] ≥ E[Tn(θi−1)|Fi] = Tn(θi−1), a.s.
Given that this submartingale is bounded from above, it is a.s. convergent to a
random variable T .5 Observe now that as a consequence of the preceding claim it
holds that
P[Tn(θi) < τ ] = o(1), (4.17)
given that ε is arbitrary. Further, Fatou’s lemma yields
P[T < τ ] = P[lim infi→∞
Tn(θi) < τ]≤ lim sup
i→∞P[Tn(θi) < τ
]= 0,
where the last equality follows by (4.17). Furthermore, as a consequence of Propo-
sition A.2 in the appendix, it holds
Tn(x) ≤ τ , gx ∈ Θ.
5Since by assumption T is bounded from above, it holds that supi
E[T (θi)] < ∞. Consequently,
Doob’s martingale convergence theorem can be applied, hence establishing the a.s. convergence to T .
Chapter 4. Global Optimization by Stochastic Search Methods 60
In particular, this implies that for every positive integer i we have P[Tn(θi) > τ ] = 0.
Consequently it holds that
P[Tn(θi) > τ ] = o(1). (4.18)
Therefore, again by Fatou’s lemma it holds that
P[T > τ ] = P[lim infi→∞
Tn(θi) > τ]≤ lim sup
i→∞P[Tn(θi) > τ
]= 0,
where the last equality holds as a consequence of (4.18).
The latter result warrants some general remarks. Roughly speaking, Claim 1 states that
the probability of the algorithm failing the optimality region, approaches 0 as the number
of iterates increases. Further, Claim 2 ensures that the sequence T (θi)i∈N converges
a.s. to a random variable T which is indistinguishable from the essential supremum.
Observe that the proof of the second claim is entirely robust to both the pure stochastic
method and the adaptive master method. Hence, the second result also holds in what
concerns the adaptive master method. It then arises the question. Is the first claim of
the previous Theorem also extendable to the adaptive master method? This issue lies
at the heart of the next theorem.
Theorem 4.13. (Convergence of the adaptive master method: Part I)
Suppose that Tn is bounded from above. Further, suppose that the master method is
adaptive, and that the following condition holds
inf1≤p≤i−1
P[zp ∈ Ocε,M ] = o(1). (4.19)
Then
P[θi ∈ Ocε,M ] = o(1).
Proof. Our line of attack is similar to the previous proof. Just note that by a similar
reasoning, it holds that
P[θi ∈ Ocε,M ] ≤ P
⋂1≤p≤i−1
zp ∈ Ocε,M
≤ inf1≤p≤i−1
P[zp ∈ Ocε,M ],
from where the final result follows directly.
Chapter 4. Global Optimization by Stochastic Search Methods 61
It is important to underscore that the hypothesis considered here in order to establish
the convergence of the adaptive stochastic method is known in the literature. Condition
(4.19) is tantamount to the one adapted by Esquıvel [17]. Note however, that whereas
Esquıvel used this condition as a means to establish the convergence of the adaptive
random search, here it is used in the more general context of the adaptive master method.
In the sequel, we evaluate how far can we reach assuming that the identification of true
parameter holds under the Weierstrass framework.6
Theorem 4.14. (Convergence of the pure master method: Part II)
Suppose that T is bounded from above. Further, suppose that master method is pure,
and that the following condition holds
∀B ∈ B(Θ) λ(B) > 0⇒ P[z1 ∈ B] > 0. (4.20)
Further, suppose that Tn(θ) is continuous and that θn = arg maxθ∈Θ
Tn(θ). Then, it holds
that
Tn(θi)− Tn(θn) = o(1), a.s. (4.21)
If furthermore Θ ⊂ Rk is compact, then
θi − θn = o(1), a.s. (4.22)
Proof.
The proof is split into two claims. The first claim establishes that Tn(θi) − T (θn) =
o(1), a.s. The second claim shows that θi − θn = o(1), a.s.
1. Let us first show that the sequence Tn(θi)i∈N converges in probability to Tn(θ0).
Consider ε > 0. Start by noting that
P[|Tn(θi)−Tn(θn)| ≥ ε] = P[Tn(θi) ≤ Tn(θn)−ε∪Tn(θi) ≥ Tn(θn)+ε]. (4.23)
Observe now that by Theorem 4.5 it holds that the essential supremum and the
maximum coincide. Hence, by definition of essential supremum it holds that
P[Tn(θi) ≥ Tn(θn) + ε] = 0. This implies that (4.23) can be rewritten as
P[|Tn(θi)− Tn(θn)| ≥ ε] = P[Tn(θi) ≤ Tn(θn)− ε] = P[θi ∈ Ocε,M ].
Now, observe that as a consequence of Proposition 4.11, it holds
P[θi ∈ Ocε,M ] ≤ P[θ1, . . . ,θi−1 ∈ O
cε,M ] ≤ P[z1, . . . , zi−1 ∈ O
cε,M ] = (P[z1 ∈ O
cε,M ])i−1.
6These are, respectively, Assumptions 3 and 1, from the previous chapter.
Chapter 4. Global Optimization by Stochastic Search Methods 62
Given that by assumption P[z1 ∈ Ocε,M ] < 1, the last inequality establishes that
Tn(θi) − Tn(θn) = op(1). The remaining part of the proof follows by a standard
argument, given that the sequence Tn(θi)i∈N is increasing. This implies that the
sequence of events Ei,ε = |Tn(θi)−Tn(θn)| ≤ ε is contractive, i.e., it is such that
Ei+1,ε ⊆ Ei,ε, for every i ∈ N and ε > 0. Consequently by a standard argument,7
convergence in probability implies that
P[
limi→∞|Tn(θi)− Tn(θo)| ≤ ε
]= 1, ∀ε > 0. (4.24)
Given that ε is arbitrary, we get that
P[
limi→∞|Tn(θi)− Tn(θn)| = 0
]= 1,
from where the final result follows.
2. Let us now suppose that Θ is compact, and suppose by contradiction that (4.22)
does not hold. Then for every ω on a set of positive probability Ω ⊂ Rk
∃ε > 0 ∀p ∈ N ∃Ni > p |θi(ω)− θn| > ε . (4.25)
Now for all ω ∈ Ω the sequence θi(ω)i∈N is a sequence of points in a compact
set Θ and by Bolzano–Weierstrass theorem there is a convergent subsequence
θiκ(ω)κ∈N of θi(ω)i∈N. This subsequence must converge to θn because if
the limit were θa then, by the continuity of Tn we would have the sequence
Tn(θiκ)(ω)κ∈N converging to Tn(θa) = Tn(θn). Now as θn is an unique min-
imizer of Tn in Θ we certainly have θa = θn. Finally observe that the subsequence
θiκ(ω)κ∈N also verifies the condition expressed in (4.25) for κ large enough, which
yields the desired contradiction.
A similar result can be established if the master method is adaptive. Again, the proviso
has to be suitably accommodated making use Esquıvel’s [17] hypothesis.
7Recall that when a sequence of events Ei is either expansive or contracting it holds that limi→∞
P [Ei] =
P[
limi→∞
Ei]. Thus, under such circumstances one can interchange the limit with the measure. See e.g.
Proposition 1.1.1 in Ross [45].
Chapter 4. Global Optimization by Stochastic Search Methods 63
Theorem 4.15. (Convergence of the adaptive master method: Part II)
Suppose that Tn is bounded from above. Further, suppose that master method is adaptive,
and that the following condition holds
inf1≤p≤i−1
P[zp ∈ Ocε,M ] = o(1). (4.26)
Further, suppose that Tn(θ) is continuous and that θn = arg minθ∈Θ
Tn(θ). Then, it holds
that as i→∞Tn(θi)− Tn(θn) = o(1), a.s.
If furthermore Θ ⊂ Rk is compact, then as i→∞
θi − θn = o(1), a.s.
Proof. By a similar reasoning to the proof of Theorem 4.13 we get that
P[θi ∈ Ocε,M ] ≤ P[θ1, . . . ,θi−1 ∈ O
cε,M ]
≤ P[z1, . . . , zi−1 ∈ Ocε,M ]
≤ inf1≤p≤i−1
P[zp ∈ Ocε,M ].
This establishes that Tn(θi)−Tn(θn) = op(1). The a.s. convergence can be now achieved
by the same argument used in the proof of Theorem 4.14, and the remaining part proof
is the same as above.
In the next section we provide a brief note regarding the establishment of confidence
intervals for the extremum of a function. For such purpose the seeds will play an
important role.
Chapter 4. Global Optimization by Stochastic Search Methods 64
4.5 A Note on the Construction of Confidence Intervals
This section is devoted to the construction of confidence intervals for the maximum of
a function, through the use of the image of the first column of the iterative matrix Z.
In fact, as we shall see below, if the master method is pure, and the seeds are uniformly
distributed over Θ then it is possible to make use of a result on extreme value theory
due to de Haan [14]. In the sequel, let T z(1) ≤ T z(2) ≤ · · · ≤ T z(r) denote the order
statistics of the sequence of the image of the seeds, where r denotes a finite (possibly
degenerated) stopping time.
Theorem 4.16. (Confidence Intervals for the Maximum - de Haan [14])
Suppose that the identification of the true parameter θo holds, i.e., θo = arg minθ∈Θ
T (θ).
Consider the sequence of independent and identically distributed zi with uniform distri-
bution over Θ. Further, consider the auxiliary correspondence Ξ : N× [0; 1]⇒ R defined
as follows,
Ξ(i, p) =
]T z(i); T z(i) +
T z(i) − T z(i−1)
(1− p)−2k − 1
[. (4.27)
The following large sample result holds
P [Ti(θo) ∈ Ξ(i, p)]− (1− p) = o(1), i→∞.
Proof. See de Haan [14], pp. 467–469.
Remark 4.17. It is worth emphasizing that the proof of such result relies in the applica-
tion of asymptotic results from extreme value theory (Galambos [20]). Given that the
proof of the theorem is relegated to a reference, it is important to underscore that there
are some small typos in the original paper of de Haan [14] that can generate confusion
and misleading conclusions. Making use of de Haan’s [14] notation, we call attention to
the following points.
• In line 21 of p. 467, it is written an α−1. Instead it should be written an α.
• The formula for constructing the confidence intervals is given in the last line of
page 467. In lieu of such formula it should be the written
Y1 − [Y2 − Y1]/[(1− p)−1/α − 1], Y1. (4.28)
Transliterated into our notation we have that Y1 and Y2 respectively mean z(1) and z(2),
and α denotes k/2. Note that the formula (4.28) is for the obtention of the minimum of
Chapter 4. Global Optimization by Stochastic Search Methods 65
a function, whereas in Theorem 4.16 it is adapted for the maximum (cf. with formula
(4.29) stated below).
It should also be pointed out that it have been developed hypothesis tests based on
this method. Veall [53] developed a statistical procedure suited for testing if a solution
achieved is a global maximum. Hence, in the same spirit, if the method is pure and the
seeds are uniformly distributed, Veall’s test can also be implemented here, making use
of the first column of the iterative matrix Z.
It is worth noting that the method is extremely easy to apply making use of the following
inputs: two order statistics (T z(r), T z(r−1)), level of significance (p) and (k) the dimen-
sion of the optimization problem at hand. If we pretend to construct confidence interval
for the minimum of a function then the following set-valued correspondence should be
used
Ψ(i, p) =
]T z(1) −
T z(2) − T z(1)
(1− p)−2k − 1
; T z(1)
[. (4.29)
In the appendix we report some computational experience with de Haan’s method. The
remainder of this section provides a brief guidelines on the construction of such tables.
Monte Carlo simulations were considered for several (degenerated) stopping times, r =
10.000, 20.000, 100.000 and 500.000. Given that we run several Monte Carlo simulations,
as a means to distinguish the several order statistics, let T z(i)j denote the i-th order
statistic from the j-th trial. Further, define the set-valued function Ψ : N× [0; 1]⇒ R
Ψ(r, p) =
r−1r∑j=1
(T z(1)j −
T z(2)j − T z(1)j
(1− p)−2k − 1
); r−1
r∑j=1
T z(1)j
. (4.30)
Moreover, when reporting the computational experience with de Haan’s method, we
make use of the following notation
σ2LB(p) = (r − 1)−1
r∑j=1
(T z(1)j −T z(2)j − T z(1)j
(1− p)−2k − 1
)− r−1
r∑j=1
(T z(1)j −
T z(2)j − T z(1)j
(1− p)−2k − 1
)2
,
σ2UB = (r − 1)−1
r∑j=1
T z(1)j − r−1r∑j=1
T z(1)j
2
.
(4.31)
The number of trials considered was 1.000. It is important to underscore that the test
functions used are classical in the literature, in what concerns testing the performance
of an optimization algorithm.
Chapter 5
Estimation in the Mixed Model
via Stochastic Optimization
5.1 Introduction
Maximum likelihood is one of the main standard techniques for yielding parameter
estimates of a statistical model of particular interest. Large sample results for this M -
estimation methodology, were long ago established in the literature (Wald [57]). Despite
their attractive features, there are circumstances under which the application of such
estimator becomes prohibitive. In fact, in a plurality of cases of practical interest, the
estimator is not analytically tractable. In this chapter we are interested in a particular
case where such occurrence takes place, namely in the MLE for normal linear mixed
models—a model which was briefly discussed in Chapter 3 (see Example 3.2).
A possible approach to surpass this general lack of a closed-form analytic solution,
is given through the application of well suited global optimization methods. Before
carrying on the optimization, it is sometimes convenient to inspect if it is possible to
simplify the problem at hand. Hence, for instance, as noted by Carvalho et al. [11], if the
model has a common orthogonal block structure, then a closed form solution for the MLE
in normal linear mixed models can be found. Beyond this very special instances, there is
no hope to achieve an explicit form for the solution of this maximum likelihood problem.
Notwithstanding, in this chapter we show that in a linear mixed model, the maximum
likelihood problem can be rewritten as a much simpler optimization problem (henceforth
the simplified problem) where the search domain is a compact whose size depends only
on the number of variance components. The original maximum likelihood problem is
thus reduced into a simplified problem which presents at least two main advantages: the
number of variables in the simplified problem is considerably lower; the domain of search
66
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 67
of the simplified problem is a compact set. Whereas the former advantage avoids the
so-called ‘curse of dimensionality’, the latter permits the use of simple stochastic search
optimization methods.1 As it can be readily noticed, from the estimation standpoint, this
features are extremely advantageous. This simplified problem will allow us to obtain the
estimates of the variance components with large computational savings. Furthermore,
given that the domain of search of the simplified problem is a compact set, we can use
simple random search methods—which are a particular instance of the master method
presented in the previous chapter. Other variants of the master method introduced
above, can also be used to solve the problem of interest.
This chapter is organized as follows. In the next section we introduce the model of
interest. In §5.3 we introduce a main result which yields the simplified problem, and to
assess the performance of our approach we conduct a Monte Carlo simulation study and
report the results in §5.4.
5.2 Model
In this section, we present the model of interest. We bring to mind that some points
regarding the mixed linear model were introduced in Chapter 3 (recall Example 3.2).
Roughly speaking, one can think of the the mixed linear model is an extension to the
simple linear model (3.2), in order to account for more than one source of error. The
model takes the following form
y = Xβo +
w−1∑i=1
Xiζi + ε, (5.1)
where (y,X,βo, ε) are defined as in (3.2), Xi are design matrices of size n × ki, and
where ζi are ki-vectors of unobserved random effects. Following classical assumptions,
we take the random effects ζi to be independent and normally distributed with null
mean vectors and covariance matrix σ2oiIki , for i = 1, . . . , w − 1. Further, we also take
the ε to be normally distributed with null mean vector and covariance matrix σ2owIn,
independently of the ζi, for i = 1, . . . , w − 1.
A general overview of topics related with the estimation and inference of these models,
can be respectively found in Searle et al. [47] and Khuri et al. [29].
The model has the following mean vector and covariance matrix
E [y|X] = Xβo, (5.2)
1This expression was introduced by the prominent Mathematician Richard Bellman. A detailedanalysis of this issue is far beyond the scope of this thesis. See Pakes and McGuire [40] for a discussion.
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 68
Σσ2o≡ V[y|X] =
w−1∑i=1
σ2oiXiX
Ti + σ2
owIn,
where σ2o ≡ [ σ2
o1 · · · σ2ow ]. Given the current framework, we have that
y|X ∼ N
(Xβo;
w−1∑i=1
σ2oiXiX
Ti + σ2
owIn
),
and thus the density for the model is given by
fy|X(y) =exp(−1
2(y −Xβo)TΣ−1
σ2o(y −Xβo)
)√(2π)n det(Σσ2
o)
.
Now, let θ ≡ [ βT σ2 ]. The estimator objective function assigned to the maximum
likelihood estimator is given by the loglikelihood of the aforementioned mixed linear
model, i.e.:
Tn(θ) =Tn([ βT σ2 ])
=− n
2ln(2π)− 1
2ln(
det(Σσ2
))− 1
2(y −Xβ)TΣ−1
σ2 (y −Xβ).
The maximum likelihood estimators of the true regression parameter and the model
variance components, respectively denoted by βML and σ2ML, are thus given by
[ βT
ML σ2ML ] = arg max
[ βT σ2 ]
Tn([ βT σ2 ]) = arg maxθ∈Θ
Tn (θ) , (5.3)
where the parameter set Θ, is a bounded subset of Rk+w which restricts the elements
θi to be nonnegative, for i = k + 1, . . . , w, i.e.
Θ ≡ θ ∈ Rk+w : θi ≥ 0, i = k + 1, . . . , w.
In the following section, we consider a special case, wherein it is possible to obtain a
closed form solution for the ML problem. It should be emphasized that cases like the
presented below represent the exception, rather than the rule. Notwithstanding, this
instance is introduced here as a benchmark case.
5.2.1 The Benchmark Case
In this subsection, we consider a particular case wherein it is possible to get a closed
form solution for the problem of interest. As we shall see below, this is gained at the
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 69
cost of the introduction of some structure in the covariance matrix Σσ2o. Specifically, if
we consider the case wherein the covariance matrix can be decomposed as
Σσ2o
=w∑j=1
ηjQj .
If the Qj are orthogonal projection matrices such that QjQj′ = 0 for j 6= j′, and if T,
the orthogonal projection matrix on the range space of X, is such that:
TQj = QjT, j = 1, . . . , w,
the model is said to have commutative orthogonal block structure. Here, η = [η1 · · · ηw]T
is the vector of the so-called canonical variance components, and it is determined by the
equation Bη = σ2o, where B is a known nonsingular matrix. In this case, we can rewrite
the density of the model as
fy|X(y) =exp
(−1
2(y −Xβo)T
(∑wj=1 η
−1j Qj
)(y −Xβo)
)√
(2π)n∏wj=1 η
gjj
. (5.4)
where gj is the rank of matrix Qj .
In this particular instance, it can be shown (see Carvalho et al. [11]) that the following
of estimators solve the optimization problem (5.3)
β = (XTX)−1XTy,
ηj =yT(I−T)Qj(I−T)y
gj,
When the above mentioned conditions on the covariance matrix and X do not hold, a
closed-form analytical expression for producing MLE is not typically obtainable.
Before we move into the application of stochastic search methods into the optimization
problem (5.3), we provide a result which is able to reduce the number of variables wherein
we perform the optimization.
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 70
5.3 The Dimension Reduction Technique
The next result establishes that in a mixed linear model, the maximum likelihood prob-
lem can be rewritten as a simplified problem where the search domain is a compact
set whose dimension depends exclusively on the number of variance components. This
result will prove to be useful in order to compute the estimation of variance components,
through maximum likelihood methods, at a much lower computational effort.
Theorem 5.1. Consider the aforementioned mixed model,
y = Xβo +w−1∑i=1
Xiζi + ε.
The maximum likelihood estimators of the true regression parameter and the model vari-
ance components, respectively denoted by βML and σ2ML, given by
[ βT
ML σ2ML ] = arg max
[ βT σ2 ]
`n([ βT σ2 ] | y),(5.5)
can be alternatively achieved by solving the following optimization problem
minγ∈[0;π
2 ]w−1
(fn p)(γ), (5.6)
where
fn(α) = ln(A(α)n det(Σα)
)(5.7)
A(α) = yT
(I−X
(XTΣ−1
α X)−1
XTΣ−1α
)T
Σα−1(I−X
(XTΣ−1
α X)−1
XTΣ−1α
)y (5.8)
p(γ) = q1
w−1∏j=1
cos(γj) +w−1∑l=2
ql
w−l∏j=1
cos(γj) sin(γw−1)
+ qw sin(γw−1) (5.9)
and qiwi=1 denotes the canonical basis of Rw.
From the inspection of Theorem 1, we can ascertain at least two major advantages of
the simplified problem, relatively to the original problem, namely: whereas the original
maximum likelihood problem has dimension w+k, the simplified equivalent problem only
has size w − 1; additionally, the search domain of the simplified problem is a compact
set—contrarily to what it is verified in the original problem. This latter advantage
permits the use of simple random search methods, adjusting a multivariate uniform
distribution over the new search domain[0; π2
]w−1. The proof is given below.
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 71
Proof. (Dimension reduction technique)
Consider the loglikelihood of the aforementioned mixed linear model,
`n([ βT
ML σ2ML ] | y) = −n
2ln(2π)− 1
2ln (det(Σσ2))− 1
2(y −Xβ)TΣ−1
σ2 (y −Xβ).
Observe that maximizing `n([ βT
ML σ2ML ] | y) is equivalent to minimizing
`∗([ βT
ML σ2ML ] | y) = ln(det(Σσ2)) + (y −Xβ)TΣ−1
σ2 (y −Xβ). (5.10)
Now define σ2 = cα, with c > 0 and ‖α‖ = 1. Making use of the first order conditions
of the ML problem we get that
β =(XΣ−1
σ2X)−1
XΣ−1σ2y,
Hence, we can rewrite (5.10), evaluated at β, as
`∗ = n ln(c) + ln(det(Σα)) + c−1A(α),
where A is defined in (5.8). Now, observe that
∂l∗∂c
= nc−1 − c−2A(α) = 0⇔ c =A(α)
n,
and∂2l∗∂c2
= 2c−3A(α)− nc−2,
so that∂2l∗∂c2
∣∣∣∣c=
A(α)n
=n3
A(α)> 0
whence
c =A(α)
n,
is in fact an absolute minimum. Hence (5.10) simplifies into n ln(A(α)
)+ ln |Σα|, which
we define as fn(α) (see above in (5.7)). Next, we transform α through the polar coor-
dinate transformation p(γ) (see e.g. Kendall [27]) defined in (5.9). This entails writing
the w components of α, through w − 1 components in γ, as follows
α1 = cos(γ1) · · · cos(γw−1) cos(γw−1)
α2 = cos(γ1) · · · cos(γw−1) sin(γw−1)
...
αw = sin(γw−1).
(5.11)
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 72
5.4 A Stochastic Optimization Study of the Dimension Re-
duction Technique
In this Monte Carlo simulation study, we considered three one-way random models. The
first model is unbalanced with a total of 72 observations and 8 groups. The dissemination
of the observations can be described through the following vector
[3 6 7 8 9 10 11 18],
whose i-th component denotes the number of elements considered in the i-th group.
Several possible true values of the variance components were considered. We then con-
ducted a Monte Carlo simulation from which we report the averaging of the several
results achieved. In every run of the simulation, the optimization problem was solved
using the reduction dimension technique introduced in the previous section, and the
pure random search method—a particular instance of the master method introduced in
the previous chapter.
Variance Component0.0 0.1 0.5 0.7 1.0 1.5 2.0 5.0 10.0
Estimate 0.016 0.073 0.425 0.609 0.869 1.311 1.759 4.657 9.421
Table 5.1: Estimates of the variance components in Model I.
We now provide some guidelines regarding the interpretation of Table 5.1. In the first
line we present the true values of the variance components σ2o . The second line includes
the solution provided by the recurrent application of pure random search methods to
the optimization problem (5.3). Thus, for instance when the “true” variance component
was of 0.5, the result yield through the application of stochastic optimization methods
and the dimension reduction technique presented above yield 0.425. Further, observe
that except when the true value of the variance component is null, the true values always
dominates the estimated values.
Next, we considered a quasi-balanced model with 66 observations. The disposal of the
observations was now the following
[6 6 6 6 6 6 6 6 6 5 7].
A Monte Carlo simulation was once more conducted. No changes were made regard-
ing the true values of the variance components considered. The same applies in what
concerns the methods used to perform the optimization step. The results produced are
reported in Table 5.2.
Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 73
Variance Component0.0 0.1 0.5 0.7 1.0 1.5 2.0 5.0 10.0
Estimate 0.023 0.086 0.448 0.633 0.893 1.344 1.850 4.611 9.091
Table 5.2: Estimates of the variance components in Model II.
Hence, when the true value of the variance component was of 0.5, its estimate obtained
by maximum likelihood methods was of 0.448. It should be emphasized that again, with
exception of the case wherein the true value of the variance component was 0, in all the
remainder cases the true value of the variance component was above its estimate.
A final model was then considered. The number of observations considered was now 72.
Observations were now grouped as follows
[2 2 3 3 4 4 15 15 24].
The results are now summarized in Table 5.3.
Variance Component0.0 0.1 0.5 0.7 1.0 1.5 2.0 5.0 10.0
Estimate 0.012 0.074 0.426 0.600 0.852 1.364 1.780 4.711 9.929
Table 5.3: Estimates of the variance components in Model III.
As it can be readily noted, from the inspection of tables 5.1, 5.2 and 5.3, there is a slight
bias present in the estimates produced through maximum likelihood. This in accord
to what is known in the literature, and there exist some methods, such as Restricted
Maximum Likelihood (REML), which can be used to compensate for such bias (see
Harville [23] and Searle et al. [47]).
Chapter 6
Summary and Conclusions
6.1 Closure
The instigation which drove us through the research which culminated in this thesis,
has now reached a final stage. It is now the time for a reflexion regarding the work
developed thus far. In this spirit, we summarize below the inquiry carried on over this
thesis and provide some concluding remarks.
As discussed above, this thesis was written as a fugue, interweaving the themes of ex-
tremum estimators and stochastic optimization. In what concerns extremum estimators,
we started by discussing some of the provisos under which the consistency and asymp-
totic normality of such estimators can be ensured. During this part of the thesis, we
presented some cornerstone results which establish the strong consistency and asymp-
totic normality, under a fairly mild set of assumptions. Given that any estimator which
can be formulated through an optimization problem is tantamount to an extremum es-
timator, the broadness of such results is in effect astonishing. The melody played in
this regard was strongly inspired in the seminal works of Newey and Mcfadden [35] and
Andrews [3]. After the analysis of the large sample results of extremum estimators we
started considering its computational features. In this regard, we emphasized the need
to carefully choose which numerical procedure to use when carrying the optimization.
Several hindrances may arise if one does not pay specific attention to this point. In
fact, given that the global solution is the unique which inherits noteworthy asymptotic
features, one should avoid methods which may eventually converge to a local solution.
Some standard numerical procedures were then introduced as a means to start giving
a voice to stochastic optimization algorithms. The presentation of some deterministic
methods worked as a bridge linking Chapters 2 and 3, allowing us to lay groundwork
for the study of stochastic optimization algorithms.
74
Chapter 6. Summary and Conclusions 75
Extremum estimators are then counterpointed with the introduction of the master
method—a general algorithm which comprises several other stochastic optimization algo-
rithms. The generality of the master method is in fact considerable including for instance
the conceptual algorithm of Solis and Wets [49], as a particular case. Another specific
embodiment of the master method is provided by the stochastic zigzag method—an op-
timization algorithm which is based on the prominent work of Mexia et al. [32]. During
this phase of the thesis, we also offer a simple matrix formulation of the algorithm. The
matrix formulation not only brings new insights into the general method, as it can also
diminish the burden of implementation. In fact, we achieve a result—the Kronecker–
zigzag decomposition (Theorem 4.8)—which allow us to easily obtain an entire course.
Hence, this result entails inherent simplifications at the implementation level. We then
move to the analysis of the large sample behavior of the method. The stochastic conver-
gence of the master method is here achieved under a fairly mild proviso. It is important
to underscore that the we relied on assumptions which are identical to the ones adopted
by Solis and Wets [49] and by Esquıvel [17]. We also discuss how to make use of the
master method in order to construct confidence intervals for the maximum, through an
asymptotic result on extreme value theory due to de Haan [14].
A dimension reduction technique was also developed in order to achieve, with large com-
putational savings, MLE for the mixed model parameters and the variance components.
The original maximum likelihood problem was reduced into a simplified problem which
presents at least two main advantages: the number of variables in the simplified problem
is considerably lower; the domain of search of the simplified problem is a compact set.
Whereas the former advantage avoids the so-called ‘curse of dimensionality’, the latter
permits the use of instances of the master method developed here.
The next section provides some topics which we have scheduled in our research agenda.
6.2 Open Problems
In some of the preceding parts of the thesis, we ascertained new claims which have
contributed into the state-of-the-art. In this part the thesis we put the emphasis on the
new questions which came hand in hand with such claims. Below we trace a roadmap
of future research directions. It is important to underscore that it is not our intention
to elaborate here on these ideas. The exposition level made below is thus obviously
perfunctory.
• It would be interesting to extend the de Haan’s [14] method in order to make
use of all the information contained in the iterative matrix Zr×c. The work of
Chapter 6. Summary and Conclusions 76
Dorea [15], can be a natural starting point in order to address such issue. In fact,
the proposal made in §4.5, directly relies on de Haan’s result, and hence it only
makes use of the first column of the iterative matrix. If the number of seeds is
sufficiently large, so that some asymptotic results from extreme value theory can
be invoked (see e.g. Galambos [20] pp. 111–119), meager accuracy gains can be
expected to be obtained when using any further information. Notwithstanding,
serious hindrances may arise if the number of seeds r is small. In fact, beyond
of being a waste of resources, it can bring us some embarace in some cases. Our
experience with the method lead us to believe that if the number is of seeds is
too small, then it might occur that the optimal value yield by the master method
can rely outside the confidence interval built by the master method. It should
be emphasized that such pathological cases only occur when we consider a small
number of seeds. Specifically, this tends to occur more frequently when besides of
r being small, the ratio r/(c− 1) is also small.
• It remains unanswered what is the rate of convergence of the master method in
general, and of the stochastic zigzag method in particular. In this regard, a natural
point of departure would be to reconsider and try to extent some results of Pflug
[41] (see pg. 24) which are related with the rate of convergence of the pure random
search method. The precursory analysis of Solis and Wets [49] in what concerns
the rate of convergence of the conceptual algorithm can be also taken into account
in the analysis of such matter. It is our opinion that this issue should be carefully
addressed in future inquiries. In fact, the cognizance of the rate of convergence is
pivotal in applications. Given the generality of the master method it would be also
important to point out what are the factors that have an influence on the rate of
convergence? For instance, can c have an effect on the rate of convergence of the
master method? If so, which values of c can lead to an higher rate of convergence?
• Is it possible to establish the convergence of an extended master method, which
instead of considering c as fixed, allows c to be defined through some stopping
criteria of interest? For the sake of illustration consider the following sequence
z1,Z1,1, . . . ,Z1,τ1 ,Z2,1, . . . ,Z2,τ2 , . . . ,Z3,1, . . . .,Z1,τ3 , . . . (6.1)
where the τ1, τ2, τ3, . . . denote finite stopping-times. Several question now arise.
How robust is the framework developed so far, to such extended method? In what
concerns the matrix formulation, we note that unless we consider the degenerate
case wherein τ1 = τ2 = τ3 = · · · = c, it cannot be applied. It can however be
adopted as a benchmark framework. In what regards the large sample behavior of
the new method, we conjecture that with the due adaptations it may be possible
Chapter 6. Summary and Conclusions 77
to characterize the large sample behavior of this new method, at the umbrella of
what it was done in §4.4. However, further inspection should be taken into course,
before any conclusion can be drawn.
• It remains to inspect whether a dimension reduction technique of the type to the
one develop in the previous chapter is robust to other M -estimation methods as
well.
• Last we intend to consider the development of the logical grounds for the almost
universal quantifier.
Appendix A
A Short Note Regarding the
Essential Supremum
This appendix includes some general results concerning the essential supremum. All the
results which we state and prove here are valid for measurable functions. Additionally,
these results also have a essential infimum counterpart. The exposition made below is
largely based in the textbook of Capinski and Kopp [9]. Here we make use of the almost
universal quantifier which was already introduced above (see §4.3, Definition 4.1).
We start by showing that if the essential supremum of a measurable function equals
−∞, then the function attains the essential supremum a.e.
Proposition A.1.
Suppose that f : Θ −→ R is a measurable function such that
ess supx∈Θ
f(x) = −∞.
Then
f(θ) = −∞,gθ ∈ Θ.
Proof. Just observe that by definition of essential supremum, and by assumption we
must have that
f(x) ≤ −n, gx ∈ Θ ∀n ∈ N.
78
Appendix A. A Short Note Regarding the Essential Supremum 79
We now show that the a measurable function cannot take values above its essential
supremum, except on a set null measure.
Proposition A.2.
Suppose that f : Θ −→ R is a measurable function. Then it holds that
f(θ) ≤ ess supx∈Θ
f(x), gθ ∈ Θ. (A.1)
Proof. Start by defining the following sequence of sets
An =
θ ∈ Θ : ess sup
x∈Θf(x) < f(θ)− 1
n
, n ∈ N.
It can be easily shown that An is expansive, i.e., An ⊂ An+1. Hence
A :=
∞⋃n=1
An =
θ ∈ Θ : ess sup
x∈Θf(x) < f(θ)
.
As a consequence of Boole’s inequality it holds that
λ(A) = λ
( ∞⋃n=1
An
)≤∞∑n=1
λ(An) = 0,
from where the final result follows.
Observe that if we assume continuity of f it can be established a stronger form of (A.1),
wherein the almost universal quantifier is replaced by the universal quantifier. In fact,
the same reasoning used in the proof Theorem 4.5, can be used to establish that, if f is
continuous then
f(θ) ≤ ess supx∈Θ
f(x), ∀θ ∈ Θ.
Next we show how the essential supremum of the sum relates with the sum of the
essential supremums.
Proposition A.3.
Suppose that f1 and f2 are measurable functions. Then
ess supx∈Θ
(f1(x) + f2(x)) ≤ ess supx∈Θ
f1(x) + ess supx∈Θ
f2(x).
Proof. By Theorem A.2, we have that
f1(θ) + f2(θ) ≤ ess supx∈Θ
f1(x) + ess supx∈Θ
f2(x),gθ ∈ Θ.
Appendix B. van der Corput’s Sublevel Set Estimates 80
Consequently
ess supx∈Θ
f1(x) + ess supx∈Θ
f2(x) ∈ t : f1(θ) + f2(θ) ≤ t,gθ ∈ Θ,
which implies that
ess supx∈Θ
f1(x) + ess supx∈Θ
f2(x) ≥ inft : f1(θ) + f2(θ) ≤ t,gθ ∈ Θ
= ess supx∈Θ
(f1(θ) + f2(θ)).
Last, but not least we show that the essential supremum attains at most the same value
as the supremum.
Proposition A.4.
Suppose that f is a measurable function. Then it holds that
ess supx∈Θ
f(x) ≤ supx∈Θ
f(x).
Proof. If supθ∈Θ
f(θ) = ∞, the proof is trivial. Suppose instead that supθ∈Θ
f(θ) = S, where
S is finite. Then it holds that
f(x) ≤ S,∀x ∈ Θ⇒ f(x) ≤ S,gx ∈ Θ
⇒ S ∈ t : f(x) ≤ t,gx ∈ Θ
⇒ S ≥ inft : f(x) ≤ t,gx ∈ Θ,
(A.2)
from where the final result follows.
Appendix B
van der Corput’s Sublevel Set
Estimates
The goal of this appendix is to ponder over the van der Corput’s sublevel set estimate
used in Chapter 4, a result due to Rogers [44]. Such result is stated and proved below.
Theorem B.1. (van der Corput’s sublevel set estimate)
Suppose that f : [a, b] → R is n times differentiable on ]a; b[, with n ≥ 1, and that
|f (n)(x)| ≥ ζ > 0. Then it holds that:
λ(x ∈ [a, b] : |f(x)| ≤ τ) ≤ n
√n!22n−1τ
ζ.
For a multidimensional version of this result see Carbery et al. [10]. Before we sketch
a proof of Theorem B.1, we have to recall two keynote topics: a generalized version
of Lagrange’s theorem; basic rudiments on Chebyshev polynomials. We start with the
former, whose proof can be found elsewhere (e.g. [44]).
Theorem B.2. (Generalized Lagrange’s Theorem)
Consider f : [a, b] → R, such that f is n times differentiable, with n ≥ 1. Consider the
points x0 < x1 < · · · < xn contained in ]a, b[. Then it holds that
∃c ∈]a, b[: f (n)(c) = n!n∑j=0
(−1)j+nf(xj)∏k 6=j |xk − xj |
.
Proof. See Rogers [44].
81
Appendix C. Construction of Confidence Intervals for the Minimum 82
Now we bring to mind some rudimentary features of Chebyshev polynomials. Recall that
a Chebyshev polynomial of order n is defined as Cn(x) = cos(n cos−1(x)), for n ∈ N0 and
x ∈ [−1; 1]. Note that Cn is indeed a polynomial of degree n, even though this may not
appear obvious on a prima facie analysis (hence for instance T1(x) = x, T2 = 2x2 − 1,
and so on).
Elementary Properties of Chebyshev Polynomials
1. The term with degree n has coefficient 2n−1.
2. The extrema of the Chebyshev polynomial Tn(x) are attained at ρk = cos(kπ/n),
for k = 0, 1, . . . , n.
Observe that it follows directly from property 1 that T(n)n (x) = n!2n−1. These and other
elementary properties of Chebyshev polynomials can be found in introductory textbooks
on numerical analysis (see [52], pp. 242–243). We are now in conditions of giving a proof
to Theorem B.1. The proof given below mirrors the reasoning given in [44].
Proof. (van der Corput’s sublevel set estimate)
Let O = x ∈ [a, b] : |f(x)| ≤ τ. The case wherein λ(O) = 0 is trivial. Hence, suppose
that λ(O) > 0. We start by mapping O to the interval O such that λ(O) = λ(O), and
which preserves the distance. Now, map O into the interval [−1, 1] by centering at the
origin and scaling by 2/λ(O). Through an application of Theorem B.2 to the Chebyshev
extrema it holds thatn∑j=0
∏k 6=j|ρk − ρj |−1 = 2n−1.
If we map back to O, there exists x0, . . . , xn ∈ O such that
n∑j=0
∏k 6=j|xk − xj |−1 ≤ 2n−1 2n
[λ(O)]n=
22n−1
[λ(O)]n. (B.1)
Consequently, it holds that
ζ ≤
∣∣∣∣∣∣n!
n∑j=0
(−1)j+nf(xj)∏k 6=j |xk − xj |
∣∣∣∣∣∣ ≤ n!n∑j=0
∏k 6=j|xk − xj |−1τ ≤ n!22n−1τ
[λ(O)]n. (B.2)
Some justifications regarding (B.2). From left to right: the first inequality holds by
assumption (|f (k)(x)| ≥ ζ > 0), and as a consequence of Theorem B.2; the second
inequality holds by triangular inequality and by assumption (|f(x)| ≤ τ); the last in-
equality is a consequence of the (B.1). The final result is now a consequence of inequality
(B.2).
Appendix C
Construction of Confidence
Intervals for the Minimum
C.1 Tables for de Haan’s Method
This appendix reports some computational experience with de Haan’s [14] method. Ad-
ditional details regarding the design of the simulation can be found in §4.5. The test
functions used are classical in the literature. Even though for the sake of completeness
we include the functional form of these function here. In Table C.1, we summarize useful
information regarding the the search domains used, as well as their global minimums in
the respective domains.
Test Function Search Domain m∗
Beale [−4.5, 4.5]2 0
Easom [−100, 100]2 -1
Griewank [−600, 600]2 0
Rastrigin [−5.12, 5.12]2 0
Rosembrock [−5, 10]2 0
Styblinski–Tang [−8, 8]2 -78.33
Table C.1: Search domains of the test functions used and their corresponding globalminimum value denoted by m∗
.
The next pages resume the results obtained. We recall that the notation used below was
already defined in foregoing chapters (formulas (4.30) and (4.31) from §4.5).
83
Appendix C. Construction of Confidence Intervals for the Minimum 84
Test 10.000 Observations
Function Ψn(p)
Beale ]-0.0409 // -0.0918 // -0.4985; 0.0048[Easom ] -2.3266 // -4.5481 // -22.3196; -0.3273[
Griewank ]-0.7654 // -1.8815 // -10.8104; 0.2391[Rastrigin ]-2.6359 // -6.1850 // -34.5774; 0.5582[
Rosembrock ]-0.5828 // -1.3078 // -7.1151; 0.0704[Styblinski–Tang ]-79.4701 // -80.8843 // -92.1976; -78.1974[
Table C.2: (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000observations.
Test 10.000 ObservationsFunction σ2
LB(0.10) σ2LB(0.05) σ2
LB(0.01) σ2UB
Beale 0.0022 0.0096 0.2580 3e-05Easom 5.9262 23.5227 585.4046 0.0908
Griewank 0.8137 3.4069 88.7540 0.0142Rastrigin 8.8619 36.4579 936.5940 0.1605
Rosembrock 0.4335 1.9305 52.4967 0.0049Styblinski–Tang 1.5938 7.0524 191.2889 0.0196
Table C.3: Sample variances for the upper bound and lower bounds of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000 observations.
Appendix C. Construction of Confidence Intervals for the Minimum 85
Test 20.000 Observations
Function Ψn(p)
Beale ]-0.0195 // -0.0440 // -0.2403; 0.0026[Easom ]-2.8362 // -5.4301 // -26.1817; -0.5016[
Griewank ]-0.5536 // -1.3559 // -7.7745; 0.1685[Rastrigin ]-21453 // -4.8858 // -26.8100; 0.3212[
Rosembrock ]-0.2898 // -0.6521 // -3.5507; 0.0363[Styblinski–Tang ]-78.8997 // -79.6046 // -85.2518; -78.2634[
Table C.4: (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000observations.
Test 20.000 ObservationsFunction σ2
LB(0.10) σ2LB(0.05) σ2
LB(0.01) σ2UB
Beale 5e-04 0.0023 0.0608 6e-06Easom 4.8064 19.4017 490.4135 0.0841
Griewank 0.3970 1.6416 42.3840 0.0080Rastrigin 5.1644 21.9313 577.5952 0.0855
Rosembrock 0.1135 0.5022 13.6278 0.0015Styblinski–Tang 0.4113 1.8156 49.1538 0.0049
Table C.5: Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000 observations.
Appendix C. Construction of Confidence Intervals for the Minimum 86
Test 100.000 Observations
Function Ψn(p)
Beale ]-0.0041 // -0.0092 // -0.0499; 5e-04[Easom ]-2.0988 // -3.4918 // -14.6353; -0.8452[
Griewank ]-0.2619 // -0.6373 // -3.6409; 0.0760[Rastrigin ]-0.5316 // -1.1991 // -6.5390; 0.0691[
Rosembrock ]-0.0584 // -0.1310 // -0.7124; 0.007 [Styblinski–Tang ]-78.4469 // -78.5898 // -79.7329; -78.3183[
Table C.6: (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 100.000observations.
Test 100.000 ObservationsFunction σ2
LB(0.10) σ2LB(0.05) σ2
LB(0.01) σ2UB
Beale 2e-06 1e-04 0.0026 2e-07Easom 1.3030 5.6331 150.2689 0.0183
Griewank 0.0864 0.3614 9.4083 0.0015Rastrigin 0.4265 1.8842 51.0218 0.0046
Rosembrock 0.0043 0.0192 0.5227 5e-05Styblinski–Tang 0.0167 0.0735 1.9868 2e-04
Table C.7: Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 100.000 observations.
Appendix C. Construction of Confidence Intervals for the Minimum 87
Test 500.000 Observations
Function Ψn(p)
Beale ]-8e-04 // -0.0018 // -0.0097; 1e-04[Easom ]-1.2893 // -1.6514 // -4.5486; -0.9633[
Griewank ]-0.1191 // -0.2883 // -1.6418; 0.0332[Rastrigin ]-0.0118 // -0.0265 //-0.1440; 0,0014[
Rosembrock ]-0.0110 // -0.0248 // -0.1356; 0.0015[Styblinski–Tang ]-78,3537 // -78.3807 // -78.5960; -78.3295[
Table C.8: (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 500.000observations.
Test 500.000 ObservationsFunction σ2
LB(0.10) σ2LB(0.05) σ2
LB(0.01) σ2UB
Beale 8e-07 4e-06 1e-04 8e-09Easom 0.1045 0.4600 12.4314 0.014
Griewank 0.0183 0.0769 2.0045 3e-04Rastrigin 2e-04 7e-04 0.0193 2e-06
Rosembrock 2e-04 8e-04 0.204 2e-06Styblinski–Tang 5e-04 0.0024 0.0643 8e-06
Table C.9: Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 500.000 observations.
Appendix C. Construction of Confidence Intervals for the Minimum 88
C.2 Functional Form of the Classical Test Functions
We now report the functional form of the classical test functions used above.
• Beale function
L(x1, x2) = (1.5− x1 + x1x2)2 + (2.25− x1 + x1x22)2 + (2.625− x1 + x1x
32)2.
• Easom function
L(x1, x2) = − cos(x1) cos(x2) exp(−(x1 − π)2 − (x2 − π)2).
• Griewank function
L(x1, . . . , xk) =k∑i=1
(x2i
4000
)−
k∏i=1
cos
(xi√i
)+ 1.
• Generalized Rastrigin function
L(x1, . . . , xk) = 10k +k∑i=1
(x2i − 10 cos(2πxi)).
• Rosenbrock function
L(x1, . . . , xk) =
k∑i=1
[100(x2
i − xi+1)2 + (xi − 1)2].
• Styblinski–Tang function
L(x1, x2) =1
2
[x4
1 − 16x21 + 5x1 + x4
2 − 16x22 + 5x2
].
Bibliography
[1] Amemiya, T. (1985) Advanced Econometrics, Cambridge: Harvard University
Press.
[2] Andrews, D. (1997) “Estimation When a Parameter is on a Boundary of a Pa-
rameter space: Part II,” Mimeo, Yale University.
[3] Andrews, D. (1999) “Estimation When a Parameter is on a Boundary,” Econo-
metrica, 1341–1383.
[4] Bazaraa, M., Sherali, H., and Shetty, C. (1992) Nonlinear Programming:
Theory and Algorithms, Hoboken: Wiley.
[5] Bohachevsky, I. O., Johnson, M. E., and Stein, M. L. (1986) “Generalized
Simulated Annealing for a Function Optimization,” Technometrics, 28, 209–217.
[6] Billingsley, P. Probability and Measures, New York: Wiley.
[7] Bierens, H. J. (2005) Introduction to the Mathematical and Statistical Foundations
of Econometrics, New York: Cambridge University Press.
[8] Booth, J., Casella, G., and Hobert, J. (2008) “Clustering using Objective
Functions and Stochastic Search,” Journal of the Royal Statistical Society, Ser. B,
70, 119–139.
[9] Capinski, M., Kopp, E. (1998) Measure, Integral and Probability, New York:
Springer.
[10] Carbery, A., Christ, M., and Wright, J. (2006) “Multidimensional Van Der
Corput and Sublevel Set Estimates,” Journal of the American Mathematical Society,
4, 981–1015.
[11] Carvalho, F., Oliveira, M., and Mexia, J. (2007) “Maximum Likelihood Es-
timator in Models with Commutative Orthogonal Block Structure,” Proceedings of
the 56th ISI Session, Lisbon, 223–226.
89
Bibliography 90
[12] Chernoff, H. (1954) “On the Distribution of the Likelihood Ratio,” Annals of
Mathematical Statistics, 25, 573–578.
[13] Christensen, R. (2002) Plane Answers to Complex Questions, New York:
Springer.
[14] de Haan, L. (1981) “Estimation of the Minimum of a Function Using Order Statis-
tics,” Journal of the American Statistical Association, 76, 467–469.
[15] Dorea, C. (1987) “Estimation of the Extreme Value and Extreme Points,” Annals
of the Institute of Mathematical Statistics, 39, 37–48.
[16] Duflo, M. (1996) Algorithmes Stochastiques, Berlin: Springer.
[17] Esquıvel, M. L. (2006) “A Conditional Gaussian Martingale Algorithm for Global
Optimization,” Lecture Notes in Computer Science, 3982, 813–823.
[18] Figueira, M. (1997) Fundamentos de Analise Infinitesimal, Textos de Matematica,
Universidade de Lisboa, Faculdade de Ciencias, Departamento de Matematica.
[19] Gan, L., Jiang, J. (1999) “A Test for a Global Optimum,” Journal of the Amer-
ican Statistical Association, 94, 847–854.
[20] Galambos, J. (1978) “The Asymptotic Theory of Extreme Order Statistics,” New
York: Wiley.
[21] Goldberg, D. (1989) Genetic Algorithms in Search, Optimization, and Machine
Learning, Reading: Addison-Wesley.
[22] Hayashi, F. (2000) Econometrics, New Jersey: Princeton University Press.
[23] Harville, D. (1974) “Bayesian Inference for Variance Components Using Only
Errors Contrasts,” Biometrika, 61, 383–385.
[24] Judge, G., Hill, R., Griffiths, W., Lutkepohl, H., and Lee, T.-C. (1985)
The Theory and Practice of Econometrics, New York: Wiley.
[25] Ho, Y.-C., Pepyne, D. L. (2002) “Simple Explanation of the No Free Lunch
Theorem and its Implications,” Journal of Optimization Theory and Applications,
3, 549–570.
[26] Kelly, C. (1999) Iterative Methods for Optimization, Philadelphia: SIAM.
[27] Kendall, M. (1961) A Course in the Geometry of n Dimensions, New York:
Hafner.
Bibliography 91
[28] Khury, A. (2003) Advanced Calculus with Applications in Statistics, Hoboken:
Wiley.
[29] Khuri, A., Mathew, T., and Sinha, B. (1998) Statistical Tests for Mixed Linear
Models, New York: Wiley.
[30] LeCam, L. (1953) “On Some Asymptotic Properties of Maximum Likelihood Es-
timates and Related Bayes Estimates,” University of California Publications in
Statistics, 1, 277–328.
[31] Luenberger, D. G. (1969) Optimization by Vector Space Methods, New York:
Wiley.
[32] Mexia, J., Pereira, D., and Baeta, J. (1999) “L2 Environmental Indexes,”
Listy Biometryczne—Biometrical Letters, 36, 2, 137–143.
[33] Mexia, J., Corte Real, P. (2001) “Strong Law of Large Numbers for Additive
Extremum Estimators,” Discussiones Mathematicae, Probability and Statistics, 21,
81–88.
[34] Mexia, J., Corte Real, P. (2003) “Compact Hypothesis and Extremal Set Es-
timators,” Discussiones Mathematicae, Probability and Statistics, 23, 103–121.
[35] Newey, W., McFadden, D. (1994) “Large Sample Estimation and Hypothesis
Testing,” Handbook of Econometrics, 4, 2111–2245.
[36] Nocedal, J., Wright, S. (1999) Numerical Optimization, New York: Springer.
[37] Nunes, S., Mexia, J., and Minder, C. (2004) “Logit Model for Tuberculosis
Incidence in Europe (1995–2000). Analysis by Sex and Age Group,” Colloquium
Biometryczne, 34, 147–159.
[38] Oliveira, M., Nunes, S., Ramos, L., and Mexia, J. (2006) “Ajustamento de
Modelos Espacio-Temporais para a Sida Utilizando Mınimos Quadrados Estrutu-
rados,” Actas do XII Congresso da Sociedade Portuguesa de Estatıstica, 519–526.
[39] Øksebdal, B. (1998) Stochastic Differential Equations: an Introduction with Ap-
plications, Berlin: Springer.
[40] Pakes, A., McGuire, P. (2001) “Stochastic Algorithms, Symmetric Markov Per-
fect Equilibrium and the Curse of Dimensionality,” Econometrica, 69, 1261–1282.
[41] Pflug, G. Ch. (1996) Optimization of Stochastic Models: The Interface Between
Simulation and Optimization, Boston: Kluwer.
Bibliography 92
[42] Rao, C., Rao, M. (1998) Matrix Algebra and its Applications to Statistics and
Econometrics, Singapore: World Scientific Publishing.
[43] Renyi, A. (1970) Probability Theory, Amsterdam: Elsevier.
[44] Rogers, K. M. (2005) “Sharp Van Der Corput Estimates and Minimal Divided
Differences,” Proceedings of the American Mathematical Society, 133, 3543–3550.
[45] Ross, S. (1996) Stochastic Processes, New York: Wiley.
[46] Sen, P. K., Singer, J. M. (1993) Large Sample Methods in Statistics, An Intro-
duction with Applications, Boca Raton: Chapman & Hall.
[47] Searle, S., Casella, G., and McGulloch, C. (1992) Variance Components,
New York: Wiley.
[48] Shao, J. (2005) Mathematical Statistics, New York: Springer.
[49] Solis, F. J., Wets, R. J-B. (1981) “Minimization by Random Search Tech-
niques,” Mathematics of Operations Research, 6, 19–30.
[50] Spall, J. C. (2003) “Introduction to Stochastic Search and Optimization: Esti-
mation, Simulation and Control,” Hoboken: Wiley.
[51] Stram, D. O., Lee J. W. (1994) “Variance Components Testing in the Longitu-
dinal Mixed Effects Model,” Biometrics, 50, 1171–1177.
[52] Suli, E., Mayers, D. (2003) An Introduction to Numerical Analysis, Cambridge:
Cambridge University Press.
[53] Veall, M. R. (1990) “Testing for a Global Minimum in an Econometric Context,”
Econometrica, 58, 1459–1465.
[54] van der Vaart, A. W., Wellner, J. A. (1996) Weak Convergence and Empir-
ical Processes with Applications to Statistics, New York: Springer.
[55] van der Vaart, A. W. (1998) Asymptotic Statistitics, New York: Cambridge
University Press.
[56] Williams, D. (1991) Probability with Martingales, New York: Cambridge Univer-
sity Press.
[57] Wald, A. (1949) “Note on the Consistency of the Maximum Likelihood Estimate,”
Annals of Mathematical Statistics, 20, 595–601.
[58] Wolpert, D. H., Macready, W. G. (1997) “No Free Lunch Theorems for Op-
timization”, IEEE Transactions on Evolutionary Computation, 1, 67–82.