extremum estimators and stochastic optimization...

UNIVERSIDADE NOVA DE LISBOA

EXTREMUM ESTIMATORS AND

STOCHASTIC OPTIMIZATION

METHODS

by

MIGUEL DE CARVALHO

Submitted in partial fulfillment for the Requirements for the Degree of

PhD in Mathematics, in the Speciality of Statistics

in the

Faculdade de Ciencias e Tecnologia

Departamento de Matematica

May 2009

University Web Site URL Here (include http://)

Faculty Web Site URL Here (include http://)

Department or School Web Site URL Here (include http://)

“Thought is only a flash between two long nights, but this flash is everything.”

Henri Poincare


Abstract



Doctor of Philosophy

by MIGUEL DE CARVALHO

Extremum estimators are one of the broadest classes of statistical methods for the ob-

tention of consistent and asymptotically normal estimates. The Ordinary Least Squares

(OLS), the Generalized Method of Moments (GMM) as well as the Maximum Likeli-

hood (ML) methods are all given as solutions to an optimization problem of interest,

and thus are particular instances of extremum estimators. One major concern regarding

the computation of estimates of this type is related with the convergence features of

the method used to assess the optimal solution. In fact, if the method employed can

converge to a local solution, the consistency of the extremum estimator is no longer

ensured. This thesis is concerned with the application of global stochastic search and

optimization methods to the obtention of estimates based on extremum estimators. For

such purpose, a stochastic search algorithm, is proposed and shown to be convergent.

We provide applications to classical test functions, as well as to a problem of variance

component estimation in a mixed linear model.





Abstract



Doctor of Philosophy

by MIGUEL DE CARVALHO

Os estimadores extremais (extremum estimators) sao uma das classes mais amplas de

metodos estatısticos utilizados para a obtencao de estimativas consistentes e assimp-

toticamente normais. O metodo dos mınimos quadrados, o metodo generalizado dos

momentos, bem como os metodos de maxima verosimilhanca resultam da solucao de

um problema de optimizacao, sendo consequentemente especificacoes particulares de

estimatores extremais. Um problema relevante no calculo de estimativas deste tipo

esta relacionado com as propriedades de convergencia do metodo utilizado para obter

a solucao optima. De facto, se o metodo utilizado convergir, eventualmente, para uma

solucao local, a consistencia do estimador extremal deixa de ser garantida. Esta tese

incide na aplicacao de metodos estocasticos de pesquisa e optimizacao global para o

calculo de estimativas baseadas em estimadores extremais. Para o efeito, e proposto

um algoritmo estocastico de pesquisa que provamos ser convergente. Neste sentido sao

providenciadas aplicacoes a funcoes de teste classicas, bem como a um problema de

estimacao em componentes de variancia num modelo linear misto.




Acknowledgements

I would like to express my hearty thanks to my advisors, Professor Tiago Mexia and

Professor Manuel Esquıvel. Their continuous generosity and unbounded encouragement

provided me the strength carry on converging to this thesis.

I would also like to record my indebtedness to my family and friends, who have con-

tributed to the realization of this thesis.

Unfortunately, some persons whose deeds contributed to this work will most probably

never be aware of that. To my grandparents Miguel and Helena, who no longer live

among us, and to Piotr Il’yich Tchaikosvky, whom I never had the opportunity to met.

I gratefully acknowledge the financial support from FCT (Fundacao para a Ciencia e a

Tecnologia).1

The unique errors which are not observable here are the error terms of the models.

1Advanced research scholarship with reference SFRH/BD/1569/2004.

iv

Contents

Abstract ii

Resumo iii

Acknowledgements iv

Glossary of Notation vii

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Prefatory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Formulation and Research Goals . . . . . . . . . . . . . . . . . . 3

1.3 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Stochastic Preliminaries 6

2.1 Overture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Modes of Stochastic Convergence and Stochastic Orders . . . . . . . . . . 9

2.3 Consistency of Point Estimators . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Extremum Estimators 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Consistency Results for Extremum Estimators . . . . . . . . . . . . . . . . 28

3.2.1 Consistency Under Compactness Assumptions . . . . . . . . . . . 29

3.2.2 Consistency for Maximum Likelihood Methods . . . . . . . . . . . 31

3.2.3 Consistency Without Compactness Assumption . . . . . . . . . . . 33

3.3 Convergence in Distribution of Extremum Estimators . . . . . . . . . . . 35

3.3.1 Interior Point Proviso . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Asymptotic Normality for Maximum Likelihood Methods . . . . . 36

3.3.3 Boundary Point Proviso . . . . . . . . . . . . . . . . . . . . . . . . 37

v

Contents vi

4 Global Optimization by Stochastic Search Methods 42

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 An Overview of Random Search Techniques . . . . . . . . . . . . . . . . . 44

4.3 Recasting the Solis and Wets Framework . . . . . . . . . . . . . . . . . . . 45

4.3.1 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . 45

4.3.2 The Master Method . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 Stochastic Zigzag Methods . . . . . . . . . . . . . . . . . . . . . . 51

4.3.4 A Matrix Formulation of the Master Method . . . . . . . . . . . . 52

4.4 Convergence of the Master Method . . . . . . . . . . . . . . . . . . . . . . 57

4.5 A Note on the Construction of Confidence Intervals . . . . . . . . . . . . . 64

5 Estimation in the Mixed Model via Stochastic Optimization 66

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.1 The Benchmark Case . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 The Dimension Reduction Technique . . . . . . . . . . . . . . . . . . . . . 70

5.4 A Stochastic Optimization Study of the Dimension Reduction Technique . 72

6 Summary and Conclusions 74

6.1 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A A Short Note Regarding the Essential Supremum 78

B van der Corput’s Sublevel Set Estimates 81

C Construction of Confidence Intervals for the Minimum 83

C.1 Tables for de Haan’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 83

C.2 Functional Form of the Classical Test Functions . . . . . . . . . . . . . . . 88

Bibliography 89

Glossary of Notation

N N ∪∞

∂(A) boundary of set A

int(A) interior of set A

cl(A) closure of set A

Df(x) gradient of the vector function f(x)

D2f(x) Hessian of the vector function f(x)

Cn(A) Set of n-times continuously differentiable functions defined in a set A

In Identity matrix of size n

0n×m Null matrix of size n×m

E[X] Expectation of the random vector X

V[X] Covariance matrix of the random vector X

N (µ,V ) Normal random vector with mean vector µ and covariance matrix V

un = O(1) The sequence of real numbers, un, is bounded

un = o(1) The sequence of real numbers, un, converges to zero

Un −→pu The sequence of random variables, Un, converges in probability to u

UnD−→ u The sequence of random variables, Un, converges in distribution to u

Un −→a.s.

u The sequence of random variables, Un, converges almost surely to u

Un = op(1) The sequence of random variables, Un, is stochastically bounded

Un = op(1) The sequence of random variables, Un, converges in probability to 0

a ∨ b maxa; b

a ∧ b mina; b

‖ · ‖r Lr-norm

‖ · ‖∞ Infinity-norm

vii

List of Figures

2.1 Uniform Convergence in Probability . . . . . . . . . . . . . . . . . . . . . 10

2.2 The logical relation between the stochastic convergence concepts introduced. 15

3.1 Assumptions (1, 2, 3) imply consistency of the extremum estimator. . . . . 30

3.2 The picture behind Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . 33

4.1 A sketch which portrays the reasoning involved in the proof of Theorem4.5. This instance is used just to provide guidance, and it is not part ofthe proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 The initialization of the stochastic zigzag method. In the picture in theleft we start by finding points a and b which initialize the algorithm.The second picture illustrates that in Step 1 we collect a random sample(c = 10) from line which passes through a and b. The remaning picturedepicts steps 2 and 3 wherein after the maximum of the first line wegenerate another seed and start by extracting a sample from the new linewhich passes by such points. . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 The application of the stochastic zigzag method to the Stybilinski–Tangtest function. This function pertains to a class of test functions which aretypically used to assess the performance of an optimization algorithm (seee.g. Spall [50]). The functional form of this function is given in formula(4.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

viii

List of Tables

5.1 Estimates of the variance components in Model I. . . . . . . . . . . . . . . 72

5.2 Estimates of the variance components in Model II. . . . . . . . . . . . . . 73

5.3 Estimates of the variance components in Model III. . . . . . . . . . . . . . 73

C.1 Search domains of the test functions used and their corresponding globalminimum value denoted by m∗ . . . . . . . . . . . . . . . . . . . . . . . . 83

C.2 (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

C.3 Sample variances for the upper bound and lower bounds of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000 observations. 84


C.5 Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000 observations. 85





ix

Dedicated to Vanda. . .

x

Chapter 1

Introduction

1.1 Prefatory Remarks

A broad class of estimators, can be formulated as the solution to an optimization problem

of interest. This class of estimators is known in the literature as extremum estimators

(see Amemiya [1]; Newey and Mcfadden [35]). The classical Ordinary Least Squares

(OLS), the General Method of Moments (GMM) and the Maximum Likelihood (ML)

methods, are all instances of extremum estimators. These instances of extremum esti-

mators, as well as their corresponding affinity, will become more clear in Chapter 3. For

now we remain the discussion at the perfunctory level.

One of the advantages of considering this general class of estimators is that it is possible

to build an elegant asymptotic theory which reduces to a set of general results. In what

concerns consistency, we will examine what assumptions have to be made in order to

establish the weak consistency of an extremum estimator. As we shall see below, mild

requirements are necessary as a means to achieve such result. Strong consistency can

also be obtained, under slightly more demanding conditions. In addition, if further

assumptions are made, this class of estimators can also be ensured to be asymptotically

normal.

Despite their appealing features, in a plurality of cases of practical interest, these es-

timators are not analytically tractable. Stated differently, in some relevant instances

of extremum estimators we frequently lack a closed-form solution for achieving the es-

timates. A solution frequently adopted to solve this problem is to rely on some opti-

mization algorithm as a means to obtain the estimates. The questions then arise. First,

is there a method which outperforms all the others? Second, what type of algorithm

should we use to perform the optimization? Some brief comments regarding these ques-

tions. In what concerns the first question, an answer is provided by the No Free Lunch

1

Chapter 1. Introduction 2

Theorem—an impossibility theorem which precludes the existence of a general purpose

strategy, robust a priori to any type of optimization problem (Wolpert and Macready

[58]). In what concerns the second question, a major concern regarding the obtention of

estimates of this type is related with the convergence features of the method used to as-

sess the optimal solution. In fact, if the method employed converges to a local solution,

the consistency of the extremum estimator is no longer ensured (cf. [19, 35]). Hence,

one should avoid to rely on optimization methods which can converge to a local solu-

tion, given that it is the global solution that possesses noteworthy asymptotic features.

Loosely speaking, two types of optimization algorithms are typically adopted to tackle

such problem, namely deterministic and stochastic optimization methods. The former

includes the Newton–Raphson, the steepest descent method, among many others (see

[36] and references therein). In this thesis the focus will however be placed on stochastic

optimization algorithms. Specifically, this thesis is concerned with the application of

global stochastic search and optimization methods to the obtention of estimates based

on extremum estimators. Here, we follow Spall [50] referring to stochastic search and

optimization algorithms in the conditions that follow.

Stochastic Search and Optimization

I. There is some random noise in the measurements of the objective function;

(or / and)

II. There is a random choice in the search direction as the algorithm iterates

toward a solution.

It is important to underline that this thesis is exclusively concerned with item II. Namely,

the focus is placed on optimization algorithms wherein the search direction is randomly

dictated. Such algorithms include the pure random search (Solis and Wets [49]), the

simulated annealing technique (Bohachevsky et al. [5]), genetic algorithms (Goldberg

[21]), the conditional martingale algorithm (see Esquıvel [17]), etc. Stochastic search and

optimization algorithms find their application in a wide variety of scenarios. The scope

of this topic is broad enough to comprehend applications ranging from game theory

(Pakes and McGuire [40]) to the clustering of multivariate data (Booth et al. [8]). A

nice overview of stochastic search and optimization methods can be found for instance in

Duflo [16], or in Spall [50]. While the former is more turned into theoretical features of

the methods, and more apropos to the reader acquainted with the French language, the

latter is written in English and maintains an interesting balance between the technical

attributes and the applications.

The next section introduces the specific problem which is located at the core of our

interests, as well as the research goals undertaken in this thesis.


1.2 Problem Formulation and Research Goals

Any estimator which can be formulated as a solution to the optimization problem stated

below, is tantamount to an extremum estimator.

Problem Formulation

maxθ∈Θ

Tn(θ). (1.1)

Some remarks regarding notation: Θ ⊆ Rk denotes a parameter space; n is the sample

size; the parameter objective function which yields the extremum estimator is denoted

by Tn. In Chapter 3, we elaborate on some of the properties shared by the broad class

of estimators which solve such unconstrained optimization problem. In the sequel, the

research goals are summarized.

Research Goals

The general goal of this research is to develop a general stochastic optimization method

for solving the optimization problem (1.1). Notwithstanding, the procedures to be elab-

orated over this thesis can also be applied in other contexts of interest, where an opti-

mization problem identical to (1.1) arises. A stochastic optimization method in which

we are particularly interested is the stochastic zigzag method. This method is largely

inspired in the seminal work of Mexia et al. [32]. The idea is to give a stochastic op-

timization structure to a method which as proved to be successful in a broad variety

of contexts—namely the ‘classical’ zigzag algorithm (see e.g. Oliveira et al. [38] and

references therein). It is not our intention here to elaborate on the latter (this was

already done elsewhere; see e.g. Nunes et al. [37]), but rather on its stochastic variant.

After the convergence of the stochastic zigzag method has been addressed, it can be

useful to raise the level of abstraction.1 In fact, if we concentrate on the features of the

method which are really important to establish convergence, we may be able to obtain

the convergence under a more general setting. Of course there is a tradeoff between the

level of abstraction that one is willing to accept, and the possibility of being (un)able to

establish convergence, which one should try to control in a proper manner. A battery

of practical implementations should also be conducted as a means to assess the perfor-

mance of the developed method. Preferentially, the method should be applied in the

context of problem (1.1). However, as mentioned above, other optimization problems of

interest can also be used to evaluate the functioning of the algorithm.

1It should be stressed that for the sake of exposition, in this thesis the results are presented in theopposite way. Hence, we establish the convergence of the master method, from where the convergenceof its particular cases follows directly.


1.3 Contribution of this Thesis

In this thesis we contribute by proposing a master method which includes several other

stochastic optimization algorithms. The method proposed here is broad enough to en-

compass the conceptual algorithm of Solis and Wets [49], as a particular case. A note-

worthy embodiment of the master method in which we are particularly interested is

the stochastic zigzag method—an optimization algorithm which is based in the promi-

nent work of Mexia et al. [32]. The master method is presented in a twofold manner.

First, we present an algorithmic version of the method. Second, we provide a matrix

formulation. The latter form of conceptualization of the method brings into light some

features which are less noticeable in the algorithmic version. At the core of the matrix

formulation lies the iterative matrix Z which can also be used as a masterplan for prac-

tical implementations. Each row of the iterative matrix is composed by a seed and a

course. Roughly speaking, the seed is a point chosen at random from the domain of

the function, and in the course rely the remaining iterations of the method (arranged

by order of extraction). In the particular case of the stochastic zigzag method we ev-

idence how to take advantage of this matrix framework in order to reduce the burden

of implementation. In this spirit, a result which is remarkably easy to establish—the

Kronecker–zigzag decomposition—allows us to easily construct an entire course of the

iterative matrix. The convergence of the master method also plays a part in the con-

tribution of this thesis. It is important to note that the convergence results established

here depart from hypothesis identical to the ones considered in the literature, in what

concerns the convergence of the random search method (cf. Solis and Wets [49]; Esquıvel

[17]). Given that the stochastic zigzag method is a particular instance of the master

method, the convergence of this algorithm also follows as a consequence. In addition,

we evidence how we can construct confidence intervals for the maximum of the function,

departing from the first column of the iterative matrix Z. The construction is direct,

being based on a known result in extreme value theory given by de Haan [14].

An application is then provided in what concerns estimation in the mixed linear model,

through the Maximum Likelihood (hereinafter ML) methods. Before we move into the

optimization procedure, we establish that in a mixed linear model, the ML problem can

be rewritten as a much simpler optimization problem (the simplified problem) where the

search domain is a compact whose size depends only on the number of variance com-

ponents. As it can be readily noticed, from the estimation standpoint, this feature is

extremely advantageous, and this simplified problem will allow us to solve the ML prob-

lem with large computational savings. The results evidence an overall good performance

of some instances of the master method proposed here.


1.4 Synopsis

This thesis is written as a fugue, interweaving the themes of extremum estimators and

stochastic optimization. Below we offer a plan of the work developed in the pages that

follow.

In Chapter 2 we consider some preliminary concepts and results from asymptotic theory.

The exposition starts with the introduction of convergence modes of stochastic interest.

These concepts will provide us the basic framework for the introduction of the several

types of consistency of point estimators, as well as for the study of other large sample

properties.

In Chapter 3, we concern ourselves with the introduction of extremum estimators. In

this part of the thesis we also shed some light on the large sample properties of ex-

tremum estimators. The focus is placed on the circumstances under which one can

ensure consistency and asymptotic normality of this broad class of estimators.

A general stochastic optimization method is proposed in Chapter 4. Here, we start

by recasting the theoretical underpinning of Solis and Wets [49]. Following this, we

introduce the algorithmic version of the master method, and stress some of its instances.

The algorithmic version of the master method is then counterpointed with a matrix

version. After a thorough exposition of the method, we establish its convergence. This

chapter ends with a brief note on the construction of the confidence intervals for the

extremum departing from the first column of the iterative matrix.

Chapter 5, is devoted to the particular case of the conduction of maximum likelihood

estimates of variance components in mixed linear models. In this framework, we show

that the maximum likelihood problem can be rewritten as a much simpler optimization

problem where the search domain is a compact whose size depends only on the number of

variance components. This result will prove to be particularly useful for the application

of specific instances of the master method in the conduction of ML estimates of the

variance components.

The thesis closes in Chapter 6 with a short conclusion and with the proposal of some

future directions for our research agenda.

Chapter 2

Stochastic Preliminaries

This chapter is devoted to the introduction of some preliminary concepts and results,

suited for our purposes. The exposition made in the sequel, was largely inspired in the

prominent textbooks of Shao [48] and Sen and Singer [46]. For the sake of complete-

ness, in the first section we formally define the structures of interest for our exposition.

Furthermore, we also provide some relevant comments on the notation to be used from

hereafter.

We note that in the exposition that follows, we will frequently assign a name to a

Theorem. We are aware that some historical inaccuracies may arise as a consequence of

adopting such approach, much in the spirit of the Stigler’s law, whereby almost nothing

is named after the person who invented it, but we do it for the sake of exposition, and

we believe that the benefits of doing so outweigh the costs.

2.1 Overture

The exposition opens with the introduction of the primitive structures of interest. We

start with the definition of linear space. Roughly speaking, a linear space is a collection

of vectors or matrices, which is closed under the linear combination. Formally, we have:

Definition 2.1. Consider a field K. A collection of vectors or matrices X is a linear

space if for all x,y ∈ X, and κ ∈ K, the following conditions hold

1. x + y ∈ X,

2. κx ∈ X.

6

Chapter 2. Stochastic Preliminaries 7

If we drop the first condition of the linear space definition, and consider the particular

case of K = R+ we get another well-known structure. Namely, a collection of vectors

or matrices Λ is a cone if it is closed under the multiplication by a positive real scalar.

Formally, we have the following definition.

Definition 2.2. A collection of vectors or matrices Λ is a cone, if for all λ ∈ Λ, and

κ ∈ R+, the following condition holds

κλ ∈ Λ.

Over the exposition we will also make use of the concept of norm.

Definition 2.3. A function ‖ · ‖ from a linear space X to R+0 is a norm, if the following

conditions hold:

i) ‖x‖ ≥ 0, for all x ∈ X, where ‖x‖ = 0, if and only if x = 0;

ii)‖κx‖ = |κ|‖x‖, for all x ∈ X and all scalars κ;

iii)‖x + y‖ ≤ ‖x‖+ ‖y‖, for all x,y ∈ X.

In the sequel, we provide some instances of distances and norms which will prove to be

useful during the exposition that follows. For such purpose, let x = [ x1 · · · xr ]. First,

we introduce the Lr-norm

‖x‖r :=

[n∑i=1

|xi|r]1/r

.

In the particular case that r = 2 we get the well-known Euclidean norm. For the ease

of notation, we will omit the subscript if r = 2, i.e., we take ‖x‖ := ‖x‖2. Further, we

introduce the quadratic norm

‖x‖W := (xTWx),

where W is a given positive definite matrix. Note that ‖x‖I = ‖x‖, i.e., the particular

choice of W = I, leads to the Euclidean norm.

If X = [xi,j ] is a matrix1 of order m× n then we analogously define the Lr-norm as

‖A‖r :=

m∑i=1

n∑j=1

|xi,j |r1/r

. (2.1)

Again, to simplify notation, we will omit the subscript if r = 2, i.e., we take ‖A‖ :=

‖A‖2. In other contexts of interest, we will make use of the sup-norm, defined on a

1Given that the collection of all m × n matrices Mm,n is a vector space, Definition 2.3 can beapplied. However, if m = n it is often required that the matrix norm should also satisfy the property‖AB‖ ≤ ‖A‖‖B‖, ∀A,B ∈Mm,n. See Rao and Rao [42] for further details.


linear space Θ ⊂ Rk, i.e.:

‖f‖∞ := supθ∈Θ

|f(θ)|,

where f is a real-valued bounded function defined over the parameter space. In our

exposition, we will make use of the the‘little-oh’ and ‘big-oh’ notations. Recall that in

classical Calculus, we have the notation xn = o (yn), which formally implies that

∀ε > 0, ∃ p(ε) ∈ N : n ≥ p(ε)⇒∣∣∣∣xnyn

∣∣∣∣ < ε.

We will also borrow from the classical Calculus, the notation xn = O (yn), to mean that

∃ M > 0 ∧ ∃ p(M) ∈ N : n ≥ p(M)⇒∣∣∣∣xnyn

∣∣∣∣ ≤M.

We highlight the particular case where yn = 1, ∀n ∈ N. Under those circumstances we

have that xn = o(1) means that xn −→n→∞

0. Further, we have that xn = O(1) means

that xn is bounded.

Given the way the ‘little-oh’ and ‘big-oh’ are defined, well-known properties can be

derived (see e.g. Khuri [28], p. 66; Figueira [18], p. 264).

1. λO(xn) = O(xn), λ ∈ R;

2. O(xn + yn) = O(xn) +O(yn);

3. O(xnyn) = O(xn)O(yn);

4. o(xnyn) = O(xn)o(yn).

To get some insight on these results, suppose that xn = yn = 1, and that an and bn

are sequences of real numbers such that an = O(1) and bn = O(1). The first three

properties yield that an + bn = O(1), and anbn = O(1). In words, the sum and product

of two bounded sequences originates another bounded sequence. As a means to illustrate

property 4, suppose that bn = o(1). Then we have that anbn = O(1)o(1) = o(1), i.e.,

the product of a sequence converging to zero by a bounded sequence, also converges to

zero.

The concept of cone stated above (Definition 2.2) will allow us to introduce the following

definition due to Chernoff [12], as well as a corresponding generalization. This defini-

tion has well-known applications in statistics (see for instance [3, 51]), and it will be

particularly useful in the next chapter.


Definition 2.4. Consider a cone Λ ⊆ Rk, and a set Θ ⊆ Rk.

1. The set Θ ⊆ Rk is locally approximated by a cone Λ if the following conditions

hold

(a) infλ∈Λ‖λ− θ‖ = o(‖θ‖), ∀θ ∈ Θ;

(b) infθ∈Θ‖λ− θ‖ = o(‖λ‖), ∀λ ∈ Λ.

2. The sequence Θn of subsets of Rk is locally approximated by a cone Λ ∈ Rk if

it holds that

(a) infλ∈Λ‖λ− θn‖ = o(‖θn‖), ∀θn : ‖θn‖ = o(1);

(b) infθn∈Θn

‖λn − θn‖ = o(‖λn‖), ∀λn : ‖λn‖ = o(1).

Broader definitions of cone and local approximation with vertex θ0 ∈ Rk, can be found

elsewhere (Stram and Lee [51]). Notwithstanding, the aforementioned definitions will

prove to be sufficient for our purposes.

2.2 Modes of Stochastic Convergence and Stochastic Or-

ders

In the sequel we will focus the development of our exposition in a normed linear space

framework. This will prove to be sufficient for our purposes. For an exposition suited

for metric spaces see van der Vaart and Welner [54] and Billingsley [6].

Definition 2.5. Consider a sequence of random vectors Xn, with cumulative dis-

tribution functions Fn(x). Further, consider a random vector X, with cumulative

distribution function F (x).

1. We say that Xn converges almost surely (a.s.) to X, if

∀ε, η > 0, ∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P[‖XN −X‖ > ε, for some N ≥ n] < η.

2. We say that Xn converges in probability to X, if

∀ε, η > 0,∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P[‖Xn −X‖ > ε] < η.

3. We say that Xn converges to X in Lr, if

∀ε > 0, ∃ p(ε) ∈ N : n ≥ p(ε)⇒ E[‖Xn −X‖r] < ε.


4. We say that Xn converges in distribution to X, if

∀ ε > 0, ∃ p(ε) : n ≥ p(ε)⇒ |Fn(x)− F (x)| < ε,

at any point of continuity x of F .

Remark 2.6. For each of the above defined, types of stochastic convergences, we will

respectively use the following notations Xn −→a.s.

X, Xn −→p

X, Xn −→Lr

X, and

XnD−→ X. Further, we emphasize that in what concerns convergence in distribution,

we will avoid the use of the expression ‘weak convergence’ to designate convergence in

distribution. As stressed by Williams [56], this terminology is unfortunate given that

the concept does not coincide with the corresponding concept in functional analysis (for

a definition of the function analysts concept see Luenberger [31]).

In the following we introduce the concept of uniform convergence in probability.

Definition 2.7. Let T be a function from Θ ⊂ Rk to R. Further, let Tn denote

a sequence of functions from Θ ⊂ Rk to R. We say that the sequence Tn converges

uniformly in probability to T if

∀ε, η > 0, ∃ p(ε, η) ∈ N : n ≥ p(ε, η)⇒ P [‖Tn − T‖∞ > ε] < η,

where

‖Tn − T‖∞ = supθ∈Θ|Tn(θ)− T (θ)|.

Figure 2.1: Uniform Convergence in Probability

This latter type of convergence will prove to be particularly useful in what regards the

statement of large sample results of the extremum estimators, to be provided in the next


Chapter. Roughly speaking, this type of convergence demands from a certain order on,

the graph of Tn(θ) should lie in the ‘ε-sleeve’ with probability 1. In Figure 2.1, we

provide an illustration of this idea.

Note that each of the above-defined modes of stochastic convergence can also be rewritten

in the following manner:

Xn −→a.s.

X⇔ Xn −→n→∞

X, a.s.; (2.2)

Xn −→Lr

X⇔ ∃r > 0 : E[(‖Xn −X‖r)r] −→n→∞

0; (2.3)

Xn −→p

X⇔ P[‖Xn −X‖ > ε] −→n→∞

0,∀ε > 0. (2.4)

Similarly, in what concerns convergence in distribution, we have

XnD−→ X⇔ Fn(x) −→

n→∞F (x), (2.5)

for each point of continuity x of the cumulative distribution function F . Further, note

that

‖Tn − T‖∞ = supθ∈Θ

|Tn(θ)− T (θ)| −→p

0, (2.6)

is equivalent to say that Tn converges in probability to T .

We now introduce the stochastic counterparts of the ‘little-oh’ and ‘big-oh,’ often re-

ferred in the literature as the stochastic orders of random variables (vectors).

Definition 2.8. Consider a sequence of random vectors Xn and a sequence of random

variables Yn.

1. We say that Xn = O(Yn), a.s., if

P[‖Xn‖ = O(|Yn|)] = 1, a.s.

2. We say that Xn = o(Yn), a.s., if

Xn

Yn−→a.s.

0.

3. We say that Xn = Op(Yn), if

∀ε > 0, ∃Cε > 0 : supn

P[‖Xn‖ ≥ Cε|Yn|] < ε.


4. We say that Xn = op(Yn), ifXn

Yn−→p

0.

These definitions warrant some remarks. As mentioned above we have that xn = O(1)

means that xn is bounded. In the same spirit, Xn is said to be stochastically bounded

(or bounded in probability), if Xn = Op(1). Note that if Xn is bounded in the sense

that

∃M > 0 : P[‖Xn‖ ≤M ] = 1, ∀n ∈ N,

then it is trivially stochastically bounded. Note however, that in general the converse is

not true.

In the following, we provide some well-known properties of the stochastic ‘little-oh’ and

‘big-oh’ (see e.g. van der Vaart [55], pp. 12–13).

op(1) + op(1) = op(1)

op(1) +Op(1) = Op(1)

Op(1)op(1) = op(1)

(1 + op(1))−1 = Op(1)

op(Xn) = Xnop(1)

Op(Xn) = XnOp(1)

op(Op(1)) = op(1)

In the next lemma we establish a connection between the classical calculus ‘little-oh’

and its stochastic counterpart.

Lemma 2.9.

Let Q be a function defined in Rk, such that Q(0) = 0. Consider a sequence of random

k-vectors such that Xn = op(1). If Q(x) = o(‖x‖r), as ‖x‖ −→n→∞

0, then Q(Xn) =

op(‖Xn‖r), for any given r > 0.

Proof. In this proof we will make use of the continuous mapping theorem (Theorem

2.20) which, for the sake of exposition, is only introduced later.2 The following auxiliar

function will also prove to be useful

q(x) =

(Q(x)

‖x‖r

)I(x 6= 0),

2We suggest the reader which is less acquainted with the continuous mapping theorem to return tothis proof after the introduction of Theorem 2.20.


where I(·) denotes the indicator function. Consider function q(·) evaluated at Xn

q(Xn) =

(Q(Xn)

‖Xn‖r

)I(Xn 6= 0),

Observe that by construction q(·) is continuous at 0. Hence, by Theorem 2.20 it holds

that

q(Xn) = op(1).

Consequently, we have

Q(Xn) = ‖Xn‖rop(1) = op(‖Xn‖r).

Even though equations (2.2) to (2.6) provide just a direct set of translations of the defi-

nitions stated above, there are some equivalent of the concepts of stochastic convergence

introduced above, which can be more suitable in certain cases. The next set of results,

is introduced in this spirit, providing alternative characterizations of some of the modes

of convergence defined above. In what concerns almost sure convergence we have the

following result.

Theorem 2.10.

Consider the sequence of random vectors Xn, and a random vector X. Xn converges

almost surely to X, if and only if

P

[ ∞⋃m=n

‖Xm −X‖ > ε

]−→n→∞

0, ∀ε > 0.

Proof. See [48] p. 51.

Before we are ready to provide other equivalent characterizations, recall that the char-

acteristic function of X is defined as

ΦX(t) = E [exp(itTX)] ,

where i denotes the imaginary unit, i.e., i =√−1.

The following theorem sheds some light on the usefulness of the characteristic functions.

Theorem 2.11. (Levy’s Theorem)

Consider a random variable X, such that there exists k ∈ N and δ ∈]0, 1[ such that

E[|X|k+δ] <∞.


Then we have that

φX(t) = 1 +

k∑j=1

(it)jE[Xj ] +Rk(t),

where i denotes the imaginary unit, and the remainder Rk(t) is bounded as follows

|Rk(t)| ≤ c|t|k+δE[|X|k+δ]

Proof. See [46] pp. 26–27.

As a consequence of the latter classical result, we have that each characteristic function

ΦX(t) is assigned to one, and only one cumulative distributions function FX(x). The

next result highlights that characteristic functions can also be used to provide equivalent

characterizations for convergence in distribution.

Theorem 2.12.

Consider the sequence of random vectors Xn, and let ΦXn denote the sequence of

corresponding characteristic functions. The following condition is necessary and suffi-

cient for XnD−→ X

ΦXn(t) −→n→∞

ΦX(t), ∀t ∈ Rk.


The next theorem establishes some further equivalent characterizations for convergence

in distribution. Claim 2 of the following theorem, is frequently referred in the literature

as the Cramer–Wold device.

Theorem 2.13.

Consider the sequence of random vectors Xn, and a random vector X. Each of the

following conditions is necessary and sufficient for XnD−→ X:

1. E[g(Xn)] −→n→∞

E[g(X)], for every bounded continuous function g(·);

2. xTXnD−→ xTX,∀xT ∈ Rk.

Proof. See [48] pp. 56–57.

One question that naturally arises is how the above-defined concepts of convergence

relate. We note that a partial answer to this question was already provided by the

suitable characterization of almost sure convergence, provided in Theorem 2.10. In fact,


if a sequence of random vectors Xn converges a.s., then by Theorem 2.10, it holds that

for every ε > 0

P

[ ∞⋃m=n

‖Xm −X‖ > ε

]−→n→∞

0,

which implies that

⇒ P[‖Xn −X‖ > ε] −→n→∞

0,

and so a.s. convergence implies convergence in probability.

Theorem 2.14.

Consider the sequence of random vectors Xn and the random vector X. The following

conditions hold:

1. Almost surely convergence implies convergence in probability, i.e:

Xn −→a.s.

X⇒ Xn −→p

X.

2. Convergence in Lr (r > 0) implies convergence in probability, i.e.:

Xn −→Lr

X⇒ Xn −→p

X.

3. Convergence in probability implies convergence in distribution, i.e.:

Xn −→p

X⇒ XnD−→ X.

Proof. Claim 1 was established above. For a proof of the remainder claims, see [48]

p. 53.

Figure 2.2: The logical relation between the stochastic convergence concepts intro-duced.


In Figure 2.2, we provide a schematic representation which summarizes what is claimed

in the previous theorem, where we have established the logical relation between the

stochastic convergence concepts introduced.

The following result establishes that stochastic boundedness is a necessary condition for

convergence in distribution.

Theorem 2.15.

Let Xn be a sequence of random vectors. Convergence in distribution implies stochastic

boundedness, i.e.

XnD−→ X⇒ Xn = Op(1).

Proof. See [46] pp. 106–107.

Since all the remaining modes of stochastic convergence defined above, imply conver-

gence in distribution, with the exception of uniform convergence in probability, we have

the simple corollary that stochastic boundedness is necessary for each of the above-

defined modes of stochastic convergence. Note that this is in accordance with what we

have in the study of deterministic sequences in classical Calculus. Hence, even though

the convergence and boundedness concepts are different in this context, we still have a

property of the type convergence implying boundedness.

It remains unanswered whether if under some special provisos, one can obtain a more

enriching portrait than the one provided in Figure 2.2. The next theorems shed some

light on this issue. We start by introducing the celebrated Skorohod’s theorem.

Theorem 2.16. (Skorohod’s theorem)

Consider the sequence of random vectors Xn and the random vector X. If XnD−→ X,

then there exists a sequence of random vectors Yn, and a random vector Y, such that

ΦX(t) = ΦY(t), ∀t ∈ Rk,

ΦXn(t) = ΦYn(t), ∀t ∈ Rk, ∀n ∈ N.

Further, it holds that

Yn −→a.s.

Y.

Proof. See Billingsley [6] pp. 399–402.


The next theorem introduces some results in the same spirit. We underscore the impor-

tance of Claim 3 (see below) which is particularly useful for obtaining a prime example

of convergence in probability, namely the Kintchine’s weak law of large numbers.

Theorem 2.17.

Consider the sequence of random vectors Xn and the random vector X. The following

conditions hold:

1. Suppose that for every ε > 0, we have

∞∑i=1

P[‖Xi −X‖ ≥ ε] <∞. (2.7)

Then it holds that

Xn −→a.s.

X.

2. Suppose that Xn −→p

X. Then, there exists a subsequence Xnj : j ∈ N, such

that

Xnj −→a.s. X, j −→∞.

3. Suppose that XnD−→ X, and that P[X = x] = 1, where x is a constant vector.

Then, it holds that

Xn −→p

x.

4. Suppose that XnD−→ X. Then

E [(‖Xn‖r)r] −→n→∞

E [(‖X‖r)r] <∞,

if and only if (‖Xn‖r)r is uniformly integrable, i.e.:

supn

E[(‖Xn‖r)r I‖Xn‖r>t

]= o(1), t→∞. (2.8)

where I(·) denotes the indicator function.

Proof. See [48] pp. 53–54.

The last theorem warrants some discussion. Claim 1 establishes that as long as P[‖Xn−X‖ ≥ ε] converges sufficiently fast to 0, the concepts of convergence in probability and

almost sure convergence are identical. Further, we note that if a sequence of random

vectors Xn verifies (2.7), then it is said to be completely convergent. In this spirit,


Claim 1 of the latter theorem, argues that complete convergence implies a.s. conver-

gence. In what regards Claims 2 and 3, these are partial converses to Theorem 2.14.

Finally in Claim 4, we conclude that converge in probability combined with uniform

integrability, implies convergence in Lr.

We now provide an example of convergence in probability. For the sake of simplicity, we

restrict our attention to random variables rather than to random vectors.

Example 2.1. (Kintchine’s weak law of large numbers)

Consider a sequence of independent and identically distributed random variables Xk,such that E[X1] = µ. We thus have that

φXn = E

[exp

(it

n∑k=1

Xk

n

)]

=

n∏k=1

E[exp

(itXk

n

)]=

[φX1

(t

n

)]n.

Now, notice that applying Levy’s theorem to the preceding equality yields,

φXn =

[1 +

it

nE[X1] + o

(1

n

)t

]n,

thus implying that

φXn −→n→∞ exp(itµ).

Hence, as a consequence of Claim 3 of Theorem 2.17, we have that

Xn −→pµ.

Given that in many circumstances of interest, one is constrained to have a sequence

of identically distributed random variables we are precluded to use the laws of larges

numbers stated above. To overcome such barrier the following weak and strong laws

are suitably modified in order to drop the requirement of ‘identical distribution’ of the

random variables.

Theorem 2.18. Let Xn be a sequence of random variables.

1. (Weak law of large numbers)

Suppose that the following condition holds

∃α ∈ [1, 2] :1

nα

n∑i=1

E[|Xi|α] = o(1).


Then, it holds that

1

n

n∑i=1

(Xi − E[Xi]) = op(1)

2. (Strong law of large numbers)

Suppose that the following condition holds

∃α ∈ [1, 2] :

n∑i=1

E[Xi]α

iα= O(1).

Then, it holds that

1

n

n∑i=1

(Xi − E[Xi]) = o(1), a.s.


We now introduce a prime example of uniform convergence in probability. For such

purpose, let A(X,θ) be a matrix of functions of a realization of the p-vector X, and the

parameter θ. The law of large numbers stated below will be particularly useful in the

next chapter. The statement of such result will make use of the norm as defined in 2.1,

in order to provide a meaning to ‖A(·, ·)‖.

Theorem 2.19. (Uniform weak law of large numbers)

Let Θ be a compact subset of Rk which denotes a parameter set of interest, and let

Xn be a sequence of random p-vectors. Suppose that A(Xi,θ) is continuous for all

θ ∈ Θ, with probability one. Further, suppose that there exists a nonnegative function

d(·) defined on Rp, such that:

1. ‖A(X,θ)‖ ≤ d(X), ∀θ ∈ Θ;

2. E[d(X)] <∞.

Then, the following conditions hold:

1. E[A(X,θ)] is continuous;

2. supθ∈Θ‖ 1n

∑ni=1 A(Xi,θ)− E[A(Xi,θ)]‖ = op(1).

Proof. See [30].

It is also possible to establish a uniform strong law of large numbers, under a set of

assumptions which somehow resembles the proviso considered here (Theorem 6.13 in


[7]; for a proof, see pp. 172–174 of the same reference). The large sample result stated

in Theorem 2.19, will however prove to be sufficient for our purposes.

One issue of general interest, is whether it is possible to assess the consistency of a

transformation of statistics whose large sample behavior is known. The next theorem

sheds some light on this issue.

Theorem 2.20. (Continuous mapping theorem)

Consider the sequence of random vectors Xn, and a constant vector x. Let g : Rk −→R be a function continuous at x. Then the following claims hold:

1. If Xn −→a.s.

x, then

g(Xn) −→a.s.

g(x).

2. If Xn −→p

x, then

g(Xn) −→pg(x).

Proof. See [46] , pp. 58–59.

Roughly speaking, the previous theorem ensures that a.s. convergence, and convergence

in probability, are preserved over continuous transformations. We will introduce lat-

ter a result (Sverdrup’s theorem) which mimics the latter theorem in what concerns

convergence in distribution.

2.3 Consistency of Point Estimators

We now introduce some of the most common concepts of consistency of point estimators.

Definition 2.21. Let θn be an estimator of the parameter θo.

1. The estimator θn is strongly consistent, if

θn −→a.s.

θo.

2. We say that the estimator θn is Lr-consistent, if

θn −→Lr

θo.


3. We say that the estimator θn is weakly consistent, if

θn −→pθo.

4. Let xn be a sequence of real numbers such that

xn > 0,∀n ∈ N,

and

xn −→n→∞

∞.

We say that the estimator θn is xn-consistent, if

xn(θn − θo) = Op(1).

Remark 2.22. Note that beyond xn-consistency, all the above-defined definitions can be

restated through the use of the ‘oh’ notations introduced above. Hence θn is strongly

consistent if,

θn − θo = o(1), a.s.

Lr-consistent, if

E[‖Xn −X‖rr] = o(1),

and it is weakly consistent if

θn − θo = op(1).

We emphasize that consistency is a mild reliability demand for an estimator, and roughly

speaking it states that in a large sample, the estimator θn should be close to the true pa-

rameter θo. The concept of consistency of an estimator is also compatible with the basic

idea that from more data we should, on average, be able to extract more information

regarding an unknown population, from which we intend to infer about.

2.4 Asymptotic Normality

From the inference standpoint asymptotic normality, is another desirable property an

estimator should have. In fact, asymptotic normality plays a central role in the con-

struction of confidence zones and hypothesis testing. The definition is given below.


Definition 2.23. An estimator θn is said to be asymptotically normal if there exists

an increasing function u(n), and a positive definite matrix Σ such that

u(n)(θn − θo)D−→ N (0,Σ).

Remark 2.24. Some remarks about the foregoing definition:

• the variance Σ of the limiting distribution is referred as the asymptotic variance

of θn.

• in many cases of practical interest u(n) =√n.

In a large number of cases of interest, asymptotic normality of the estimator is achieved

by the construction of suitable decompositions of the sampling error (θn − θo). This

motivates the definition of asymptotically linear estimator.

Definition 2.25. An estimator θn is asymptotically linear if there exists a function

ψ(x) such that

E[ψ(x)] = 0,

E[ψ(x)[ψ(x)]T] <∞,

√n(θn − θo) =

n∑i=1

ψ(Xi)√n

+ op(1).

Remark 2.26. The function ψ(x) is the so-called the influence function, and it can be

used to assess the impact that a single observation can exert on the estimation, up to a

0 remainder with probability 1.

The theorem stated below is perhaps the cornerstone of statistical asymptotic theory.

We take it as the prime example of the concept of convergence in distribution.

Theorem 2.27. (Classical central limit theorem)

Let Xn be a sequence of independent and identically distributed random vectors. Fur-

ther, let Σ = V[X1] <∞. Then, the following large sample result holds

1√n

n∑i=1

(Xi − E[X1])D−→ N (0,Σ).

Proof. See Billingsley [6].

Note that we have the following simple corollary to the central limit theorem

√n(Xn − E[X1]) = Op(1),


as a consequence of Theorem 2.15.

In the sequel we provide an important result which is useful for deriving the asymptotic

distribution of a transformation of a statistic with a known asymptotic distribution. As

aforementioned, this theorem mimics the continuous mapping theorem stated above in

what concerns convergence in distribution.

Theorem 2.28. (Sverdrup’s theorem)

Let Xn be a sequence of random vectors, such that

XnD−→ X.

Further, let g : Rk −→ R be a continuous function. Then, it holds that

g(Xn)D−→ g(X).

Proof. See [46] p. 106.

This theorem implies that convergence in distribution is preserved over continuous trans-

formations. However, it remains unanswered how to obtain the convergence in distribu-

tion of the of the sum and the product of random vectors. In this context, we have the

following result which plays a central role in statistical asymptotic theory.

Theorem 2.29. (Slutsky’s theorem)

Let Xn and Yn be a sequence of random p-vectors such that

XnD−→ X,Yn = op(1).

Consider further a sequence of random (w × p) matrices Wn such that

tr[(Wn −W)T (Wn −W)

]= op(1),

where W is a nonstochastic matrix, and tr(.) denotes the trace operator. Then:

1. Xn + YnD−→ X;

2. WnXnD−→WX.

Proof. See [46] pp. 130–131.

This chapter closes with the introduction of a tool of broad use in the obtention of the

distribution of transformation, namely the δ-method.


Theorem 2.30. (δ-method)

Let Xn be a sequence of random vectors, and let Y be a random vector with distribution

N (0,Σ), such that

xn(Xn − x)D−→ Y,

where x is a constant vector, and xn is a sequence of real numbers, such that

xn > 0,∀n ∈ N,

and

xn −→n→∞

∞.

Further, let g(·) be a function from Rk to R. Then, the following large sample result

holds

xn[g(Xn)− g(x)]D−→ N (0, [Dg(x)]TΣDg(x)).

Proof. See Shao [48] p. 61.

Chapter 3

Extremum Estimators

In this chapter, we concern ourselves with the introduction of extremum estimators. In

this part of the thesis we address large sample properties of these estimators, and discuss

the provisos under which one can ensure consistency and asymptotic normality of this

class of estimators.

3.1 Introduction

A broad class of estimators can be achieved by solving an optimization problem of

interest. Such estimators are known as extremum estimators (see [1, 33–35]). One of

the advantages of considering this general class of estimators is that it is possible to

build an elegant asymptotic theory with a set of general results. In the sequel, let

Θ ⊂ Rk denotes the parameter space. In the definition that follows, we formally define

an extremum estimator.

Definition 3.1. An estimator θn is an extremum estimator if there exists a function

Tn (θ), such that:

Tn(θn) = supθ∈ΘTn(θ) + op(1). (3.1)

Note that an alternative definition which is often considered in the literature is to define

an extremum estimator as

θn = arg maxθ∈Θ

Tn (θ) .

From the conceptual standpoint, the definition provided in (3.1) is preferable, given

that it only demands that Tn(θn) is within op(1) from the global maximum of Tn(θn).

This overcomes the question of existence and it is also more suitable for computational

25

Chapter 3. Extremum Estimators 26

purposes (cf. Andrews [3]). Notwithstanding, from the pragmatical stance, the latter

definition is preferred.

Note further, that an M-estimator is a an extremum estimator for which the estimator

objective function is given by a sample average, i.e., if Tn(θ) takes the form

Tn(θ) =1

n

n∑i=1

q(Xi,θ).

where q is some function of the data and the parameter. Here and below we will make use

of Xini=1 to denote a random sample of size n. The Xi are assumed to be identically

distributed with the random variable x. In the sequel we provide some simple examples

of extremum estimators. Other examples can be found elsewhere (e.g. Newey and

McFadden [35]; Andrews [3]; Mexia and Corte Real [34]).

Example 3.1. (Ordinary least squares—simple linear model)

Consider the simple linear model

y = Xβo + ε, (3.2)

where y is the n-vector of observations, X is the design matrix of size n × k, βo is a

k-vector of unknown regression parameters, and ε is (n×1)-vector of unobserved errors.

The estimator objective function assigned to the Ordinary Least Squares (OLS) estimator

is given by

Tn(β) = −‖y −Xβ‖, β ∈ Θ = Rk.

The OLS estimator is thus defined as

βOLS = arg minβ∈Rk

‖y −Xβ‖.

We underscore the example that follows, given that it will be reconsidered in Chapter 5.

Example 3.2. (Maximum likelihood methods—mixed linear models)

The mixed linear model is an extension to the simple linear model (3.2), in order to

account for more than one source of error. For a reference see Christensen [13].

The model takes the following form

y = Xβo +

w−1∑i=1

Xiζi + ε, (3.3)

where (y,X,βo, ε) are defined as in (3.2), Xi are design matrices of size n×ki, and where

ζi are ki-vectors of unobserved random effects. Following classical assumptions, we take


the random effects ζi to be independent and normally distributed with null mean vectors

and covariance matrix σ2oiIki, for i = 1, . . . , w − 1. Further, we also take the ε to be

normally distributed with null mean vector and covariance matrix σ2owIn, independently

of the ζi, for i = 1, . . . , w− 1. The model has the following mean vector and covariance

matrix

E [y|X] = Xβo, (3.4)

Σσ2o≡ V[y|X] =

w−1∑i=1

σ2oiXiX

Ti + σ2

owIn,

where σ2o ≡ [ σ2

o1 · · · σ2ow ]. Given the current framework, we have that

y|X ∼ N

(Xβo;

w−1∑i=1

σ2oiXiX

Ti + σ2

owIn

),

and thus the density for the model is given by

fy|X(y) =exp(−1

2(y −Xβo)TΣ−1

σ2o(y −Xβo)

)√(2π)n det(Σσ2

o)

.

Now, let θ ≡ [ βT σ2 ]. The estimator objective function assigned to the maximum

likelihood estimator is given by the loglikelihood of the aforementioned mixed linear model,

i.e.:

Tn(θ) =Tn([ βT σ2 ])

=− n

2ln(2π)− 1

2ln(

det(Σσ2

))− 1

2(y −Xβ)TΣ−1

σ2 (y −Xβ).

The maximum likelihood estimators of the true regression parameter and the model vari-

ance components, respectively denoted by βML and σ2ML, are thus given by

[ βT

ML σ2ML ] = arg max

[ βT σ2 ]

Tn([ βT σ2 ]) = arg maxθ∈Θ

Tn (θ) , (3.5)

where the parameter set Θ, is a bounded subset of Rk+w which restricts the elements θi

to be nonnegative, for i = k + 1, . . . , w, i.e.

Θ ≡ θ ∈ Rk+w : θi ≥ 0, i = k + 1, . . . , w.


Example 3.3. (Generalized method of moments—simple linear model)

Consider the following alternative representation of the simple linear model stated above

yi = xTi βo + εi, i = 1, . . . , n.

Here yi and εi respectively denote the i-th element vectors y and ε; further xTi represents

the i-th row of the design matrix X. This example puts forward another important

instance of extremum estimators—the so-called Generalized Method of Moments (GMM)

estimators. In GMM-based estimators the objective function is given by

Tn(θ) =

∣∣∣∣∣∣∣∣∣∣ 1n

n∑i=1

g(yi,xi,θ)

∣∣∣∣∣∣∣∣∣∣W

=

[1

n

n∑i=1

g(yi,xi,θ)

]T

W

[1

n

n∑i=1

g(yi,xi,θ)

],

where ‖ · ‖W denotes the quadratic norm introduced in (2.1), g is a k-vector function,1

and W is a symmetric positive definite matrix (possibly dependent on the sample) of

size k × k; further details of this class of estimators, including large sample properties,

can be found in Hayashi [22].

The examples provided above, evidence the broadness of the class of extremum estima-

tors.

In the next section we provide some consistency results for extremum estimators.

3.2 Consistency Results for Extremum Estimators

Under a set of mild assumptions it is possible to characterize the large sample behavior

of the broad class of estimators defined in (3.1). In order to establish the consistency

of this broad class of estimators, we first introduce the proviso which will allow us to

establish consistency. Here and in the sequel, a convention regarding the numeration of

assumptions will be made. Thus, an assumption numbered using roman numerals (e.g.

I,II,. . .) will be used to denote alternative set up to the one considered by the assumption

with the same number in arabic numeral (i.e. 1,2,. . .). This convention will prove to be

useful over the exposition.

1The k-vector function should obey some orthogonal conditions which are not relevant for our pur-poses, and which can be found elsewhere [22].


3.2.1 Consistency Under Compactness Assumptions

We start by establishing the consistency of extremum estimators under a set of assump-

tions which we state in the sequel. Note that here and below, the main goal will be to

provide a set of conditions under which we are able to assure that a sequence of functions

Tn converges either in probability or almost surely to the maximand of the estimator

objective function T .

Assumption 1: Weierstrass Framework

• Θ ⊂ Rk is compact.

• T : Θ ⊂ Rk −→ R is continuous.

Later, we will consider an alternative to Assumption 1 (Assumption I). The reason why

we refer to this assumption as the Weierstrass framework is justified by this being set

of assumptions under which the Weierstrass’s theorem is classically derived. Namely,

Assumption 1 implies the existence of θ∗∈Θ such that the following condition holds.

θ∗ ∈ arg maxθ∈Θ

T (θ). (3.6)

Note that this simple consequence of the well-known Weierstrass’s theorem, implies that

the possibly set-valued function arg maxθ∈Θ

T (θ) is non-empty.

Further we also have to assume that the extremum estimator objective function Tnconverges uniformly in probability to T , in the sense of Definition 2.7.

Assumption 2: Uniform Convergence in Probability

• ‖Tn − T ‖∞ = supθ∈Θ

|Tn(θ)− T (θ)| = op(1).

Note that the uniform weak law of large numbers presented in the previous chapter

(Theorem 2.19), provides a set of sufficient conditions implying Assumptions 1 and 2,

if the paramater set Θ is compact; other laws of large numbers suited for extremum

estimators are also known in the literature (see e.g. Mexia and Corte Real [33]).

Later we will consider the alternative assumptions of a.s. uniform convergence (Assump-

tion II), and pointwise convergence in probability (Assumption II*).

In order to be able to identify the true parameter θo, we will have to assume that

arg maxθ∈Θ

T (θ) is not a set-valued function, thus implying that T (θ) has a unique maxi-

mum.


Assumption 3: Identification

• θo = arg maxθ∈Θ

T (θ).

It turns out that Weierstrass framework combined with uniform convergence in proba-

bility and the assumption of identification, are sufficient for the weak consistency of the

class of estimators defined in (3.1). In fact, under these conditions the argument of the

maximum of the sequence of functions Tn will converge in probability, to the maximand

of estimator objective function T . This point is formalized in the theorem that follows,

and in Figure 3.1 we give a graphical representation that provides some insight on this

result.

Figure 3.1: Assumptions (1, 2, 3) imply consistency of the extremum estimator.

Theorem 3.2. (Weak consistency of extremum estimators)

Suppose that assumptions (1, 2, 3) hold. Then,

θn − θo = op(1).

Proof. cf. [35], pp. 2121–2122.

It is easy to verify that Theorem 3.2 carries over to the ‘min’ case simply by replacing

T by −T .

In order to move towards a strong consistency result, we will have to focus on a more

demanding proviso.

Assumption ii: a.s. Uniform Convergence


• ‖Tn − T ‖∞ = supθ∈Θ

|Tn(θ)− T (θ)| = o(1), a.s.

If we now consider this stronger form of uniform convergence, and drop Assumption 3,

we get the following strong consistency result.

Theorem 3.3. (Strong consistency of extremum estimators)

Suppose that assumptions (1,II,3) hold. Then,

θn − θo = o(1), a.s.

Proof. cf. [35], pp. 2121–2122.

The foregoing theorem ensures that every extremum estimator will be strongly consis-

tent, as long as the aforementioned requirements are fulfilled. Note that this result is

indeed quite general, given that it provides a set of sufficient conditions under which

the broad class of estimators defined in (3.1) converges a.s. to θo, the true value of the

relevant parameter.

In the next subsection we focus our attention on a well-known particular case of ex-

tremum estimator, namely the maximum likelihood methods.

3.2.2 Consistency for Maximum Likelihood Methods

Maximum Likelihood Estimation (MLE) methods are among the main standard tech-

niques for yielding parameter estimates of a statistical model of particular interest.

Large sample results for this M -estimation methodology, were long ago established in

the literature (cf. Wald [57]). We emphasize that it is not our intention to address

here general consistency results for maximum likelihood methods, but either to provide

a characterization of consistency in the style of the previous section. In this spirit, we

will illustrate in the sequel how we can rely on a proviso more suited for the particular

case of the maximum likelihood methods.

For such purpose, suppose that the following identification property holds

θ 6= θo ⇒ f(x|θ) 6= f(x|θo),


where θo is the true value of the parameter. As a consequence of the strict Jensen

inequality, we can establish that

T (θo)− T (θ) = E[ln f(x|θo)− ln f(x|θ)]

= E[− ln

f(x|θ)

f(x|θo)

]> − lnE

[f(x|θ)

f(x|θo)

]= − ln

∫f(x|θ)f(x|θo)

f(x|θo)dx

= 0.

Thus, we have established the following result.

Theorem 3.4.

Suppose that the following conditions are verified:

1. θ 6= θo ⇒ f(x|θ) 6= f(x|θo);

2. E[| ln f(x|θ)|] <∞.

Then it holds that

θo = arg maxθ∈Θ

T (θ),

where T (θ) = E[ln f(x|θ)]

Proof. See above.

The last theorem states sufficient conditions for Assumption 3, made above. In Figure

3.4 we provide a graphical representation which is useful for getting a clear portrait of

the reasoning leading to Theorem 3.4.

Theorem 3.5. (Consistency of maximum likelihood methods)

Suppose that the following conditions hold:

1. Θ is a compact set;

2. θo 6= θ ⇒ f(x|θ) 6= f(x|θo);

3. ln[f(x|θ)] is continuous for every θ ∈ Θ, with probability one;

4. E [‖ln f(x|θ)‖∞] <∞.


Figure 3.2: The picture behind Proposition 3.4

Then it holds that

θn − θo = op(1),

where

θo = arg maxθ∈Θ

E[ln f(x|θ)].

Proof. See [35], p. 2131.

3.2.3 Consistency Without Compactness Assumption

The hypothesis of compactness included in Assumption 1 is somehow restrictive. Since

we are considering the parameter set Θ to be a subset of Rk the compactness of the

parameter set is equivalent to Θ being closed and bounded. Thus, the compactness


condition imposes either the bounds on the true parameter θo are known, or that at

least its existence should somehow be assured.

As an alternative to Assumption 1, we now consider Assumption I.

Assumption I

• Θ ⊂ Rk is convex.

• T : Θ ⊂ Rk −→ R is concave.

From the inspection of the latter assumption, we can notice that the compactness as-

sumption was abandoned in favor of the convexity of the parameter space Θ. Further,

whereas in Assumption 1 we required the estimator objective function T to be continu-

ous, we are now demanding that it should be concave.

Under this framework, we can consider the following weaker form of convergence in

probability.

Assumption ii*: Pointwise Convergence in Probability

• Tn(θ)− T (θ) = op(1), ∀ θ ∈ Θ.

Further, we also assume that the true parameter θo is interior to the parameter set Θ.

Assumption 4: Interior Parameter

• θo ∈ int(Θ), where int(·) denotes the interior of a set.

We emphasize that Assumption 4 is a standard assumption made in the literature in

order to obtain the asymptotic distribution of an estimator. This assumption will be

dropped later.

Theorem 3.6. (Consistency without compactness requirements)

Suppose that assumptions (I,II*,3,4) hold. Then there exists θn with probability ap-

proaching one such that

θn − θo = op(1).

Proof. See [35], p. 2133.

In the next section we will address the issue of convergence in distribution of extremum

estimators.


3.3 Convergence in Distribution of Extremum Estimators

3.3.1 Interior Point Proviso

In this subsection we will be particularly interested in the establishment of convergence

in distribution results for extremum estimators, in the sense of Assumption 4.

Further, we will have to assume weak consistency of the extremum estimator. Note that

as a consequence of Theorem 3.2, assumptions (1,2,3) are sufficient for Assumption 5;

other sufficient conditions for Assumption 5 can be found elsewhere (Andrews [3]).

Assumption 5 - Weak Consistency

• θn = θo + op(1).

Here and in the sequel, we will make use of the following notation

DTn(θ) =∂Tn(θ)

∂θT=[

∂Tn∂θ1

∂Tn∂θ2

· · · ∂Tn∂θk

]Tand

D2Tn(θ) =∂Tn(θ)

∂θ∂θT=

∂2Tn∂θ21

∂2Tn∂θ1∂θ2

· · · ∂2Tn∂θ1∂θk

∂2Tn∂θ2∂θ1

∂2Tn∂θ22

· · · ∂2Tn∂θ2∂θk

......

. . . · · ·∂2Tn

∂θk−1∂θ1∂2Tn

∂θk−1∂θ2

. . . ∂2Tn∂θk−1∂θk

∂2Tn∂θk∂θ1

∂2Tn∂θk∂θ2

· · · ∂2Tn∂θ2k

.

Assumption 6: Regularity Conditions

• ∃δ > 0 : Tn is twice differentiable in

Bδ(θo) ≡ θ ∈ Θ : ‖θ − θo‖ < δ;

•√nDTn(θo)

D−→ N (0,Σ);

• There exists a continuous function H(·) such that

supθ∈Bδ(θo)

‖D2Tn(θ)−H(θ)‖ −→p

0.

The aforementioned assumptions will now allow us to establish the asymptotic normality

of extremum estimators.


Theorem 3.7. (Asymptotic normality of extremum estimators)

Suppose that assumptions (4,5,6) hold. Then

√n(θn − θo)

D−→ N (0,H−1ΣH−1).

Proof. See [35], p. 2143.

Note that having in mind the prior comment regarding the sufficiency of (1,2,3) for

Assumption 5, we have that the latter Theorem could also have been stated making use

of assumptions (1,2,3,4,6).

As aforementioned Assumption 4, is a standard assumption made in the literature for the

obtention of the asymptotic distribution of an estimator. There is however a substancial

number of cases of interest under which the true parameter θo lies in the boundary (∂)

of the parameter space Θ.2

3.3.2 Asymptotic Normality for Maximum Likelihood Methods

Similarly to what we have previously done, we now restrict our attention the particular

case of maximum likelihood methods. The following theorem is a version of the general

Theorem 3.7, using a set of assumption which are more suitable for maximum likelihood

methods.

Theorem 3.8. (Asymptotic normality for maximum likelihood methods)

Consider a sequence of independent and identically distributed random vectors Xnni=1.

Suppose that the conditions of Theorem 3.5 hold. Further, suppose that the following

conditions are verified:

1. θo ∈ int(Θ);

2. f(x|θ) is twice continuously differentiable and there exists δ > 0 such that f(x|θ) >

0 for all θ ∈ Bδ(θo), where

Bδ(θo) ≡ θ ∈ Θ : ‖θ − θo‖ < δ;2Such occurrence takes place for instance when one intends to test the nullity of the variance compo-

nents σ2o ≡ [σ2

o1 · · · σ2ow] in a mixed linear model (5.1), using the standard likelihood-ratio test. In

fact, under such circumstances, we have that the parameter space A = σ2o ∈ Rw : σ2

oi ≥ 0, i = 1, . . . , w,and so 0w ∈ ∂(A) and thus 0w /∈ int(A); Stram and Lee [51] provide an insightful discussion on thisissue.


3. The following bounding conditions are satisfied:∫sup

θ∈Bδ(θo)‖Df(x|θ)‖dx <∞;

∫sup

θ∈Bδ(θo)‖D2f(x|θ)‖dx <∞;

4. J ≡ E[D ln f(x|θo)(D ln f(x|θo))T] is nonsingular;

5. E

[sup

θ∈Bδ(θo)‖D2 ln f(x|θ)‖

]<∞ .

Then√n(θn − θo)

D−→ N (0,J−1).

Proof. See [35].

This theorem requires some discussion in what regards the assumptions made, in com-

parison with the hypothesis of the general result provided by Theorem 3.7. First, notice

that similarly to what was done in the general result 3.7, we are also considering the

true parameter θo as an interior point to the paramater space Θ. Further, note that

since we are assuming conditions of Theorem 3.5, we will have consistency of the ML

estimator, so Assumption 5 of the general result 3.7 holds. All the remainder conditions

are sufficient for Assumption 6.

In the next subsection, we abandon Assumption 4 and move again to a more general

framework. The exposition of the next subsection is largely inspired in the prominent

work by Andrews [3].

3.3.3 Boundary Point Proviso

In the prior subsections we have restricted our attention to the establishment of con-

vergence in distribution results for extremum estimators. We now depart from the

assumption that the true parameter θo was an interior point of the parameter space Θ.

In this subsection we will allow the true parameter θo, to be a closure point. Thus, the

true parameter can now be either in the interior point of the parameter set int(Θ), or

in the boundary of the parameter set ∂(Θ). Here and in the sequel, the idea will be

to approach the parameter space Θ by a cone Λ, in the sense of Definition 2.2. This

type of approximation, allows the boundary of the parameter space ∂(Θ) to be linear

or curved, and to eventually have kinks.


We thus have the following alternative to Assumption 4.

Assumption iv: Boundary Parameter

• θo ∈ cl(Θ) ≡ int(Θ) ∪ ∂(Θ), where cl(·) denotes the closure of a set, and ∂(·)denotes its boundary.

Obviously, even though we allow the true parameter θ to be an interior point, the main

interest here relies in the cases wherein θo is on the boundary of the parameter set.

We consider the case where the estimator objective function Tn(θ) has a quadratic

expansion around the true parameter θo

Assumption 7: Quadratic Expansion with Bounded Remainder

• Tn(θ) has a quadratic expansion around θo

Tn(θ) =Tn(θo) + [DTn(θo)]T(θ − θo)

+1

2(θ − θo)TD2Tn(θo)(θ − θo) +Rn(θ),

(3.7)

where

supθ∈Θ:‖Bn(θ−θo)‖≤γ

|Rn(θ)| = op(1).

and Bn is a sequence of matrices such that

λmin(Bn) −→n→∞

∞,

where λmin(·) denotes the smallest eigenvalue.

For the current purposes, the quadratic expansion of the estimator objective function

around the true parameter, can be restated in a suitable manner. This is established in

the following lemma.

Lemma 3.9. The quadratic expansion stated in equation (3.7) can be rewritten as

Tn(θ) = Tn(θo) +1

2qn(0)− 1

2qn(Bn(θ − θo)) +Rn(θ), (3.8)

where

qn(λ) ≡ (λ− Zn)TFn(λ− Zn), λ ∈ Rk, (3.9)

and

Fn ≡ −[B−1n ]TD2Tn(θo)B

−1n ,

Zn ≡ F−1n [B−1

n ]DTn(θo).


Proceeding as in the previous sections, we now state the remaining assumptions under

which it will be possible to obtain the large sample results of interest. We start with the

introduction of some further regularity conditions for the quadratic expansion provided

in (3.8).

Assumption 8

• Assume that the following large sample results hold[B−1n ]′DTn(θ)

D−→ G,

FnD−→ F,

where G is a random k-vector, and F is a matrix with size k×k, which is symmetric

and nonsingular with probability one.

Note that, as a consequence of Theorem 2.15, the latter condition can also be interpreted

as a bounding condition. Further, given Assumption 8, we are able to define a limiting

version of (3.9) given by

q(λ) ≡ (λ− Z)TF(λ− Z), λ ∈ Rk, (3.10)

where

Z ≡ F−1G,

and F and G follow from Assumption 8.

Further we have to impose another boundedness condition, for the argument of qn in

the quadratic expansion (3.8). Namely, in Assumption 9, we will assume Bn(θn − θo),to be bounded in probability.

Assumption 9: Stochastic Boundedness Condition

• Bn(θn − θo) is stochastically bounded, i.e.

Bn(θn − θo) = Op(1).

The next assumption is concerned with the local approximation of the parameter space

by a cone. Here and in the sequel, by the shifted and rescaled parameter space we will

mean the set

Bn(Θ− θo)/bn =

Bn

(θ − θobn

): θ ∈ Θ

n∈N

,


where bn is a sequence of real numbers such that bn −→n→∞

∞, and

bn ≤ cλmin(Bn),

for some positive real number c.

The following assumption makes use of the generalization of the definition of local ap-

proximation by a cone, which was introduced in Definition 2.4.

Assumption 10: Local Approximation by a Cone

• The sequence of sets Bn(Θ− θo)/bn is locally approximated by a cone Λ.

Before stating the last assumption of interest, consider the following possibly set-valued

mappings

λn ⇒ arg infλ∈cl(Λ)

qn(λ), (3.11)

where qn(λ) is defined as in (3.9). Similarly consider the limiting version of λn as

λ⇒ arg infλ∈cl(Λ)

q(λ), (3.12)

where q(λ) is defined as in (3.10).

In order to avoid that the aforementioned mappings can be set-valued, we will have to

assume the convexity of the cone used to approximate the shifted and rescaled parameter

space Λ.

Assumption 11: Convexity of the Cone

• Λ is convex.

As a consequence of the latter assumption, we are able to define

λn ≡ arg infλ∈cl(Λ)

qn(λ),

instead of (3.11). Similarly, we are able to define

λ ≡ arg infλ∈cl(Λ)

q(λ),

instead of (3.12).

The assumptions stated above allows to reach the following large sample result.


Theorem 3.10.

Suppose that assumptions (IV,5,7,8,9,10,11) hold. Then

1. Bn(θn − θo)− λn = op(1).

2. λnD−→ λ.

3. Bn(θn − θo)D−→ λ.

Proof. See [3] pp. 1378–1379.

Thus far we have focused our attention on the large sample properties of the broad class

of estimators defined in equation (3.1). No comments were done yet in what concerns

the computational aspects of extremum estimators. This issue will be addressed in the

next chapter.

Chapter 4

Global Optimization by

Stochastic Search Methods

4.1 Introduction

As it was discussed in the previous chapter, any estimator which can be formulated

through an optimization problem of interest, is tantamount to an extremum estimator.

Thus far we have concerned ourselves with the circumstances under which it is possible

to establish the consistency and asymptotic normality of such estimators. We have not

yet relied our focus on the computational aspects of extremum estimators. If fact, in

a plurality of cases of practical interest, these estimators are not analytically tractable

and so we frequently lack of a closed-form solution for obtaining the estimates. A nice

overview of some standard numerical procedures which can be employed to compute the

estimates can be found for instance in Hayashi [22] and Judge et al. [24]. Given that

the quest for an input value which fulfills some determined output criteria is a problem

of interest in a wide variety of scenarios, there is nowadays a broad class of methods

available. Notwithstanding, when choosing on which algorithm to rely, one should avoid

iterative maximization procedures which can convergence to a local solution. We em-

phasize that the fact of an extremum estimator being achieved as a global solution to

a determined optimization problem, has deep implications in what regards consistency.

As it becomes clear from the inspection of the large sample results stated in the previous

chapter, consistency is only ensured to hold for the global solutions. This highlight is

peculiarly important when the optimization problem at hand is analytically intractable.

Broadly speaking, two types of numerical procedures are typically adopted to tackle

such problem, namely deterministic and stochastic optimization algorithms. The former

includes the Newton-Raphson, the steepest descent method, among many others (see

42

Chapter 4. Global Optimization by Stochastic Search Methods 43

[36] and references therein). In this thesis the focus will however be placed on stochastic

optimization algorithms. These include the pure random search (Solis and Wets [49]),

the simulated annealing technique (see Bohachevsky et al. [5]), the conditional martin-

gale algorithm (Esquıvel [17]), etc. As it was mentioned in the Chapter 1, we follow Spall

[50] referring to stochastic search and optimization algorithms in the following terms.

Stochastic Search and Optimization

I. There is some random noise in the measurements of Tn(θ); (or / and)

II. There is a random choice in the search direction as the algorithm iterates

toward a solution.

In this thesis we are exclusively concerned with item II, and hence the focus is placed

on optimization algorithms wherein the search direction is randomly dictated.

In this chapter, we contribute by proposing a master method from which several stochas-

tic optimization algorithms are particular cases. The generality of the proposed method

is broad enough to include the conceptual algorithm of Solis and Wets [49] as a particu-

lar case. Furthermore, we also establish the convergence of the proposed method under

a set of fairly mild assumptions. An important instance of this method to which we also

devote some attention, is given by the stochastic zigzag method—an algorithm which

is largely inspired in the prominent work of Mexia et al. [32]. We point out that even

though the crux of our analysis relies over the optimization problem

maxθ∈Θ

Tn (θ) , (4.1)

for a fixed n, the procedures developed here carry over mutatis mutandis to other un-

constrained optimization problems of interest.

The remaining of this chapter is as follows. In the next section we provide an overview

of random search techniques as a starting point towards our method. In §4.3 we recast

the meta-approach of Solis and Wets [49] through the introduction of a master method

which includes several other stochastic optimization algorithms as a particular case. The

convergence of the master method is established in §4.4. Finally in §4.5 we offer a short

note regarding the construction of confidence intervals for the maximum, departing from

the general algorithm previously introduced.


4.2 An Overview of Random Search Techniques

Suppose that one has available a random sample of size n, from a population of interest.

With such sample at hand, we intend to solve the optimization problem (4.1), as a means

to obtain estimates of θo. It is worth noting that from the conceptual standpoint, for

a fixed n, one can also think of the graph of Tn as a population of interest from which

one intends to consistently estimate the parameters1

(arg max

θ∈ΘTn (θ) ,max

θ∈ΘTn (θ)

).

In order to do so, suppose that we collect a random sample (θi, Tn(θi)pi=1 from such

population. Hence for each sampled value θi, we also inquire its corresponding image

value Tn(θi). Assume that such sample is collected sequentially and that during each

extraction period we compute

θi =

θ0 ⇐ i = 0,

θi−1 ITn(θi−1) ≥ Tn(θi)+ θi ITn(θi−1) < Tn(θi) ⇐ i ∈ N.(4.2)

As we shall see below, the procedure described above contains the essence of the classical

pure random search algorithm.

Classical Random Search Algorithm

1. Choose a initial value of θ, say θ0 ∈ Θ, either randomly or deterministically.

Set i = 0 and θ0 = θ0.

2. Generate a new independent value θi+1 from a probability distribution f , with

support over Θ. If Tn(θi+1) > Tn(θi), set θi+1 = θi+1. Else set θi+1 = θi.

The convergence of the algorithm stated above was established in the seminal work of

Solis and Wets [49]. The crux of their work relies in the introduction of a conceptual

algorithm which includes among others the algorithm stated above. In fact, as a means

to shed some light on some standard variants of the classical random search algorithm,

observe that

- other types of processes can be used in lieu of (4.2);

- independence in the choice of the values of θi is sometimes dropped;

- the probability distribution f can be allowed to have a support defined over Rk ⊇ Θ.2

1Recall that the graph of T is defined gr(T ) = (θ, T (θ)) : θ ∈ Θ.2Obviously due adaptations are entailed, otherwise some hindrances can arise in Step 2. For instance,

the lapse of such modifications may preclude the computation of the image for certain values of θ whichare not included in the domain of T .


4.3 Recasting the Solis and Wets Framework

4.3.1 Preliminaries and Notation

A Brief Note on Notation

Here and below we make use of the following shorthand notation

∇(t) = θ ∈ Θ : Tn(θ) < t, ∇(t) = θ ∈ Θ : Tn(θ) ≤ t,

∆(t) = θ ∈ Θ : Tn(θ) > t, ∆(t) = θ ∈ Θ : Tn(θ) ≥ t.

Further, we make use of the almost universal quantifier, which we introduce below.

Definition 4.1. Let P(x) denote a proposition which depends on a variable x which

takes values on a specified domain D. We say that gx P(x), if P(x) is true for all the

values of D\N , where N is a null-measure set.

Remark 4.2. The universal quantifier ∀ and the existential quantifier ∃ are often used to

select the elements of a set wherein some property holds. The almost universal quantifier

g, is also introduced for selecting what are . Whereas the universal quantifier ∀ should

be read as for every element, the almost universal quantifier should be read as for almost

every element.3 For the sake of illustration, using classical measure theoretical notation

we say that functions f and g are equivalent if

f = g, a.e.

With our notation we get

f(θ) = g(θ), gθ ∈ Θ.

In the sequel we introduce some definitions which are necessary for the presentation

of the convergence result of a general algorithm stated below. First we introduce the

concept of essential supremum, which is deeply related with the maximum. It turns out

that the concept of essential supremum is more suited for computational purposes than

the maximum itself. Second, we present the concept of optimality region.

Definition 4.3. Let Tn : Θ→ R be a measurable function. The essential supremum is

defined as

ess supθ∈Θ

Tn(θ) ≡ inft : Tn(θ) ≤ t, gθ ∈ Θ.

3The logical grounds for this quantifier are far beyond the scope of this work. We leave this for futurework.


Similarly, we define the essential infimum as

ess infθ∈Θ

Tn(θ) ≡ supt : Tn(θ) ≥ t,gθ ∈ Θ.

Remark 4.4. Observe that the essential supremum can be equivalently rewritten as

ess supθ∈Θ

Tn(θ) = inft : λ(∆t) = 0 = supt : λ(∆t) > 0, (4.3)

where λ(·) denotes the Lebesgue measure. Note that the last equality follows by a similar

argument as in Williams [56] (p. 34), and by noting that for every positive h it holds

that ∆t+h ⊆ ∆t. By a similar reasoning the essential infimum definition is tantamount

to

ess infθ∈Θ

Tn(θ) = inft : λ(∇t) > 0. (4.4)

This type of representation is actually preferred by Solis and Wets [49].

The concepts of essential supremum and essential infimum are also used in the context

of stochastic differential equations (see for instance Øksendal [39]). To gain some insight

on the mechanics of these definitions replace the almost universal quantifier g, in the

definition of the ess sup, by ∀. Hence, we would intend to compute the infimum of the

set

t : Tn(θ) ≤ t, ∀θ ∈ Θ,

and this would yield the least majorant of the range of Tn. A similar reasoning applies

when we consider the universal quantifier in lieu of the almost universal quantifier.

For the sake of completeness below, we state some basic results regarding the essential

supremum; the corresponding proofs are included in the appendix.

Fundamental Properties of the Essential Supremum

1. Tn(θ) ≤ ess supx∈Θ

Tn(x), gθ ∈ Θ.

2. ess supx∈Θ

(Tn(x) +Rn(x)) ≤ ess supx∈ΘTn(x) + ess sup

x∈ΘRn(x).

3. ess supx∈Θ

Tn(x) ≤ supx∈Θ

Tn(x).

It is important to underscore that the all the results stated above are valid, as long as

the measurability of Tn and Rn is verified. As a means to ease notation, in the sequel we

use τ and τ to denote the essential supremum and the essential infimum, respectively.

Let us observe that if the maximizer of Tn is unique, and Tn is continuous, the essential

supremum τ coincides with the maximum.


Theorem 4.5.

Let Tn be continuous and suppose that

θn = arg maxθ∈Θ

Tn(θ).

Then, the essential supremum τ coincides with the maximum of the parameter objective

function, i.e.

τ = Tn(θn).

Proof. As a consequence of Property 3 (see above and Proposition A.4 in the appendix),

we have that τ ≤ Tn(θn). Thus, we only have to prove that τ ≥ Tn(θn). Let ε > 0 be

given. There exists θε ∈ Θ such that

Tn(θn)− ε < Tn(θε) < Tn(θn).

Since we are assuming Tn to be continuous, there exists δ > 0, such that for every

θ ∈ Bδ(θε) ≡ θ ∈ Θ : ‖θ − θε‖ < δ we still have

Tn(θn)− ε < Tn(θ) < Tn(θn).

Consequently, we get

λ(

∆Tn(θn)−ε

)≥ λ(Bδ(θε)) > 0 ,

and so, (4.3) implies that τ ≥ Tn(θn)− ε. Since ε is arbitrary, we have τ ≥ Tn(θn).

Figure 4.1: A sketch which portrays the reasoning involved in the proof of Theorem4.5. This instance is used just to provide guidance, and it is not part of the proof.


We now formally define the concept of optimality zone.

Definition 4.6. Let τ denote the essential supremum of Tn. The optimality zone for

the maximand of Tn is given by the set-valued function O : R2+ ⇒ Θ defined as

Oε,M =

θ ∈ Θ : Tn(θ) > τ − ε ⇐ τ ∈ R,

θ ∈ Θ : Tn(θ) > M ⇐ τ = +∞.

Similarly, we define the optimality zone for the minimand as

Oε,M =

θ ∈ Θ : Tn(θ) < τ + ε ⇐ τ ∈ R,

θ ∈ Θ : Tn(θ) < M ⇐ τ = −∞.

Remark 4.7. Note that making use of the shorthand notation defined above we can

restate the optimality zones as follows

Oε,M =

∆(τ − ε) ⇐ τ ∈ R,

∆(M) ⇐ τ = +∞.; Oε,M =

∇(τ + ε) ⇐ τ ∈ R,

∇(M) ⇐ τ = −∞.

It remains unanswered, what is the magnitude of the optimality zone. We provide a

rough upper bound for the measure of the optimality region of the minimand, which

makes use of the van der Corput’s sublevel set bound (see Appendix B).4

Theorem 4.8.

Let Θ = [a, b] denote some parameter space of interest. Suppose that Tn : Θ → R+ is

k-times differentiable on int(Θ), with k ≥ 1, and that |T (k)n (θ)| ≥ ζ > 0. Then it holds

that:

λOε,M ≤ ck

[(τ + ε

ζ

) 1k

I(τ ∈ R) +

(M

ζ

) 1k

I(τ = −∞)

],

where ck =k√k!22k−1, and I(·) denotes the indicator function.

Proof. The proof follows from a direct application of the van der Corput’s sublevel set

bound (see Appendix B). If τ ∈ R, then by the van der Corput’s sublevel set bound

λOε,M ≤ ck(τ + ε

ζ

) 1k

.

Similarly, if τ = −∞, we have the following sublevel set bound λOε,M ≤ ck(Mζ

) 1k.

4Some modern versions of this result can be found for instance in Rogers [44] and Carber et al. [10].Note that this classical result is still playing an active role in contemporaneous mathematical research.


Even though appealing from the theoretical standpoint, this result is useless from a

practical stance. Even with large requirements over function Tn, we are unable to obtain

an upper bound which is independent of τ .

The next definition closes our conceptual framework.

Definition 4.9. A function C : Θ× Rk → Θ is a compass function if(Tn C)(θa,θb) ≥ Tn(θa), ∀(θa,θb) ∈ Θ× Rk,

(Tn C)(θa,θb) ≥ Tn(θb), ∀(θa,θb) ∈ Θ×Θ.

Example 4.1. A simple example of compass function is given by the mapping

C(θa,θb) = θaIθa∈∆(Tn(θb))(θa,θb) + θbIθa∈∇(Tn(θb))(θa,θb).

If we define

θi+1 = C(θi,θi+1),

then it holds that

θi+1 = θi ITn(θi) ≥ Tn(θi+1)+ θi+1 ITn(θi) < Tn(θi+1),

and so we recover the above mentioned probabilistic recursive translation of the pure

random search algorithms.

Even though the definition of compass function given above is for maximization prob-

lems, it can be easily accommodated for minimization problems. It is worth mentioning

that we refer to this function as the compass, since this is the mapping that guides the

process of selection of the extremes.

Given that our main interest relies in the optimization problem

maxθ∈Θ

Tn(θ),

for a fixed n, hereinafter we focus on maximization.

In the next section we introduce the master method—a broad algorithm which includes

several other optimization algorithms.


4.3.2 The Master Method

We open this subsection with the introduction of a general method from which several

other algorithms are particular cases. The modus operandi of such method is given

below.

Modus Operandi of the Master Method (c ∈ N)

0. Set i, j = 1. Find a,b ∈ Θ, and set θ0 and θ0 equal to arg maxx∈a,b

Tn(x).

Further, set z1 and Z1,1 equal to arg minx∈a,b

T (x).

1. If c > 1, generate Zi,j from the probability space (Rk,B(Rk),Pi,j), and set

Zi,j+1 = Zi,j . Else, go to Step 2.

2. If j < c− 1, increment j, and return to Step 1. Otherwise, set θi = C(θi−1,θi),

where θi = arg maxq∈1,...,c

Tn(Zi,q), and set j = 1.

3. Generate zi from the probability space (Rk,B(Rk),Pi), set Zi,1 = zi, increment

i and j, and return to Step 1.

Some comments concerning this general algorithm:

• The parameter c can be defined a priori by the user, and it can take any positive

integer value. As a rule of thumb, we suggest taking c as random (e.g. drawn from

discrete uniform distribution U1, . . . , k).

• Observe that Step 0 simply initiates the algorithm. If we repeat Step 1 for a fixed

i, we construct the iterates Zi,1,Zi,2, . . . ,Zi,c−1. In Step 2 we update the compass

and obtain the next ‘candidate’ to argument of the maximum, as proposed by the

algorithm. The repetition of Step 3 yields z2, z3, etc.

• Here and below, we refer to each zi as a seed. If the seeds are independent and

identically distributed, we refer to the master method as pure. If the probability

measure Pp depends on some probability measure(s) Pq, with q < p, then the

master method will be called adaptive. Further, we refer to each Zi,j as an iterate.

For each i we will refer to sequence Zi,1, . . . ,Zi,c as a course. At the light of this

terminology, we can say that the consecutive repetition of Step 1 builds a course.

Similarly, if we rerun serially Step 3 we obtain a sequence of seeds.

• The mechanics of the algorithm is perhaps better understood through the law of

movement of the iterates which can be written as

Zi,j = ziI(j = 1 ∨ c = 1) + Zi,j−1I(j ∈ 2, . . . , c ∧ c > 1) (4.5)


To gain some insight on the mechanics of the method, consider the case wherein c = 1.

Throughout this chapter, this benchmark case will be invoked frequently. In such case

we have that

θi = arg maxq∈1

Tn(Zi,q) = Zi,1 = zi,

and hence θi = zi = Zi,1. Additionally, j becomes inactive in the the algorithm, given

that under these circumstances, Step 1 is never activated. Consequently, for c = 1 the

algorithm can be equivalently rewritten as follows.

Modus Operandi of the Master Method (c = 1)

0. Set i = 1. Find θ1 ∈ Θ, and set θ0 = θ1.

1. Set θi = C(θi−1,θi), and increment i.

2. Generate θi from the probability space (Rk,B(Rk),Pi), and return to Step 1.

Hence, the classical Solis and Wets conceptual algorithm [49], is a particular case of

our master method with c = 1. We emphasize that the master method is simply a

generalization of this method which follows a course between any two seeds. These and

other features of the master method will become more clear after the introduction of a

matrix formulation of the master method which we present in the upcoming subsection.

4.3.3 Stochastic Zigzag Methods

We will be particularly interested in the following instance of the master method

Modus Operandi of the Stochastic Zigzag Method (c ∈ N)

0. Set i, j = 1. Find a,b ∈ Θ, and set θ0 and θ0 equal to arg maxx∈a,b

Tn(x).

Further, set z1 and Z1,1 equal to arg minx∈a,b

Tn(x).

1. If c > 1, generate αi,j from the probability space (R,B(R),Pi,j), and set

Zi,j+1 = αi,jθi−1 + (1− αi,j)zi. Else, go to Step 2.

2. If j < c− 1, increment j, and return to Step 1. Otherwise, set θi = C(θi−1,θi),

where θi = arg maxq∈1,...,c

Tn(Zi,q), and set j = 1.

3. Generate zi from the probability space (Rk,B(Rk),Pi), set Zi,1 = zi, increment

i and j, and return to Step 1.

Essentially the layout of the algorithm is the following. In Step 1 we initialize the

algorithm, and sample points from the line which passes through the points θ1 and z1.

The consecutive application of Step 1, simply collects a random sample of c points from

such line. In Step 2, we refresh the compass function C, obtaining the next candidate to


the argument of the maximum yield by the algorithm. We then move to Step 3, wherein

a new seed is generated. Again, we sample the line which passes through the argument

of the maximum of the previous line and the new generated seed, and proceed repeating

the described above (eventually ad infinitum).

For the sake of illustration below we provide an application of the the stochastic zigzag

method to the classical test function

L(x1, x2) =1

2

[x4

1 − 16x21 + 5x1 + x4

2 − 16x22 + 5x2

](4.6)

Figure 4.2: The initialization of the stochastic zigzag method. In the picture in theleft we start by finding points a and b which initialize the algorithm. The secondpicture illustrates that in Step 1 we collect a random sample (c = 10) from line whichpasses through a and b. The remaning picture depicts steps 2 and 3 wherein after themaximum of the first line we generate another seed and start by extracting a sample

from the new line which passes by such points.

It is important to underscore that other variants of the stochastic zigzag method are

also included in the general method, but we preferred to focus on the one which is

stated above for its simplicity and appealing ease of implementation (see Theorem 4.10

in §4.3.4). We could have considered an alternative shape for the line, and the robustness

of the master method is such that we are even able to take a different type of line per

each different course.

4.3.4 A Matrix Formulation of the Master Method

In this subsection, we shed some light on the matrix formulation of the master method.

This conceptual framework will help us to clarify some features of the method. Addi-

tionally, as we shall see latter, such representation can reduce substantially the burden

of implementation. In order to be able to present this formulation, we need to consider

a stopping time of the method, which we denote by r. From the theoretical stance, one

can consider for instance the time of entry in the optimal zone. In fact, this can be

defined for every ε,M > 0, as

rε,M = infi ∈ N : θi ∈ Oε,M.


Figure 4.3: The application of the stochastic zigzag method to the Stybilinski–Tangtest function. This function pertains to a class of test functions which are typicallyused to assess the performance of an optimization algorithm (see e.g. Spall [50]). The

functional form of this function is given in formula (4.6)

It can be easily shown that this is a stopping time, with respect to the natural filtration

Fi = σ(θ1,θ2, . . . ,θi). Analogous stopping times can be found even in introductory

textbooks (e.g. [56]), so we skip the details. The crux of the proof is given by observing

that

rε,M ≤ i =

i⋃p=1

θp ∈ Oε,M ∈ Fi, ∀i ∈ N.

The law of movement of the iterates (4.5) allow us to describe the mechanics of the

method in a matrix form by defining the iterative matrix Z as the (r × kc)-matrix

Z ≡

Z1,1 Z1,2 · · · Z1,c

Z2,1 Z2,2 · · · Z2,c

......

. . ....

Zr,1 Zr,2 · · · Zr,c

=

z1 Z1,1 · · · Z1,c−1

z2 Z2,1 · · · Z2,c−1

......

. . ....

zr Zr,1 · · · Zr,c−1

=

z1

z2

...

zr

. (4.7)


Further, we will refer to the map-iterative matrix TZ as the (r × c)-matrix

TZ ≡

T (Z1,1) T (Z1,2) · · · T (Z1,c)

T (Z2,1) T (Z2,2) · · · T (Z2,c)...

.... . .

...

T (Zr,1) T (Zr,2) · · · T (Zr,c)

=

Tz1

Tz2...

Tzr

.

For the sake of illustration, lets rethink the case wherein c = 1. Then the iterative

matrix Z and the map-iterative matrix TZ become

Z =

z1...

zr

, TZ =

T (z1)

...

T (zr)

.

Hence, it is now more clear the affinity between the Solis and Wets conceptual algorithm

[49] and the master method introduced above. In fact, in the particular case wherein

c = 1, the iterative matrix degenerates into a matrix composed uniquely by seeds, i.e.,

random draws generated from the probability space (Rk,B(Rk),Pi).

In the particular case of the stochastic zigzag method, the following matrix also finds

application

α ≡

α1,1 α1,2 · · · α1,c−1

α2,1 α2,2 · · · α2,c−1

......

. . ....

αr,1 αr,2 · · · αr,c−1

.

In the proposition that follows we show how the matrix representation of the stochastic

zigzag method can bring inherent implementation advantages.

Theorem 4.10. (Kronecker–zigzag decomposition)

The i-th zigzag course can be rewritten as

zi =

[zi

... αi ⊗ θi−1 + (1Tc−1 −αi)⊗ zi

], (4.8)

for i = 1, . . . , r, and where θi−1 is defined accordingly to the formulation of the stochastic

zigzag method given above.


Proof. Just note that

zi =

[zi

... αi,1θi−1 + (1− αi,1)zi · · · αi,c−1θi−1 + (1− αi,c−1)zi

]=

[zi

... αi,1θi−1 · · · αi,c−1θi−1

]+

[0

... (1− αi,1)zi · · · (1− αi,c−1)zi

]=

[zi

... αi ⊗ θi−1 + (1Tc−1 −αi)⊗ zi

].

The latter result warrants some comments. Roughly speaking, it states that the law

of movement of each iterate, can be readily extended to describe the whole law of

movement a zigzag course simply by replacing the usual scalar product, by the Kronecker

product (and by performing the necessary scalar to vector adaptations). Note that the

latter result allows to build very simple computational implementations, namely it only

requires a loop which can be easily stated in pseudocode.

Pseudocode Implementation of the Stochastic Zigzag Method

• rand:

- seeds;

- alpha.

for i=1 to r,

- compute theta_i-1;

- compute z_i;

- increment i

Making use of the binary operation ⊗, we are able to build in a step the whole first

line of the iterative matrix Z. In the following we introduce an example. The presented

example should by no means be considered for optimization purposes, but merely as an

illustration which clarifies the conceptual framework introduced above.

Example 4.2. (Minimizing the Styblinski–Tang function)

We randomly generate the following matrices

α =

α1

α2

α3

=

23

13

−1 −13

−2 −1

; z =

−4 1

0 0

2 0

; θ0 = θ0 =[−1 4

].


Using the binary operation ⊗, we built in a step the whole first line of the iterative matrix

Z,

z1 =

[z1

... α1 ⊗ θ0 + (1T2 −α1)⊗ z1

]=[z1

... 23θ0 + 1

3z113θ0 + 2

3z1

]=[−4 1 −2 3 −3 2

].

This yields

Tz1 =[−15 −53 −58

],

and so θ1 = [ −3 2 ]. Similarly, we build second and third lines of the iterative matrix

Z,

z2 =

[z2

... α2 ⊗ θ1 + (1T2 −α2)⊗ z2

]=[

0 0 3 −2 1 −23

].

This yields

Tz2 =[

0 −53 −10.12],

and so θ2 = [ 3 −2 ]. Finally, we have

z3 =

[z3

... α3 ⊗ θ2 + (1T2 −α3)⊗ z3

]=[

2 0 0 4 1 2].

Implying that

Tz3 =[−19 10 −24

],

and so θ3 = [ 1 2 ]. Thus, we have the following iterative matrix Z and corresponding

map-iterative matrix TZ

Z =

−4 1 −2 3 −3 2

0 0 3 −2 1 −23

2 0 0 4 1 2

; TZ =

−15 −53 −58

0 −53 −10, 12

−19 10 −24

.


4.4 Convergence of the Master Method

This section establishes the convergence of the general algorithm introduced above.

We start this journey with the introduction of some preliminary considerations. First,

note that as a consequence of the compass update rule of the master algorithm θi =

C(θi−1,θi), it holds that the sequence Tn(θi)i∈N is increasing. In fact, we have that

Tn(θi) = (Tn C)(θi−1,θi) ≥ Tn(θi−1). (4.9)

This reasoning can be easily inducted, being valid that for every positive integer k

Tn(θi+κ) ≥ Tn(θi).

This simple fact will play an important role in the establishment of the following trinity

of elementary results.

Proposition 4.11.

For every positive integer κ, we have that:

1. If θi ∈ Oε,M , then θi+κ ∈ Oε,M ;

2. If θi ∈ Oε,M , then θi+κ ∈ Oε,M ;

3. θκ ∈ Ocε,M ⊆ θ1, . . . , θκ−1 ∈ O

cε,M ∩ θ1, . . . ,θκ−1 ∈ O

cε,M.

Proof.

1. We will just deal with the case where the essential supremum is finite, because the

case wherein τ =∞ is similar. Given that the sequence Tn(θi)i∈N is increasing,

it holds that for every positive integer κ

Tn(θi+κ) ≥ Tn(θi) = Tn(C(θi−1,θi)) ≥ Tn(θi). (4.10)

Further, since by assumption θi ∈ Oε,M , it holds that

Tn(θi) > τ − ε. (4.11)

The final result now follows by combining inequalities (4.10) and (4.11).


2. We will only consider the case in which τ ∈ R, given that the case wherein τ =∞is similar. Since by assumption we have that θi ∈ Oε,M , then it holds that

Tn(θi) > τ − ε. (4.12)

The final result follows directly as a consequence of the sequence Tn(θi)i∈N being

increasing.

3. As a consequence of Claims 1 and 2 we have that for every positive integer κ(θκ−1 ∈ Oε,M ∨ θκ−1 ∈ Oε,M

)⇒ θκ ∈ Oε,M . (4.13)

Applying the contrapositive law to (4.13) yields

θκ ∈ Ocε,M ⇒

θκ−1 ∈ Ocε,M ,

θκ−1 ∈ Ocε,M .

⇒

θ1, . . . ,θκ−1 ∈ Ocε,M ,

θ1, . . . , θκ−1 ∈ Ocε,M .

The last implication follows directly from Claims 1 and 2.

Claims 1 and 2 of the foregoing theorem, translate the idea that if an iterate of the

algorithm falls in the optimal zone, then it remains there forever. Claim 3 will be

particularly useful in the proof of convergence of the general algorithm proposed above.

With next theorem we start the study towards the convergence of the master method.

Theorem 4.12. (Convergence of the pure master method—Part I)

1. Suppose that T is bounded from above. Further, suppose that the master method

is pure, and that the following condition holds

∀B ∈ B(Θ) λ(B) > 0⇒ P[z1 ∈ B] > 0. (4.14)

Then

P[θi ∈ Ocε,M ] = o(1).

2. Suppose that T is bounded from above. Then

Tn(θi)− T = o(1), a.s., (4.15)

where T is a random variable such that P[T = τ ] = 1.


Proof.

1. As a consequence of Proposition 4.11 it holds that

P[θi ∈ Ocε,M ] ≤ P

⋂1≤p≤i−1

θp ∈ Ocε,M ∩ θp ∈ O

cε,M

≤ P

⋂1≤p≤i−1

θp ∈ Ocε,M

.(4.16)

Observe now that since by definition θp = arg maxq∈1,...,c

Tn(Zp,q), then it holds that

θp ∈ Ocε,M⊆ zp ∈ O

cε,M, ∀p ∈ N. This latter observation combined with (4.16)

yields


⋂1≤p≤i−1

zp ∈ Ocε,M

= P[z1 ∈ Ocε,M ]i−1.

The final result now holds since by assumption P[z1 ∈ Ocε,M ] < 1.

2. Start by noting that T (θi),Fii∈N is a submartingale, where Fi = σ(θ1, θ2, . . . , θi),

denotes the natural filtration; just observe that

E[Tn(θi)|Fi] = E[(Tn C)(θi−1,θi)|Fi] ≥ E[Tn(θi−1)|Fi] = Tn(θi−1), a.s.

Given that this submartingale is bounded from above, it is a.s. convergent to a

random variable T .5 Observe now that as a consequence of the preceding claim it

holds that

P[Tn(θi) < τ ] = o(1), (4.17)

given that ε is arbitrary. Further, Fatou’s lemma yields

P[T < τ ] = P[lim infi→∞

Tn(θi) < τ]≤ lim sup

i→∞P[Tn(θi) < τ

]= 0,

where the last equality follows by (4.17). Furthermore, as a consequence of Propo-

sition A.2 in the appendix, it holds

Tn(x) ≤ τ , gx ∈ Θ.

5Since by assumption T is bounded from above, it holds that supi

E[T (θi)] < ∞. Consequently,

Doob’s martingale convergence theorem can be applied, hence establishing the a.s. convergence to T .


In particular, this implies that for every positive integer i we have P[Tn(θi) > τ ] = 0.

Consequently it holds that

P[Tn(θi) > τ ] = o(1). (4.18)

Therefore, again by Fatou’s lemma it holds that

P[T > τ ] = P[lim infi→∞

Tn(θi) > τ]≤ lim sup

i→∞P[Tn(θi) > τ

]= 0,

where the last equality holds as a consequence of (4.18).

The latter result warrants some general remarks. Roughly speaking, Claim 1 states that

the probability of the algorithm failing the optimality region, approaches 0 as the number

of iterates increases. Further, Claim 2 ensures that the sequence T (θi)i∈N converges

a.s. to a random variable T which is indistinguishable from the essential supremum.

Observe that the proof of the second claim is entirely robust to both the pure stochastic

method and the adaptive master method. Hence, the second result also holds in what

concerns the adaptive master method. It then arises the question. Is the first claim of

the previous Theorem also extendable to the adaptive master method? This issue lies

at the heart of the next theorem.

Theorem 4.13. (Convergence of the adaptive master method: Part I)

Suppose that Tn is bounded from above. Further, suppose that the master method is

adaptive, and that the following condition holds

inf1≤p≤i−1

P[zp ∈ Ocε,M ] = o(1). (4.19)

Then

P[θi ∈ Ocε,M ] = o(1).

Proof. Our line of attack is similar to the previous proof. Just note that by a similar

reasoning, it holds that


⋂1≤p≤i−1

zp ∈ Ocε,M

≤ inf1≤p≤i−1

P[zp ∈ Ocε,M ],

from where the final result follows directly.


It is important to underscore that the hypothesis considered here in order to establish

the convergence of the adaptive stochastic method is known in the literature. Condition

(4.19) is tantamount to the one adapted by Esquıvel [17]. Note however, that whereas

Esquıvel used this condition as a means to establish the convergence of the adaptive

random search, here it is used in the more general context of the adaptive master method.

In the sequel, we evaluate how far can we reach assuming that the identification of true

parameter holds under the Weierstrass framework.6

Theorem 4.14. (Convergence of the pure master method: Part II)

Suppose that T is bounded from above. Further, suppose that master method is pure,

and that the following condition holds

∀B ∈ B(Θ) λ(B) > 0⇒ P[z1 ∈ B] > 0. (4.20)

Further, suppose that Tn(θ) is continuous and that θn = arg maxθ∈Θ

Tn(θ). Then, it holds

that

Tn(θi)− Tn(θn) = o(1), a.s. (4.21)

If furthermore Θ ⊂ Rk is compact, then

θi − θn = o(1), a.s. (4.22)

Proof.

The proof is split into two claims. The first claim establishes that Tn(θi) − T (θn) =

o(1), a.s. The second claim shows that θi − θn = o(1), a.s.

1. Let us first show that the sequence Tn(θi)i∈N converges in probability to Tn(θ0).

Consider ε > 0. Start by noting that

P[|Tn(θi)−Tn(θn)| ≥ ε] = P[Tn(θi) ≤ Tn(θn)−ε∪Tn(θi) ≥ Tn(θn)+ε]. (4.23)

Observe now that by Theorem 4.5 it holds that the essential supremum and the

maximum coincide. Hence, by definition of essential supremum it holds that

P[Tn(θi) ≥ Tn(θn) + ε] = 0. This implies that (4.23) can be rewritten as

P[|Tn(θi)− Tn(θn)| ≥ ε] = P[Tn(θi) ≤ Tn(θn)− ε] = P[θi ∈ Ocε,M ].

Now, observe that as a consequence of Proposition 4.11, it holds

P[θi ∈ Ocε,M ] ≤ P[θ1, . . . ,θi−1 ∈ O

cε,M ] ≤ P[z1, . . . , zi−1 ∈ O

cε,M ] = (P[z1 ∈ O

cε,M ])i−1.

6These are, respectively, Assumptions 3 and 1, from the previous chapter.


Given that by assumption P[z1 ∈ Ocε,M ] < 1, the last inequality establishes that

Tn(θi) − Tn(θn) = op(1). The remaining part of the proof follows by a standard

argument, given that the sequence Tn(θi)i∈N is increasing. This implies that the

sequence of events Ei,ε = |Tn(θi)−Tn(θn)| ≤ ε is contractive, i.e., it is such that

Ei+1,ε ⊆ Ei,ε, for every i ∈ N and ε > 0. Consequently by a standard argument,7

convergence in probability implies that

P[

limi→∞|Tn(θi)− Tn(θo)| ≤ ε

]= 1, ∀ε > 0. (4.24)

Given that ε is arbitrary, we get that

P[

limi→∞|Tn(θi)− Tn(θn)| = 0

]= 1,

from where the final result follows.

2. Let us now suppose that Θ is compact, and suppose by contradiction that (4.22)

does not hold. Then for every ω on a set of positive probability Ω ⊂ Rk

∃ε > 0 ∀p ∈ N ∃Ni > p |θi(ω)− θn| > ε . (4.25)

Now for all ω ∈ Ω the sequence θi(ω)i∈N is a sequence of points in a compact

set Θ and by Bolzano–Weierstrass theorem there is a convergent subsequence

θiκ(ω)κ∈N of θi(ω)i∈N. This subsequence must converge to θn because if

the limit were θa then, by the continuity of Tn we would have the sequence

Tn(θiκ)(ω)κ∈N converging to Tn(θa) = Tn(θn). Now as θn is an unique min-

imizer of Tn in Θ we certainly have θa = θn. Finally observe that the subsequence

θiκ(ω)κ∈N also verifies the condition expressed in (4.25) for κ large enough, which

yields the desired contradiction.

A similar result can be established if the master method is adaptive. Again, the proviso

has to be suitably accommodated making use Esquıvel’s [17] hypothesis.

7Recall that when a sequence of events Ei is either expansive or contracting it holds that limi→∞

P [Ei] =

P[

limi→∞

Ei]. Thus, under such circumstances one can interchange the limit with the measure. See e.g.

Proposition 1.1.1 in Ross [45].


Theorem 4.15. (Convergence of the adaptive master method: Part II)

Suppose that Tn is bounded from above. Further, suppose that master method is adaptive,

and that the following condition holds

inf1≤p≤i−1

P[zp ∈ Ocε,M ] = o(1). (4.26)

Further, suppose that Tn(θ) is continuous and that θn = arg minθ∈Θ

Tn(θ). Then, it holds

that as i→∞Tn(θi)− Tn(θn) = o(1), a.s.

If furthermore Θ ⊂ Rk is compact, then as i→∞

θi − θn = o(1), a.s.

Proof. By a similar reasoning to the proof of Theorem 4.13 we get that

P[θi ∈ Ocε,M ] ≤ P[θ1, . . . ,θi−1 ∈ O

cε,M ]

≤ P[z1, . . . , zi−1 ∈ Ocε,M ]

≤ inf1≤p≤i−1

P[zp ∈ Ocε,M ].

This establishes that Tn(θi)−Tn(θn) = op(1). The a.s. convergence can be now achieved

by the same argument used in the proof of Theorem 4.14, and the remaining part proof

is the same as above.

In the next section we provide a brief note regarding the establishment of confidence

intervals for the extremum of a function. For such purpose the seeds will play an

important role.


4.5 A Note on the Construction of Confidence Intervals

This section is devoted to the construction of confidence intervals for the maximum of

a function, through the use of the image of the first column of the iterative matrix Z.

In fact, as we shall see below, if the master method is pure, and the seeds are uniformly

distributed over Θ then it is possible to make use of a result on extreme value theory

due to de Haan [14]. In the sequel, let T z(1) ≤ T z(2) ≤ · · · ≤ T z(r) denote the order

statistics of the sequence of the image of the seeds, where r denotes a finite (possibly

degenerated) stopping time.

Theorem 4.16. (Confidence Intervals for the Maximum - de Haan [14])

Suppose that the identification of the true parameter θo holds, i.e., θo = arg minθ∈Θ

T (θ).

Consider the sequence of independent and identically distributed zi with uniform distri-

bution over Θ. Further, consider the auxiliary correspondence Ξ : N× [0; 1]⇒ R defined

as follows,

Ξ(i, p) =

]T z(i); T z(i) +

T z(i) − T z(i−1)

(1− p)−2k − 1

[. (4.27)

The following large sample result holds

P [Ti(θo) ∈ Ξ(i, p)]− (1− p) = o(1), i→∞.

Proof. See de Haan [14], pp. 467–469.

Remark 4.17. It is worth emphasizing that the proof of such result relies in the applica-

tion of asymptotic results from extreme value theory (Galambos [20]). Given that the

proof of the theorem is relegated to a reference, it is important to underscore that there

are some small typos in the original paper of de Haan [14] that can generate confusion

and misleading conclusions. Making use of de Haan’s [14] notation, we call attention to

the following points.

• In line 21 of p. 467, it is written an α−1. Instead it should be written an α.

• The formula for constructing the confidence intervals is given in the last line of

page 467. In lieu of such formula it should be the written

Y1 − [Y2 − Y1]/[(1− p)−1/α − 1], Y1. (4.28)

Transliterated into our notation we have that Y1 and Y2 respectively mean z(1) and z(2),

and α denotes k/2. Note that the formula (4.28) is for the obtention of the minimum of


a function, whereas in Theorem 4.16 it is adapted for the maximum (cf. with formula

(4.29) stated below).

It should also be pointed out that it have been developed hypothesis tests based on

this method. Veall [53] developed a statistical procedure suited for testing if a solution

achieved is a global maximum. Hence, in the same spirit, if the method is pure and the

seeds are uniformly distributed, Veall’s test can also be implemented here, making use

of the first column of the iterative matrix Z.

It is worth noting that the method is extremely easy to apply making use of the following

inputs: two order statistics (T z(r), T z(r−1)), level of significance (p) and (k) the dimen-

sion of the optimization problem at hand. If we pretend to construct confidence interval

for the minimum of a function then the following set-valued correspondence should be

used

Ψ(i, p) =

]T z(1) −

T z(2) − T z(1)

(1− p)−2k − 1

; T z(1)

[. (4.29)

In the appendix we report some computational experience with de Haan’s method. The

remainder of this section provides a brief guidelines on the construction of such tables.

Monte Carlo simulations were considered for several (degenerated) stopping times, r =

10.000, 20.000, 100.000 and 500.000. Given that we run several Monte Carlo simulations,

as a means to distinguish the several order statistics, let T z(i)j denote the i-th order

statistic from the j-th trial. Further, define the set-valued function Ψ : N× [0; 1]⇒ R

Ψ(r, p) =

r−1r∑j=1

(T z(1)j −

T z(2)j − T z(1)j

(1− p)−2k − 1

); r−1

r∑j=1

T z(1)j

. (4.30)

Moreover, when reporting the computational experience with de Haan’s method, we

make use of the following notation

σ2LB(p) = (r − 1)−1

r∑j=1

(T z(1)j −T z(2)j − T z(1)j

(1− p)−2k − 1

)− r−1

r∑j=1

(T z(1)j −

T z(2)j − T z(1)j

(1− p)−2k − 1

)2

,

σ2UB = (r − 1)−1

r∑j=1

T z(1)j − r−1r∑j=1

T z(1)j

2

.

(4.31)

The number of trials considered was 1.000. It is important to underscore that the test

functions used are classical in the literature, in what concerns testing the performance

of an optimization algorithm.

Chapter 5

Estimation in the Mixed Model

via Stochastic Optimization

5.1 Introduction

Maximum likelihood is one of the main standard techniques for yielding parameter

estimates of a statistical model of particular interest. Large sample results for this M -

estimation methodology, were long ago established in the literature (Wald [57]). Despite

their attractive features, there are circumstances under which the application of such

estimator becomes prohibitive. In fact, in a plurality of cases of practical interest, the

estimator is not analytically tractable. In this chapter we are interested in a particular

case where such occurrence takes place, namely in the MLE for normal linear mixed

models—a model which was briefly discussed in Chapter 3 (see Example 3.2).

A possible approach to surpass this general lack of a closed-form analytic solution,

is given through the application of well suited global optimization methods. Before

carrying on the optimization, it is sometimes convenient to inspect if it is possible to

simplify the problem at hand. Hence, for instance, as noted by Carvalho et al. [11], if the

model has a common orthogonal block structure, then a closed form solution for the MLE

in normal linear mixed models can be found. Beyond this very special instances, there is

no hope to achieve an explicit form for the solution of this maximum likelihood problem.

Notwithstanding, in this chapter we show that in a linear mixed model, the maximum

likelihood problem can be rewritten as a much simpler optimization problem (henceforth

the simplified problem) where the search domain is a compact whose size depends only

on the number of variance components. The original maximum likelihood problem is

thus reduced into a simplified problem which presents at least two main advantages: the

number of variables in the simplified problem is considerably lower; the domain of search

66

Chapter 5. Estimation in the Mixed Model via Stochastic Optimization 67

of the simplified problem is a compact set. Whereas the former advantage avoids the

so-called ‘curse of dimensionality’, the latter permits the use of simple stochastic search

optimization methods.1 As it can be readily noticed, from the estimation standpoint, this

features are extremely advantageous. This simplified problem will allow us to obtain the

estimates of the variance components with large computational savings. Furthermore,

given that the domain of search of the simplified problem is a compact set, we can use

simple random search methods—which are a particular instance of the master method

presented in the previous chapter. Other variants of the master method introduced

above, can also be used to solve the problem of interest.

This chapter is organized as follows. In the next section we introduce the model of

interest. In §5.3 we introduce a main result which yields the simplified problem, and to

assess the performance of our approach we conduct a Monte Carlo simulation study and

report the results in §5.4.

5.2 Model

In this section, we present the model of interest. We bring to mind that some points

regarding the mixed linear model were introduced in Chapter 3 (recall Example 3.2).

Roughly speaking, one can think of the the mixed linear model is an extension to the

simple linear model (3.2), in order to account for more than one source of error. The

model takes the following form

y = Xβo +

w−1∑i=1

Xiζi + ε, (5.1)

where (y,X,βo, ε) are defined as in (3.2), Xi are design matrices of size n × ki, and

where ζi are ki-vectors of unobserved random effects. Following classical assumptions,

we take the random effects ζi to be independent and normally distributed with null

mean vectors and covariance matrix σ2oiIki , for i = 1, . . . , w − 1. Further, we also take

the ε to be normally distributed with null mean vector and covariance matrix σ2owIn,

independently of the ζi, for i = 1, . . . , w − 1.

A general overview of topics related with the estimation and inference of these models,

can be respectively found in Searle et al. [47] and Khuri et al. [29].

The model has the following mean vector and covariance matrix

E [y|X] = Xβo, (5.2)

1This expression was introduced by the prominent Mathematician Richard Bellman. A detailedanalysis of this issue is far beyond the scope of this thesis. See Pakes and McGuire [40] for a discussion.


Σσ2o≡ V[y|X] =

w−1∑i=1

σ2oiXiX

Ti + σ2

owIn,

where σ2o ≡ [ σ2

o1 · · · σ2ow ]. Given the current framework, we have that

y|X ∼ N

(Xβo;

w−1∑i=1

σ2oiXiX

Ti + σ2

owIn

),

and thus the density for the model is given by

fy|X(y) =exp(−1

2(y −Xβo)TΣ−1

σ2o(y −Xβo)

)√(2π)n det(Σσ2

o)

.

Now, let θ ≡ [ βT σ2 ]. The estimator objective function assigned to the maximum

likelihood estimator is given by the loglikelihood of the aforementioned mixed linear

model, i.e.:

Tn(θ) =Tn([ βT σ2 ])

=− n

2ln(2π)− 1

2ln(

det(Σσ2

))− 1

2(y −Xβ)TΣ−1

σ2 (y −Xβ).

The maximum likelihood estimators of the true regression parameter and the model

variance components, respectively denoted by βML and σ2ML, are thus given by

[ βT


[ βT σ2 ]

Tn([ βT σ2 ]) = arg maxθ∈Θ

Tn (θ) , (5.3)

where the parameter set Θ, is a bounded subset of Rk+w which restricts the elements

θi to be nonnegative, for i = k + 1, . . . , w, i.e.

Θ ≡ θ ∈ Rk+w : θi ≥ 0, i = k + 1, . . . , w.

In the following section, we consider a special case, wherein it is possible to obtain a

closed form solution for the ML problem. It should be emphasized that cases like the

presented below represent the exception, rather than the rule. Notwithstanding, this

instance is introduced here as a benchmark case.

5.2.1 The Benchmark Case

In this subsection, we consider a particular case wherein it is possible to get a closed

form solution for the problem of interest. As we shall see below, this is gained at the


cost of the introduction of some structure in the covariance matrix Σσ2o. Specifically, if

we consider the case wherein the covariance matrix can be decomposed as

Σσ2o

=w∑j=1

ηjQj .

If the Qj are orthogonal projection matrices such that QjQj′ = 0 for j 6= j′, and if T,

the orthogonal projection matrix on the range space of X, is such that:

TQj = QjT, j = 1, . . . , w,

the model is said to have commutative orthogonal block structure. Here, η = [η1 · · · ηw]T

is the vector of the so-called canonical variance components, and it is determined by the

equation Bη = σ2o, where B is a known nonsingular matrix. In this case, we can rewrite

the density of the model as

fy|X(y) =exp

(−1

2(y −Xβo)T

(∑wj=1 η

−1j Qj

)(y −Xβo)

)√

(2π)n∏wj=1 η

gjj

. (5.4)

where gj is the rank of matrix Qj .

In this particular instance, it can be shown (see Carvalho et al. [11]) that the following

of estimators solve the optimization problem (5.3)

β = (XTX)−1XTy,

ηj =yT(I−T)Qj(I−T)y

gj,

When the above mentioned conditions on the covariance matrix and X do not hold, a

closed-form analytical expression for producing MLE is not typically obtainable.

Before we move into the application of stochastic search methods into the optimization

problem (5.3), we provide a result which is able to reduce the number of variables wherein

we perform the optimization.


5.3 The Dimension Reduction Technique

The next result establishes that in a mixed linear model, the maximum likelihood prob-

lem can be rewritten as a simplified problem where the search domain is a compact

set whose dimension depends exclusively on the number of variance components. This

result will prove to be useful in order to compute the estimation of variance components,

through maximum likelihood methods, at a much lower computational effort.

Theorem 5.1. Consider the aforementioned mixed model,

y = Xβo +w−1∑i=1

Xiζi + ε.

The maximum likelihood estimators of the true regression parameter and the model vari-

ance components, respectively denoted by βML and σ2ML, given by

[ βT


[ βT σ2 ]

`n([ βT σ2 ] | y),(5.5)

can be alternatively achieved by solving the following optimization problem

minγ∈[0;π

2 ]w−1

(fn p)(γ), (5.6)

where

fn(α) = ln(A(α)n det(Σα)

)(5.7)

A(α) = yT

(I−X

(XTΣ−1

α X)−1

XTΣ−1α

)T

Σα−1(I−X

(XTΣ−1

α X)−1

XTΣ−1α

)y (5.8)

p(γ) = q1

w−1∏j=1

cos(γj) +w−1∑l=2

ql

w−l∏j=1

cos(γj) sin(γw−1)

+ qw sin(γw−1) (5.9)

and qiwi=1 denotes the canonical basis of Rw.

From the inspection of Theorem 1, we can ascertain at least two major advantages of

the simplified problem, relatively to the original problem, namely: whereas the original

maximum likelihood problem has dimension w+k, the simplified equivalent problem only

has size w − 1; additionally, the search domain of the simplified problem is a compact

set—contrarily to what it is verified in the original problem. This latter advantage

permits the use of simple random search methods, adjusting a multivariate uniform

distribution over the new search domain[0; π2

]w−1. The proof is given below.


Proof. (Dimension reduction technique)

Consider the loglikelihood of the aforementioned mixed linear model,

`n([ βT

ML σ2ML ] | y) = −n

2ln(2π)− 1

2ln (det(Σσ2))− 1

2(y −Xβ)TΣ−1

σ2 (y −Xβ).

Observe that maximizing `n([ βT

ML σ2ML ] | y) is equivalent to minimizing

`∗([ βT

ML σ2ML ] | y) = ln(det(Σσ2)) + (y −Xβ)TΣ−1

σ2 (y −Xβ). (5.10)

Now define σ2 = cα, with c > 0 and ‖α‖ = 1. Making use of the first order conditions

of the ML problem we get that

β =(XΣ−1

σ2X)−1

XΣ−1σ2y,

Hence, we can rewrite (5.10), evaluated at β, as

`∗ = n ln(c) + ln(det(Σα)) + c−1A(α),

where A is defined in (5.8). Now, observe that

∂l∗∂c

= nc−1 − c−2A(α) = 0⇔ c =A(α)

n,

and∂2l∗∂c2

= 2c−3A(α)− nc−2,

so that∂2l∗∂c2

∣∣∣∣c=

A(α)n

=n3

A(α)> 0

whence

c =A(α)

n,

is in fact an absolute minimum. Hence (5.10) simplifies into n ln(A(α)

)+ ln |Σα|, which

we define as fn(α) (see above in (5.7)). Next, we transform α through the polar coor-

dinate transformation p(γ) (see e.g. Kendall [27]) defined in (5.9). This entails writing

the w components of α, through w − 1 components in γ, as follows

α1 = cos(γ1) · · · cos(γw−1) cos(γw−1)

α2 = cos(γ1) · · · cos(γw−1) sin(γw−1)

...

αw = sin(γw−1).

(5.11)


5.4 A Stochastic Optimization Study of the Dimension Re-

duction Technique

In this Monte Carlo simulation study, we considered three one-way random models. The

first model is unbalanced with a total of 72 observations and 8 groups. The dissemination

of the observations can be described through the following vector

[3 6 7 8 9 10 11 18],

whose i-th component denotes the number of elements considered in the i-th group.

Several possible true values of the variance components were considered. We then con-

ducted a Monte Carlo simulation from which we report the averaging of the several

results achieved. In every run of the simulation, the optimization problem was solved

using the reduction dimension technique introduced in the previous section, and the

pure random search method—a particular instance of the master method introduced in

the previous chapter.

Variance Component0.0 0.1 0.5 0.7 1.0 1.5 2.0 5.0 10.0

Estimate 0.016 0.073 0.425 0.609 0.869 1.311 1.759 4.657 9.421

Table 5.1: Estimates of the variance components in Model I.

We now provide some guidelines regarding the interpretation of Table 5.1. In the first

line we present the true values of the variance components σ2o . The second line includes

the solution provided by the recurrent application of pure random search methods to

the optimization problem (5.3). Thus, for instance when the “true” variance component

was of 0.5, the result yield through the application of stochastic optimization methods

and the dimension reduction technique presented above yield 0.425. Further, observe

that except when the true value of the variance component is null, the true values always

dominates the estimated values.

Next, we considered a quasi-balanced model with 66 observations. The disposal of the

observations was now the following

[6 6 6 6 6 6 6 6 6 5 7].

A Monte Carlo simulation was once more conducted. No changes were made regard-

ing the true values of the variance components considered. The same applies in what

concerns the methods used to perform the optimization step. The results produced are

reported in Table 5.2.



Estimate 0.023 0.086 0.448 0.633 0.893 1.344 1.850 4.611 9.091

Table 5.2: Estimates of the variance components in Model II.

Hence, when the true value of the variance component was of 0.5, its estimate obtained

by maximum likelihood methods was of 0.448. It should be emphasized that again, with

exception of the case wherein the true value of the variance component was 0, in all the

remainder cases the true value of the variance component was above its estimate.

A final model was then considered. The number of observations considered was now 72.

Observations were now grouped as follows

[2 2 3 3 4 4 15 15 24].

The results are now summarized in Table 5.3.


Estimate 0.012 0.074 0.426 0.600 0.852 1.364 1.780 4.711 9.929

Table 5.3: Estimates of the variance components in Model III.

As it can be readily noted, from the inspection of tables 5.1, 5.2 and 5.3, there is a slight

bias present in the estimates produced through maximum likelihood. This in accord

to what is known in the literature, and there exist some methods, such as Restricted

Maximum Likelihood (REML), which can be used to compensate for such bias (see

Harville [23] and Searle et al. [47]).

Chapter 6

Summary and Conclusions

6.1 Closure

The instigation which drove us through the research which culminated in this thesis,

has now reached a final stage. It is now the time for a reflexion regarding the work

developed thus far. In this spirit, we summarize below the inquiry carried on over this

thesis and provide some concluding remarks.

As discussed above, this thesis was written as a fugue, interweaving the themes of ex-

tremum estimators and stochastic optimization. In what concerns extremum estimators,

we started by discussing some of the provisos under which the consistency and asymp-

totic normality of such estimators can be ensured. During this part of the thesis, we

presented some cornerstone results which establish the strong consistency and asymp-

totic normality, under a fairly mild set of assumptions. Given that any estimator which

can be formulated through an optimization problem is tantamount to an extremum es-

timator, the broadness of such results is in effect astonishing. The melody played in

this regard was strongly inspired in the seminal works of Newey and Mcfadden [35] and

Andrews [3]. After the analysis of the large sample results of extremum estimators we

started considering its computational features. In this regard, we emphasized the need

to carefully choose which numerical procedure to use when carrying the optimization.

Several hindrances may arise if one does not pay specific attention to this point. In

fact, given that the global solution is the unique which inherits noteworthy asymptotic

features, one should avoid methods which may eventually converge to a local solution.

Some standard numerical procedures were then introduced as a means to start giving

a voice to stochastic optimization algorithms. The presentation of some deterministic

methods worked as a bridge linking Chapters 2 and 3, allowing us to lay groundwork

for the study of stochastic optimization algorithms.

74

Chapter 6. Summary and Conclusions 75

Extremum estimators are then counterpointed with the introduction of the master

method—a general algorithm which comprises several other stochastic optimization algo-

rithms. The generality of the master method is in fact considerable including for instance

the conceptual algorithm of Solis and Wets [49], as a particular case. Another specific

embodiment of the master method is provided by the stochastic zigzag method—an op-

timization algorithm which is based on the prominent work of Mexia et al. [32]. During

this phase of the thesis, we also offer a simple matrix formulation of the algorithm. The

matrix formulation not only brings new insights into the general method, as it can also

diminish the burden of implementation. In fact, we achieve a result—the Kronecker–

zigzag decomposition (Theorem 4.8)—which allow us to easily obtain an entire course.

Hence, this result entails inherent simplifications at the implementation level. We then

move to the analysis of the large sample behavior of the method. The stochastic conver-

gence of the master method is here achieved under a fairly mild proviso. It is important

to underscore that the we relied on assumptions which are identical to the ones adopted

by Solis and Wets [49] and by Esquıvel [17]. We also discuss how to make use of the

master method in order to construct confidence intervals for the maximum, through an

asymptotic result on extreme value theory due to de Haan [14].

A dimension reduction technique was also developed in order to achieve, with large com-

putational savings, MLE for the mixed model parameters and the variance components.

The original maximum likelihood problem was reduced into a simplified problem which

presents at least two main advantages: the number of variables in the simplified problem

is considerably lower; the domain of search of the simplified problem is a compact set.

Whereas the former advantage avoids the so-called ‘curse of dimensionality’, the latter

permits the use of instances of the master method developed here.

The next section provides some topics which we have scheduled in our research agenda.

6.2 Open Problems

In some of the preceding parts of the thesis, we ascertained new claims which have

contributed into the state-of-the-art. In this part the thesis we put the emphasis on the

new questions which came hand in hand with such claims. Below we trace a roadmap

of future research directions. It is important to underscore that it is not our intention

to elaborate here on these ideas. The exposition level made below is thus obviously

perfunctory.

• It would be interesting to extend the de Haan’s [14] method in order to make

use of all the information contained in the iterative matrix Zr×c. The work of


Dorea [15], can be a natural starting point in order to address such issue. In fact,

the proposal made in §4.5, directly relies on de Haan’s result, and hence it only

makes use of the first column of the iterative matrix. If the number of seeds is

sufficiently large, so that some asymptotic results from extreme value theory can

be invoked (see e.g. Galambos [20] pp. 111–119), meager accuracy gains can be

expected to be obtained when using any further information. Notwithstanding,

serious hindrances may arise if the number of seeds r is small. In fact, beyond

of being a waste of resources, it can bring us some embarace in some cases. Our

experience with the method lead us to believe that if the number is of seeds is

too small, then it might occur that the optimal value yield by the master method

can rely outside the confidence interval built by the master method. It should

be emphasized that such pathological cases only occur when we consider a small

number of seeds. Specifically, this tends to occur more frequently when besides of

r being small, the ratio r/(c− 1) is also small.

• It remains unanswered what is the rate of convergence of the master method in

general, and of the stochastic zigzag method in particular. In this regard, a natural

point of departure would be to reconsider and try to extent some results of Pflug

[41] (see pg. 24) which are related with the rate of convergence of the pure random

search method. The precursory analysis of Solis and Wets [49] in what concerns

the rate of convergence of the conceptual algorithm can be also taken into account

in the analysis of such matter. It is our opinion that this issue should be carefully

addressed in future inquiries. In fact, the cognizance of the rate of convergence is

pivotal in applications. Given the generality of the master method it would be also

important to point out what are the factors that have an influence on the rate of

convergence? For instance, can c have an effect on the rate of convergence of the

master method? If so, which values of c can lead to an higher rate of convergence?

• Is it possible to establish the convergence of an extended master method, which

instead of considering c as fixed, allows c to be defined through some stopping

criteria of interest? For the sake of illustration consider the following sequence

z1,Z1,1, . . . ,Z1,τ1 ,Z2,1, . . . ,Z2,τ2 , . . . ,Z3,1, . . . .,Z1,τ3 , . . . (6.1)

where the τ1, τ2, τ3, . . . denote finite stopping-times. Several question now arise.

How robust is the framework developed so far, to such extended method? In what

concerns the matrix formulation, we note that unless we consider the degenerate

case wherein τ1 = τ2 = τ3 = · · · = c, it cannot be applied. It can however be

adopted as a benchmark framework. In what regards the large sample behavior of

the new method, we conjecture that with the due adaptations it may be possible


to characterize the large sample behavior of this new method, at the umbrella of

what it was done in §4.4. However, further inspection should be taken into course,

before any conclusion can be drawn.

• It remains to inspect whether a dimension reduction technique of the type to the

one develop in the previous chapter is robust to other M -estimation methods as

well.

• Last we intend to consider the development of the logical grounds for the almost

universal quantifier.

Appendix A

A Short Note Regarding the

Essential Supremum

This appendix includes some general results concerning the essential supremum. All the

results which we state and prove here are valid for measurable functions. Additionally,

these results also have a essential infimum counterpart. The exposition made below is

largely based in the textbook of Capinski and Kopp [9]. Here we make use of the almost

universal quantifier which was already introduced above (see §4.3, Definition 4.1).

We start by showing that if the essential supremum of a measurable function equals

−∞, then the function attains the essential supremum a.e.

Proposition A.1.

Suppose that f : Θ −→ R is a measurable function such that

ess supx∈Θ

f(x) = −∞.

Then

f(θ) = −∞,gθ ∈ Θ.

Proof. Just observe that by definition of essential supremum, and by assumption we

must have that

f(x) ≤ −n, gx ∈ Θ ∀n ∈ N.

78

Appendix A. A Short Note Regarding the Essential Supremum 79

We now show that the a measurable function cannot take values above its essential

supremum, except on a set null measure.

Proposition A.2.

Suppose that f : Θ −→ R is a measurable function. Then it holds that

f(θ) ≤ ess supx∈Θ

f(x), gθ ∈ Θ. (A.1)

Proof. Start by defining the following sequence of sets

An =

θ ∈ Θ : ess sup

x∈Θf(x) < f(θ)− 1

n

, n ∈ N.

It can be easily shown that An is expansive, i.e., An ⊂ An+1. Hence

A :=

∞⋃n=1

An =

θ ∈ Θ : ess sup

x∈Θf(x) < f(θ)

.

As a consequence of Boole’s inequality it holds that

λ(A) = λ

( ∞⋃n=1

An

)≤∞∑n=1

λ(An) = 0,


Observe that if we assume continuity of f it can be established a stronger form of (A.1),

wherein the almost universal quantifier is replaced by the universal quantifier. In fact,

the same reasoning used in the proof Theorem 4.5, can be used to establish that, if f is

continuous then

f(θ) ≤ ess supx∈Θ

f(x), ∀θ ∈ Θ.

Next we show how the essential supremum of the sum relates with the sum of the

essential supremums.

Proposition A.3.

Suppose that f1 and f2 are measurable functions. Then

ess supx∈Θ

(f1(x) + f2(x)) ≤ ess supx∈Θ

f1(x) + ess supx∈Θ

f2(x).

Proof. By Theorem A.2, we have that

f1(θ) + f2(θ) ≤ ess supx∈Θ


f2(x),gθ ∈ Θ.

Appendix B. van der Corput’s Sublevel Set Estimates 80

Consequently

ess supx∈Θ


f2(x) ∈ t : f1(θ) + f2(θ) ≤ t,gθ ∈ Θ,

which implies that

ess supx∈Θ


f2(x) ≥ inft : f1(θ) + f2(θ) ≤ t,gθ ∈ Θ

= ess supx∈Θ

(f1(θ) + f2(θ)).

Last, but not least we show that the essential supremum attains at most the same value

as the supremum.

Proposition A.4.

Suppose that f is a measurable function. Then it holds that

ess supx∈Θ

f(x) ≤ supx∈Θ

f(x).

Proof. If supθ∈Θ

f(θ) = ∞, the proof is trivial. Suppose instead that supθ∈Θ

f(θ) = S, where

S is finite. Then it holds that

f(x) ≤ S,∀x ∈ Θ⇒ f(x) ≤ S,gx ∈ Θ

⇒ S ∈ t : f(x) ≤ t,gx ∈ Θ

⇒ S ≥ inft : f(x) ≤ t,gx ∈ Θ,

(A.2)


Appendix B

van der Corput’s Sublevel Set

Estimates

The goal of this appendix is to ponder over the van der Corput’s sublevel set estimate

used in Chapter 4, a result due to Rogers [44]. Such result is stated and proved below.

Theorem B.1. (van der Corput’s sublevel set estimate)

Suppose that f : [a, b] → R is n times differentiable on ]a; b[, with n ≥ 1, and that

|f (n)(x)| ≥ ζ > 0. Then it holds that:

λ(x ∈ [a, b] : |f(x)| ≤ τ) ≤ n

√n!22n−1τ

ζ.

For a multidimensional version of this result see Carbery et al. [10]. Before we sketch

a proof of Theorem B.1, we have to recall two keynote topics: a generalized version

of Lagrange’s theorem; basic rudiments on Chebyshev polynomials. We start with the

former, whose proof can be found elsewhere (e.g. [44]).

Theorem B.2. (Generalized Lagrange’s Theorem)

Consider f : [a, b] → R, such that f is n times differentiable, with n ≥ 1. Consider the

points x0 < x1 < · · · < xn contained in ]a, b[. Then it holds that

∃c ∈]a, b[: f (n)(c) = n!n∑j=0

(−1)j+nf(xj)∏k 6=j |xk − xj |

.

Proof. See Rogers [44].

81

Appendix C. Construction of Confidence Intervals for the Minimum 82

Now we bring to mind some rudimentary features of Chebyshev polynomials. Recall that

a Chebyshev polynomial of order n is defined as Cn(x) = cos(n cos−1(x)), for n ∈ N0 and

x ∈ [−1; 1]. Note that Cn is indeed a polynomial of degree n, even though this may not

appear obvious on a prima facie analysis (hence for instance T1(x) = x, T2 = 2x2 − 1,

and so on).

Elementary Properties of Chebyshev Polynomials

1. The term with degree n has coefficient 2n−1.

2. The extrema of the Chebyshev polynomial Tn(x) are attained at ρk = cos(kπ/n),

for k = 0, 1, . . . , n.

Observe that it follows directly from property 1 that T(n)n (x) = n!2n−1. These and other

elementary properties of Chebyshev polynomials can be found in introductory textbooks

on numerical analysis (see [52], pp. 242–243). We are now in conditions of giving a proof

to Theorem B.1. The proof given below mirrors the reasoning given in [44].

Proof. (van der Corput’s sublevel set estimate)

Let O = x ∈ [a, b] : |f(x)| ≤ τ. The case wherein λ(O) = 0 is trivial. Hence, suppose

that λ(O) > 0. We start by mapping O to the interval O such that λ(O) = λ(O), and

which preserves the distance. Now, map O into the interval [−1, 1] by centering at the

origin and scaling by 2/λ(O). Through an application of Theorem B.2 to the Chebyshev

extrema it holds thatn∑j=0

∏k 6=j|ρk − ρj |−1 = 2n−1.

If we map back to O, there exists x0, . . . , xn ∈ O such that

n∑j=0

∏k 6=j|xk − xj |−1 ≤ 2n−1 2n

[λ(O)]n=

22n−1

[λ(O)]n. (B.1)

Consequently, it holds that

ζ ≤

∣∣∣∣∣∣n!

n∑j=0

(−1)j+nf(xj)∏k 6=j |xk − xj |

∣∣∣∣∣∣ ≤ n!n∑j=0

∏k 6=j|xk − xj |−1τ ≤ n!22n−1τ

[λ(O)]n. (B.2)

Some justifications regarding (B.2). From left to right: the first inequality holds by

assumption (|f (k)(x)| ≥ ζ > 0), and as a consequence of Theorem B.2; the second

inequality holds by triangular inequality and by assumption (|f(x)| ≤ τ); the last in-

equality is a consequence of the (B.1). The final result is now a consequence of inequality

(B.2).

Appendix C

Construction of Confidence

Intervals for the Minimum

C.1 Tables for de Haan’s Method

This appendix reports some computational experience with de Haan’s [14] method. Ad-

ditional details regarding the design of the simulation can be found in §4.5. The test

functions used are classical in the literature. Even though for the sake of completeness

we include the functional form of these function here. In Table C.1, we summarize useful

information regarding the the search domains used, as well as their global minimums in

the respective domains.

Test Function Search Domain m∗

Beale [−4.5, 4.5]2 0

Easom [−100, 100]2 -1

Griewank [−600, 600]2 0

Rastrigin [−5.12, 5.12]2 0

Rosembrock [−5, 10]2 0

Styblinski–Tang [−8, 8]2 -78.33

Table C.1: Search domains of the test functions used and their corresponding globalminimum value denoted by m∗

.

The next pages resume the results obtained. We recall that the notation used below was

already defined in foregoing chapters (formulas (4.30) and (4.31) from §4.5).

83


Test 10.000 Observations

Function Ψn(p)

Beale ]-0.0409 // -0.0918 // -0.4985; 0.0048[Easom ] -2.3266 // -4.5481 // -22.3196; -0.3273[

Griewank ]-0.7654 // -1.8815 // -10.8104; 0.2391[Rastrigin ]-2.6359 // -6.1850 // -34.5774; 0.5582[

Rosembrock ]-0.5828 // -1.3078 // -7.1151; 0.0704[Styblinski–Tang ]-79.4701 // -80.8843 // -92.1976; -78.1974[

Table C.2: (1 − p) Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000observations.

Test 10.000 ObservationsFunction σ2

LB(0.10) σ2LB(0.05) σ2

LB(0.01) σ2UB

Beale 0.0022 0.0096 0.2580 3e-05Easom 5.9262 23.5227 585.4046 0.0908

Griewank 0.8137 3.4069 88.7540 0.0142Rastrigin 8.8619 36.4579 936.5940 0.1605

Rosembrock 0.4335 1.9305 52.4967 0.0049Styblinski–Tang 1.5938 7.0524 191.2889 0.0196

Table C.3: Sample variances for the upper bound and lower bounds of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 10.000 observations.



Function Ψn(p)

Beale ]-0.0195 // -0.0440 // -0.2403; 0.0026[Easom ]-2.8362 // -5.4301 // -26.1817; -0.5016[

Griewank ]-0.5536 // -1.3559 // -7.7745; 0.1685[Rastrigin ]-21453 // -4.8858 // -26.8100; 0.3212[

Rosembrock ]-0.2898 // -0.6521 // -3.5507; 0.0363[Styblinski–Tang ]-78.8997 // -79.6046 // -85.2518; -78.2634[



LB(0.10) σ2LB(0.05) σ2

LB(0.01) σ2UB

Beale 5e-04 0.0023 0.0608 6e-06Easom 4.8064 19.4017 490.4135 0.0841


Rosembrock 0.1135 0.5022 13.6278 0.0015Styblinski–Tang 0.4113 1.8156 49.1538 0.0049

Table C.5: Sample variances for the lower bounds and upper bound of the (1 − p)Confidence Intervals for p = 0.10/0.05/0.01, based on 20.000 observations.



Function Ψn(p)

Beale ]-0.0041 // -0.0092 // -0.0499; 5e-04[Easom ]-2.0988 // -3.4918 // -14.6353; -0.8452[

Griewank ]-0.2619 // -0.6373 // -3.6409; 0.0760[Rastrigin ]-0.5316 // -1.1991 // -6.5390; 0.0691[

Rosembrock ]-0.0584 // -0.1310 // -0.7124; 0.007 [Styblinski–Tang ]-78.4469 // -78.5898 // -79.7329; -78.3183[



LB(0.10) σ2LB(0.05) σ2

LB(0.01) σ2UB

Beale 2e-06 1e-04 0.0026 2e-07Easom 1.3030 5.6331 150.2689 0.0183


Rosembrock 0.0043 0.0192 0.5227 5e-05Styblinski–Tang 0.0167 0.0735 1.9868 2e-04




Function Ψn(p)

Beale ]-8e-04 // -0.0018 // -0.0097; 1e-04[Easom ]-1.2893 // -1.6514 // -4.5486; -0.9633[

Griewank ]-0.1191 // -0.2883 // -1.6418; 0.0332[Rastrigin ]-0.0118 // -0.0265 //-0.1440; 0,0014[

Rosembrock ]-0.0110 // -0.0248 // -0.1356; 0.0015[Styblinski–Tang ]-78,3537 // -78.3807 // -78.5960; -78.3295[



LB(0.10) σ2LB(0.05) σ2

LB(0.01) σ2UB

Beale 8e-07 4e-06 1e-04 8e-09Easom 0.1045 0.4600 12.4314 0.014

Griewank 0.0183 0.0769 2.0045 3e-04Rastrigin 2e-04 7e-04 0.0193 2e-06

Rosembrock 2e-04 8e-04 0.204 2e-06Styblinski–Tang 5e-04 0.0024 0.0643 8e-06



C.2 Functional Form of the Classical Test Functions

We now report the functional form of the classical test functions used above.

• Beale function

L(x1, x2) = (1.5− x1 + x1x2)2 + (2.25− x1 + x1x22)2 + (2.625− x1 + x1x

32)2.

• Easom function

L(x1, x2) = − cos(x1) cos(x2) exp(−(x1 − π)2 − (x2 − π)2).

• Griewank function

L(x1, . . . , xk) =k∑i=1

(x2i

4000

)−

k∏i=1

cos

(xi√i

)+ 1.

• Generalized Rastrigin function

L(x1, . . . , xk) = 10k +k∑i=1

(x2i − 10 cos(2πxi)).

• Rosenbrock function

L(x1, . . . , xk) =

k∑i=1

[100(x2

i − xi+1)2 + (xi − 1)2].

• Styblinski–Tang function

L(x1, x2) =1

2

[x4

1 − 16x21 + 5x1 + x4

2 − 16x22 + 5x2

].

Bibliography

[1] Amemiya, T. (1985) Advanced Econometrics, Cambridge: Harvard University

Press.

[2] Andrews, D. (1997) “Estimation When a Parameter is on a Boundary of a Pa-

rameter space: Part II,” Mimeo, Yale University.

[3] Andrews, D. (1999) “Estimation When a Parameter is on a Boundary,” Econo-

metrica, 1341–1383.

[4] Bazaraa, M., Sherali, H., and Shetty, C. (1992) Nonlinear Programming:

Theory and Algorithms, Hoboken: Wiley.

[5] Bohachevsky, I. O., Johnson, M. E., and Stein, M. L. (1986) “Generalized

Simulated Annealing for a Function Optimization,” Technometrics, 28, 209–217.

[6] Billingsley, P. Probability and Measures, New York: Wiley.

[7] Bierens, H. J. (2005) Introduction to the Mathematical and Statistical Foundations

of Econometrics, New York: Cambridge University Press.

[8] Booth, J., Casella, G., and Hobert, J. (2008) “Clustering using Objective

Functions and Stochastic Search,” Journal of the Royal Statistical Society, Ser. B,

70, 119–139.

[9] Capinski, M., Kopp, E. (1998) Measure, Integral and Probability, New York:

Springer.

[10] Carbery, A., Christ, M., and Wright, J. (2006) “Multidimensional Van Der

Corput and Sublevel Set Estimates,” Journal of the American Mathematical Society,

4, 981–1015.

[11] Carvalho, F., Oliveira, M., and Mexia, J. (2007) “Maximum Likelihood Es-

timator in Models with Commutative Orthogonal Block Structure,” Proceedings of

the 56th ISI Session, Lisbon, 223–226.

89

Bibliography 90

[12] Chernoff, H. (1954) “On the Distribution of the Likelihood Ratio,” Annals of

Mathematical Statistics, 25, 573–578.

[13] Christensen, R. (2002) Plane Answers to Complex Questions, New York:

Springer.

[14] de Haan, L. (1981) “Estimation of the Minimum of a Function Using Order Statis-

tics,” Journal of the American Statistical Association, 76, 467–469.

[15] Dorea, C. (1987) “Estimation of the Extreme Value and Extreme Points,” Annals

of the Institute of Mathematical Statistics, 39, 37–48.

[16] Duflo, M. (1996) Algorithmes Stochastiques, Berlin: Springer.

[17] Esquıvel, M. L. (2006) “A Conditional Gaussian Martingale Algorithm for Global

Optimization,” Lecture Notes in Computer Science, 3982, 813–823.

[18] Figueira, M. (1997) Fundamentos de Analise Infinitesimal, Textos de Matematica,

Universidade de Lisboa, Faculdade de Ciencias, Departamento de Matematica.

[19] Gan, L., Jiang, J. (1999) “A Test for a Global Optimum,” Journal of the Amer-

ican Statistical Association, 94, 847–854.

[20] Galambos, J. (1978) “The Asymptotic Theory of Extreme Order Statistics,” New

York: Wiley.

[21] Goldberg, D. (1989) Genetic Algorithms in Search, Optimization, and Machine

Learning, Reading: Addison-Wesley.

[22] Hayashi, F. (2000) Econometrics, New Jersey: Princeton University Press.

[23] Harville, D. (1974) “Bayesian Inference for Variance Components Using Only

Errors Contrasts,” Biometrika, 61, 383–385.

[24] Judge, G., Hill, R., Griffiths, W., Lutkepohl, H., and Lee, T.-C. (1985)

The Theory and Practice of Econometrics, New York: Wiley.

[25] Ho, Y.-C., Pepyne, D. L. (2002) “Simple Explanation of the No Free Lunch

Theorem and its Implications,” Journal of Optimization Theory and Applications,

3, 549–570.

[26] Kelly, C. (1999) Iterative Methods for Optimization, Philadelphia: SIAM.

[27] Kendall, M. (1961) A Course in the Geometry of n Dimensions, New York:

Hafner.

Bibliography 91

[28] Khury, A. (2003) Advanced Calculus with Applications in Statistics, Hoboken:

Wiley.

[29] Khuri, A., Mathew, T., and Sinha, B. (1998) Statistical Tests for Mixed Linear

Models, New York: Wiley.

[30] LeCam, L. (1953) “On Some Asymptotic Properties of Maximum Likelihood Es-

timates and Related Bayes Estimates,” University of California Publications in

Statistics, 1, 277–328.

[31] Luenberger, D. G. (1969) Optimization by Vector Space Methods, New York:

Wiley.

[32] Mexia, J., Pereira, D., and Baeta, J. (1999) “L2 Environmental Indexes,”

Listy Biometryczne—Biometrical Letters, 36, 2, 137–143.

[33] Mexia, J., Corte Real, P. (2001) “Strong Law of Large Numbers for Additive

Extremum Estimators,” Discussiones Mathematicae, Probability and Statistics, 21,

81–88.

[34] Mexia, J., Corte Real, P. (2003) “Compact Hypothesis and Extremal Set Es-

timators,” Discussiones Mathematicae, Probability and Statistics, 23, 103–121.

[35] Newey, W., McFadden, D. (1994) “Large Sample Estimation and Hypothesis

Testing,” Handbook of Econometrics, 4, 2111–2245.

[36] Nocedal, J., Wright, S. (1999) Numerical Optimization, New York: Springer.

[37] Nunes, S., Mexia, J., and Minder, C. (2004) “Logit Model for Tuberculosis

Incidence in Europe (1995–2000). Analysis by Sex and Age Group,” Colloquium

Biometryczne, 34, 147–159.

[38] Oliveira, M., Nunes, S., Ramos, L., and Mexia, J. (2006) “Ajustamento de

Modelos Espacio-Temporais para a Sida Utilizando Mınimos Quadrados Estrutu-

rados,” Actas do XII Congresso da Sociedade Portuguesa de Estatıstica, 519–526.

[39] Øksebdal, B. (1998) Stochastic Differential Equations: an Introduction with Ap-

plications, Berlin: Springer.

[40] Pakes, A., McGuire, P. (2001) “Stochastic Algorithms, Symmetric Markov Per-

fect Equilibrium and the Curse of Dimensionality,” Econometrica, 69, 1261–1282.

[41] Pflug, G. Ch. (1996) Optimization of Stochastic Models: The Interface Between

Simulation and Optimization, Boston: Kluwer.

Bibliography 92

[42] Rao, C., Rao, M. (1998) Matrix Algebra and its Applications to Statistics and

Econometrics, Singapore: World Scientific Publishing.

[43] Renyi, A. (1970) Probability Theory, Amsterdam: Elsevier.

[44] Rogers, K. M. (2005) “Sharp Van Der Corput Estimates and Minimal Divided

Differences,” Proceedings of the American Mathematical Society, 133, 3543–3550.

[45] Ross, S. (1996) Stochastic Processes, New York: Wiley.

[46] Sen, P. K., Singer, J. M. (1993) Large Sample Methods in Statistics, An Intro-

duction with Applications, Boca Raton: Chapman & Hall.

[47] Searle, S., Casella, G., and McGulloch, C. (1992) Variance Components,

New York: Wiley.

[48] Shao, J. (2005) Mathematical Statistics, New York: Springer.

[49] Solis, F. J., Wets, R. J-B. (1981) “Minimization by Random Search Tech-

niques,” Mathematics of Operations Research, 6, 19–30.

[50] Spall, J. C. (2003) “Introduction to Stochastic Search and Optimization: Esti-

mation, Simulation and Control,” Hoboken: Wiley.

[51] Stram, D. O., Lee J. W. (1994) “Variance Components Testing in the Longitu-

dinal Mixed Effects Model,” Biometrics, 50, 1171–1177.

[52] Suli, E., Mayers, D. (2003) An Introduction to Numerical Analysis, Cambridge:

Cambridge University Press.

[53] Veall, M. R. (1990) “Testing for a Global Minimum in an Econometric Context,”

Econometrica, 58, 1459–1465.

[54] van der Vaart, A. W., Wellner, J. A. (1996) Weak Convergence and Empir-

ical Processes with Applications to Statistics, New York: Springer.

[55] van der Vaart, A. W. (1998) Asymptotic Statistitics, New York: Cambridge

University Press.

[56] Williams, D. (1991) Probability with Martingales, New York: Cambridge Univer-

sity Press.

[57] Wald, A. (1949) “Note on the Consistency of the Maximum Likelihood Estimate,”

Annals of Mathematical Statistics, 20, 595–601.

[58] Wolpert, D. H., Macready, W. G. (1997) “No Free Lunch Theorems for Op-

timization”, IEEE Transactions on Evolutionary Computation, 1, 67–82.

extremum estimators and stochastic optimization...

Documents