sampling from constrained domains - uvathe markov chains due to their ability to propose long...

Sampling from constrained domains

Carla Groenland

June 27, 2014

Bachelor thesis (18EC)

Supervisor: Taco Cohen

Bacheloropleiding Kunstmatige Intelligentie

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

Universiteit van Amsterdam

Abstract

In this bachelor thesis various sampling methods for a particular constrained domainare evaluated. When applying the Expectation Maximization (EM) algorithm in orderto learn representations of SO(3), an integral is approximated using the Monte Carlomethod. For this purpose, the need arises to sample from an unnormalized density onSO(3), which is a constrained domain when embedded in Euclidean space. Various sam-pling methods have been implemented and evaluated based on their approximations ofthe EM objective. A basic importance sampling algorithm appears to work best, howeverthis method has the disadvantage that it does not produce samples from the requireddistribution and that a large number of samples is required for a good approximation.On the other hand, the evaluated Markov chain Monte Carlo methods are able to givereasonable approximations using a small number of samples, but require much more timefor generating samples and give worse approximations. Furthermore, a new techniquefor lifting constraints is proposed. It is derived under which conditions samples from adensity on an easier domain can be transformed to samples from the required density.This technique is applied to the problem of sampling from SO(3) by proving for a certaincase that samples can be transformed from R4 to S3. The theory is also shown to workexperimentally by applying the same sampling method to sample from both domainsand comparing their estimates of an EM objective.

Title: Sampling from constrained domainsAuthor: Carla Groenland, 10208429Supervisor: Taco Cohen, Machine Learning GroupSecond grader: dr. Leo Dorst, Intelligent Systems Lab AmsterdamDate: June 27, 2014

University of AmsterdamScience Park 904, 1098 XH Amsterdam

2

Contents

1 Introduction 41.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Method and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Context 72.1 Lie groups and irreducible representations . . . . . . . . . . . . . . . . . . 72.2 Learning representations of Lie groups . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Toroidal Subgroup Analysis . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Computational framework for SO(3) . . . . . . . . . . . . . . . . . 12

3 Sampling methods 163.1 Definition of sampling methods . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Existing sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.2 Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 Hybrid Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.4 Spherical HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Investigating new sampling methods . . . . . . . . . . . . . . . . . . . . . 233.3.1 Conditions for changing domains . . . . . . . . . . . . . . . . . . . 253.3.2 Projection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 Manifold sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.4 Variations on Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 31

4 Experiments 344.1 Implemented sampling methods . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Discretized Gibbs sampling on R4 . . . . . . . . . . . . . . . . . . 35

4.2 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.1 Problems with the normalization constant . . . . . . . . . . . . . . 384.2.2 Problems with HMC . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Conclusion 50

Bibliography 53

3

1 Introduction

Research on sampling methods is conducted in a wide range of fields including arti-ficial intelligence, econometrics, physics, statistics and computational chemistry. Thenecessity for sampling methods arises most frequently in artificial intelligence when acomputationally intractable integral is approximated using the Monte Carlo method,that is, ∫

f(z)p(z)dz ≈L∑i=1

f(x(i))

where x(1), . . . , x(L) are sampled from the probability density function (pdf) p. Manyformulas, for example expectations, can be written in the form

∫f(z)p(z)dz. Monte

Carlo methods tend to outperform numeric integration when approximating integrals onhigh-dimensional domains [3, Chapter 11].

Which sampling method works best is usually highly dependent on the specifics of theproblem, such as the dimensionality of the domain or whether the gradient of the pdf isknown. In this thesis, several sampling methods are evaluated on a particular problem incomputer vision which arose when learning representations over the Lie group SO(3) inthe attempt to continue the research initiated by Cohen and Welling [7]. An integral ofthe form

∫f(z)p(z)dz is encountered in the EM algorithm and this leads to the problem

of sampling from SO(3), which is an instance of a manifold (it is a Lie group) containingthe rotations of three-dimensional space. The domain is constrained when embedded inEuclidean space and many sampling methods are incapable of handling such constraints.

1.1 Literature review

Much research has been conducted on Markov chain Monte Carlo (MCMC) samplingmethods, where a Markov chain is constructed which has the desired distribution asits stationary distribution [19]. The first proposed MCMC method was the Metropolisalgorithm [13] which was created in 1953 and which was improved by Hastings in 1970[10]. Gibbs sampling [9] is a special case of the Metropolis-Hastings algorithm whichworks with the univariate conditional distributions. An existing variation on this methodupdates blocks of variables simultaneously instead of updating all variables separately[11].

Hamiltonian Monte Carlo (also called Hybrid Monte Carlo or HMC) is a class ofMCMC methods which alternate between Metropolis updates (using a state proposalcomputed according to Hamiltonian dynamics) and updates of auxiliary momentumvariables. The equations of Hamiltonian dynamics are difficult to solve analytically andare therefore approximated by discretizing time. The leapfrog method is commonly

4

used for this approximation [14]. The HMC methods improve the convergence rate ofthe Markov chains due to their ability to propose long distance moves which retain ahigh acceptance probability.

Several approaches to handling constraints by using an adjusted HMC sampling methodhave been proposed. Neal [14] discusses a modified HMC method which enables the sam-pler to remain within prescribed boundaries due to an additional term in the potentialenergy which tends to infinity as the sampler approaches the boundaries. Spherical HMC[12] is able to handle constraints of the form ‖x‖q ≤ C by transforming such a domaininto a sphere Sn and sampling from this by splitting the Hamiltonian dynamics in sucha way that both parts can be solved analytically.

Finally, the possibilities of applying HMC sampling on manifolds and Hilbert spaceshave recently been investigated [5, 2]. Much research focuses on defining the Hamiltonianon such domains and explicitly adjusting the standard HMC algorithm for use on certaintypes of manifold or Hilbert space.

1.2 Research question

The main goal of the project is to support the research on learning representationsof SO(3) by implementing a sampling method which can be used to approximate acertain expectation encountered in the EM algorithm. The research question is “Whichsampling method performs best when integrated in the EM algorithm used for the Liegroup SO(3)?”. The question is divided into three parts:

1. What kind of domain do we need to sample from and which restrictions do weneed to take into account? Which existing methods can be applied naturally?

2. Which new methods can be used for this task? What are their properties and canthese be generalized to other domains?

3. Which of the new methods and existing methods performs best?

In particular, the feasibility of facilitating sampling from difficult domains by transform-ing samples taken from more agreeable domains is also examined.

1.3 Method and overview

In the chapter Context, the question What kind of domain do we need to sample fromand which restrictions do we need to take into account? will be addressed by explainingthe computational framework the sampling methods need to be integrated in and howthe elements of SO(3) can be parametrized in this framework. Furthermore, the concepts“Lie group” and “fully reduced representation” will be explained and motivation for theresearch described in this thesis will be given by explaining why learning representationsover SO(3) is promising for research in computer vision and machine learning.

The chapter Sampling methods consists of three main parts:

5

• Definition of sampling. The intuition behind sampling and the precise mathe-matical definition are given.

• Existing sampling methods. Several basic methods are explained as well asthe more recent Spherical HMC method [12] in order to respond to the questionWhich existing methods can be applied naturally?. The examination of why existingmethods work assists the investigation of new sampling methods.

• Investigating new sampling methods. The results of the theoretical researchconducted in order to address Which new methods can be used for this task? Whatare their properties and can these be generalized to other domains? are outlined.Failed attempts are stated as well as the origin of their failure. General conditionsare derived for sampling from difficult domains by transforming samples takenfrom more agreeable domains. It is shown how this approach enables more existingsampling methods to be applied to the task in hand.

In order to answer Which of the new methods and existing methods performs best?, var-ious sampling methods are implemented and evaluated. In the chapter Experiments,the details of the implementation of these sampling methods and the experiments con-ducted to evaluate the methods are given and the results are discussed.

1.4 Acknowledgements

I would like to thank my supervisor Taco Cohen for answering my questions, motivatingme with his enthusiasm and for all his helpful suggestions. I am thankful to dr. LeoDorst for agreeing to be second grader and to dr. Shiwei Lan for answering my questionsabout Spherical HMC and sending me the required adjustments I had to make. Finally,I would like to thank Josha Box, Wieger Hinderks and dr. Bert van Es for proofreadingseveral theoretical parts of the thesis and for the discussions I have had with them.

6

2 Context

In this chapter, the approach described in [7] will be explained, together with the requiredbackground needed to understand the approach and how to extend it. This is importantfor understanding the context of the sampling problem, which gives a purpose for theresearch described in this thesis and also helps understand what needs to be done.

First of all, the notions Lie group and irreducible representation will be explained basedon [8]. After that, the approach of [7] will be outlined and extensions of the approachwill be discussed, where certain definitions from [3] are added for completeness.

2.1 Lie groups and irreducible representations

Group theory is a large field of study in mathematics. Before introducing a Lie group, thenotions of a group and a differentiable manifold (a notion from differentiable geometry)have to be introduced first.

Definition 2.1. A group G is a set of elements together with an operation : G×G→ Gand an element e ∈ G such that

• (x y) z = x (y z) for all x, y, z ∈ G;

• x e = x = e x for all x ∈ G;

• for all x ∈ G, there exists a x−1 ∈ G such that x x−1 = e = x−1 x.

It follows directly that this inverse x−1 is unique. Instead of x y, notations such asx+ y, xy or x · y are also used.

Example 2.2. Many well-known elements form a group, for example the invertiblereal (n× n)-matrices, the orthogonal transformations and the rotations in 2D. The realnumbers also form a group under addition and R\0 forms a group under multiplication.

The definition of a differentiable manifold is a bit more involved: this requires firstintroducing the notions of topology, open set, homeomorphism, second countable andHausdorff space. The intuition behind the definition is that a manifold is locally aEuclidean space, meaning that when you zoom in enough and adjust the space slightly,you see Rn for a certain n ∈ N. This means that Rn is an example of a manifold (to beprecise, when Rn is taken with the Euclidean topology). It follows that the n×n-matricesare a manifold, since Mn×n(R) = Rn2

by putting all matrix elements in a vector (if onedefines a topology on Mn×n(R) using this identification). A Lie group is a combinationof a group and a manifold.

7

Definition 2.3. A Lie group is a group that is also a differentiable manifold, such thatthe group operations are given by smooth maps.

The definition above means that the map (x, y) 7→ xy−1 must be C∞ with respect tothe topology on the Lie group as differentiable manifold.

Example 2.4. Using the regular value theorem it follows that the set of invertiblematrices GLn(R) is also a manifold. Similar techniques show that the special orthogonalgroup SO(n) is a manifold as well. This gives us the first examples of Lie groups: SO(n)and GLn(R) form Lie groups for all n ∈ N. The group operation is given by matrixmultiplication and one needs to verify that taking the inverse and multiplication aresmooth operations in the topology induced on these manifolds as subspaces of Mn×n(R).

Since Lie groups have a topology defined on them, compact Lie groups can be defined.Compact sets have nice properties (such as being bounded) making them more tractablefor computations.

A group representation is a mathematical notion which is abbreviated to a represen-tation in this thesis.

Definition 2.5. A representation of a group (G, G) over a field k is a k-vector spaceV together with a group homomorphism ρ : G→ GL(V ), that is,

ρ(g) ρ(h) = ρ(g G h) (for each g, h ∈ G)

and for each group element g ∈ G, ρ(g) : V → V is a linear and invertible map.

Example 2.6. A possible choice of field is k = R and V = Rn can then be chosenas vector space over k. In this case GL(V ) = GLn(R) simply consists of the invertiblematrices. The composition is given by matrix multiplication.

Seeing the elements of GL(V ) as (invertible) matrices, a representation can be seenas a linear assignment of matrices to group elements (‘linear’ when ‘+’ is seen as multi-plication in the group).

One important property that a representation can have, is called irreducibility.

Definition 2.7. A non-zero representation (ρ, V ) of G is called irreducible if V has nonon-trivial subrepresentations, that is, if W is a subspace of V such that ρ(g)(W ) ⊂Wfor all g ∈ G, then W = V or W = 0.

Example 2.8. A vector space V has only 0 and itself as subspace if V is one-dimensional,which implies that all one-dimensional representations are irreducible.

The word irreducibility is used often in mathematics to denote elements which are insome sense primitive. An irreducible representation cannot be decomposed into smallerelements in some sense and a logical next question for a mathematician is when a repre-sentation can be decomposed into irreducible representations (and whether there exists aunique way to do so). If this is possible, the representation is called completely reducible,fully reducible or semisimple.

8

Definition 2.9. A completely reducible representation of G is the direct sum of irre-ducible representations of G.

In language of matrices, the matrices ρ(g) have a block diagonal form (with eachblock corresponding to an irreducible representation) if a representation is completelyreducible.

The last notion of representation theory that will be required is that of an isomorphismof representations. This explains when two representations can be seen as equivalent.

Definition 2.10. Let (V1, ρ1) and (V2, ρ2) be two representations of a group G. Anisomorphism of representations is an invertible, linear map φ : V1 → V2 such thatφ(ρ1(g)v) = ρ2(g)φ(v) for all g ∈ G, v ∈ V1.

This definition is equivalent to saying that the representations are related by a changeof basis. Namely, let φ : V1 → V2 be an isomorphism of G-representations, then φ is alinear map such that φρ1(g) = ρ2(g)φ or φρ1(g)φ−1 = ρ2(g) for all g ∈ G.

If a subspace U of a vector space V and its orthogonal complement U⊥ are bothinvariant under G then we can write

Wρ(g)W−1 =

(ρU (g) 0

0 ρU⊥(g)

)for a certain matrix W which performs a basis change. The map corresponding to W isthen an isomorphism of representations between ρ and ρU ⊕ ρU⊥ . If a representation isisomorphic to a completely reducible representation, then we will also call it completelyreducible. All group representations of compact groups have this property.

2.2 Learning representations of Lie groups

The computational framework for learning fully reduced representations of SO(2) isexplained in [6] and [7]. In this section the motivation for learning representation overSO(2), the chosen approach and the prospects of extending the approach to SO(3) aresummarized. Finally, the computational framework in which the sampling methods willbe implemented is described and certain formulas that will become important later onare derived.

When a picture is rotated, the picture still depicts the same thing. Therefore, it isvery useful in computer vision and machine learning to be able to compute features ofan image which are invariant under rotation or other symmetries. In order to be able tospeak of equivalence of images and individual properties of an image in a language, twoinsights of Felix Klein and Hermann Weyl are applied. The motivation behind learningirreducible representations of Lie groups is based on these two principles.

Klein tried to apply group theory on geometry by considering two figures equivalentif they are related by an element of the Euclidean group (which contains rigid bodymotions). This can be extended by exchanging the Euclidean group with any othergroup of transformations, for example SO(2), the rotations in 2D. The chosen group is

9

referred to as the symmetry group (because it describes the symmetries of a system) andwill be taken to be a Lie group.

Besides features which are invariant (or covariant) under certain transformation, italso useful to be able to put labels on them. The features need to have separate orindependent meanings with respect to the other features. Weyl’s principle helps infinding such independent, elementary components; his principle is that “the elementarycomponents of a system are the irreducible representations of the symmetry group of thesystem” [7, page 3].

Hence, in order to learn the elementary components of a system the symmetry groupworking on the system needs to be known and the irreducible representations of thisgroup need to be learned. A first step in this direction has been taken by proposing aprobabilistic framework for learning fully reduced representations of the Lie group SO(2).This Lie group is a good starting point, since it is both commutative and compact,which makes it easy to work with. Since it is compact, all representations will be fullyreducible, that is, they can be decomposed into irreducible representations. Furthermore,all irreducible representations are of complex dimension one (and hence real dimensiontwo) due to the commutativity of SO(2). These irreducible representations are uniquelygiven by the weight of the representation, which is the number n ∈ N for which

R(θ) = exp

(nθ

(0 −11 0

)).

2.2.1 Toroidal Subgroup Analysis

All finite-dimensional irreducible representations of SO(2)J for J ∈ N are given by a(tensor) product of irreducible representations of SO(2). The algorithm for learningirreducible representations of SO(2), called Toroidal Subgroup Analysis (TSA), is ableto learn representations of compact commutative subgroups of SO(n) for n ∈ N, calledtoroidal subgroups. All elements of such a toroidal group can be written in the formρφ = WR(φ)W t for W an orthogonal matrix and R(φ) a block-diagonal matrix withblocks of the form

R(φj) =

(cos(φj) − sin(φj)sin(φj) cos(φj)

)on the diagonal. Such a toroidal group can hence be parametrized by φ = (φ1, . . . , φJ)for some J ∈ N and the name “toroidal” arises from the fact that all these componentsare periodic. This also gives an identification with SO(2)J .

The probabilistic model relates each data pair (x, y) by a transformation with addi-tional Gaussian noise, that is,

p(y|x, φ) = N (y|ρφx, σ2) or y = ρφx+ ε

for ε ∼ N (0, σ2).The components φj are assumed to be marginally independent and von-Mises dis-

tributed. The definition of von-Mises distribution requires Bessel functions, which are

10

the canonical solutions y(x) of Bessel’s differential equation

x2 d2y

dx2+ x

dy

dx+ (x2 + α2)y = 0,

where α ∈ C is called the order of the Bessel function. Let I0 be the modified Besselfunction of order 0, then

p(φj) =1

2πI0(κj)exp(κj cos(φj − µj)),

where the mean µ and the precision κ are parameters of the von-Mises distribution. Thevon-Mises distribution is an example of an exponential family.

Definition 2.11. The exponential family of distributions over x, given parameters η, isdefined to be the set of distributions of the form

p(x|h) = h(x)g(η) expηtu(x).

In this definition, η are called the natural parameters of the distribution. The “co-efficient” g(η) ensures normalization. The von-Mises distribution can be expressed innatural parameters as

p(φj) =1

2πI0(‖ηj‖)exp(ηtjT (φj)).

for T (φj) = (cos(φj), sin(φj))t the sufficient statistics.

Definition 2.12. A statistic T (X) is sufficient for underlying parameter θ if P (X =x|T (X) = t, θ) = P (X = x|T (x) = t).

A very important result shown in [7] is that the posterior p(φ|x, y) is again a productof von-Mises distributions, called a conjugacy relation.

Definition 2.13. A family F of prior distributions p(θ) is conjugate to a likelihoodp(D|θ) if the resulting posterior p(θ|D) ∈ F .

The advantage of such a conjugacy relation is that the model can be updated by dataiteratively, such that the model can be updated with newly acquired data. A similarconjugacy relation can be found for the coupled model, where the data is related by atransformation of the form

ρφ = W

R(ω1φ). . .

R(ωJφ)

W t

for ω1, . . . , ωJ the weights of the irreducible representations mentioned previously. Inthis case, the generalized von-Mises is the right prior. The other model is incorrect inrepresenting 2D image rotations due to the fact that all components of φ can separately

11

turn the invariant subspaces (they are “uncoupled”), but the model appears to be easierto learn.

In both models, φ are latent variables and the orthogonal matrix W is what needs tobe learned (along with the weights ωi for the uncoupled model). The learning of theseparameters is done by maximizing the marginal likelihood. This can be done using theexpectation maximization (EM) algorithm, where in the first step

Q(θ|θt) = E[log pθ(y, φ|x)]φ|x,y,θt =

∫[log pθ(y, φ|x)]pθt(φ|x, y)dφ

is computed and in the second step θt+1 = argmaxθQ(θ|θt) is chosen. Explicit formulasfor Q(θ|θt) and the gradient are found for the TSA model. This is unique: it can beexpected that such explicit formulas may not be found for other Lie groups and thatapproximation techniques are required.

After learning the matrix W and computing u = W tx and v = W ty, the manifolddistance between x and y can be computed as

d(x, y) = minφ‖y − ρφx‖2 =∑j

‖vj −R(µj)uj‖2

for µj the mean of p(φj |x, y, κj = 0). This gives the distance between the orbits of xand y under the group ρφφ, such that elements from the same orbit have zero distanceto each other. Such a distance can make the idea of “rotated pictures should be seenas the same picture” more concrete: all elements in the same orbit (that is, those thatare related by a certain type of transformation, for example a rotation) can be seen asequivalent. In a machine learning task, such can be useful for labeling an entire orbit ofelements at once. Experiments have shown that manifold distance is able to filter outvariation by image rotation almost completely in a nearest neighbour classification task.

2.2.2 Computational framework for SO(3)

The approach described in the previous section can be extended to more Lie groups.However, non-commutative groups have worse tractability properties. Certain formulasthat could be found in closed form when learning distributions over representations oftoroidal subgroups, may be hard to compute analytically in the case of non-commutativeLie groups. Furthermore, the domain may be very high-dimensional, such that mostnumerical methods do not work well in general.

An example of a formula that is hard to compute analytically can already be foundfor the Lie group SO(3). The set-up for learning representations can be done in a similarfashion as has been done for SO(2) and a formula of the form

E[log p(y, φ|x)]φ|x,y =

∫[log p(y, φ|x)]p(φ|x, y)dφ

is encountered again in the EM algorithm. For toroidal subgroups, the integral abovecould be solved analytically, but the integral for SO(3) appears to be more difficult. A

12

standard solution for approximating such integrals, which also generalizes well to higherdimensional domains, is called Monte Carlo, where φ(i) ∼ p(φ|x, y) are sampled fori = 1, . . . , L and the integral is approximated as

1

L

L∑i=1

log(p(y, φ(i)|x)).

For each data pair (x, y), samples φ(i) ∼ p(φ|x, y) are required. The Monte Carlo methodand sampling methods will be discussed in the next chapter.

A computational disadvantage of the non-commutativity of SO(3) is that SO(3) hasirreducible representations of various dimensions. Since SO(3) is compact, all represen-tations can be written as a direct sum of irreducible representations again, but now allthese irreducible representations may have any (complex) dimension 2l+ 1 for l ∈ N. Aderivation of the irreducible representations of SO(3) can be found in [18, Chapter 8].

Recently, a computationally efficient method for working with these irreducible rep-resentations has been proposed [16]. Let V = L2(S2), the Hilbert space of the square-integrable functions on S2. Define a representation ρ : SO(3)→ GL(V ) by

(ρ(g)(f))(x) = f(g−1x), x ∈ S2, g ∈ SO(3).

We need to verify that ρ is a homomorphism:

(ρ(gh)(f))(x) = f((gh)−1x) = f(h−1g−1x) = ρ(h)f(g−1x) = ρ(g)ρ(h)f(x)

for all x ∈ S2 and henceρ(gh)f = ρ(g)ρ(h)f

for all f ∈ V , which gives ρ(gh) = ρ(g)ρ(h) for all g, h ∈ SO(3) as desired. All irreduciblerepresentations can be found as subrepresentations of this representation by pickingsubspaces El generated by 2l + 1 real or complex spherical harmonics functions, whichare simply well-chosen elements of V (but with quite complicated formulas).

The elements of SO(3) can be parametrized by Euler angles. The Euler angles repre-sent rotation angles for rotations about one of the three axis in R3. Various conventionsexist, but in this thesis the (3,2,3) or (z, y, z) convention will be applied where (α, β, γ)stands for the elementcos γ − sin γ 0

sin γ cos γ 00 0 1

cosβ 0 − sinβ0 1 0

sinβ 0 cosβ

cosα − sinα 0sinα cosα 0

0 0 1

∈ SO(3).

It is shown in [16] that the matrix form of elements of SO(3) in the basis given by thelth real spherical harmonics can be written as

[ρ|El((α, β, γ))] = Xl(α)JlXl(β)JlXl(γ)

for Jl a fixed matrix and Xl(φ) a matrix consisting mainly of zeros, but with

(cos(−lφ), cos((−l + 1)φ), . . . , cos(lφ))

13

on the diagonal and(sin(−lφ), sin((−l + 1)φ), . . . , sin(lφ))

on the anti-diagonal.In the computational framework in which the sampling method are integrated, the

conventions above are used. In this case, φ = (α, β, γ) ∈ [0, 2π] × [0, π] × [0, 2π] is theform in which the elements need to be sampled and the density takes the form

p(φ|x, y) ∝ exp

(− 1

σ2(ytW )R(φ)(W tx)

)where R(φ) = ρ(φ) is φ in some representation. The symbol “∝” stands for proportionalto, where a ∝ b if there exists a fixed constant C such that a = Cb. The function pdepends on the data, the to be learned matrix W and the representation, where for thisthesis the dimensions of the irreducible representations which make up the representationρ are prescribed. In a further version of the model, it would be desirable to learn thesedimensions or even the Lie group the representations are taken over as well.

It is only possible to compute an unnormalized version of p(φ|x, y) (since computationof the normalization constant would be expensive) which implies that computations withvalues from p can become too large, causing computation errors due to overflow. Whenprobabilities need to be multiplied often, underflow or overflow can appear easily. Forthis reason, the code needs to work with log probabilities instead.

The sampling domain can also be converted to S3. Mathematically speaking, thereexists a double cover f : S3 → SO(3). This means that for every element φ ∈ SO(3)exactly two elements in S3 exist which are mapped to x under f , in such a way that in asmall area around either of these elements, f is a homeomorphism (a continuous bijectionwith a continuous inverse). Such a map is, however, hard to construct explicitly. Basedon [1] and [15], formulas for converting unit quaternions to Euler angles are derived here,which will be useful later on. Here S3 ⊂ R4 is seen as the set of unit quaternions.

Assume a unit quaternion is written as q = w + ix+ jy + kz, then the correspondingrotation matrix is 1− 2y2 − 2z2 2xy − 2zw 2xz + 2yw

2xy + 2zw 1− 2x2 − 2z2 2yz − 2xw2xz − 2yw 2yz + 2xw 1− 2x2 − 2y2

.

With the abbreviations c = cos and s = sincφ −sφ 0sφ cφ 00 0 1

cψ 0 −sψ0 1 0sψ 0 cψ

cξ −sξ 0sξ cξ 00 0 1

=

. . . . . . −sψ cφ. . . . . . −sψ sφsψ cξ −sψ sξ cψ

with expressions like cφ cξ − sφ cψ sξ on the dots. This shows that a problem ariseswhen ψ ∈ 0, π, since then this matrix takes the forma1 b1 0

a2 b2 00 0 1

14

and many choices of φ, ξ will give this particular matrix. This is called gimbal lock. Forψ 6∈ 0, π, the solutions are

φ = tan−1(c2/c1) = atan2(2yz − 2xw, 2xz + 2yw),

ψ = cos−1(c3) = cos−1(1− 2x2 − 2y2),

ξ = tan−1(b3/− a3) = atan2(2yz + 2xw,−2xz + 2yw).

The values (φ, ψ, ξ) correspond to (γ, β, α).The derivative of these conversion formulas will also be required later on, when the

need arises to compute the gradient of

U(w, x, y, z) = − log[p(f(w, x, y, z))]

for f the conversion formula and p the density defined in Euler angles. The chain rulegives U ′(θ) = [− log p]′(f(θ)) Df(θ). Using Wolfram Alpha the partial derivatives canbe computed, which together form the total derivative

Df(w, x, y, z) =

− zw2+z2

0 − zw2+z2

yx2+y2

4x√1−(1−2x2−2y2)2

− yx2+y2

− xx2+y2

4y√1−(1−2x2−2y2)2

xx2+y2

ww2+z2

0 ww2+z2

.

The derivative of − log p can also be found reasonably easily. This is a function

φ 7→ −V tR(φ)U

and has derivative −V t dR(φ)dφ U . Recall that

R(φ) = ⊕ri=1Rli(α, β, γ) = ⊕ri=1Xli(α)JliXli(β)JliXli(γ)

for certain l1, . . . , lr ∈ N. The direct sum notation implies that R(φ) is a block-diagonalmatrix with blocks Rl1(φ), . . . , Rlr(φ), such that

dR(φ)

dφ=

dRl1

(φ)

dφ 0 . . . 0

0dRl2

(φ)

dφ . . . 0...

. . ....

0 . . . 0dRlr (φ)dφ

.

A derivative dRl(φ)dφ is again given by the matrix of partial derivatives, where for example

dRl(φ)

dα=dXl(α)

dαJlXl(β)JlXl(γ)

and dXl(α)dα may be computed by taking the derivative (matrix)element-wise.

15

3 Sampling methods

Many applications, such as those using the EM algorithm, require the computation ofan integral of the form

Ep[f(z)] =

∫f(z)p(z)dz

for some function f and some probability distribution p. The computation of Ep[f(z)]may be intractable, such that an approximation method is required. A possible methodfor this purpose is Monte Carlo.

Definition 3.1. If z1, . . . , zn ∼ p i.i.d. then µn = 1n

∑ni=1 f(zi) is a (basic) Monte Carlo

estimator Ep[f(z)], where z ∼ p.

This requires a method for getting samples z1, . . . , zn ∼ p. The meaning of “z1, . . . , zn ∼p” will be given in the next section and various methods for gaining such samples willbe discussed in this chapter.

Remark 3.2. In this thesis, sampling methods for the particular problem explainedin the previous chapter are evaluated. The need of a sampling method is, however, astandard problem which arises in many fields of study and this chapter will thereforediscuss sampling methods in a general framework.

Remark 3.3. The Monte Carlo method is supported by mathematical theory. Thefollowing properties hold.

• E[µn] = E[f(z)], that is, µn is an unbiased estimator.

• Assuming that the variance of f(z) is finite µn is a consistent estimator, since∀ε > 0

P (|µn − E[f(z)]| < ε)→ 1 (as n→∞).

This follows from the law of large numbers.

• The variance σ2(µn) = 1nσ

2(f(z)) → 0. The mean squared error is defined as thesquared bias plus the variation. Since µn is unbiased, the mean squared error goesto zero at the same convergence rate of 1√

n.

• The probability distribution does not need to be normalized, since if q(z) = 1Z p(z)

for Z the normalization constant, then p ∼ q.

A well-known application of Monte Carlo is that it can be used to approximate π, knownunder the name “Buffon’s needle problem”.

16

3.1 Definition of sampling methods

In this section the intuition behind sampling methods and the precise mathematicaldefinition is given based on [17]. The formal definition is needed in Section 3.3, butSection 3.2 only requires some intuition.

The intuition behind sampling methods is that points need to be sampled “accordingto” the given density p.

Example 3.4. The way sampling methods ought to work becomes more clear when p isa probability density function for a discrete distribution. Assume the domain Ω consistsof two elements, x and y. In this case, p is called a probability mass function. Assumep(x) = 0.2 and p(y) = 0.8, such that y gets four times the mass of x. A sampling methodwhich samples “according to” the distribution, should return y four times as often as xas well (approximately).

In the continuous case, the idea of sampling becomes a bit more complicated, sincein theory every point has zero chance of being sampled (when the sampling domain isequipped with the usual measure). Still, the intuition holds that the larger the valuethat p assigns to a point, the more often it ought to be picked.

In order to understand the mathematical definition of sampling, some measure theoryneeds to be introduced first. Mathematicians have founded probability theory formallybased on measure theory. Start with a set Ω, called the sample space and define aσ-algebra F on this.

Definition 3.5. A collection F of subsets of a sample space Ω is called a σ-field or aσ-algebra, if ∅ ∈ F and F is closed under countable unions and taking complements.

Next, define a measure ν : F → [0,∞] which has ν(∅) = 0 and is countably additive,that is, for all disjoint Ai ∈ F

ν (∪∞i=1Ai) =

∞∑i=1

ν(Ai).

Definition 3.6. The triple (Ω,F , ν) is called a measure space.

A simple example is R, with the Borel σ-algebra B (which consists of countable unionsof intervals) and the Lebesgue measure (which assigns each interval its length and isdefined appropriately when taking unions of intervals).

If ν(Ω) = 1, the measure ν is called a probability measure. When the measure inthe triple of a measure space is left out the remaining pair is called a measurable space.Another important definition is that of a measurable function.

Definition 3.7. A function f : Ω1 → Ω2 between measurable spaces (Ωi,Fi) is calledmeasurable if f−1(A) ∈ F1 for all A ∈ F2.

Continuous functions and indicator functions 1A for A measurable are examples ofmeasurable functions and “continuous” combinations of measurable functions give mea-surable functions again.

17

In probability theory, a measurable function X is called a random element. If it alsotakes image in (R,B) or (Rk,Bk), then it is called a random variable or random k-vector. For a probability measure P and a random vector X, we call PX = P X−1 thedistribution of X. The notation P X−1 stands for the induced measure by X and is ameasure on Rk defined by

P X−1(B) = P (X ∈ B) = P (X−1(B)) for B ∈ B.

Now that measures are defined, integrals and the Radon-Nikodym derivative can alsobe defined. Let λ, ν be two measures on (Ω,F). We call λ absolutely continuous withrespect to ν if ν(A) = 0 =⇒ λ(A) = 0 for all A ∈ F . When ν is also σ-finite (Ω isthe union of countably many sets with finite ν-measure), then there exists a nonnegative(Borel) function f on Ω such that

λ(A) =

∫Afdν, A ∈ F ,

unique almost everywhere (a.e.) with respect to ν (any other function with the sameproperties may only take different values on “negligible” sets to which ν assigns zeromeasure) according to the Radon-Nikodym theorem. The function f is called the Radon-Nikodym derivative or density of λ with respect to ν and denoted by dλ/dν. If

∫fdν = 1

and f ≥ 0 a.e. ν, then λ is a probability measure and f is called its probability densityfunction (pdf) with respect to ν. The definition above can be seen as a generalizationof the usual definition of a pdf f by

P (a ≤ X ≤ b) = P (X ∈ [a, b]) =

∫ b

af(x)dx.

It is also possible to define a product measure in the most obvious way (when workingwith σ-finite measures). This gives the possibility of defining a joint pdf with respectto a product measure on Bk for a random k-vector (X1, . . . , Xk). The i-th marginal pdfcan be found by integrating over the other variables

fi(x) =

∫Rk−1

f(x1, . . . , xi−1, x, xi+1, . . . , xk)dν1 · · · dνi−1dνi+1 · · · dνk.

We call X1, . . . , Xn independent random variables if

f(x1, . . . , xn) = f1(x1) · · · fn(xn).

Definition 3.8. Given a measure space (Ω,F , P ) and a distribution PX , we callX1, . . . , Xn

a sample of PX of length n if X1, . . . , Xn are independent random variables and havedistribution PX .

Similarly, a sample X from a density f is a random variable which has f as probabilitydensity, that is, f is the Radon-Nikodym derivative of PX with respect to the Lebesguemeasure. We write X ∼ f . The symbol ∼ is often used in mathematics to denote an

18

equivalence relation. In statistics it is used for the equivalence relation “has the samedistribution as”. Hence, X ∼ N (0, 1) means that the random variable has the samedensity function as N (0, 1) (where N (0, 1) is seen as a random variable). Note that thedefinition above presupposes that X1, . . . , Xn are measurable functions to Rm for somem ∈ N, but a similar definition could also be given for manifolds in general. We usuallycall their realizations Xi(ω) samples as well. In the context of sampling the functionsXi are taken to be the identity.

3.2 Existing sampling methods

In this section, an overview will be given of the basic sampling methods and a state-of-the-art method called Spherical HMC will be explained.

One of the easiest sampling methods is called rejection sampling. Given an unnor-malized density function p(z), define a proposal distribution q(z) from which can besampled and find a constant K such that Kq(z) ≥ p(z) for all z. The algorithm worksby sampling x ∼ q(z) and accepting x according to the difference between q and p inorder to compensate for the fact that the wrong distribution is being sampled from. Theprobability that x is accepted is p(x)

Kq(x) which is between 0 and 1 by definition of K. Ac-

cepting according to a probability can be achieved by sampling ux ∼ Uniform[0,Kq(x)]and accepting x if p(x) ≥ ux.

A problem with this method is that it requires a good proposal distribution. If theconstant K needs to be chosen large, which happens if q and p differ a lot, then this willcause many samples to be rejected.

Importance sampling also requires a proposal distribution q(z) for p(z), this time withthe property that p(z) = 0 =⇒ q(z) = 0. Note that

E[f(z)] =

∫f(x)p(x)dx =

∫f(x)

p(x)

q(x)q(x)dx ≈ 1

n

n∑i=1

f(zi)p(zi)

q(zi)

gives an approximation of the integral using samples zi ∼ q. The importance weightsp(zi)q(zi)

compensate for sampling from the wrong distribution. As with rejection sampling,

it is important that the distribution q(z) is close to the distribution p(z). Samplesfor which p(x)/q(x) is small will contribute little to the expectation and lead to a badestimation of the expectation of f . With a slight change, this sampling technique is alsoable to handle unnormalized distributions, see [3, page 533].

3.2.1 Markov chains

A lot of sampling methods simulate a Markov chain which has the desired distributionas its so-called stationary distribution. This introduction to Markov chains is based on[19] and [3, Chapter 11.2.1]. Let Xt denote the value of a random variable X at time t.We call X a Markov process if

P (Xt+1 = sj |X0 = sk, . . . , Xt = si) = P (Xt+1 = sj |Xt = si) =: P (i→ j).

19

The sequence (X0, . . . , Xn) is then called a Markov chain and P (i → j) are calledtransition probabilities. Let πj(t) = P (Xt = sj) denote the probability that the chainis in state j at time t. Initiate the vector π(0). The Chapman-Kolomogrov equationdescribes the other values by

πi(t+ 1) =∑k

P (k → i)πk(t).

Putting P (i → j) as the (i, j)th entry of a matrix P , it follows that π(t) = π(0)P t. Adistribution π∗ is called stationary if π∗ = π∗P . The word ‘stationary’ in the definitioncan be explained by the fact that π(t) = π∗ implies π(t+1) = π∗ when π∗ is a stationarydistribution.

Another desirable property is that the convergence π(t) → π∗ is independent of thechosen initiation π(0). A sufficient condition for a unique stationary distribution is theexistence of a t such that every entry of P t is strictly greater than zero, which is calledergodicity. For finite Markov chains, this is equivalent to irreducibility (meaning everypoint can be reached) and aperiodicity (meaning that there is no fixed period needed towalk from a point to itself, that is, the t for which P (i→ i)t > 0 have greatest commondivisor 1).

We want to construct Markov chains which have a certain probability distribution asits unique stationary distribution. A manner for verifying that this stationary distri-bution is the desired distribution, is by means of the detailed balance property. If theequation

P (j → k)π∗j = P (k → j)π∗k

holds for all k and j, then this gives a sufficient condition for π∗ being the stationarydistribution.

3.2.2 Metropolis algorithm

The first proposed MCMC method is the Metropolis algorithm [13] in 1953, which wasimproved by Hastings in 1970 [10]. The method is able to draw samples from an un-normalized probability distribution p and requires the programmer to think of a goodjumping distribution q(z|zt) which represents the probability of going from zt to a certainstate z (when combined with a certain acceptance probability). This jumping distribu-tion q has to be symmetric for the original Metropolis algorithm and Hastings improvedthe algorithm by adjusting it in such a way that this symmetry is no longer required.

The algorithm works as follows.

• Start with initial value z(0) such that p(z(0)) > 0.

• For t = 1, . . . , k, sample a candidate point z∗ from q(z|z(t−1)) and accept this withprobability

α(z(t−1), z∗) = min

(p(z∗)q(z(t−1)|z∗)

p(z(t−1))q(z∗|z(t−1)), 1

).

20

• When the stationary distribution has been reached, say at t = k, let (z(k+1), . . . , z(k+n))be the required samples from p.

In order to check the existence of a stationary distribution, the detailed balance conditionneeds to be verified. Inspection gives

P (j → k) = q(sk|sj)α(sj , sk) and π∗k = p(sk)

and

q(z∗|z(t−1))p(z(t−1))α(z(t−1), z∗) = q(z∗|z(t−1))p(z(t−1))min

(p(z∗)q(z(t−1)|z∗)

p(z(t−1))q(z∗|z(t−1)), 1

)= min

(p(z∗)q(z(t−1)|z∗), p(z(t−1))q(z∗|z(t−1))

)= p(z∗)q(z(t−1)|z∗)min

(1,p(z(t−1))q(z∗|z(t−1))

p(z∗)q(z(t−1)|z∗)

)= p(z∗)q(z(t−1)|z∗)α(z∗, z(t−1)).

This shows that the desired probability distribution p is the stationary distribution of theMarkov chain that is created. However, it can be hard to know whether the stationarydistribution has been reached and a lot of samples may be rejected if the jumpingdistribution is badly chosen. In the latter case, the chain is said to poorly mixing.

The Gibbs sampler [9] is a special case of this Metropolis-Hastings algorithm whichworks with the univariate conditional distributions (the distribution when all of therandom variables but one are assigned fixed values). At each step, a coefficient z∗i of thecandidate point is sampled from

p(zi|z1 = z∗1 , . . . , zi−1 = z∗i−1, zi+1 = z(t)i+1, . . . , zn = z(t)

n )

for i = 1, . . . , n and the resulting candidate is always accepted (α(z(t), z∗) = 1).

3.2.3 Hybrid Monte Carlo

Hybrid Monte Carlo or Hamiltonian Monte Carlo (HMC) methods are a class of MCMCmethods originating from physics. An intuitive picture is sketched in [14]: imagine a ballrolling over a landscape made up off hills. The ball has kinetic energy (from its rollingspeed) and potential energy (from its height on the landscape). Due to the addition ofthe kinetic energy to the model, the ball is able to roll down a hill and up to the topof the next. A problem with Metropolis-Hastings is that either many candidates arerejected or very small steps in the domain are made. HMC methods ought to solve thisissue of Metropolis-Hastings.

Assume p is the density function that needs to be sampled from. Let θ denote theposition variable which has potential enegry U(θ) = − log p(θ) and introduce an auxiliarymomentum variable v with kinetic energy K(v) = ‖v‖2/2. The momentum variablescorrespond to the rate of change of θ.

21

The Hamiltonian is defined by H(θ, v) = U(θ) +K(v). The Hamiltonian equations

dθidt

=∂H

∂vi,

dvidt

= −∂H∂θi

describe how θ and v change over time. These equations define a mapping

Ts : (θ(t), v(t)) 7→ (θ(t+ s), v(t+ s)).

An important property of Hamiltonian dynamics is that Ts is invertible (the inverse canbe obtained by negating the time derivatives in the Hamiltonian equations) which showsthat the dynamics are reversible. Other important properties are that the dynamicsleave the Hamiltonian invariant, that Ts is volume-preserving and that the Hamiltoniandynamics are symplectic. These properties are maintained when the dynamics are ap-proximated, which is crucial for implementation since time needs to be discretized. Moreinformation on these properties can found in [14, Section 2.2].

In order to approximate the dynamics, a stepsize ε > 0 for time is introduced.The leapfrog method usually outperforms other approximation methods such as Euler’smethod [14, page 8]. The updates are given by

vi(t+ ε/2) = vi(t)− (ε/2)∂U

∂θi(θ(t)),

θi(t+ ε) = θi(t) + εvi(t+ ε/2),

vi(t+ ε) = vi(t+ ε/2)− (ε/2)∂U

∂θi(θ(t+ ε)).

These updates are used in the HMC algorithm to generate proposals. Let q(θ, v) =1Zp

exp(−H(θ, v)). The HMC algorithm exists of two steps, both of which leave q invari-ant:

• Draw new momentum variables vi randomly from N (0, 1).

• Perform a Metropolis update for the variables (θ, v). A proposal is generatedby performing the updates of the leapfrog method (given above) L times. Theacceptance probability for a proposal (θ∗, v∗) is

min1, exp(−H(θ∗, v∗) +H(θ, v)).

A proof that the detailed balance condition holds for HMC can be found in [14, page 13].The stepsize ε or the number of leapfrog steps L is sometimes chosen randomly within asmall interval to ensure Lε is not constantly equal to a periodicity in the density function.

22

3.2.4 Spherical HMC

Spherical HMC is an HMC method which is proposed in [12]. As any HMC method, itworks by simulating Hamiltonian dynamics . The momentum variables, potential energy,kinetic energy and Hamiltonian are defined the same as for the regular HMC algorithm.Spherical HMC is aimed at sampling from distributions defined on constrained domains,where the constraints may be of the form ‖z‖p ≤ C for 1 ≤ p ≤ ∞. Useful variants are

‖z‖1 = |z1|+ . . . |zn|, ‖z‖2 =

√√√√ n∑i=1

|zi|2 and ‖z‖∞ = max|z1|, . . . , |zn|.

For example, the domain [0, 2π] × [0, π] × [0, 2π] of Euler angles can be transformed to[−1, 1]3, where the domain can be written as z : ‖z‖∞ ≤ 1.

The idea of Lan et al. [12] is to map such a constrained domain to a ball θ ∈ Rm :‖θ‖ ≤ 1. By defining θm+1 =

√1− ‖θ‖2 and θ = (θ, θm+1), the ball can be mapped

to the m-sphere Sm by θ 7→ θ. This gives a function f : Sm → Ω for Ω the originalconstrained domain. Spherical HMC samples from Sm and transforms these samplesto samples from Ω according to this function (although the authors do not use thisterminology). By splitting the Hamiltonian dynamics on Sm, analytical solutions forthe updates can be found. Given the current position θ, an update is made in a similarfashion as HMC, where only the generation of a proposal is done differently. The proposalis generated in Spherical HMC by running the Hamiltonian dynamics according to thesolutions that were found analytically. The process of proposal generation requires thegradient of U (as function on the ball). The algorithm that can be used for SphericalHMC with U defined on a sphere (as will be required later on) is outlined in Algorithm1, where the made adjustments were recommended by Shiwei Lan, one of the designersof the Spherical HMC algorithm.

3.3 Investigating new sampling methods

In this section, the investigated sampling methods are summarized. All methods areaimed at sampling from the probability density p(φ|x, y) defined in Chapter 2 and thesamples can be taken from either [0, 2π] × [0, π] × [0, 2π] or S3. Everything in this sec-tion is original and based on the ideas of my supervisor or based on my own ideas. Themain idea proposed in this section is to sample from a different domain and transformthose samples to the desired domain. Let M be a space on which the pdf p from whichneeds to sampled is defined and N a space on which sampling is more easily done thanon M . The idea is to define a function f : N → M and a pdf q on N such thatX ∼ q =⇒ f X =: f(X) ∼ p.

Why is measure theory introduced?The need for a more formal approach to this setting arose during an attempt to proveX ∼ q =⇒ f(X) ∼ p for two different cases. In the first case, M = S3, N = R4 andf is projection on the sphere (which actually ill-defined in the origin, but this can be

23

Algorithm 1 Spherical HMC (for distributions defined on a sphere SD)

Initialize θ(1).for i = 1, . . . , N do

Sample a momentum variable v(1) ∼ N (0, ID+1)Project to tangent space: v(1) = v(1) − θ(1)(θ(1))T v(1)

Calculate H(θ(1), v(1)) = U(θ(1)) +K(v(1))for l = 1, . . . , L do

v(l+ 12

) = v(l) − ε2

(ID − θ(l)(θ(l))T −θ(l)θ

(l)D+1

−θ(l)D+1(θ(l))T ‖θ(l)‖2

)∇U(θ(l))

θ(l+1) = θ(l) cos(‖v(l+ 12

)‖ε) + v(l+12 )

‖v(l+12 )‖

sin(‖v(l+ 12

)‖ε)

v(l+ 12

) = −θ(l)‖v(l+ 12

)‖ sin(‖v(l+ 12

)‖ε) + v(l+ 12

) cos(‖v(l+ 12

)‖ε)

v(l+1) = v(l+ 12

) − ε2

(ID − θ(l+1)(θ(l+1))T −θ(l+1)θ

(l+1)D+1

−θ(l+1)D+1 (θ(l+1))T ‖θ(l+1)‖2

)∇U(θ(l+1))

end forCalculate H(θ(L+1), v(L+1)) = U(θ(L+1)) +K(v(L+1))Calculate the acceptance probability α = exp−H(θ(L+1), v(L+1)) +H(θ(1), v(1))Sample u ∼ U [0, 1] uniformlyif u ≤ α then

θ(1) = θ(L+1)

end ifSave the acquired sample θ(1)

end for

24

overlooked). A pdf p on M is given and a pdf q needs to be defined on R4. Intuitively,q should spread the density of p over R4. However, in order to have a finite integralfor q, the spreading should be controlled. Therefore, a “spreading function” r can beintroduced and q can be defined as p(f(x))r(‖x‖). In order to verify the conditionX ∼ q =⇒ f(X) ∼ p, it might seem most logical to verify that∫

f−1xq(y)dy ∝ p(x),

namely that for each point in x ∈ S3, an amount proportionate to p(x) is spread over R4.In the second context, M = SO(3), N = S3 and f : N →M is a double cover (a two-to-one map with some nice properties). Since every point in M has exactly two points inN which are mapped to it, it seems that surely q(x) = 1

2p(f(x)) must work. However,∫f−1x q(y)dy = 0 in this case. It is not immediately trivial which condition it is that

makes both cases work, although both are intuitively sound methods. In such cases, itis easier to know what it is exactly that needs to proven and for such reasons, I choseto introduce more mathematical definitions. Since such conditions were hard to find inliterature (except for the well-known change-of-variables formula), I introduce them herein a general context for arbitrary measurable domains. In particular, I merely derive theconditions by unraveling the definitions and these may be well-known in other contexts,although they have not been applied yet in the context of sampling to my knowledge.

Why is change-of-variables insufficient?In the most basic form, change-of-variables can be stated as∫

Mpd(f∗µ) =

∫Np fdµ

for p a measurable function on M and f∗(µ) the pushforward of µ which is defined as(f∗(µ))(B) = µ(f−1(B)) (of course f : N →M needs to be measurable). However, thisformula is rather limited, since only the functions p and f and one of the measures canbe varied. In the context described above, the function q does not necessarily have theform q = p f and the measures on M and N do not have to be connected through f .

3.3.1 Conditions for changing domains

First of all, the function f needs to be measurable to ensure that f(X) is a randomelement again. In this setting, measures ν on M and λ on N are required. Rememberthat “p is a pdf” also means that p ≥ 0 a.e. ν and

∫M pdν = 1. The latter property can

be reduced to∫M pdν < ∞ for most purposes, since sampling algorithms usually work

with unnormalized densities as well. The function q needs to be a pdf as well, hence it isrequired that q ≥ 0 a.e. λ and

∫qdλ = 1 (or finite). Saying that X ∼ q =⇒ f(X) ∼ p

means that if

P (X−1(A)) =

∫Aqdλ for all A ∈ FN

25

then also

P ((X−1 f−1)(B)) =

∫Bpdν for all B ∈ FM ,

where P is a measure on the domain of X. Since f is measurable, we have f−1(B) ∈ FNfor all B ∈ FM (this is the definition of measurable). Call Y = f X and pY the pdfcorresponding to Y . Then∫

BpY dν = P (Y −1(B)) = P ((X−1 f−1)(B)) =

∫f−1(B)

qdλ.

It remains to show pY = p, and since pY is defined (almost uniquely) by the condition∫B pY dν = P (Y −1(B)), it is sufficient to show

∫B pdν =

∫B pY dν for all B ∈ FM . This

leaves the following conditions.

• The functions f and q are measurable. This usually follows easily since mostfunctions encountered in practice, such as indicator functions on measurable setsand continuous functions, are measurable.

• The integral∫N qdλ is finite.

• The function q ≥ 0 on all sets with non-zero λ-measure.

• For all B ∈ FM , ∫Bpdν =

∫f−1(B)

qdλ.

This last condition is easiest understood in the case of M = R, where it can be seenas “the cumulative distribution functions are the same”. It is sufficient to check thiscondition on a set which generates the σ-algebra by a theorem in measure theory, forexample for all intervals or open sets when M = R.

Example 3.9. Let us look at a simple example. Take M = [0, 1] and N = [0, 1] ∪ [2, 3]and define f by f(x) = 1[0,1](x)x+1[2,3](x)(x−2), which is a combination of measurablefunctions and hence measurable again. Define p : [0, 1] → R : x 7→ 1 and q(x) =12(p f)(x). Obviously, q ≥ 0 holds everywhere. Give both M and N the Lebesguemeasure µ. Then ∫

Nqdµ =

∫N

1

2(p f)dµ = 2

∫[0,1]

1

2pdµ = 1

and for all B ∈ (B ∩ [0, 1]), we have∫f−1(B)

qdµ =

∫Bqdµ+

∫B+2

qdµ =

∫Bpdµ.

This shows that we are allowed to sample from (N,µ, q) and translate these samples to(M,µ, p) using f .

26

In the example above, f is called a covering map. The example below gives anotherexample of a case where the properties are easily verified.

Example 3.10. Let f : S1 → S1 be the double cover given by z 7→ z2 and p a pdf onS1 and q = 1

2(p f). Since f is continuous, it is also measurable. Furthermore,

1

2

∫[0,1]

p(f(e2πiφ))dφ =1

2

∫[0,1]

p((e2πiφ)2)dφ =1

2

∫[0,1]

p(e4πiφ)dφ =

∫[0,1]

p(e2πiψ)dψ

using the change-of-coordinates formula for ψ = 2φ. A similar result holds for all subsetsof [0, 1].

It can be expected that if f : N →M is a map such that each point inM has n inverses,then q = 1

n(p f) will be a pdf on N satisfying X ∼ q =⇒ f(X) ∼ p. However, if only“each point in M has n inverses” is assumed, then the following situation may occur.If N has a measure ν such that ν(x) = 1 for a certain x ∈ N and ν(N \ x) = 0,then the property “

∫B pdν =

∫f−1(B) qdλ” will obviously fail (in this case, the point x

will always get all probability mass). The main idea is that the measures on M and Nmay be very different and that these also decide whether X ∼ q =⇒ f(X) ∼ p holdsor not.

In order to be able to use Spherical HMC to sample from SO(3), a new probabilitydensity function q on S3 should be defined, along with a function f : S3 → Euler angleswith the property X ∼ q =⇒ f(X) ∼ p. The chosen conversion formula in given inSection 2.2 and q = 1

2(p f) is taken, where in practice p is unnormalized and hence thehalf can be dropped. The intuition behind the choice of f and q is that there exists adouble cover S3 → SO(3). The chosen f is written out such that the rotation representedby a unit quaternions x in S3 is exactly the same as the rotation represented by f(x) andq is taken to be 1

2p based on the previous two examples. In order to show the property∫A pdν =

∫f−1(A) qdλ, the change-of-variables formula∫

f(U)p(v)dv =

∫U

(p f)(u)|Df |du

can be often very useful. In this case, the formula cannot be applied since the Eulerangles are a subset of R3 and S3 ⊂ R4. However, S3 is a three-dimensional manifoldand the problem might be solved by parameterizing S3 with three variables, such thathopefully the new f will have |Df | = 1

2 (in fact, since unnormalized densities are used,|Df | constant is sufficient). Since the particular sampling domain S3 ⊂ R4 is required,such derivations would not help here and proving the property X ∼ q =⇒ f(X) ∼ pwill not follow from the change-of-variables formula in a trivial way. The conditions can,however, be verified quite easily for the other case that they were derived for, as will bedone in the next section.

3.3.2 Projection sampling

One of the first approaches to sampling from a restricted area is to sample from anunrestricted area and project the acquired sample to the restricted area. Let p be

27

a probability density function which needs to be sampled from, and R the restricteddomain on which p is defined. The idea is to define an area U , a (measurable) functionπ : U → R and a probability density function q on U such that q is easier to samplefrom and x ∼ q implies π(x) ∼ p. This would give the possibility of sampling from p bysampling from q and “projecting” these samples from U to R.

For spheres Sn := x ∈ Rn+1 | ‖x‖ = 1, an obvious projection function

π : Rn+1 \ 0 → Sn : x 7→ x

‖x‖

exists. The problem then is to define a probability density function q on Rn+1. An easyway to fulfill the property x ∼ q =⇒ π(x) ∼ p is by setting q = p π. However, thisleads to the problem that

∫q(x)dx is infinite.

A possible way to solve this is to “spread the density” to Rn+1 in a controlled manner.

Definition 3.11. Let p be a pdf on Sn and let r : R>0 → R≥0 be a pdf such thatZ =

∫R>0

r(ρ)ρndρ <∞. Define the pdf associated to p and r as

q : Rn+1 \ 0 → R≥0 : x 7→ 1

Zr(‖x‖)p

(x

‖x‖

).

As mentioned before, proving the validity of this setting was one of the main reasonsto introduce more formal definitions. The remainder of this subsection is used to verifythe conditions that were found in the previous section and to discuss the condition that∫R>0

r(ρ)ρndρ <∞, which may seem somewhat counterintuitive at first. This conditionarises from the fact that the density is not spread in a “straight” manner: if a segmentS from the sphere Sn is taken and all points in Rn+1 which are projected to S arevisualized, then the area takes the form a fan instead of a rectangle.

Remark 3.12. In order to proof∫R>0

r(ρ)ρndρ < ∞ it is sufficient to show that there

exists an M > 0 such that r(ρ) ≤ Cρ−(n+2) for all ρ ≥M , since

limn→∞

∫ n

Mρ−2−ndρ ≤ lim

n→∞

∫ n

Mρ−2dρ = lim

n→∞[−1

2ρ−1]nM = lim

n→∞−n−1 +

M

2=M

2<∞,

and ∫ M

0ρnr(ρ)dρ ≤

∫R>0

Mnr(ρ)dρ = Mn.

To see that not every density function has∫R>0

r(ρ)ρndρ <∞, the following exampleshows a C∞ function f which is positive everywhere and has integral 1, but where theproperty above already fails for n = 1, that is,

∫R>0

f(ρ)ρdρ =∞.

Example 3.13. Construct a C∞ function f1 : R>0 → R≥0 with the property that f1

is at least 12 on an interval B1 of length at least 1

2 , which is zero outside [2, 3] and hasan integral of 1

2 . Such a function can be constructed using the so-called bump functionswhich are also used in proving the existence of partitions of unity. Define

fn(x) =1

2n−1f1(x− 2n + 2),

28

a rescaled and translated version of f1, such that fn is at least 12n on an interval Bn of

length at least 12 , which is zero outside [2n, 2n+1] and has an integral of 1

2n . Now gN (x) :=∑Nn=1 fn(x) is C∞ as well and the pointwise limit g(x) = limN→∞ gN (x) obviously exists.

In order to see that the convergence is uniform, let ‖f‖∞ := supx∈R>0|f(x)| which we

allow to take to the value ∞. It is easy to see that this satisfies the triangle inequalityand hence

‖gN − g‖∞ =

∥∥∥∥∥∞∑n=N

fn

∥∥∥∥∥∞

≤∞∑n=N

‖fn‖∞ ≤ ‖f1‖∞∞∑n=N

2−(n−1) = ‖f1‖∞2−(N−2)

where ‖f1‖∞ < ∞ since it is a continuous function with compact support. HencegNN∈N converges uniformly to g and it follows that g is C∞ as well by Weierstrasstheorem. Furthermore, since all fi ≥ 0,∫

R>0

g(ρ)ρdρ =

∫R>0

∞∑n=1

fn(ρ)ρdρ ≥N∑n=1

∫[2n,2n+1]

fn(ρ)ρdρ

for all N ∈ N, but ∫[2n,2n+1]

f(ρ)ρdρ ≥∫Bn

1

2nρdρ ≥

∫Bn

1

2n2ndρ ≥ 1

2

for all n ∈ N, hence∑N

n=1

∫[2n,2n+1] fn(ρ)ρdρ→∞.

The following two theorems show that the function q of Definition 3.11 is indeed aprobability density function and that X ∼ q =⇒ f(X) ∼ p holds.

Theorem 3.14. Let p, q, r be given as in Definition 3.11. Then q has an integral of 1.

Proof. Let φ1, . . . , φn, ρ denote the Spherical coordinates for Rn+1\0 (see [4]), whereρ = ‖x‖ for x ∈ Rn+1 and let

f : [0, π]n−1 × [0, 2π)× R>0 → Rn+1 \ 0

be the change-of-coordinates map. Write Ω = (φ1, . . . , φn) and

| det(Df)| = ρn sinn−1(φ1) · · · sin(φn−1) =: ρng(Ω)

then ∫Rn+1

q(x)dx =1

Z

∫R>0

(∫Ωq(f(Ω, ρ))ρng(Ω)dΩ

)dρ =

1

Z

∫R>0

r(ρ)ρndρ

∫Ωp(f(Ω, 1))g(Ω)dΩ =

1

Z

∫R>0

r(ρ)ρndρ

∫Sn

p(x)dx = 1

(the first step applies change-of-variables formula and uses that 0 may be removed fromthe domain without changing the value of the integral, the second step fills in the def-inition of q and uses that constants may be removed from the integral and the fourthstep applies the change-of-variables formula again). Since the Lebesgue integral is equalto the Riemann integral in cases like these, the result follows.

29

Theorem 3.15. Let p, q, r be given as in Definition 3.11. Then z ∼ q implies z‖z‖ ∼ p.

Proof. Define π : Rn+1 \ 0 → Sn : x 7→ x‖x‖ and let f : R>0 × Ω → Rn+1 \ 0 the

change-of-coordinates map to Spherical coordinates. Let x ∈ Sn be given, for sets B ∈ B∫π−1(B)

qdµ =

∫π−1(B)

q(x)dx =

∫f−1(π−1(B))

|det(Df)|q(f(y))dy =

1

Z

∫R>0×B′

p(f(1, ω)r(ρ)g(ω)ρndρdω =1

Z

∫B′p(f(1, ω)g(ω)dω

∫R>0

r(ρ)ρndρ

where it is used that the Lebesgue integral is equal to the Riemann integral and thechange-of-coordinates formula is applied. Also, f−1(π−1x) = R>0 × ωx for a singlevector ωx ∈ Ω, where ωx 6= ωy whenever x 6= y. This shows f−1(π−1(B)) = R>0 × B′for f(1, B′) = B. Hence∫

π−1(B)qdµ =

∫B′p(f(1, ω))g(ω)dω =

∫Bp(x)dx.

Remark 3.16. In practice, unnormalized probability density functions p and q areused. It is not required to compute the exact value of

∫R>0

r(ρ)ρndρ (it only needs to be

verified that it is finite), since random elements X and Y (y) := CX(y) have the samedistribution for a constant C > 0.

3.3.3 Manifold sampling

If the sampling domain is a k-dimensional manifoldM , then there exist charts (Uα, φα)α∈Awhich cover the manifold, for φα : Uα → Rk. If there is a small number of charts, thesecharts may be used to sample from Rk instead of from the manifold, which may becomean easier task. This does, however, give a lot of additional computations for switchingbetween the manifold and Rk.

A tool which comes in handy for manifold are the so-called partitions of unity. Thereexists a partition of unity ρα subordinate to each open cover Uα, which satisfies thethree conditions:

• ρα : M → [0, 1] is C∞;

•∑

α∈A ρα(x) = 1 for all x ∈M ;

• supp(ρα) ⊂ Uα.

Using these functions, the sampling problem can in principle be “transported” to Rk. Ifp : M → R≥0 is a density defined on the manifold, then a function on Rk can be definedby (assuming A = 1, . . . , n is finite)

q(x) :=

n∑i=1

1φi(Ui)(x)ρi(φ−1i x)p(φ−1

i x),

30

that is, for all i for which φ−1i (x) exists (φi injective by definition), compute a weighted

probability and sum those. The advantage of the use of partition of unity is that q willapproach zero when it approaches the boundary of ∪iφi(Ui). It might do so, however,with such a sharp slope that sampling methods do not appreciate this fact in practiceat all.

A problem with proving that the charts can be used to transport the sampling problem,is that for every x ∈ Rk there may exist zero or several y ∈ M such that φi(y) = x forsome i ∈ 1, . . . , n (although at most one per chart). Such a y ∈ M can also be inthe intersection of multiple charts, say y ∈ Ui ∩ Uj . In such a case, it can happen thatφi(y) 6= φj(y), such that y has two points where it would like to give its density to.

The latter problem can be easily solved with the partitions of unity: each point y ∈Mcan give ρj(y)p(y) to φj(x) and the partitions of unity spread the density nicely (C∞)to Rk. It can be expected that the q defined above is indeed a density with the sameintegral as p (although this would require a formal proof still). The main problem is indefining a manner of conversion f such that X ∼ q gives f(X) ∼ p. As noted before,each x ∈ Rk may have multiple y ∈ M such that y has given part of its density to x.The most obvious conversion manner would be to select all y ∈ M such that φi(y) = xfor some i ∈ 1, . . . , n and define f(x) by a random selection procedure which choosesone of the y based on the amount of density they have given x.

Besides the fact that this will be complicated to prove, such a method would have atremendous amount of computational overhead and is therefore not suited for practice.The advantage is that it provides a manner of sampling from Rk for k the dimension of themanifold, which is as good as it can get. A (non-trivial) theorem in mathematics calledthe Whitney embedding theorem states that any smooth real m-dimensional manifoldcan be smoothly embedded in R2m (if m > 0). For some manifolds (such as spheres) alower dimensional embedding can be easily found as well.

For this thesis, it would have been perfect to find a sampling method which is tunedto sampling from exponential families defined on (compact) Lie groups (or manifolds ingeneral). It appears difficult to do this from a mathematical point of view (using thedefinition of a manifold).

3.3.4 Variations on Gibbs sampling

If the sampling domain is Sn, then the domain may be reduced to a lower-dimensionalSm using an approach similar to Gibbs. Let a probability density p(z) = p(z1, . . . , zn+1)defined on Sn ⊂ Rn+1 be given. The idea is to intersect Sn with an (m + 1)-plane (an(m + 1)-dimensional plane) of the form z ∈ Rn+1 : z1 = x1, . . . , zn−m = xn−m wherex1, . . . , xn−m are values of a previous sample. The intersection will be an m-sphere inthe generic case (almost always) but may be a single point in the case that the plane iscontained in the tangent plane to a point of Sn. The idea of sampling blocks of variablesinstead of a single one is already present in literature [11]. Only reduction to S0 or S1

will be discussed here.For m = 0, the Gibbs sampling method would sample

zt+1i ∼ p(zi|zt+1

1 , . . . , zt+1i−1 , z

ti+1, . . . , z

tn+1).

31

Figure 3.1: Sampling from a circle by an adapted version of Gibbs. The picture showsthat given a value (x, y) ∈ S1 there are at most four other values that maybe reached by this algorithm.

This is a one-dimensional distribution and the values of n variables are already given.Taking into account the restriction that a sample (zt+1

1 , . . . , zt+1i , zti+1, . . . , z

tn+1) should

lie on the sphere, there can be only one or two possible choices for zt+1i , which are the

intersections of a line with the sphere. This would lead to a very easy sampling method:at every step, the intersections x1, x2 of the line with the sphere need to found (wherex1 = x2 is allowed) and x1 is picked with probability p(x1)/(p(x1) + p(x2)) and x2 ischosen otherwise.

Unfortunately, such a method is not valid. An easy way for seeing this is by looking ata circle S1 ⊂ R2. Given a sample (x1, y1), we create (x2, y1) intersecting the line y = y1

with S1. One of the solutions is (x1, y1) and the other is opposite to it in the y-direction(and may be equal). Let us assume that a new value x2 6= x1 is reached in this step.In the next step, y2 may be either y1 or the point opposite to it on the line x = x2

(in a similar way). Eventually, it becomes apparent that only the rectangle defined byspecifying the sides x = x1 and y = y1 can be sampled from, see Figure 3.1.

Similarly, for n > 1 sampling methods are stuck in the intersection of an (n + 1)-dimensional variant of a rectangle defined by z1 = x1, . . . , zn+1 = xn+1 (defined to haveits corner points on the sphere) and the sphere Sn.

What went wrong here? One of the requirements for Gibbs sampling is that an ergodicMarkov chain is created. Ergodicity basically means that any point in the space can bereached, which is obviously not true here.

Considering a reduction of Sn to S1, the Gibbs method is able to reach any point ofSn. Sample

(zt+1i , zt+1

i+1) ∼ p(zi, zi+1|zt+11 , . . . , zt+1

i−1 , zti+2, . . . , z

tn+1).

32

Figure 3.2: Gibbs sampling on an adjusted sample domain. The big red line is drawn toshow that the x-coordinate has a fixed value and the blue areas denote theplaces where the probability density may be reasonably high. Samples onlines tangent to S1 are seen as the same, which is depicted by the pink area.

In order to restrict that the new point lies in Sn again, intersect Sn with the plane

x ∈ Rn+1 : x1 = zt+11 , . . . , xi−1 = zt+1

i−1 , xi+2 = zti+2, . . . , xn+1 = ztn+1.

This leaves(xi, xi+1) : x2

i + x2i+1 = 1−

∑j 6=i,i+1

z2j

which is a circle (or a point (0, 0) if∑

j 6=i,i+1 z2j = 1 already).

The ergodicity problem which arises when Gibbs is applied can also be solved using aform of projection sampling. In Figure 3.2 the idea of Gibbs sampling on the adjusteddomain is depicted for n = 1. The same idea can be applied for spheres Sn for anyn ∈ N.

33

4 Experiments

This chapter describes which methods are implemented, how the evaluation procedureworks and what the results are.

4.1 Implemented sampling methods

All sampling methods should sample from either p(φ|x, y) defined on [0, 2π] × [0, π] ×[0, 2π] or from p(g(φ)|x, y) defined on S3, where

g : S3 → [0, 2π]× [0, π]× [0, 2π]

is the conversion function from quaternions to Euler angles. More details about thecomputational framework in which the methods are integrated can be found in Chapter2.

All code is written in Python. It is easiest to run the code in Linux, since some of thecode in the framework has C code integrated in the Python code which Windows has ahard time finding a compiler for. All methods should be able to return samples in anarray of shape (dimension, amount of data, number of samples). The dimension can beeither 3 or 4 for Euler angles respectively S3 or R4. For each data point, the densityp(φ|x, y) which is sampled from is different thus it is desirable that the sampling methodsare able to sample from various distributions simultaneously. The function f for whichEp[f ] is approximated is the log of the density corresponding to N (y|WR(φ)W tx, σ),that is,

fx,y,W (φ) = − 1

2σ2‖y −WR(φ)W tx‖2 − log(

√(2π)nσ2n)

for n the dimension of the data and

log(√

2πnσ2n) =n

2log(2π) + n log(σ).

The algorithm actually needs to approximate E[log p(y|x, φ)+log p(φ)]φ|x,y, but the priorof φ may be omitted (since the formula for p(φ) is independent of the parameters thatare learned) and the constant term is also omitted in all experiments.

4.1.1 Existing methods

The methods below are explained in more detail in Section 3.2.

• Importance sampling. Sample zi uniformly from [0, 2π] × [0, π] × [0, 2π] and

approximate the integral by 4π3

L

∑Li=1 f(zi)p(zi), which is a form of importance

34

sampling with proposal distribution q(z) = 14π3 . Since the normalization constant

of p is unknown, this needs to be approximated as well by

C =

∫Euler angles

p(z)dz ≈ 4π3

L(p(z1) + · · ·+ p(zL)).

This gives the approximation∫f(z)p(z)dz ≈

∑Li=1 f(zi)p(zi)∑L

i=1 p(zi).

• Metropolis-Hastings. The Metropolis-Hastings algorithm is implemented inthe framework by Taco Cohen. A new proposal φ∗ based on current value φ isgenerated by φ∗ = φ+ u for u ∼M(0, κ)× 1

2M(0, κ)×M(0, κ).

• Importance sampling on S3. This is included to verify experimentally whetherthe conversion function f from quaternions to Euler angles and the distributionon S3 are chosen correctly. Quaternions are sampled randomly by sampling theirSpherical coordinates and converting these to elements of S3. The samples areconverted to samples for p(φ|x, y) by applying the conversion function f .

• Spherical HMC. The algorithm that is implemented is given in Algorithm 1 inSubsection 3.2.4. The computations for the gradient of U are given in Section2.2. The samples are in the end converted to samples for p(φ|x, y) by applying theconversion function f .

• HMC. A simple HMC algorithm is implemented and used to sample from R4.Due to the fact that the sampler makes steps in the direction of the gradient, thismethod cannot easily be applied to sample from S3 or [0, 2π]× [0, π]× [0, 2π], sinceit might step out of the domain.

4.1.2 Discretized Gibbs sampling on R4

Define r as the probability density function of N (1, 0.02) and

q : R4 → R≥0 : x 7→ r(‖x‖)p(

x

‖x‖

).

Note that q is ill-defined in the point x = 0, but that in practice this will not be aproblem, since the chance of needing to evaluate q in the point 0 is extremely slim.Gibbs sampling can now be applied on this adjusted domain, which results in the needto sample from one-dimensional distributions on a relatively small domain.

Denote z = (z1, z2, z3, z4) ∈ R4, and assume that the need arises to sample a newvalue for x1 from

q(x1) = q(x1, z2, z3, z4).

A sample x1 ∼ q can be approximated as follows.

35

• Partition the domain into n segments: −2 = a0 ≤ a1 ≤ · · · ≤ an = 2.

• Evaluate q(ai) for i = 1, . . . , n and compute bi = (ai+1− ai)(q(ai) + q(ai+1)) as anestimation for twice the value of the integral of q on [ai, ai+1].

• Let B =∑n

i=1 bi and sample y uniformly from (0, 1). Keep summing values of biuntil

∑r−1i=1 bi ≤ By ≤

∑ri=1 bi.

• Sample x1 uniformly from [ar, ar+1].

This is a discretized form of inverse transform sampling. The partition of the domaincan be done in various ways. Both uniform partitions and random partitions havebeen implemented and evaluated. A random partition is created by sampling n randomnumbers l1, . . . , ln and computing a normalization constant C = 1

4

∑ni=1 li. In this case,

l1/C, . . . , ln/C will sum to four and the elements ai used above are defined by a0 = −2and

ai = ai−1 +liC

for i = 1, . . . , n.

Another idea (which is not implemented) that can improve the estimation is the follow-ing. The function q will be large whenever r and p are large, and r is obviously large inthis case when ‖x‖ is close to 1, that is, when x1 is close to µ1 =

√1− z2

2 − z23 − z2

4 orµ2 = -

√1− z2

2 − z23 − z2

4 . Using the standard deviation and these to middle points, thedomain could be partitioned in such a way that the segments are smaller when they arecloser to one of the µi.

Figure 4.1: A conditional of log q is shown in red, with its components log p in green andlog r in blue, for p, q, r as in Definition 3.11

The error in the discretization of the domain will probably be low, since the condi-tionals of the pdf q are quite smooth, as can be seen in Figure 4.1.

36

Figure 4.2: Level curves of a two-dimensional slice of the logarithm of a pdf p definedon Euler angles. Blue colours indicate low values of p, whereas red coloursindicate high values. The arrows indicate the direction of the gradient, whichpoints (as it should) to values where log p is higher

4.2 Evaluation method

The sampling methods are mainly evaluated on how well they are able to approximatethe integrals they were implemented for. For this purpose, a distribution p(φ|x, y) withtoy-data (x, y) is fixed and the various methods are used to draw samples from thisparticular distribution. These samples are used to approximate the integral, which iscontrasted with a baseline approximation. The number of samples and the parametersof the methods are varied, as well as the fixed distribution. The methods are evaluatedboth on amount of samples per time unit and result per amount of samples. During allthe runs, the dimensions of the irreducible representations are predefined as three, fiveand seven.

The conversion function from quaternions to Euler angles is tested by showing thatan element represents the same rotation after conversion as before. This is verified fora few random quaternions, where the quaternions are generated randomly by samplingSpherical coordinates and converting those to quaternions. The idea of transformingsamples is evaluated experimentally by sampling from S3 and R4 in various ways andtransforming those samples to Euler angles. Estimates of the objective using samplestaken in Euler angles and transformed samples are compared.

The implementation of the gradients is evaluated by plotting level curves of two-dimensional slices of the distribution, together with arrows pointing in the direction ofthe gradient. One of the plots can be found in Figure 4.2.

37

4.2.1 Problems with the normalization constant

Obtaining a good baseline for evaluation appeared to be the most difficult and time-consuming task. Intuitively, importance sampling seems like a good baseline. Impor-tance sampling approximates

B

∫D

1

Bf(z)p(z)dz ≈ 1

B

∑f(zi)p(zi)

for D = [0, 2π]×[0, π]×[0, 2π], B = µ(D) and zi ∼ U(D) uniform samples. Theoretically,the approximation error will tend to zero as the sampling size tends to infinity. Since thedomain is low-dimensional and “small” (it is bounded) and since the generated pdfs tendto be reasonably smooth (based on looking at plots), it can be expected that the amountof samples required for a good approximation will be relatively small. Experimentsalso showed that the variation of the estimation of the integral when using importancesampling becomes very small as the sample size increases, as can be seen in Table 4.1.In every run of a sampling algorithm, different samples will be produced, resulting in

Number of samples Mean estimate Standard deviation Mean error

10 -651 139 45420 -578 197 38050 -521 164 324100 -296 62 99200 -283 51 85500 -258 17 611000 -237 11 402000 -227 17 305000 -233 11 3610000 -213 13 17

Baseline -197 3 -

Table 4.1: The mean estimate, standard deviation and mean error when applying im-portance sampling with various amounts of samples is given. The error iscomputed using a baseline created by taking the mean of fifty estimates, eachobtained by applying importance sampling with 100.000 samples.

a different estimation of the integral. Due to the random results, the variation of thisestimation seems to be a good indicator for whether the sampling method works well(with the current parameters, pdf and the chosen amount of samples). However, theresults in the table also show that the mean estimate are “converging” in a strangeway: an estimation using importance sampling with few samples gives a lower estimateof the objective than an estimation using importance sampling with many samples.This phenomenon is encountered in all runs of the algorithm on the problem at hand.Furthermore, if 100.000 uniform samples are taken and the objective is approximatedusing importance sampling with batches of various lengths (dividing these fixed samplesinto various batches), then the estimation has this converging effect as well (in this case,

38

Batch size All fixed Fixed constant Fixed samples

10 -1028.34720106 -0.111669519628 -1595.3409671520 -1028.34720106 -762.607423827 -1433.8009662550 -1028.34720106 -0.763794128348 -1289.88298997100 -1028.34720106 -42.2942551534 -1221.29146655200 -1028.34720106 -0.426344860118 -1174.71068312500 -1028.34720106 -0.210725436807 -1126.379384791000 -1028.34720106 -0.0492779940323 -1101.053991752000 -1028.34720106 -0.137627568186 -1081.122674415000 -1028.34720106 -0.00129862813144 -1061.2771751110000 -1028.34720106 -632.50308832 -1052.3337602920000 -1028.34720106 -71.8705168278 -1045.6374467750000 -1028.34720106 -0.0284389787768 -1037.57087215100000 -1028.34720106 -40.8382662736 -1028.34720106

Table 4.2: The mean estimates when applying importance sampling with variousamounts of samples are given. The estimation is done based on 100.000 sam-ples distributed using various batch sizes. The normalization constant andthe samples are either varied or held constant as the batch sizes vary.

the normalization constant is approximated anew for each batch based on the samples inthe batch). This can be seen in Table 4.2. This table also shows that the estimation ofthe objective itself is independent of the batch size: when the normalization constant isprecomputed based on all 100.000 samples, then we find the exact same approximationof the integral as expected. Finally, when the normalization constant is precomputed,but new samples are taken for each batch size, the results vary greatly. The density pfrom which is sampled is very steep and log p takes values well over a thousand, suchthat a precomputed normalization constant may be very high or low for the samplesthat are taken, which might explain this variation.

When the normalization constant and the objective are estimated based on differ-ent sample sets (using two sets of 100.000 samples), the converging effect becomes onlyworse. However, both the estimation of the (log) normalization constant and the es-timation of the objective do not show this effect when they are computed separately.A similar phenomenon is encountered when approximating

∫z2p(z)dz for unnormalized

p ∼ N (0, 1) as can be seen in Figure 4.3. In this case, the approximation does convergeto the true value of the integral.

The integral can also be approximate numerically, for example by placing a dense gridover [0, 2π]× [0, π]× [0, 2π] and sampling a point zi randomly from each bucket Bi andapproximating ∫

f(z)p(z)dz ≈L∑i=1

f(zi)p(zi)µ(Bi)

for p the normalized version of p. This approximation follows the definition of Riemannintegrals and the approximation should reach the true value of the integral as µ(Bi)→ 0

39

Figure 4.3: The integral∫z2p(z)dz is approximated for p ∼ N (0, 1) using Monte

Carlo with 100.000 samples using various batch sizes, by sampling fromN (0, 1) (red), applying importance sampling (yellow) and applying impor-tance sampling while simultaneously approximating the normalization con-stant (green). The blue line indicates the true value.

(thus L → ∞). In this case, the normalization constant of p needs to be approximatedas well and this can be done using the same samples in a similar fashion as it is done forimportance sampling. If B is the total area of the domain, then µ(Bi) = B/L for eachi and

L∑i=1

f(zi)p(zi)µ(Bi) =B

L

L∑i=1

f(zi)p(zi) ≈∑L

i=1 f(zi)p(zi)∑Li=1 p(zi)

,

such that the theory of Riemann integrals gives the same approximation as applyingimportance sampling, where only the way of acquiring samples is different. This approx-imation has been done for one-dimensional slices and for the full domain and is plottedagainst the approximation of importance sampling in Figure 4.4.

Since both Riemann estimates and importance sampling give similar estimates fora high amount of samples, the mean estimate of applying importance sampling fiftytimes with 100.000 samples is used as baseline. No explanation for the phenomenon of“monotonically converging” estimates has been found, although the simultaneous needfor approximation of normalization constant and objective seems to be the cause.

40

Figure 4.4: The mean estimate is plotted against the number of samples for importancesampling (blue) and estimation by Riemann sums (red). The first graphshows the approximation made for the full domain and the second graphshows the approximation for a one-dimensional slice.

41

Figure 4.5: The four coordinates of samples acquired using HMC (without burn-in pe-riod) have been plotted.

4.2.2 Problems with HMC

In the first few test runs, both HMC and Spherical HMC performed very poorly: theirapproximations of the same objective differed greatly between various runs of the al-gorithm. This may have various causes: the algorithms may have problems with theparticular type of distribution, there might be a mistake in the implementation of thealgorithm or in the implementation of the gradient and the parameters of the algorithm(the amount and size of the leapfrog steps and the burn-in period) may need better tun-ing. As mentioned before, the implementation of the gradient is evaluated by looking atlevel plots of which an example can be found in Figure 4.2. By tuning the parameters,some improvement was made, but the desired results did not follow. For HMC methods,poor performance can also be recognized when the trajectories are plotted, by assessingwhether the acceptance rate is high enough and whether the sampler moves around theentire domain. Neal [14, page 30] derived that an acceptance rate 65% is optimal forHMC.

In this case, one of the explanations for the poor performance of HMC is the steepnessof the distribution. In Figure 4.5 a sample trajectory of HMC has been plotted. Thefour coordinates of the samples acquired using HMC are plotted in different colours. Theacceptance rate seems to be quite high, but in most plots, the sampler does not seem themove around the space (which is of course required to get a good estimation). In Figure4.6 the results of letting HMC sample from a two-dimensional slice of the domain areplotted. As can be seen, the sampler walks to a place where the q is very high and cannotget out. If the distribution is made less steep by hand, then the sampler is able to movearound the space, as can be seen in Figure 4.7. This trajectory seems to indicate that

42

Figure 4.6: Three different sample trajectories of HMC (without burn-in period) havebeen plotted, where HMC is used to sample from a two-dimensional slice ofthe domain. The red dots indicate the initiation position for the trajectories.The various colours indicate the value of log q.

Figure 4.7: A sample trajectory of HMC (without burn-in period) has been plotted,where HMC is used to sample from a two-dimensional slice of the domainand log q has been divided by 100. A contour plot of log(q)/100 is given andthe initiation position is indicated with a red dot.

43

Figure 4.8: The sample trajectories of discretized Gibbs (blue) and HMC (green) areplotted on a level plot of a two-dimensional slice of log(q)/100. The bluedots are equally divided between both peeks in the distribution, whereas thegreen dots are stuck in one peek. Both methods were initiated on the reddot.

the sampling algorithm is working well, since it samples more often from areas where thedistribution is higher. HMC does, however, seem to have more difficulties with peaksthen discretized Gibbs. In Figure 4.8 it can be seen that discretized Gibbs is able tomove between both peaks in the distribution, whereas HMC is stuck in one of them.

4.3 Results

First of all, the implementations can be used to verify experimentally whether the chosendistribution p on S3 and the conversion function f satisfy X ∼ p =⇒ f(X) ∼ p. Thisis done by taking uniform samples in S3 and uniform samples in [0, 2π]× [0, π]× [0, 2π],transforming the samples in S3 to Euler angles and evaluating the objective. If theconversion function works, then both estimations should behave in similar ways. Ascan be seen in Figure 4.9, the estimations are very close to each other. This does notproof that the transformation is working well, but it is a strong indicator for it. Anothernice result which can be seen in Figure 4.9 is that great precision needs to be usedwhen defining a function for creating random quaternions. If four random numbersbetween 0 and 1 are sampled and the resulting vector is divided by its norm, then arandom quaternion results. However, this quaternion is not sampled from the uniformdistribution over S3, since elements of the form (x, x, x, x) will be sampled more oftenthan (1, 0, 0, 0) (which can be seen when picturing a circle in a square). If this “wrong”

44

Figure 4.9: Estimates using importance sampling with a uniform distribution over Eulerangles (red), a uniform distribution over quaternions (blue) and an almostuniform distribution over quaternions (yellow) are plotted against the samplesize, where the mean estimate over 10 trials is taken.

sampling method for creating quaternions is used, the estimates do indeed differ fromthe estimates created using a “true” uniform sampling method (which can be done bysampling Spherical coordinates, as mentioned previously).

In order to test experimentally whether it is indeed possible to sample from R4 andconvert the samples to samples for S3 (and then to Euler angles), HMC and discretizedGibbs have been applied to sample from the distribution q on R4. In this case, r ∼N (0, 0.02) is taken, where plots have been used to derive a good value for the standarddeviation.

Discretized Gibbs has been implemented for R4 in three versions: one version parti-tions the domain randomly and the other two uniformly, where one of those saves the“in between updates”, the results after updating a single variable, as well. In Table4.3, the mean estimates of the EM objective are given when the three versions of dis-cretized Gibbs are applied with various amounts of samples and partitions. Uniformdiscretization seems to work better than random discretization, which is probably dueto the fact that the maximum size a partition segment can have is much larger for ran-dom discretization, such that the discretization error is larger. A disadvantage of theimplementation of discretized Gibbs is that it takes very long to sample a point, sincefour conditionals need to be discretized for each sample. The number of partitions needsto be chosen as low as possible in such a way that the domain is still discretized suffi-ciently. Apparently, ten partition segments is too little, but the estimates are alreadyclose to the baseline for a hundred partition segments. Furthermore, a small amount

45

Number of Number of Euler Random Uniform Adjustedsamples partitions Gibbs Gibbs Gibbs Gibbs

5 10 -1548 -2076 -1763 -17965 100 -1183 -1386 -1162 -11545 500 -1148 -1360 -1150 -113910 10 -1545 -2110 -1826 -177410 100 -1229 -1393 -1239 -123010 500 -1136 -1530 -1237 -123350 10 -1516 -2201 -1757 -179050 100 -1114 -1223 -1190 -118250 500 -1112 -1455 -1371 -1371100 10 -1490 -2164 -1756 -1805100 100 -1157 -1376 -1153 -1137100 500 -1140 -1378 -1246 -1245

Table 4.3: Mean estimates of the EM objective are taken of ten trials when sampling us-ing discretized Gibbs where the domain is partitioned randomly or uniformly.Adjusted Gibbs partitions the domain uniformly and saves a sample as soonas a variable is updated (thus has four times as many samples in each run).Euler Gibbs samples in Euler angles, whereas the other three methods samplefrom q as defined in Definition 3.11. The baseline estimate is -1033.

of samples already gives a reasonable estimate for the objective, which is very differentfrom importance sampling. This effect can be expected from the theory, since discretizedGibbs is able to sample the points that count, namely those for which p is high, whereasimportance sampling needs to compensate for sampling from the wrong distribution andapproximate the normalization constant.

Also in Table 4.3 are the mean estimates of discretized Gibbs when applied to samplefrom Euler angles, where the domain is partitioned uniformly. If the transformationof samples is performed in the correct way, the estimates of discretized Gibbs appliedto Euler angles and discretized Gibbs applied to R4 should behave in a similar way.Discretized Gibbs applied to Euler angles performs better, but does not reach the baselineeither. Furthermore, it behaves similar with respect to change in the number of samplesand the number of partitions. The slight difference in performance may be due to thefact that the sampling domain R4 is more difficult to sample from than Euler angles.The advantage in R4 lies mainly in the fact that more sampling methods can be appliedto sample from SO(3) and does not necessarily give any computational improvements.

After tuning the parameters and fixing all bugs, Spherical HMC gives estimates veryclose to the baseline more than half the time, but may also give much lower estimates.A step size which is too large (for example ε = 0.2) results in a low acceptance rate(and bad estimates). For small step sizes (for example ε = 0.005), the acceptancerate becomes surprisingly high (it reaches 99%), although the method is now unableto propose long-distance moves and/or a large amount of computation time is required

46

Figure 4.10: The mean error when Spherical HMC is used to estimate an objective isplotted against against the amount of samples for various amount of leapfrogsteps. The mean is taken over ten trials and the colours red, blue, greenand yellow correspond to five, ten, fifty and hundred leapfrog steps.

(which depends on the choice of the amount of leapfrog steps). In Figure 4.10 the meanerror is plotted against the amount of samples for various amount of leapfrog steps (thenumber of leapfrog steps is actually randomized between the amount and 1,25 times theamount to ensure the sampler does not get stuck in periodicities). Again, an increasein the number of samples does not seem to give much improvement in the estimates,as was also the case for discretized Gibbs sampling. As expected, more leapfrog stepsresult in better estimates. The amount of computation time increases proportional tothe number of leapfrog steps (around 27 seconds for L = 10 and 274 seconds for L = 100for producing all estimates) and to the number of samples. Based on the graph, it ismost efficient to pick a sufficiently high L and use a small number of samples.

The estimates of Spherical HMC, HMC and Metropolis-Hastings for the same distri-bution as in Table 4.3 have been given in Table 4.4. As can be seen, Metropolis-Hastingsseems to be most benefited by a burn-in period. This method is also the fastest of theMCMC methods, which can be expected since HMC methods require leap frog steps andthe computation of the gradient and Gibbs needs to sample all four conditionals (andMetropolis-Hastings has a more experienced programmer). The results seem reasonablygood, but all samplers have problems moving around: Metropolis-Hastings has an ex-tremely low acceptance rate (depending on the distribution, amount of samples and therun, often 0-10%) and the HMC methods move around only within their own peek.

47

Number Spherical Metropolis-of samples Burn-in HMC HMC Hastings

10 0 -1241 -1299 -159520 0 -1122 -1120 -152350 0 -1209 -1219 -128675 0 -1295 -1209 -1219100 0 -1178 -1125 -1228200 0 -1072 -1164 -1196350 0 -1145 -1059 -1183500 0 -1210 -1221 -119010 10 -1092 -1181 -137410 50 -1178 -1165 -119310 100 -1148 -1158 -1196

Table 4.4: Mean estimates of the EM objective are taken of ten trials when samplingusing Spherical HMC, HMC or Metropolis-Hastings with various number ofsamples and burn-in samples. The baseline estimate is -1033. For the HMCmethods, ε = 0.005 and L = 20.

In order to see the effects of the distribution on the estimates, the distribution is madeless steep by hand by dividing log p by hundred. The same experiment is conducted forGibbs, HMC, Spherical HMC and Metropolis-Hastings of which the results are presentedin Table 4.5 and Table 4.6. Unexpectedly, Metropolis-Hastings and discretized Gibbson Euler angles seem to benefit most. Apparently, for this particular distribution, thedistribution on Euler angles is most benefited by making the distribution less steep. Alsonote that this time some estimates below the baseline have been made.

When the same experiment is conducted for different distributions (that is, by pro-ducing different data x, y, on which the distribution depends, but still making the dis-tribution less steep by hand), the results vary. In one case, the HMC methods performbest and Metropolis-Hastings and Gibbs perform very poorly and in the next HMCmethods perform poorly and Gibbs on Euler angles performs best. It thus depends onthe particular distribution which methods works best. Importance sampling producesestimates with the least variance, and even though it requires many samples for doingso, it is still one of the fastest methods. Therefore, importance sampling seems like thebest method for approximating the EM objective. If only a small amount of samples arerequired which actually needs to be from the specified distribution p, then one of theother methods can be applied, of which Gibbs on Euler angles seems to perform best ingeneral.

48

Number of Number of Euler Random Uniform Adjustedsamples partitions Gibbs Gibbs Gibbs Gibbs

5 10 -1601 -2690 -2318 -22015 100 -1349 -1858 -1350 -13255 500 -1281 -2134 -1450 -134310 10 -1550 -2714 -2286 -224610 100 -1280 -1832 -1431 -129710 500 -1305 -1717 -1312 -126550 10 -1580 -2714 -2236 -219450 100 -1267 -1689 -1315 -121750 500 -1261 -1617 -1414 -1352100 10 -1574 -2729 -2287 -2147100 100 -1261 -1659 -1334 -1249100 500 -1259 -1722 -1476 -1384

Table 4.5: Mean estimates of the EM objective are taken of ten trials when sampling us-ing discretized Gibbs where the domain is partitioned randomly or uniformly.Adjusted Gibbs partitions the domain uniformly and saves a sample as soonas a variable is updated (thus has four times as many samples in each run).Euler Gibbs samples in Euler angles, whereas the other three methods samplefrom q as defined in Definition 3.11. The baseline estimate is -1251.

Number Spherical Metropolis-of samples Burn-in HMC HMC Hastings

10 0 -1452 -1484 -159520 0 -1276 -1261 -152350 0 -1358 -1364 -128675 0 -1404 -1390 -1219100 0 -1256 -1301 -1228200 0 -1428 -1302 -1196350 0 -1233 -1250 -1183500 0 -1345 -1356 -119010 10 -1310 -1306 -137410 50 -1391 -1441 -119310 100 -1345 -1343 -1196

Table 4.6: Mean estimates of the EM objective are taken of ten trials when samplingusing Spherical HMC, HMC or Metropolis-Hastings with various number ofsamples and burn-in samples. The baseline estimate is -1251. For the HMCmethods, ε = 0.005 and L = 20.

49

5 Conclusion

Sampling methods are often required when integrals are approximated using the MonteCarlo method, that is, ∫

f(z)p(z)dz ≈ 1

L

L∑i=1

f(zi)

for zi ∼ p. Such integrals are encountered in artificial intelligence when the EM algorithmis used and the EM objective (which is an integral of the required form) is difficult toderive analytically. In this thesis, sampling methods are studied for evaluating theintegral arising when the EM algorithm is applied to learn representations over SO(3).The computational framework required is explained in depth in order to explain thepurpose of the research and to understand how sampling methods can be integrated inthe existing framework. Furthermore, various sampling methods are discussed and newsampling methods are investigated. A new sampling method is proposed which samplesfrom Sn by sampling from a different density on Rn+1 and translating these samples tosamples for Sn. The validity of this method is shown by deriving the conditions thatneed to be proven in a general context and by proving that the conditions hold for thisparticular case.

The conditions are derived for the general context that p is a pdf on some space M andq is a pdf on N in a situation where it is preferred to sample from q, but samples from pare required. Furthermore, a measurable (for example continuous) function f : N →Mis required which is used to transform the samples. It is shown that the conditionX ∼ q =⇒ f(X) ∼ p (which basically states “transformation is valid”) amounts to∫

Bpdν =

∫f−1(B)

qdλ for all B ∈ FM .

In most cases in practice, the integrals will be Riemann integrals and it is sufficient tocheck the condition for all (generalizations of) intervals or for all open sets.

For the research on learning representations over Lie groups, it would have been perfectto find a sampling method which is designed for sampling from exponential familiesdefined on Lie groups (which means a manifold and a group structure is given). Anattempt has been made at using properties of manifolds to ease sampling, but a newmethod has not been found.

Several sampling methods have been implemented and evaluated based on their ap-proximation of the EM objective in the LieLearn framework. Finding a good baselineis difficult for the particular problem, since the normalization constant of p needs to beapproximated as well. For a high amount of samples, estimations by Riemann sums and

50

importance sampling are similar, hence the mean estimate of applying importance sam-pling with 100.000 samples multiple times is used as baseline. Importance sampling isvery easy to implement, requires little execution time and gives good estimates if enoughsamples are used, but it does not actually give samples from the required distribution(which may be necessary for other purposes) and its performance degrades as the di-mension of the sampling domain increases. In the framework, Metropolis-Hastings wasalready integrated as sampling method and is able to give good estimates when manysamples are used.

The elements of SO(3) are parametrized using Euler angles. A conversion function fis given in this thesis for converting quaternions (elements of S3) to Euler angles. It isshown experimentally that samples from q = p f can be transformed to samples fromp using f . This is done by applying importance sampling to both distributions, using fto transform the samples from q to p and by comparing the estimates made using bothsets of samples. The procedure of transforming samples in Rn+1 to samples for Sn isshown to work experimentally as well for n = 3. A discretized Gibbs method is usedto sample from the distribution on R4 and the distribution on Euler angles and bothmethods make similar estimates as well. In this case, the sampling domain R4 seems tobe more difficult for this method. The gradients of p and f are implemented in orderto be able to use HMC methods. Both Spherical HMC and HMC perform poorly onthis particular problem, which is shown to be partly due to the steepness of the distri-bution for HMC using plots of the trajectories showing that all samples are taken fromthe same peek. Spherical HMC, HMC and Gibbs are able to give reasonable estimatesusing a small amount of samples, whereas Metropolis-Hastings requires some burn-intime and is faster. All these MCMC methods do, however, have a large variance in theirestimates. Since importance sampling gives steady estimates when a large amount ofsamples is used and the samples for this method can be generated extremely fast, im-portance sampling seems to be the best method for this problem. When a small numberof high quality samples is required, Gibbs sampling seems to work best.

The following might be interesting for further research.

• The validity of the conversion function from S3 to Euler angles has only beenproven experimentally. In theory, each Euler angle is reached exactly twice bythis function. This property might be helpful for proving the required conditionsanalytically.

• The possibility of making the sampling domain (or pdf) more agreeable by us-ing transformation of samples might be useful for more applications in artificialintelligence or other fields of study requiring sampling methods.

• There is still much work left in investigating sampling methods which use particularproperties of their domain, for example the group or manifold structure.

• When the dimension of the Lie group (and hence of the sampling domain) increases,the performance of the various sampling methods can differ greatly with the results

51

found in this thesis. Therefore, it is interesting to evaluate the sampling methodson higher-dimensional problems as well.

• The estimates of the objective using importance sampling or Riemann estimatesshow a “monotonically converging” effect, which is due to the fact that the normal-ization constant needs to be estimated as well. Further research can look furtherinto the cause of this, whether early estimates can predict the value that will beconverged to and how to determine whether enough samples have been taken toguarantee the estimate is within a certain error margin.

52

Bibliography

[1] Martin John Baker. Maths - conversion quaternion to matrix, http:

//www.euclideanspace.com/maths/geometry/rotations/conversions/

quaternionToMatrix/index.htm.

[2] A. Beskos, F.J. Pinski, J.M. Sanz-Serna, and A.M. Stuart. Hybrid Monte Carlo onHilbert spaces. Stochastic Processes and their Applications, 121(10):2201 – 2230,2011.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[4] L. E. Blumenson. A derivation of n-dimensional spherical coordinates. The Amer-ican Mathematical Monthly, 67(1):63–66, 1960.

[5] Simon Byrne and Mark Girolami. Geodesic Monte Carlo on embedded manifolds.2013.

[6] Taco Cohen. Learning transformation groups and their invariants. Master’s thesis,University of Amsterdam, 2013.

[7] Taco Cohen and Max Welling. Learning the irreducible representations of commu-tative Lie groups. JMLR: W&CP, 32, 2014.

[8] Pavel Etingof, Oleg Golberg, Sebastian Hensel, Tiankai Liu, Alex Schwendner,Dmitry Vaintrob, and Elena Yudovina. Introduction to representation theory. Stu-dent mathematical library. American Mathematical Society, 2010.

[9] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images. IEEE Transactions on Pattern Analysis andMachine Intelligence, 6(6):721–741, 1984.

[10] W. K. Hastings. Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika, 57(1):97–109, 1970.

[11] Claus Skaanning Jensen. Blocking Gibbs sampling for inference in large and complexBayesian networks with applications in genetics. 1997.

[12] Shiwei Lan, Bo Zhou, and Babak Shahbabay. Spherical Hamiltonian Monte Carlofor constrained target distributions. 2013.

[13] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H.Teller, and Edward Teller. Equation of state calculations by fast computing ma-chines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.

53

http://www.euclideanspace.com/maths/geometry/rotations/conversions/quaternionToMatrix/index.htm



[14] Radford M. Neal. MCMC using Hamiltonian dynamics. 2011.

[15] P. E. Nikravesh. Lesson 8-a: Euler angles, http://www.u.arizona.edu/~pen/

ame553/Notes/Lesson%2008-A.pdf.

[16] Didier Pinchon and Philip E Hoggan. Rotation matrices for real spherical harmon-ics: general rotations of atomic orbitals in space-fixed axes. Journal of Physics A:Mathematical and Theoretical, 40(50):1597, 2007.

[17] Jun Shao. Mathematical statistics. Springer texts in statistics. 2003.

[18] Eef van Beveren. Some notes on group theory. 2012.

[19] B. Walsh. Markov chain Monte Carlo and Gibbs sampling, http://web.mit.edu/

~wingated/www/introductions/mcmc-gibbs-intro.pdf, 2004.

54

http://www.u.arizona.edu/~pen/ame553/Notes/Lesson%2008-A.pdf

http://www.u.arizona.edu/~pen/ame553/Notes/Lesson%2008-A.pdf

http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf

http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf

sampling from constrained domains - uvathe markov chains due to their ability to propose long...

Documents