bayesian inference for arti cial perception using …...abstract this dissertation project addresses...

Faculdade de Ciencias e Tecnologia

Departamento de Engenharia Electrotecnica e de Computadores

Bayesian inference for artificial perception using

OpenCL on FPGAs and GPUs

Rodrigo de Oliveira Lourenco Lopes

Supervisor:

Prof. Doutor Jorge Nuno de Almeida e Sousa Almada Lobo

Juri:

Prof. Doutor Rui Paulo Pinto da Rocha

Prof. Doutor Jorge Nuno de Almeida e Sousa Almada Lobo

Prof. Doutor Gabriel Falcao Paiva Fernandes

Dissertacao de Mestrado Integrado de Engenharia Electrotecnica e de

Computadores

Coimbra, 2019

Abstract

This dissertation project addresses the implementation of Bayesian inference on FPGAs and

GPUs, following a top-down approach and using OpenCL. The target application of this Bayesian

inference algorithms is artificial perception in robotics. The aim is to improve the power efficiency

of Bayesian inference computations. Previous work at our university in the scope of an European

project followed a bottom-up approach and developed a toolchain that enabled having custom

circuits for Bayesian inference on reconfigurable logic. These had better power efficiency than

desktop solutions, but require more design effort. In this work the goal is to use already available

vendor tools, namely the OpenCL support from Intel (formerly Altera), to explore the design

space in search of low power efficient solutions. To achieve this, the same benchmark problem

used in previous works is going to be applied, tested in various dimensions in order to study

scaling challenges. The main metrics analysed are nominal power, energy consumed, latency and

result’s precision. As expected the results show a great gain in power efficiency in relation to

desktop solutions, and comparable performance with previous works developed in the context of

the project BAMBI, but with gain in point precision, integration and usability.

Resumo

Este projecto de dissertacao aborda a implementacao de um algoritmo de inferencia Bayesiana

em FPGAs e GPUs seguindo uma abordagem ”top-down” e usando OpenCL. Este algoritmo de in-

ferencia Bayesiana tem como foco em aplicacoes de percepcao artificial para robotica. O objectivo e

melhorar a eficiencia energetica de computacoes de inferencia Bayesiana. O trabalho previamente

desenvolvido na nossa universidade no ambito de um projecto europeu seguiu uma abordagem

”bottom-up” e desenvolveu uma ”toolchain” capaz de ter circuitos personalizados para inferencia

Bayesiana em logica reconfiguravel. Estes tinham maior eficiencia energetica do que solucoes im-

plantadas tipicamente em ”desktops”, porem requeriam significativamente maior esforco em design.

Neste trabalho, a ideia e usar ferramentas comerciais ja disponıveis, nomeadamente OpenCL su-

portado actualmente pela Intel (antes pela a Altera), para explorar todo o espaco de design de

modo a encontrar solucoes de baixo custos energeticos. Para o fazer, e usado o mesmo problema

de ”benchmark” utilizado em trabalhos anteriores, o qual sera testado em varias dimensoes de

modo a poder estudar os problemas de escalonamento. As principais metricas usadas em analise

sao potencia nominal, energia consumida, latencia, e a precisao de resultados. Como esperado

os resultados mostram um ganho de eficiencia energetica em relacao a solucoes de ”desktop”, e

performance comparavel a trabalhos anteriores desenvolvidos no ambito do projecto BAMBI, mas

com ganhos em precisao, integracao e na usabilidade.

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives and Key Contributions . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background on Bayesian Inference and Computation Hardware 7

2.1 Performing Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 The BAMBI Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 ProBT Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Hardware for Computation of the Bayesian Inference . . . . . . . . . . . . . 12

2.2.1 GPUs and FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.3 FPGA Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Our Implementation of Bayesian Inference on OpenCL 19

3.1 Work Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Benchmark Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Boat Localization Inference Problem . . . . . . . . . . . . . . . . . . 20

3.2.2 Generating the Bayesian Model . . . . . . . . . . . . . . . . . . . . . 21

3.3 Information Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Host Side Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Bayesian Inference Kernel . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Double vs Single Precision Floating-Point Numbers . . . . . . . . . . . . . . 29

3.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

i

ii CONTENTS

3.5.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.2 CPU Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5.3 GPU Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5.4 FPGA Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Framework and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Experimental Results 35

4.1 Finding the Position of the Boat . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Scalability and Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Results Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 GPU Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.2 FPGA Energy Consumption . . . . . . . . . . . . . . . . . . . . . . 41

4.5.3 CPU Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 Computing with Double Precision vs Single Precision . . . . . . . . . . . . 42

4.7 Comparison With Previous Works . . . . . . . . . . . . . . . . . . . . . . . 45

5 Conclusion 48

5.1 Result’s Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Hardware and Software Limitations . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

List of Figures

2.1 Schema of BAMBI organization. [9]. . . . . . . . . . . . . . . . . . . . . . 10

2.2 Island-style global FPGA architecture. A unit tile consists of Configurable

Logic Block (CLB), Connect Box (CB) and Switch Box(SB). . . . . . . . . 12

2.3 Zoom-out of a general circuit logic present on a GPU card. . . . . . . . . . 13

2.4 FPGA Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Boat Localization Problem [15] . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Likelihood of the boat location for the first landmark on a 64 by 64 model,

where (a) for the distance sensor model when the distance is 32, and (b) for

the angle sensor model when the angle is 0o degrees [10]. . . . . . . . . . . . 21

3.3 Likelihoods LUT Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Boat location problem formal specification [15]. . . . . . . . . . . . . . . . . 25

3.5 Single and double bit arrangement. . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Finding the Positions of the Boat through Bayesian Inference. . . . . . . . . 35

4.2 FPGA Resource utilization for 16x16 grid dimensions. . . . . . . . . . . . . 36

4.3 FPGA Resource utilization for 32x32, 64x64, and 128x128 grid dimensions. 36

4.4 FPGA and GPU Execution Times for one Bayesian Inference Iteration on

Device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 FPGA and GPU Power Consumption for one Bayesian Inference Iteration

on Device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 KL divergence results of boat example for different grid sizes and increasing

energy consumption of the BM1 (background), OpenCL on FPGA (blue

dotted) and GPU (red dotted), and finally our improved version of the

CPU algorithm (pink dotted). . . . . . . . . . . . . . . . . . . . . . . . . . 45

iii

List of Tables

3.1 Information captured on the GPU hardware during the Bayesian inference

algorithm execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Kullback-Leibler divergence for each grid size . . . . . . . . . . . . . . . . . 37

4.2 Posterior Probability Matrix Normalization Time on CPU . . . . . . . . . . 39

4.3 Bayesian Inference in GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Bayesian Inference in a single FPGA . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Bayesian Inference in CPU using ProBT version for one iteration . . . . . . 40

4.6 Bayesian Inference in CPU using Optimized C++ version . . . . . . . . . . 40

4.7 Avarege Energy input on Devices . . . . . . . . . . . . . . . . . . . . . . . . 41

4.8 Power Consuption of One Iteration Bayesian Inference on GPU . . . . . . . 41

4.9 Power Consuption of One Iteration Bayesian Inference on FPGA . . . . . . 41

4.10 One Iteration Bayesian Inference in CPU using Optimized C++ version . . 41

4.11 Average Transfer Times to GPU . . . . . . . . . . . . . . . . . . . . . . . . 43

4.12 Using Single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.13 Ratio Between single and Double Transfer Times . . . . . . . . . . . . . . . 44

4.14 Avarege Execution Times on GPU . . . . . . . . . . . . . . . . . . . . . . . 44

iv

Acronyms

ASIC Application-Specific Integrated Circuit

API Application Programming Interface

AI Artificial Intelligence

BAMBI Bottom-up Approaches to Machines dedicated to Bayesian Inference

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

DRAM Dynamic Random Access Memory

FPGA Field Programmable Gate Array

GPU Graphical Processing Unit

LUT Look Up Table

OpenCL Open Computing Language

RTL Register-Transfer Level

Chapter 1

Introduction

1.1 Motivation

One of the main challenges in the path to full robotic automation is perception.

Perception is the organization, identification, and interpretation of sensory information in

order to represent and understand the presented information, or the environment [1].

Perception can be split into two processes [2], the processing of the sensory input,

which transforms low-level information to higher-level information, and the processing con-

nected with a person’s concepts and expectations, restorative and selective mechanisms

that influence perception.

In the context of living beings, biological intelligence [3] comes from an amazing

duality of arriving at conclusions based on perception of patterns of our inputs (senses),

and the contrary, conclusions based on very structured and rational decisions. Both forms

are different, but ultimately complement each other. Machine-based intelligence also comes

in two forms: Deep learning based Artificial Intelligence [4] interprets patterns in data to

arrive at conclusions mimicking the perception based intelligence of our brain, and stan-

dard instruction-by-instruction computing.

As is explicit in the title of the dissertation, we focus on artificial perception which

deals with the capability of a computer system to interpret data in a manner that is similar

to the way humans use their senses to relate to the world around them [5].

1

2 1.1. MOTIVATION

When developing algorithms capable of deriving perception meaning from sensors,

sometimes we stumble upon problems in which the complexity exponentially increases

with the adding of inputs.

A probabilistic algorithm is a method where the result and/or the way the result

is obtained depend on chance. In some applications the use of probabilistic algorithms

is natural, e.g. simulating the behaviour of some existing or planned system over time.

In this case the result by nature is stochastic. In some cases the problem to be solved

is deterministic but can be be transformed into a stochastic one and solved by applying

a probabilistic algorithm (eg. numerical integration, optimization). For these numerical

applications the result obtained is always approximate, but its expected precision improves

as the time available to use the algorithm increases [6][7].

Another way to improve precision is to use Bayesian inference that derives the poste-

rior probability as a consequence of two antecedents: a prior probability and a ”likelihood

function” derived from a statistical model for the observed data [8].

However, with increasing cardinality of Bayesian Inference computations, even sim-

ple perception problems can easily overload traditional CPU (sequential based) implemen-

tations.

In this context, adapting the hardware to the specific constraints of probabilistic

computation, using parallelism via multicore architectures, it is shown to achieve several

orders of magnitude in gains.

Such probabilistic computations can be highly independent for the individual can-

didate solutions, making them excellent candidates for cluster computations because they

require little communication and synchronization. It is possible to specify a common par-

allel control structure as a generic algorithm for probabilistic cluster computations. Such

a generic parallel algorithm can be “glued” together with domain-specific sequential algo-

rithms in order to derive approximate parallel solutions for different intractable problems.

3 1.1. MOTIVATION

The BAMBI EU FET project, developed new computing machines largely inspired

by biology with a main focus on those to perform probabilistic inferences efficiently. And

in a long-term view, this might give rise to completely new hardware based on these prin-

ciples [9].

The numerous simulations that were performed during the BAMBI project showed

that these new highly parallel stochastic architectures could solve very efficiently several

inference problems, including sensor fusion, Bayesian filters, parameter learning and ap-

proximation of highly intractable inference problems [9].

In this context previous works in the field used reconfigurable logic (FPGA) to

emulate Bayesian gates and investigate the parallelization of probabilistic calculus on as-

semblies of gates. This dissertation takes another direction and performs exact Bayesian

Inference, instead of approximate using a top-down approach with heterogeneous pro-

gramming, namely using OpenCL on FPGAs and GPUs, with the intention to compare

results. The main goal is to investigate the trade-off between energetic consumption and

precision in relation to previous works in the field. Having a working model of the artifi-

cial perception hardware using Bayesian inference enables its use in robotics applications,

further advancing towards autonomous robots able to evolve in uncertain and ambiguous

real-world situations.

4 1.2. OBJECTIVES AND KEY CONTRIBUTIONS

1.2 Objectives and Key Contributions

Objectives

This dissertation explores the solution space in search of low power efficient method

on the application of Bayesian inference orientated towards artificial perception in robotics.

In previous works a toolchain was used that enabled having custom circuits for ap-

proximate Bayesian inference [10], this engine (known as BM1) presented better power

efficiency than desktop solutions, but requires more design effort.

In this work OpenCL implementation of exact Bayesian inference is used running

on FPGA and GPU. An analysis and comparison is going to be conducted between this

method and previous works as well as a discussion of best possible applications for each

implementation.

Also a comparison between single vs double floating point number program ap-

proaches was pursued in order to further explore the solutions space for Bayesian inference.

Key Contributions

The exact Bayesian inference implementation on OpenCL using GPU presents sig-

nificantly less latency, total energy consumption, and development times, but considerable

more nominal power required in relation to the FPGA approach. The OpenCL implemen-

tation (both GPU and FPGA) has a much higher result precision and lower design times

in relation to the BM1 machine but with higher energetic costs and nominal power, as was

initially predicted.

On the comparison between single vs double floating point number program ap-

proaches, both presented similar computation times of the Bayesian inference in both

hardware, but single floating point number program presented lower the transfer times

between host and device and lower memory occupied in device. The loss in precision from

the double floating point number to single in the problem used as benchmark was nearly

negligible.

5 1.3. RELATED WORK

1.3 Related Work

During the development of this dissertation some works were used either as theoret-

ical base, or for orientation due to some overlapping topics addressed, namely:

Ferreira et al. demonstrate tractable implementation of exact Bayesian inferece

with a real-time OpenCL vision-based model to estimate gaze direction in a human-robot

interaction (HRI) using GPUs [11]. On the other hand, on a Bottoms-Up Approach

[12], unconventional hardware architectures are explored to perform the required infer-

ence. This was the course of action of BAMBI project: a generic toolchain was developed

to generate unconventional hardware for probabilistic inference from stochastic building

blocks, allowing bit level parallelism [13]. During this project, a Boat Location problem

was used as a case study [14], this is used in this dissertation as a performance benchmark.

The results obtained in this project are used as comparison to the ones in this disserta-

tion, accompanied with trade-off analysis. Later, Mira continued the work of the BAMBI

project, integrating ROS and adding a PCIe interface to extract the full capabilities of the

FPGA mini-cluster [15], this same setup was used in this dissertation.

6 1.4. OVERVIEW

1.4 Overview

• In chapter 1 we describe the work that its being proposed as well the motivations

behind it. We give a context of fields relating to the project and review of the work already

conducted on the subject.

• On chapter 2 the background presents the theoretical foundations of this dis-

sertation briefly described.

• In chapter 3 we discuss the work methods used to approach the problems faced.

In this chapter it’s also addressed the framework/tools used during the development.

• In chapter 4 we present the results obtained, accompanied with a critical anal-

ysis discussing their relevance.

• In chapter 5 a brief conclusion is presented with final thoughts about the the-

sis, and some suggestions about future work.

Chapter 2

Background on Bayesian Inference

and Computation Hardware

2.1 Performing Bayesian Inference

2.1.1 Bayesian Inference

Bayesian inference is a method of statistical inference in which Bayes’ theorem is

used to update the probability for a hypothesis as more evidence or information becomes

available. Bayesian inference is an important technique in statistics, and especially in

mathematical statistics. It has found application in a wide range of activities, including

science, engineering, philosophy, medicine, sport, and law. In the philosophy of deci-

sion theory, Bayesian inference is closely related to subjective probability, often called

”Bayesian probability” [16].

In statistical inference, there are two broad categories of interpretations of prob-

ability: Bayesian inference and frequentist inference. These views often differ with each

other on the fundamental nature of probability. Simply put Frequentist inference defines

probability as the limit of an event’s relative frequency in a large number of trials, and

only in the context of experiments that are random and well-defined. Bayesian inference,

on the other hand, can assign probabilities to any statement, even when a random process

is not involved. In Bayesian inference, probability is a way to represent an individual’s

degree of belief in a statement or given evidence.

7

8 2.1. PERFORMING BAYESIAN INFERENCE

Bayesian inference is a method of statistical inference in which Bayes’ theorem is

used to update the probability for a hypothesis as more evidence or information becomes

available.

Formally put Bayesian inference derives the posterior probability as a consequence

of two antecedents, a prior probability and a ”likelihood function” derived from a statis-

tical model for the observed data. Bayesian inference computes the posterior probability

according to the Bayes’ theorem:

P (H | E) =P (E | H)× P (H)

P (E)(2.1)

However, Bayesian inference in some applications has been limited by the computa-

tional requirements for real-time implementations. These computational limitations come

sometimes from the large amounts of data being handled, this might lead to tractability

problems.

In a straightforward way, a tractable problem is one that with the linear increase of

inputs the solving time of the problem increases in near linear way as well.

The intractability issues come from the fact that Bayesian programs generally in-

clude a large number of inter-dependent random variables, that even though these have

been discretized, the complexity of the Bayesian models results in an exponential incre-

ment in solving times with the linear increment of variables.

There are two types of inference techniques: exact inference and approximate in-

ference. Exact inference algorithms calculate the exact value of probability P (X | Y ) .

Algorithms in this class include the elimination algorithm, the message-passing algorithm,

and the junction tree algorithms. The time complexity of Bayesian inference is NP-hard

[?] that stands for ”non-deterministic polynomial acceptable problems” which informally

means that the problem cannot be solved in a polynomial time in relation the inputs.

To tackle this situation where some exact inference problems have unreasonable solving


times, approximate inference techniques are used because they require less computations

time with the cost of reduced accuracy. Some Approximate inference algorithms include

stochastic simulation and sampling methods, Markov chain Monte Carlo methods, and

variational algorithms.

2.1.2 The BAMBI Project

The BAMBI project [9], had ambition to lay the foundations of new computing

machines, largely inspired by biology, and specially designed to efficiently perform proba-

bilistic inferences. For that, their main effort focused on three research axes: to develop a

new theory of probabilistic computation, to investigate how simple living organisms pro-

cess information and perform basic inferences, to build completely new hardware based on

these principles.

In the first axis, Bayesian algebra is developed and also proposed the Bayesian gates

that can be seen as the extensions of the Boolean algebra and logical gates, the essential

components of current computers. Then, new architectures capable to solve probabilistic

inference problems of various complexity were detailed and simulated. A key feature of

these architectures is to rely on stochastic computation and on components that behave

randomly. The numerous simulations that were performed during the project showed that

these new architectures could solve very efficiently several inference problems, including

sensor fusion, Bayesian filters, parameter learning and approximation of highly intractable

inference problems.

In the second axis, the intention was to establish the link between some general

principal governing biological signal processing at various space-time scales and the basic

principles of probabilistic computation. This enabled to successfully developed probabilis-

tic inference model for biochemical kinetics cast in the form of a nonlinear finite state

machine which is massively parallel.

In the third axis, two tracks were followed. The first track is for short-term, it

is based on the use of existing fully integrated circuits, namely the reconfigurable array

of logical elements (FPGA), which is the area we will delve the most in the development


of the dissertation. The long-term track is based on existing components, the super para-

magnetic tunnel junctions (SMTJs). The short-term track was very important, since it

allowed to produce real circuits implementing the new architectures simulated in the first

axis. The long-term track is very promising since, as previously shown, the energy ef-

ficiency will be several orders of magnitude better than one could expect with existing

hardware.

The figue 2.1 represents the three main axes (theory of probabilistic computation,

probabilistic inference in biology, and hardware implementation) which are oriented ac-

cording to the bottom-up approach.

Figure (2.1) Schema of BAMBI organization. [9].


2.1.3 ProBT Software

ProBT is a C++ library for developing efficient Bayesian software. This library con-

sists of two main components, a friendly Application Program Interface (API) for building

Bayesian models and a high-performance Bayesian inference and learning engine allowing

execution of the probability calculus in exact or approximate ways. ProBT aims to provide

a programming tool that facilitates the creation of Bayesian models and their reusability.

Its main idea is to use ”probability expressions” as basic bricks to build more complex

probabilistic models. The numerical evaluation of these expressions is accomplished just-

in-time: computation is done when numerical representations of the corresponding target

distributions are required. This property allows designing advanced features such as sub-

model reuse and distributed inference. Therefore, constructing symbolic representations

of expressions is a central issue in ProBT [17].

The model adopted in this dissertation used the ProBT software to build the in-

ference sensor fusion model of our boat localization problem from which the inference

likelihoods table is created, along with one iteration of an exact inference vector output

of the model that will be used as a precision benchmark to compare with the inference

employed in our work. More of this subject will be discussed in the methodology chapter.

12 2.2. HARDWARE FOR COMPUTATION OF THE BAYESIAN INFERENCE

2.2 Hardware for Computation of the Bayesian Inference

2.2.1 GPUs and FPGAs

Whether you’re talking about real-time stock trading, online searches, artificial per-

ception, faster results are among one of the top priorities. In this search for computational

speed, a debate on the best accelerators has emerged. A great portion of times this debate

has focused on FPGA’s vs GPU’s as these two are probably the most common hardware

solutions more widely used for task acceleration.

The FPGA stands for Field Programmable Gate Array and it is a type of device

that is widely used in electronic circuits. As represented in the figure 2.2 FPGAs are semi-

conductor devices that contain programmable logic blocks and interconnection circuits.

It can be programmed or reprogrammed to the required functionality after manufactur-

ing, hence the term ”field-programmable”. The FPGA configuration is generally specified

using a hardware description language, similar to that used for an application-specific in-

tegrated circuit (ASIC), which is a very design-intensive process.

Figure (2.2) Island-style global FPGA architecture. A unit tile consists of Configurable

Logic Block (CLB), Connect Box (CB) and Switch Box(SB).

A GPU is a heterogeneous chip multi-processor that is designed in most cases to

be highly tuned for graphics. Although presenting significantly lower clock frequencies


than CPUs, their highly parallelized structure makes them more efficient in relation to

general-purpose CPUs in algorithms that are possible to unfold in several parallel threads

in the various cores or CU (compute units) as is shown in thhe figure 2.3 The GPU’s

offer higher floating-point throughput, data parallelism orientated architecture and higher

memory bandwidth than processors, characteristics that make it a prime candidate to be

used as accelerators in heterogeneous programming.

Figure (2.3) Zoom-out of a general circuit logic present on a GPU card.

FPGA and GPU manufacturers continuously compare against CPUs, sometimes

making it sound like they can take the place of CPUs. But it’s important to note that

both FPGA’s and GPU’s do not function on their own without a server and neither can

replace the server’s CPU with the exception of FPGA’s with system on chip, that has its

own CPU and operating system, allowing it to function as a “server + accelerator” alone.

Depending on how fast you want or need your programs to run, the acceleration

comes at a price. After the cost of acquisition, the final price also includes the amount of

heat generated, the power required, and the huge increase in development times of appli-

cations to take full advantage of the new hardware [18].

The FPGA offers some clear advantages when it comes heterogeneous. These are the

ability to tweak and tune the underlying hardware architecture and use software-defined


processing which allows FPGA-based platforms to deploy super problem-specific solutions

to each problem.

Also, the FPGA’s low latency offers unique advantages for mission-critical appli-

cations, such as autonomous vehicles and manufacturing operations. This comes from the

fact that the data flow pattern in these applications may be in streaming form, requiring

pipelined-oriented processing.

Power efficiency is another factor to have in great consideration in cases such as

long term use, or energy-critical systems ( eg. Devices with batteries ). The FPGA’s

can have low power requirements and high performance per watt, since the logic has been

tailored for specific applications and workloads, and thus extremely efficient in executing

said application. For example in an FPGA the control structure is hardwired, and there

is no need to fetch and decode instructions.

The modern high-performance GPU’s have very high-bandwidth DRAM interfaces,

which typically outperform those of FPGAs implemented using comparable technologies.

However, external DRAM access consumes a significant amount of power. This means

that the data organization for FPGA-bound implementations must be carefully optimized

to use on-chip resources, even for kernel-to-kernel communication.


2.2.2 OpenCL

With the increase of the need for computational power, so the usage of accelerators

has exploded in recent years. A typical debility of these systems has been the much larger

effort required to program than conventional CPU based routines, as these have separate

memories that require additional buffers and memory transfers, different ways to launch

code, the need to specify details that do not exist in CPUs, and an overall higher level of

overhead. Another problem related to most tools available to heterogeneous programming

is that these are tied to a specific family of devices (like CUDA in the case of Nvidia

GPUs), wich severely restricts the portability of applications between systems, resulting

in the rapid obsolescence of the code built on them, and higher design times for cross-

platform programs.

OpenCL is an answer to this latter problem as it is a framework for writing pro-

grams that execute across heterogeneous platforms consisting of central processing units,

such as GPUs, FPGAs, DSPs and other processors or accelerators. OpenCL is a parallel

programing language developed by Khronos group built upon C/C++ that provides a

standard interface for parallel computing using task- and data-based parallelism [19]. The

portability in OpenCL gives it a hand up on the other accelerating frameworks such as

CUDA, which can only be used to create programs for NVIDIA GPU’s.

Although presenting all these benefits, one issue is that OpenCL does not offer

automatic optimal performance portability across multiple devices. This means that even

though a specific OpenCL code can be executed on multiple devices from different ven-

dors, it’s not assured by the framework that the code will be compiled to the absolute best

specifications of each hardware, and so different performances are to be expected. Some

tweaking may be required to do on an OpenCL code when transitioning devices to take

advantage of each architecture to the fullest.

During the development of this work, we noticed that the biggest difference between

optimizing code for a GPU and for a FPGA lies in the different targeted architectures. In

the case of a GPU, the programmer tries to achieve the best mapping of a problem into

the fixed restrains of each hardware architecture. On the other hand for an FPGA, the


task is to guide the compiler ( using pragmas for example ) to generate optimized memory

and computing architectures for each kernel in the application.

Even though the optimization of the code for FPGA tends to require more effort,

since the architecture must be adapted to the code and vice-versa, we will argue in this

thesis that the advantages gained in reduced energy consumption justify the additional

work.


2.2.3 FPGA Workflow

The figure 2.4 shows the general FPGA workflow, to which usually the first step in

is to convert the intended algorithm and divide it in the processes that are possible to be

parallelized, these ones if long enough to be worth it to transfer them to an accelerator

should be written in kernel in the OpenCL language, while the others should be main-

tained on the host side.

Figure (2.4) FPGA Workflow

After this process is completed we should take the OpenCl code and verify its func-

tional correctness and build a software-based emulation. This emulations take significantly

less time to build than the synthesis of the code to be deployed on the FPGA, helping

us ensure that the outputs produced by the simulation are correct before going forward


on the development. It is also common in this stage to add test-benches to the host code

which provides the inputs to and checks the outputs from the algorithm.

Once the functionality criteria is met, we can do a gross estimation of the perfor-

mance of each individual kernel, compute unite, resource usage, etc and thus by taking

also into account the targeted hardware specifications we can get an early idea of the final

performance gains or losses.

After this, we generate the hardware design of the OpenCL kernels. Numerous

compute units can be called and driven by the host code, in order to boost performance.

Each workgroup can execute its tasks either in pipelined fashion, or by unrolling them,

or both. In most cases pipelining provides the best cost/performance trade-off, but some-

times either the loops over the work items or some of the loops nested within them can be

further unrolled to improve performance. After this, we proceed to hardware emulation

(also known as RTL simulation) that is used to certify that the hardware is behaving

correctly, verify if there is any resource limitation, and have a more precise notion of the

overall performance. The physical design of each compute unit needs to be generated in

this hardware emulation so this process is several times slower than the software emulation.

The last step consists on deploying the hardware design on the FPGA, running

it and verify the real metrics such as time needed to run code, resources used, and the

result’s fidelity.

Chapter 3

Our Implementation of Bayesian

Inference on OpenCL

3.1 Work Methods

One of this thesis objectives is the comparison of the OpenCL Bayesian Inference

implementation of a well-known Boat Localization problem with a conventional CPU ap-

proach and with a heterogeneous computing platform running Bayesian Framework pre-

viously developed by BAMBI.

The OpenCL SDK libraries of Altera (now Intel FPGA) will allow us to focus on

the program in a higher level of abstraction than if we were to use low-level hardware de-

scriptive languages such as VHDL. This factor reduces design times and costs, but also the

probable outcome of some losses in performance due to the general purpose of the language.

To enable the benchmarking of the tests performed since the beginning a Boat

Localization problem was used as a performance measure (same used in previous works,

namely in the project BAMBI EU FET).

19

20 3.2. BENCHMARK MODEL

3.2 Benchmark Model

3.2.1 Boat Localization Inference Problem

One of this thesis objectives is to test the performance of the OpenCL Bayesian

Inference with a conventional CPU approach and with the Bayesian Machines previously

developed during BAMBI project. This test is not representative of all Bayesian inference

problems, but it was the only one used in the BAMBI project where an energy performance

test was conducted.

Figure (3.1) Boat Localization Problem [15]

To benchmark the tests performed a boat localization problem with variable com-

plexity will be used. This problem consists in determining the position of a boat in a

discrete square grid of variable size. The user provides the boat coordinates (x, y) as

input to a program working concurrently with the OpenCL host program, this data is

pre-processed by a routine and mathematically converted into sensor values. These new

data are fed to the inference engine for processing. As shown in the figure 3.1, we consider

the existence of three landmarks from which the boat is able to detect the distance (D 1 ,

D 2 , D 3) and bearing (B 1 , B 2 , B 3) of each.


(a) Distance sensor. (b) Bearing sensor.

Figure (3.2) Likelihood of the boat location for the first landmark on a 64 by 64 model,

where (a) for the distance sensor model when the distance is 32, and (b) for the angle

sensor model when the angle is 0o degrees [10].

The figure 3.2 illustrates the likelihood of the boast’s position (in the middle of the

grid) captured by a distance and bearing sensor respectively. The final likelihood of the

boat’s position corresponds to the product of likelihood map of all sensors. The boat

problem will be scaled to several dimensions to test out multiple inference engine sizes.

3.2.2 Generating the Bayesian Model

Before we can run the Bayesian inference algorithm we first have to generate the

fusion likelihoods.

To compute this file, the ProBT library is used on a program that computes the

posterior probability distribution on the searched variable S (from the sensors), using the

knowledge of the prior distribution P(S) and the set of conditional distributions P(K i


—S), where Z is a normalization constant:

P ([S = sj ])K1, ...,Kn) =1

ZP (S = sj))

N∏i=a

P (Ki[S = sj ]) (3.1)

After this operation the computation tree is extracted and a set of probability distri-

bution Look-up Tables (LUTs) is generated. After this, the LUT, along with other control

variables are preloaded onto the device, next the inference is performed and the results

stored on the host machine.

As illustrated in figure 3.3 the fusion likelihoods file holds for every possible value

of each of the six sensors ( 3 distance sensors, 3 bearing sensors). This file holds the

likelihood associated with each combination of cell, sensor, and sensor’s value (discrete).

The file itself is structured so it can be easily accessed as Lookup-table (LUT), with cell

number, sensor number, and sensor value, functioning as the keys.

Figure (3.3) Likelihoods LUT Access

While the fusion priors represent the current likelihood of the presence of the boat

in each of the cells on the grid. These two files are loaded on the device’s memory during

the execution of the Bayesian inference engine, with the fusion priors being updated as

the outcome of the inference algorithm.


To generate both these files we adopt an algorithm in python that uses the ProBT

libraries in which the user introduces the size of the model in the number of bits for de

lateral dimensions ( 8x8, 16x16, 32x32, 64x64, etc.). We also generate the exact inference

distribution given some x and y coordinates that later are used for benchmark precisions

comparations with the approximate inferences tested on the other hardware.

From a high level of the abstraction, this code performs the following steps:

• From the values introduced by the user, we define the dimensions of the grid in which

the model will be created.

• Define the probabilistic variables associated with location.

• Define the probabilistic variables associated with range and bearing.

• Define the function necessary to compute the mean and the standard deviation of

the distance to each beacon knowing the position X and Y.

• Define the function necessary to of the bearing to each beacon knowing the position

X and Y.

• Build the specification, and define the distributions related to distances and bearings

of the model.

• Generate the exact inference distribution.

The Inference problem corresponds to computing the probability distribution over

the X and Y coordinates, knowing the outputs of the six sensors. Therefore, considering:

• Mm = Position cell of index m

• Ek = Value of Sensor k

• P (Mm) =Set of prior probabilities for each cell of index m

• Posterior = Likelihood×PriorEvidence ⇐⇒

• P (model | data) = P (data|model)×P (model)P (data) ⇐⇒

From Bayes Theorem, we gather that: P (M | E) = P (E|M)P (M)P (E) . As the partition

Mm is finite, we can use the Law of Total Probability to define P (E) =∑

m P (E |Mm)P (Mm).

From the above, we can define our new Posterior Ratio as:


P (M | E) =P (E |M)× P (M)∑

m P (E |Mm)× P (Mm)(3.2)

Considering that E is in fact a set of k multiple independent sensors Ek , we have:

ym =xm × am

z⇐⇒ P (M | E) =

∏k P (Ek |M)× P (M)∑

m

∏k P (Ek |Mm)× P (Mm)

With:

• ym : Posterior ratio.

• am : Prior ratio.

• xm :Likelihood.

• z :Normalization factor, scalar independent of of both m and k.

We can formalize this problem as a Bayesian Program in the ProBT language. The

ProBT API is then used to generate a simplified computation tree. In addition, the

probability distribution across all variables is calculated offline and stored in Lookup Tables

(LUTs) in floating point variables.


Boa

tL

oca

tor

Des

crip

tion

Sp

ecifi

cati

ons

InputV ariables

D1, D2, D3, B1, B2, B3

Decomposition

P (D1D2D3B1B2B3 | XY ) =

P (D1 | XY )P (D2 | XY )P (D3 | XY )P (B1 | XY )P (B2 | XY )P (B3 | XY )

ParametricForm :

P (D1 | XY ) = N (d1, (5 + d110)2)

P (D2 | XY ) = N (d2, (5 + d210)2)

P (D3 | XY ) = N (d3, (5 + d310)2)

P (B1 | XY ) = N (b1, (5 + b110)2)

P (B2 | XY ) = N (b2, (5 + b210)2)

P (B3 | XY ) = N (b3, (5 + b310)2)

Identification

Alltablesprovidedbytheuser

Question :

P (X ∧ Y | [D1 = d1] ∧ [D2 = d2] ∧ [D3 = d3] ∧ [B1 = b1] ∧ [B2 = b2] ∧ [B3 = b3])

Figure (3.4) Boat location problem formal specification [15].

The schema 3.4 demonstrates formally how the problem’s model is specified, namely

the variables used, to which is shown that the sensors values is dependent on the boats

coordinates, to which are added a random value form a Gaussian distribution (to simulate

the sensor’s error margins), and the generated likelihood of each sensor is formed with

a certain normal distribution (also visible in the figure 3.2), and finally that the boats

position probability distribution is obtained by the product of all of the sensors’ likelihoods

maps.

26 3.3. INFORMATION FLOW

3.3 Information Flow

3.3.1 Host Side Control

After these files are generated, we use two C++ functions, one that converts the

user inputs that is controlling the boat position in the grid into sensor values in model,

and sends this values via socket to a program running in parallel.

The second function named is the OpenCL host of the program, that effectively

manages the bayesian inference computation. This program will detect and initialize the

available devices (FPGAs or GPUs), create the context and queues for each device, create

the buffers and load them with the Lookup Tables and sensor data to be passed to the com-

puting kernels. In addition it partitions the inference problem and distributes it through

the queues of each device. It is also required to pass the necessary control variables to each

kernel, such as number of iterations, and dimensions of the problem. When the number of

iterations is completed the device or devices returns the computed posterior probability

distribution and stores it both as a heatmap and as a file, to be used as comparation to

the exact inference distribution file previous generated.

All this is done in real time, for each time new values are sent from the inter-

face controled by the user. The host program only loads the likelihoods table in the device

on the first time, since it is already loaded on to the device, and on the following turns

only the new sensor values are sent, and thus saving in transfer times from host to devices,

since the likelihoods table is the biggest file by several orders of magnitude.

In all this process we are at the same time collecting all the metrics relevant to

the performance of such methods, such as time, total energetic consumption, average en-

ergetic consumption per unit of time, peak energetic consumption, area of resources used

(FPGA). And all these methods will be tested in FPGAs and GPUs, for different dimen-

sions of the model, different number of iterations, and comparing also the performance of

single vs double floating point number programs.


3.3.2 Bayesian Inference Kernel

Given the fact that in the inference algorithm for this specific case, the end prod-

uct of the calculus is a likelihood attributed to each cell on the problem’s grid, then the

process of parallelizing the tasks was performed by assigning a cell to each kernel of the

problem since the data of each is independent and can be isolated without synchronization.

So when possible the number of kernels instantiated should be the same, or at least, the

closest as the number of cells in the grid.

The kernel will run a simple function for every sensor value, and this function is

based on the inference model previously created, that receives as inputs the pre-computed

likelihoods, the dimensions of the grid, and sensor values, and outputs the index position

of the likelihood table loaded on device. The product of these values returned by the like-

lihood table is then multiplied by the posterior probability on the kernel’s slot, resulting

in the updated slot’s probability.


Algorithm 1 Bayesian Inference Kernel

Input: N, likelihoodsVector[], distanceLikelihoodsTable[], bearingLikelihoodsTable[], dis-

tanceSensorValues[], bearingSensorValues[], maxDist, maxBear, nClocks;

Output: likelihoodsVector[];

1: s← 0

2: y ← 0

3: i← globalID()

4: count← 0

5: while count < nClocks do

6: for s < 3 By 1 do

7: index← (i× 3× (maxDist+ 1)) + (s× (maxDist+ 1)) + (sensorV alues[s])

8: currentLikelihood← currentLikelihood× distanceLikelihoodsTable[index]

9: for s < 6 By 1 do

10: index← (i×3×(maxBear+1))+((s−3)×(maxBear+1))+(sensorV alues[s])

11: currentLikelihood← currentLikelihood× distanceLikelihoodsTable[index]

12: if count = 0 then

13: y ← likelihoodsV ector[i]× currentLikelihood

14: else

15: y ← y + (y × currentLikelihood)

16: count+ +

17: likelihoodsV ector[i]← y

18: return likelihoodsV ector[]

During the development of the kernels, was noticed that the seamless migration

between OpenCL kernels of GPU’s to FPGA’s and vice-versa was not ideal due to the

hardware differences and the significant inefficiencies it would create. So the inference

kernels used in both situations although performing the same tasks were created with very

slight modifications.

Another point to highlight is that the kernels were created with no vendor-specific

optimizations put in place to enable migration of the kernels to other machines, yet, in

the FPGA’s kernels, pragmas were used. These pragmas in OpenCL are pre-processed

29 3.4. DOUBLE VS SINGLE PRECISION FLOATING-POINT NUMBERS

directives that the compiler only follows if the targeted machine can execute them, and it

means that this code specifications causes no hurdles when running in other machines to

which the code was not designed with that particular case in mind.

3.4 Double vs Single Precision Floating-Point Numbers

In this work, we also studied and implemented the use of Double vs Single (Float in

C/C++) variables in the Bayesian inference programs, to discover the most appropriate

option in each scenario.

As figure 3.5 shows the single floating point number program uses 32 bits (4 bytes) in

which 23 of those are used for the mantissa, and the remaining 9 are used for the number’s

sign (1 bit), and for the exponent (8 bits).

In the double floating point number program, as the name suggests, the double

of the bytes are used (64 bits), where 52 are used for the mantissa, 11 for the exponents

and 1 for the number’s sign.

Figure (3.5) Single and double bit arrangement.

Obviously, between the two approaches, there’s a precision trade-off. So, to find

the number of precision digits we use the function log10(2N ), being N the number the

bits that codify the mantissa of the number. Given this, we can determine that a sin-

gle has approximately 7.22 digits of precision and 15.95 for double variables reserved for

the mantissa. This without even counting the exponent, which also a great edge to the

30 3.4. DOUBLE VS SINGLE PRECISION FLOATING-POINT NUMBERS

double precision variable. This means that for the double variable the exponent can go

from 10−1024 to 101024 and the single 10−128 to 10128 (one bit is dedicated to the exponent

sign) which in both cases the precision is far more than needed for this specific application.

Given the problem being tested, from one approach to the other, we find we have to

duplicate data transfer ( from host to device and vice-versa ) and memory allocation on

device.

Since that in our specific model the likelihoods file size is only dependent on the

size of the grid itself, and that the size of this is solely dependent on the number of bits

used to codify it, given the rest of the model the file’s size follows this equation:

FileSize = N × (MaxBearing ×Bs+MaxDistance×Ds)× S (3.3)

Where the N represents the number of cells in the grid of the model, Bs and Ds

respectively represent the number of bearing and distance sensors, S the size in Bytes

of the variable being used and finally Max Distance and Max Bearing are the maximum

possible values obtainable by those sensors in the context of that size of the model. So

finally the equation translates to the following form:

FileSize = (2bits)2 × (2bits+2 × 3 + 2bits+1 × 3)× S (3.4)

In the function above, the word ”bits” means the number of bits required to repre-

sent one of the coordinates on the grid (3 bits gives us 8x8, 4 bits 16x16, etc.). By this

function we see that the size of the likelihoods file is only dependant on two variables, the

grid size and the variable type, like the following table shows:

31 3.5. EVALUATION METRICS

single Double

8x8 36 KBytes 72 KBytes

16x16 288 KBytes 576 KBytes

32x32 2 304 KBytes 4 608 KBytes

64x64 18 432 KBytes 36 864 KBytes

128x128 147 456 KBytes 294 912 KBytes

256x256 1 179 648 KBytes 2 359 296 KBytes

We can observe how important is to save space on the device due to the exponential

growth of the likelihoods table. On the prior files is less of a problem due to linear growth,

and much smaller size in relation to the likelihoods file.

3.5 Evaluation Metrics

3.5.1 Precision

To compare the precision of the final results achieved in this work we will use the

Kullback–Leibler divergence method between the arrays of the exact inference distributions

and the approximate inference distribution, in order to understand the loss in precision

when using the second technique.

The Kullback–Leibler divergence is a measure of mathematical statistics that tells

us how two probabilistic distribution are different from a each other [20].

Since we will be comparing values between two vectors the definition that we’ll

be using is the discrete one, which goes as:

DKL(P ‖ Q) =∑x∈X

P (x) log

(P (x)

Q(x)

)(3.5)

This means that the divergence is the sum of the expectation times the logarithm

of the exact value dividing by the aproximation. Thus the closer the approximation, the

closer to zero the sum will be [21].


3.5.2 CPU Power Estimation

In the CPU subcategory, we will run the exact inference algorithm in ProBT toolkit

(python) as well a more lightweight C++ approach for a more fair comparison to the GPU

and FPGA algorithms.

Estimation of the power consumption of a CPU can be worked out through sim-

ple circuit theory. The CPU can be approximately modeled as a variable resistor which

changes its resistance as the workload increases. Considering that the power supply to

the CPU provides a constant voltage, this will draw proportionally more current as its

workload increases.

Given this fact, can find power consumption with only 3 values, workload, and the

CPU voltages, namely resting voltage and max voltage. Thus total energy consumed can

be found multiplying the execution time of the algorithm by the cpu’s voltage (estimated

by the workload value):

E = VCPU × t (3.6)

To find the execution time of the algorithm in both the python code and C++ we

use functions that keep track of processor usage time, respectively time.clock() [22] and

std::clock t clock() [23].

For measuring workload we have access to the command ”top” on terminal (in Linux)

that give us a lot of information on OS and hardware usage including the workload in

percentage for each core. The CPU voltage info is retrieved from the vendors specifications

where the algorithm is run.

After collecting this measures we use a linear regression between the average resting

CPU power consumption and maximum power consumption (without overclock), where

the determinant factor is the CPU’s frequency, this provides us a decent estimation of

power used per unit of time [24].

3.5.3 GPU Power Estimation


For the GPU power estimations, a similar procedure will be followed, yet we have

available a much more accurate tool to perform these said estimations. NVIDIA Man-

agement Library (NVML) is a C-base API for monitoring and managing various states of

NVIDIA GPU devices. It provides a direct access to the queries and commands exposed

via nvidia-smi.

With this probe it is periodically captured several measurements relevant to power

consumption and resourse utilization:

power.draw The last measured power draw for the entire board, in watts. Only

available if power management is supported. This reading is accurate to

within +/- 5 watts. Requires Inforom PWR object version 3.0 or higher

or Kepler device.

clocks.gr Current frequency of graphics (shader) clock.

clocks.sm Current frequency of SM (Streaming Multiprocessor) clock.

utilization.gpu Percent of time over the past sample period during which one or more

kernels was executing on the GPU. The sample period may be between

1 second and 1/6 seconds depending on the product.

utilization.memory Percent of time over the past sample period during which global (device)

memory was being read or written. The sample period may be between

1 second and 1/6 second depending on the product.

clocks.mem Current frequency of memory clock.

Table (3.1) Information captured on the GPU hardware during the Bayesian inference

algorithm execution

3.5.4 FPGA Power Estimation

The FPGA’s power consumption is estimated using the Power Analysis Tools pro-

vided by the Intel Quartus software. More specifically, the tool being used is the Prime

Power Analyzer, which estimates power consumption for a post-fit design, allowing to es-

tablish guidelines for the power budget.

34 3.6. FRAMEWORK AND HARDWARE

It is up to the user to provide timing assignments for all clocks in the design, or

specify signal activity data for power analysis, specify the I/O standard on each device

input and output, and the board trace model on each output in the design.

3.6 Framework and Hardware

On the dissertation we use the same material from the previous work in order to

maintain fair comparisons [14]. The requirements in terms of power, airflow and size for

the FPGA accelerator boards installation are identical to those of server grade GPUs, for

which the case was designed. The system has 64GB of RAM, two Intel Xeon E5-2620

CPUs running at 2.6GHz, each with 12 cores, and CPUs running tightly interconnected

through two Intel QuickPath Interconnect (QPI) interfaces, supporting 9.6GT/s transfer

speeds each. The FPGA used is a ProceV [25] board from Altera’s Stratix V family. The

ProceV provides massive capacity (up to 952K LEs), and high memory and I/O perfor-

mance, 8-Lane PCIe gen 3, twenty six 12.5/14.1 Gb/s transceivers provide external IOs of

up to 366 Gb/s (full duplex). The memory is embedded with 8 TB/s throughput, 16 GB

ECC DDR III and optional 288 Mb DDR II SRAM.

To test the algorithms in the GPUs and CPUs it is used a Intel Core i5-4570 [26]

quad-core processor, 3.20 GHz and 6MB cache, requiring 84W TDP. The GPU is a Nvidia

GTX Titan [27] with 2688 cores, 837 MHz of graphics clock, 6.0 Gbps of memory clock,

and 250 Watts of maximum power.

To run the algorithms on GPU it was used Intel FPGA SDK for OpenCL 19.1,

and 14.1 for the FPGA due to the ProceV board incompatibly with more recent versions.

Chapter 4

Experimental Results

4.1 Finding the Position of the Boat

Figure 4.1 shows the Bayesian inference engine running on the devices in real time,

with the user controlling the boat position with a simple interface and getting its likelihood

position as return with very low latency.

Figure (4.1) Finding the Positions of the Boat through Bayesian Inference.

In the following tests that will measure various metrics, such as transfer times, power

consumption and executing times of the algorithm we will add a new dimension to the

tests by performing the same algorithms different number of times each. This approach

was used due to the relatively short running times the algorithm had on each device, even

when comparing with just the transfer times of the tables from host to device and vice-

versa. By extending running times, we can then have more precise power consumption

35

36 4.2. SCALABILITY AND RESOURCE USAGE

measures as well as more accurate executing times per iteration.

4.2 Scalability and Resource Usage

During the development of the project it was observed that (opposite to what was

expected) the hardware ”footprint” stopped growing after 32x32 grid size, with the follow-

ing 64x64 and 128x128 hardware designs presenting the exactly same resource allocations

as can be observed by the figures 4.3 and 4.4. Several attempts were made to discover

the limiting factor to the increasing scale of the problem, but none was found since the

compilation reports did not present physical max out.

This just means that a limited number of kernels are performing the work of several

cells sequentially due to these constraints. Which leads us to assume that the results pre-

sented might be sub-optimal, and that probably there is space for improvement, either in

execution time and power consumption.

Figure (4.2) FPGA Resource utilization for 16x16 grid dimensions.

Figure (4.3) FPGA Resource utilization for 32x32, 64x64, and 128x128 grid dimensions.

37 4.3. RESULTS PRECISION

For the GPU, the utilization was always near the 50% and power consumption

around 32% for all grid sizes when subjected to constant utilization. Due to the command

”nvidia-smi” the limiting factor was very clear, since the access to global memory speed

was constantly at 100%. This was not possible to replace or mitigate since the constant

access to the likelihood lookup tables are the integral part of this algorithm, then the

space for improvement might consist on a better choice of specific GPU to run these kind

of algorithms.

4.3 Results Precision

As previously mentioned in the section 3.7, to check the results fidelity that came

out of the devices, these were compared directly with the output of the model.

To achieve this, firstly we supply to the proBT model some coordinates (x,y) and

perform one inference iteration, to a certain size of the model. After this, we proceed to

do the same process on the programs developed in this dissertation, and compared this

final matrices value by value using the Kullback-Leibler divergence.

DKL(P ‖ Q) =∑x∈X

P (x) log

(P (x)

Q(x)

)(4.1)

Since we’re trying to obtain the results closest to the generated in the model, then

in this equation P (x) is the model and Q(x) represents our results.

Table (4.1) Kullback-Leibler divergence for each grid size

Grid Size Kullback-Liebler divergence

16x16 3.89257× 10−16

32x32 1.41173× 10−15

64x64 7.36625× 10−16

128x128 4.68538× 10−15

38 4.4. EXECUTION TIMES

The table 4.1 represents Kullback-Leibler divergence using double floating point

number program for the various grid sizes. The doubles were used in this case to compare

with the standard result due to the output of the ProBT library was also stored in dou-

bles. After comparing both matrices, we obtained the result of 3.82875× 10−16, which is

virtually zero, and thus verifying the expected outcome. The reason to the divergence not

being exactly zero, might be due to approximations that happen somewhere on any given

operations. As expected it is indeed exact Bayesian inference being performed.

4.4 Execution Times

During the development and testing of the inference Kernels in the several devices,

it was noticed that the vast majority of time and energy of the devices was being spent

on the normalization of the posterior probability matrix. This was due to the significant

overhead necessary to synchronize all the kernel threads to perform the several steps of

this procedure.

This disproportional consumption of time and energy by the normalization operation

in relation to the computation of the posterior probability was also noted on a previous

work where both tasks were divided in two different kernels, in which 93.75% of time on

the device was spent in normalizing the matrix [15]. Test runs performed on the GPU also

show the similar gains when the normalization is passed to CPU.

After some testing it appeared that performing this final portion of the algorithm

back on the host had a great reduction in time spent performing the total operation of

Bayesian Inference, and also some reduction in total energy consumed (accounting for the

energy consumed on CPU to perform that operation), these times are shown in table 4.2.

This statement is true for small matrix sizes as the ones being discussed in this dissertation.

For higher dimension sizes of matrices it is more time and energy economic to perform

the normalization on devices since the overhead necessary to the kernels synchronization

becomes more and more negligible in relation to said operation.


Table (4.2) Posterior Probability Matrix Normalization Time on CPU

16×16 32×32 64×64 128×128

1325ns 5326ns 21525ns 85884ns

Note that on the following tables of the Bayesian Inference times in GPU and FPGA,

the values refer to the time spent only on the devices. To have a fair comparison with

other approaches the time of the posterior probability matrix normalization time in CPU

has to be added.

Table (4.3) Bayesian Inference in GPU

no of Iterations 16×16 32×32 64×64 128×128

1 60.1µs 80.2µs 93.3µs 172.3µs

10 61.3µs 87.7µs 105µs 177.6µs

100 83.8µs 117.3µs 115.6µs 192.9µs

1000 108.8µs 129µs 188.2µs 366.5µs

Table (4.4) Bayesian Inference in a single FPGA

no of Iterations 16×16 32×32 64×64 128×128

1 118µs 231µs 415µs 1361µs

10 145µs 265µs 436µs 1374µs

100 227µs 505µs 1346µs 4900µs

1000 853µs 3330µs 11990µs 47571µs


0 50 100 150

Lateral Grid Size

10 -8

10 -7

10 -6

10 -5

10 -4

Bay

esia

n In

fere

nce

Itera

tion

Tim

e in

Sec

onds

GPUFPGA

Figure (4.4) FPGA and GPU Execution Times for one Bayesian Inference Iteration on

Device.

In the following tables we present the times obtained running exact inference on the

ProBT library in python and optimized the C++ version.

Table (4.5) Bayesian Inference in CPU using ProBT version for one iteration

16×16 32×32 64×64 128×128

58.373ms 184.744ms 895.136s 5.998676s

Table (4.6) Bayesian Inference in CPU using Optimized C++ version

no of Iterations 16×16 32×32 64×64 128×128

1 332.6µs 3061.3µs 22373.66µs 172216.2µs

10 1349µs 16116µs 241084.7µs 1.840625s

100 17628.4µs 203309.5µs 2.68634s 20.77593s

1000 1.622749s 18.622719s 102.94s 523.70552s

41 4.5. POWER CONSUMPTION

4.5 Power Consumption

Table (4.7) Avarege Energy input on Devices

16×16 32×32 64×64 128×128

GPU 78.55 W 80.02 W 81.08 W 81.17 W

FPGA 12.584 W 12.564 W 12.65 W 12.618 W

The constant average energy consumption for problems of different dimensions as

shown above indicates again the trouble found in scaling properly the algorithm on devices.

4.5.1 GPU Energy Consumption

Table (4.8) Power Consuption of One Iteration Bayesian Inference on GPU

16×16 32×32 64×64 128×128

Bayesisan Inference (on Device) 0.008546 mJ 0.01032 mJ 0.015259 mJ 0.02975 mJ

Kernel Normalization (host) 0.1113 mJ 0.447384 mJ 1.8081 mJ 7.214256 mJ

Total 0.119846 mJ 0.457704 mJ 1.816259 mJ 7.244006 mJ

4.5.2 FPGA Energy Consumption

Table (4.9) Power Consuption of One Iteration Bayesian Inference on FPGA

16×16 32×32 64×64 128×128

Bayesisan Inference (on Device) 0.010734 mJ 0.041839 mJ 0.151731 0.600271 mJ

Kernel Normalization (host) 0.1113 mJ 0.447384 mJ 1.8081 mJ 7.214256 mJ

Total 0.122034 mJ 0.489223 mJ 1.959831 mJ 7.814527 mJ

4.5.3 CPU Energy Consumption

Table (4.10) One Iteration Bayesian Inference in CPU using Optimized C++ version

16×16 32×32 64×64 128×128

11.3316 mJ 135.374 mJ 1.879 J 14.462 J

42 4.6. COMPUTING WITH DOUBLE PRECISION VS SINGLE PRECISION

0 50 100 150

Lateral Grid Size

10 -5

10 -4

10 -3

Bay

esia

n In

fere

nce

Itera

tion

Pow

er in

Jou

les

GPUFPGA

Figure (4.5) FPGA and GPU Power Consumption for one Bayesian Inference Iteration

on Device.

In the tables of GPU and FPGA power consumption we can observe that the total

energetic cost is very similar due to the normalization operation on host being by far the

most substantial power drainer. On the other hand when comparing the operations on

the device the OpenCL implementation applied on GPU has a clear energetic advantage

due to the much faster execution times, even though it has almost 6.5 times larger average

power input. In summary the FPGA implementation requires a smaller power supply, but

with an overall higher energy consumption (and gets greater with the increase of problem

dimensions), and higher latency in relation to the GPU implementation.

4.6 Computing with Double Precision vs Single Precision

Double vs Single Precision

The default programs developed were made using double precision variables, but

another version was made for the GPU with single precision variables in order to compare

performances.


On the single version of the program the Kullback-Leibler divergence result in rela-

tion to the output of the ProBT program was 2.84407× 10−10 on a 16x16 grid size. Given

this, we can observe this difference on the possible quotient size on the higher divergence

shown.

This specific application does not require such precision provided on the double

programs in order to maintain the results relevance or usability. So when developing a

new program is important to reflect if it’s worth the trade-off.

Transfer Times

The data presented in the following tables was obtained measuring the interval of

time that took the host program to write in the GPU’s memory all the information nec-

essary to the execution of the program. This includes the likelihoods lookup table (that

as we’ve seen previously, grows exponentially with the size of the problem), sensor values,

priors vector, and finally control variables.

The times were recorded in for possible dimensions of the problem, 4, 5, 6, and

7 Bits, that represent grids of 16x16, 32x32, 64x64, and 128x128, respectively. To account

for possible anomalies, each case was measured 5 times and the final line of the table

represents the averaged value.

Table (4.11) Average Transfer Times to GPU

Table (4.12) Using Single

16×16 32×32 64×64 128×128

Single 388.2µs 1616.4µs 7182.8µs 48155.6µs

Double 537µs 2225µs 13196µs 95522µ

As we can see from table 4.8, with the exponential increase of data being transferred

to the device, the time consumed by the overhead processes becomes more negligible and

the ratio between both transfer times tends to the ratio of the memory size being passed,


as expected.

Table (4.13) Ratio Between single and Double Transfer Times

16×16 32×32 64×64 128×128

0.7229 0.7264 0.54431 0.5041

Execution Times

To test the double vs single execution times, we counted the time since the kernels

were launched until the host received the final data from the device. Important to note

that these OpenCL kernels had no difference between them besides the variables it oper-

ated with.

The results were obtained testing the problem with 32x32 (5 bits) in dimension

for several different iterations, respectively 1, 10, 100, and 1000. Due to relative stability

of results obtained, only 5 samples were taken of each case.

Table (4.14) Avarege Execution Times on GPU

no of Iterations 1 10 100 1000

Single 128.6µs 490.6µs 3688.8µs 37599.4µs

Double 141.4µs 528.6µs 3684.2µs 37424.8µs

From the fact that the results coming from the device with the kernels of the dou-

ble variables having twice as much information being sent back, the prediction was that

there would be some difference between the values in the times of both versions, yet this

difference would have to stay constant throughout all the sets of iterations if they were to

have the same execution times.

We can see those slight interval in the times of set of 1 and 10 iterations, of 12.8 µs

and 38 µs respectively, yet in the last two sets, the double version of the kernels actually

presents inferior times. This is due to the times measured had high intervals of variation.

45 4.7. COMPARISON WITH PREVIOUS WORKS

From these values we can confirm that for inference algorithms no significant ex-

ecuting time difference exists for this commercial GPU, maybe the case that in other

architectures results would vary between using single or double. As explained in the

Background chapter, some vendors create specialized GPUs configurations for one type

of variable. All this information should be taken into account when choosing the design

approach.

Even though results were recorded while no other program (necessary for the exe-

cution of the experiment) was active, the values still show some fluctuation between the

single and double approach. This may be explained by other extrinsic factors hard to iso-

late or control, such as GPU temperature, or other programs running concurrently (such

as the OS) that the user has no control on.

4.7 Comparison With Previous Works

10 -8 10 -6 10 -4 10 -2 10 0 10 2

Energy Consuption in Joules

10 -9

10 -8

10 -7

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

10 0

Kul

lbac

k-Li

eble

r D

iver

genc

e

GP

U 1

6x16

GP

U 3

2x32

GP

U 6

4x64

GP

U 1

28x1

28

FP

GA

16x

16

FP

GA

32x

32

FP

GA

64x

64

FP

GA

128

x128

CP

U 1

6x16

CP

U 3

2x32

CP

U 6

4x64

CP

U 1

28x1

28

Figure (4.6) KL divergence results of boat example for different grid sizes and increasing

energy consumption of the BM1 (background), OpenCL on FPGA (blue dotted) and GPU

(red dotted), and finally our improved version of the CPU algorithm (pink dotted).


As it can be observed in figure 4.6 the background graph displayed corresponds to

the diagram from the previous work on this field [10], that was fitted inside the dimensions

of our work due to the individual point information of the latter could not be obtained at

the time . The bold vertical lines indicate the energy used by a typical laptop computer

to perform the corresponding exact inference on the old algorithm, the finer dotted lines

corresponds to the data obtained during this dissertation. In this graph the OpenCL im-

plementations on GPU and FPGA are occupying almost overlap due to the low granularity

of this graph.

Despite the great improvement in precision with fairly similar power consumption, it

is important to note that BM1 inference engine and the OpenCL kernels used in this work

perform different operations. This work can achieve the exact inference by performing

only one iteration, due to every single possible sensor’s Bayesian Inference outcome being

calculated a priori. Our method is itself already a trade off between memory and circuit

space on the device. Both approaches present difficulties with increased scaling due to

the exponential nature of the problem, BM1 with circuit space, and a priori calculations

with memory on device, yet both represent a huge improvement in relation to desktop

computations either in energy consumption and latency.

The preference for each system depends on the intended outcome, where for high

precision results our method takes a decisive advantage in energy consumption either in

FPGAs and GPUs, with the latter also presenting few scaling problems and great gains

in executions times in all tested dimensions.

If precision is a less relevant factor, the BM1 engine might be a better suited choice

due to the fact that enables a very selective precision by just regulating the number of

inference iterations, and thus can lead to great energetic savings by relinquishing precision.

Although these inference problems if presented with large dimensions become diffi-

cult to scale on a single device, it is also important to note that the algorithm presented

in this work can be easily divided by several devices, since each kernel is assigned one cell

of the grid and each cell, only acesses one specific portion of the likelihoods table, this

means the algorithm can be divided by cells and memory throughout several devices.

Another important factor to highlight is that the OpenCL implementation on GPU

is faster to implement, compile and test in relation to the FPGA approach, as well it is

possible allocate dynamically GPU resources during the execution of the algorithm. Due

to this last reason, for the FPGA was needed to create, compile (several hours), and deploy


different codes to the FPGA device on each grid size tested. The GPU implementation

also consistently performed with lower latency, with difference between the two approaches

growing with increasing dimension sizes. As stated in the section 4.6 he FPGA program

had the advantage of requiring less power to operate (approximately 6.5 times), but hav-

ing a higher total energy consumption for the Bayesian inference operation (from 1.25 ate

16x16 size to 20 times on 128x128 grid size).

Also as expected even before the start of this dissertation the advantages of the

OpenCL implementations in relation using BAMBI toolchain and Hardware Description

Languages (HDL) have a much faster development cycle, less specialized development

work, hardware flexibility allowing the development of an algorithm not specific to any

particular hardware, and easy migration between devices.

Chapter 5

Conclusion

5.1 Result’s Discussion

As stated in the previous chapter it is clear that no Bayesian inference implementa-

tion outmatches the other in all categories. The solutions pursued in this work lose to the

BM1 engine where the main priority of an application is the low energetic consumption

and have low precision requirements, although for higher grid dimensions, it seams that

there would be less of a difference.

Exact Bayesian Inference on OpenCL using FPGA

Performing the exact Bayesian Inference on OpenCL using FPGA presents a ad-

vantage on the BM1 machine in terms of energetic consumption if the high precision is

a requirement on the intended application, as well as presenting significantly less design

effort (the same can be said for the OpenCL GPU implementation). The increased design,

and energetic costs in relation to using GPU as hardware for exact Bayesian inference

might be worth it in cases where possible power supplied might be reduced, such as is the

circumstances of some robots and IoT devices.

Exact Bayesian Inference on OpenCL using GPU

The OpenCL implementation on the GPU presents significantly less latency the

more the dimensions of the problem increase, as well a low energy consumption overall

48

49 5.2. HARDWARE AND SOFTWARE LIMITATIONS

making it a great candidate for solving Bayesian inference problems when considering the

much reduced design and integration effort, especially in the developing and testing new

Bayesian solutions. It is also important to highlight that it has the significant set back of

being the approach that has the highest power requirement by far, making it an automatic

disqualifier for several applications.

5.2 Hardware and Software Limitations

Throughout the development of the projects, several hardware and software issues

and conflicts were found that had a significant impact on the development of these. The

time taken to resolve, mitigate, or go around these problems severely delayed the finishing

of the dissertation. The most prominent setbacks are the following:

The server’s FPGA where the work was developed integrated Gidel‘s Boards to

which their Support Package only supports version 14.1 of the ALTERA (now Intel) FPGA

OpenCL SDK. This fact brought some legacy issues when operating in newer OSs, as well

the impossibility of using more recent tools available that could have shorten the develop-

ment times ( e.g. Being able to use compilers that synthesize the FPGA’s designs faster).

Another problem found was also the incredible low support and documentation available

for this outdated support package.

The full compilation times in FPGA’s would take approximately 2 to 3 hours, mak-

ing the process of debugging practical problems extremely slow. The emulation of these

same programs only took a few minutes, enabling us to debug virtually every logic problem

before deployment.

The commercial package Gidel’s BSP has poor documentation support and includes

various bugs on its use. Another problem related to this is that it being a commercial

product, its source code is not available for correction. The most significant setback is

that its use on the PCIe drivers is extremely unstable. Some common occurrences were

50 5.3. FUTURE WORK

that any attempt to reconfigure one or several FPGAs that were configured not so long

ago (a few hours) the host would freeze and force a reboot.

5.3 Future Work

In future works done in this field, probably the most significant goal would be to

develop other benchmarks to test the performance of Bayesian inference algorithms, due

to the fact that a single test can not be representative of all kinds of Bayesian inference

applications.

Generalize and automate the toolchain in order to accept a broader range of Bayesian In-

ference problems, as well as integrate the toolchain with ROS (Robotic Operating System)

to allow testing in real world conditions.

Another interesting goal would be to create a Graphical User Interface (GUI). Should

be relatively easy to adapt inputs of the toolchain and the Python Master script execution

to such an interface.

Finally to study further the limitations on the scaling of the OpenCL implementa-

tions.

Bibliography

[1] Schacter, Daniel (2011). Psychology. Worth Publishers.

[2] Bernstein, Douglas A. (5 March 2010). Essentials of Psychology. Cengage Learning.

pp. 123–124. ISBN 978-0-495-90693-3. Archived from the original on 2 January 2017.

Retrieved 25 March 2011.

[3] S. Legg; M. Hutter (2007). A Collection of Definitions of Intelligence. 157. pp. 17–24.

ISBN 9781586037581.

[4] Bengio, Yoshua; LeCun, Yann; Hinton, Geoffrey (2015). Deep Learning. Nature.

521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID

26017442

[5] Malcolm Tatum, October 3, 2012. What is Machine Perception.

[6] Brassard, G. and P. Bratley: Algorithmics - Theory and Practice, Prentice-Hall, 1988.

(Chapter 8)

[7] Harel, D.: Algorithmics - The Spirit of Computing, 2nd Edition, Addison-Wesley, 1992.

(Chapter 11)

[8] Bayes’ Theorem. Stanford Encyclopedia of Philosophy. Plato.stanford.edu. Retrieved

2014-01-05

[9] Dr Jacques Droulez. Bottom-up Approaches to Machines dedicated to Bayesian Infer-

ence. Period covered: from 01/01/2014 to 31/12/2016, http://www.bambi-fet.eu.

[10] Alexandre Coninx, Pierre Bessiere, Emmanuel Mazer, Jacques Droulez, Raphael Lau-

rent, et al.. Bayesian Sensor Fusion with Fast and Low Power Stochastic Circuits. IEEE

International Conference on rebooting Computing, Oct 2016, San Diego, United States.

hal-01374910

[11] J. F. Ferreira, P. Lanillos, and J. Dias, Fast Exact Bayesian Inference for HighDi-

mensional Models. 2015.

[12] P. BAMBI, BAMBI D3.10 : Emulation of a probabilistic computer on current recon-

51

52 BIBLIOGRAPHY

figurable logic. tech. rep., 2017

[13] A. Coninx, P. Bessi‘ere, E. Mazer, J. Droulez, R. Laurent, M. A. Aslam, and J. Lobo,

Bayesian sensor fusion with fast and low power stochastic circuits 2016 IEEE Inter-

national Conference on Rebooting Computing, ICRC 2016 - Conference Proceedings,

2016.

[14] M. G. D. Mira, Using an FPGA Mini-Cluster to Implement Bayesian Application-

Specific Integrated Circuits for Robotic Applications. 2017.

[15] Jose Carlos Direito May 2018, Probabilistic Computing Using OpenCL on an FPGA

Mini-cluster, Dissertation submitted to the Electrical and Computer Engineering De-

partment of the Faculty of Science and Technology of the University of Coimbra in

partial fulfillment of the requirements for the Degree of Master of Computer Science.

[16] FPGA Architecture for the Challenge. toronto.edu. University of Toronto.

[17] K. Mekhnacha, J.-M. Ahuactzin, P. Bessiere, E. Mazer and L. Smail. A unifying

framework for exact and approximate Bayesian inference. January 2006. Institut Na-

tional De Recherche en Informatique Et Automatique. And Retrived from the URL:

https://hal.inria.fr/inria-00070226/document

[18] Janet Morss FPGAs vs. GPUs: A Tale of Two Accelerators January 16th,

2019. Retrived from URL: https://blog.dellemc.com/en-us/fpgas-vs-gpus-tale-two-

accelerators/. Retrived Outober 17th, 2019.

[19] The open standard for parallel programming of heterogeneous systems, Retrived from

URL: https://www.khronos.org/opencl/ at 26/02/2020

[20] Kullback, S.; Leibler, R.A. (1951). On information and sufficiency. Annals of Math-

ematical Statistics. 22 (1): 79–86. MR 39968. doi:10.1214/aoms/1177729694

[21] Kullback, S. (1959), Information Theory and Statistics, John Wiley Sons. Repub-

lished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.

[22] Janet Morss Measure Time in Python – time.time() vs time.clock(). Last Updated:

Friday 3rd May 2013. Retrived from URL: https://www.pythoncentral.io/measure-

time-in-python-time-time-vs-time-clock/. Retrived December 8th, 2019.

[23] Janet Morss std::clock Retrived from URL: https://en.cppreference.com/w/cpp/chrono/c/

clock. Retrived December 8th, 2019.

[24] Matthew Travers 2015, CPU Power Consumption Experiments and Results Analy-

sis of Intel i7-4820K , Systems Research Group School of Electrical and Electronic

Engineering. Technical Report Series NCL-EEE-MICRO-TR-2015-197. Chapter 2.3.

53 BIBLIOGRAPHY

[25] ProceV-GX- PCIe based 100G+ ultra-low latency platform.

https://www.intel.com/content/www/us/en/programmable/solutions/partners/partner-

profile/gidel–inc-/board/procev-gx–pcie-based-100g–ultra-low-latency-platform.html.

Retrieved 2020-02-28.

[26] Intel R© CoreTM i5-4570 Processor. https://ark.intel.com/content/www/us/en/ark/products

/75043/intel-core-i5-4570-processor-6m-cache-up-to-3-60-ghz.html. Retrieved 2020-02-

28.

[27] GeForce GTX TITAN — Specifications.https://www.geforce.com/hardware/desktop-

gpus/geforce-gtx-titan/specifications. Retrieved 2020-02-28.

bayesian inference for arti cial perception using …...abstract this dissertation project addresses...

Documents