eamta_vlsi architecture design for particle filtering in

7
VLSI Architecture Design for Particle Filtering in Real-time A. Pasciaroni *† , J. A. Rodr´ ıguez , F. Masson *† , P. Juli´ an *† , E. Nebot * Dep. Ing. El´ ectrica y Computadoras, Universidad Nacional del Sur Av. Alem 1253, Bah´ ıa Blanca, Argentina CONICET, Argentina Australian Centre for Field Robotics, University of Sydney, Australia Abstract—Particle Filter is an algorithm that provides system state estimation even for non-linear and non-gaussian systems. For applications that require a large number of particles, real time constraint is hard to accomplish since the algorithm is computationally expensive and the resampling step becomes a bottleneck. In this work, a VLSI architecture for particle filtering in real time is presented. The proposed design implements a fraction of the processing using piecewise linear functions and allocates them as global resources. In this way, a large number of processing elements (PE) working in parallel can be instantiated in the design. An example based on a range-only localization using Radio-Frecuency identification (RFID) tags is developed to illustrate the approach. The received signal strength indicator (RSSI) is used to estimate the distance between transmitter and receiver. A VHDL RTL model of the processing data flow is implemented and compared to Matlab simulations showing similar results. Index Terms—Particle Filter, VLSI Design, RFID, RTL. I. I NTRODUCTION Particle Filters (PF) [1] are a method to perform statistical dynamic state estimation. The probability density function of a given state is represented by a set of weighted entities or particles which is updated iteratively according to sensor mea- surements and a dynamic system model. The three main steps of the particle filter are: sampling, update and resampling. This last step presents high data dependency between particles, becoming the major bottleneck in the execution time of the filter. There exist applications that require real-time estimation of non-linear and non-gaussian systems as robot localization and visual tracking [2], [3], [4]. These applications are well suited for particle filtering but a large number of particles is required to provide accurate estimations. Since the PF algorithm is computationally expensive and the resampling step cannot be fully parallelized, particle filter computation in real time is limited by the available computational resources. In this context, a VLSI implementation that exploits algorithm data level parallelism will allow particle filtering at real time. Previous works have addressed particle filter implementa- tions for real time applications [5] [6] [7] [8]. In [5] a PF architecture composed of multiple processing elements and a central unit for the bearing-only tracking problem is presented and implemented in FPGA. Particle filter steps are performed locally on each processing element (PE). After resampling, a central unit controls the particle exchange among processors in order to reduce performance degradation. Several commu- nication schemes are introduced including a fixed particle exchange among processors. In [7] a VLSI design of the processing element is presented which also includes a pipeline dataflow that deals with logic blocks of variable latency. In [8] a central unit that performs communication schemes, intro- duced in [5], for an architecture composed of four processing elements is designed and a VLSI implementation is presented. In [6] a parallel pipelined design is presented. The number of replicated pipeline stages is variable. Taking into account the rate of each stage an optimal number of replicated stages is determined. However, a VLSI implementation that takes full advantage of the data level parallelism present in the algorithm, has not been developed yet. In this work a VLSI architecture for particle filtering in real time applications is presented. It is composed of processing clusters with one resampling module and an array of PE. Each PE performs several steps of the PF operation that do not present data dependency, in a pipelined fashion. Therefore, if more PE can be instantiated in a given Silicon area, more particles can be effectively processed in parallel, increasing the throughput. Afterwards, resampling modules gather PE outputs so that the resampling is performed in groups. In addition, to reduce the PE area, a fraction of the PE data processing is time-multiplexed so hardware dedicated to this processing is instantiated once and can be shared by multiple PE. The application chosen to illustrate the approach is target tracking based on Received Signal Strength Indicator (RSSI) of Radio Frequency Identification devices (RFID). The paper is organized as follows. Section II presents the localization framework and RSSI sensor model. The archi- tecture and microarchitecture design is presented in section III. Execution time of proposed architecture is presented in section IV. Simulation results comparing the VHDL RTL and Matlab models are presented in section V. Finally, Section VI is dedicated to the conclusions. II. LOCALIZATION FRAMEWORK In sensor networks, Radio Frequency based localization systems have gained importance in those environments where Global Positioning based system (GPS) do not perform well due to poor satellite availability or multiple path issues [9] [10]. This a possible situation for the choosen target appli- cation: trucks localization in opencast mining enviroments

Upload: alejandro-pasciaroni

Post on 13-Apr-2017

83 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EAMTA_VLSI Architecture Design for Particle Filtering in

VLSI Architecture Design for Particle Filtering in

Real-timeA. Pasciaroni∗†, J. A. Rodrıguez†, F. Masson∗†, P. Julian∗†, E. Nebot‡

∗Dep. Ing. Electrica y Computadoras, Universidad Nacional del Sur

Av. Alem 1253, Bahıa Blanca, Argentina†CONICET, Argentina

‡Australian Centre for Field Robotics, University of Sydney, Australia

Abstract—Particle Filter is an algorithm that provides systemstate estimation even for non-linear and non-gaussian systems.For applications that require a large number of particles, realtime constraint is hard to accomplish since the algorithm iscomputationally expensive and the resampling step becomes abottleneck. In this work, a VLSI architecture for particle filteringin real time is presented. The proposed design implements afraction of the processing using piecewise linear functions andallocates them as global resources. In this way, a large number ofprocessing elements (PE) working in parallel can be instantiatedin the design. An example based on a range-only localizationusing Radio-Frecuency identification (RFID) tags is developed toillustrate the approach. The received signal strength indicator(RSSI) is used to estimate the distance between transmitterand receiver. A VHDL RTL model of the processing data flowis implemented and compared to Matlab simulations showingsimilar results.

Index Terms—Particle Filter, VLSI Design, RFID, RTL.

I. INTRODUCTION

Particle Filters (PF) [1] are a method to perform statistical

dynamic state estimation. The probability density function of

a given state is represented by a set of weighted entities or

particles which is updated iteratively according to sensor mea-

surements and a dynamic system model. The three main steps

of the particle filter are: sampling, update and resampling.

This last step presents high data dependency between particles,

becoming the major bottleneck in the execution time of the

filter.

There exist applications that require real-time estimation of

non-linear and non-gaussian systems as robot localization and

visual tracking [2], [3], [4]. These applications are well suited

for particle filtering but a large number of particles is required

to provide accurate estimations. Since the PF algorithm is

computationally expensive and the resampling step cannot

be fully parallelized, particle filter computation in real time

is limited by the available computational resources. In this

context, a VLSI implementation that exploits algorithm data

level parallelism will allow particle filtering at real time.

Previous works have addressed particle filter implementa-

tions for real time applications [5] [6] [7] [8]. In [5] a PF

architecture composed of multiple processing elements and a

central unit for the bearing-only tracking problem is presented

and implemented in FPGA. Particle filter steps are performed

locally on each processing element (PE). After resampling, a

central unit controls the particle exchange among processors

in order to reduce performance degradation. Several commu-

nication schemes are introduced including a fixed particle

exchange among processors. In [7] a VLSI design of the

processing element is presented which also includes a pipeline

dataflow that deals with logic blocks of variable latency. In [8]

a central unit that performs communication schemes, intro-

duced in [5], for an architecture composed of four processing

elements is designed and a VLSI implementation is presented.

In [6] a parallel pipelined design is presented. The number of

replicated pipeline stages is variable. Taking into account the

rate of each stage an optimal number of replicated stages is

determined. However, a VLSI implementation that takes full

advantage of the data level parallelism present in the algorithm,

has not been developed yet.

In this work a VLSI architecture for particle filtering in real

time applications is presented. It is composed of processing

clusters with one resampling module and an array of PE. Each

PE performs several steps of the PF operation that do not

present data dependency, in a pipelined fashion. Therefore,

if more PE can be instantiated in a given Silicon area, more

particles can be effectively processed in parallel, increasing the

throughput. Afterwards, resampling modules gather PE outputs

so that the resampling is performed in groups. In addition, to

reduce the PE area, a fraction of the PE data processing is

time-multiplexed so hardware dedicated to this processing is

instantiated once and can be shared by multiple PE.

The application chosen to illustrate the approach is target

tracking based on Received Signal Strength Indicator (RSSI)

of Radio Frequency Identification devices (RFID).

The paper is organized as follows. Section II presents the

localization framework and RSSI sensor model. The archi-

tecture and microarchitecture design is presented in section

III. Execution time of proposed architecture is presented in

section IV. Simulation results comparing the VHDL RTL and

Matlab models are presented in section V. Finally, Section VI

is dedicated to the conclusions.

II. LOCALIZATION FRAMEWORK

In sensor networks, Radio Frequency based localization

systems have gained importance in those environments where

Global Positioning based system (GPS) do not perform well

due to poor satellite availability or multiple path issues [9]

[10]. This a possible situation for the choosen target appli-

cation: trucks localization in opencast mining enviroments

Page 2: EAMTA_VLSI Architecture Design for Particle Filtering in

0 10 20 30 40 50 60−150

−100

−50

0

50

Two Ray Model

Distance [m]Average Signal Strength [dBm]

Fig. 1: Two Ray Model for a communication link of 433 Mhz in arural enviroment.

[9]. The RFID technology comprises the receivers, antennas

and RFID tags. The tags send their identification number to

the receivers. Making use of RSSI it is possible to estimate

the distance between a tag and a receiver since RSSI values

decrease with distance with a known law. Due to several

factors that affect propagation of electromagnetic waves in

a medium (refractions, reflections, scattering), the received

power vs distance relation varies with the obstacles in the

environment, the height and direction of the antenna and also

the power of the signal transmitted. This results in a non-

biyective and thus multimodal sensor function.

Figure 1 shows a typical two-ray propagation model of RF

signals [11] for a rural environment and a communication

frequency of 433 MHz and transmitter and receiver height

of 2.5 m. It shows the average signal strength of the received

power versus distance. For a given distance the distribution

of RF signal is considered Gaussian and its variance varies

with the signal strength [9]. It is possible to observe that for

a received power of −70 dBm there exist multiple distance

values: 8 m, 15.5 m, 20 m and 43.1 m being one of those the

true value of the tag position. This example shows the multi-

modal probabilistic density function associated with RFID

sensor.

RSSI based localization can be performed using the particle

filter algorithm. Consider a hypothetical scenario of one RFID

tag moving in 2-D and one antenna located at the origin. Let

pik denote the ith particle, where pik =[

x x y y]

. The

target system evolution is given by

f(pik−1, vx, vy) =

1 ∆T 0 00 1 0 00 0 1 ∆T0 0 0 1

· pik−1+

0.5 ·∆T 2 0∆T 00 0.5 ·∆T 2

0 ∆T

·[

vxvy

]

,

(1)

where vx and vy are drawn from a uniform distribution

U[0, Q].The pseudocode of the Particle Filter algorithm for the

chosen application and for a set of N particles is described

below:

random initialization of particles;for i← 1 to N do

pik = f(pik−1, vx, vy); //sampling

di = sqrt(pik(1)2+ pik(3)

2);

Poti = Fsensor(di);

wi =1

2π·σ2· exp(−(Poti−Potmeasurement)

2

2·σ2 ) //update

end[w, p] = resampling(w, pk);

where Potmeasurement is the power measurement of the

received signal whose variance is σ2 and Fsensor(d) is the

mathematical expression of the Two ray propagation model

whose characteristic is shown in Figure 1. Depending on

the obstacles present in the enviroment a more complex

sensor model can be utilized. For the resampling step there

exist several algorithms [12], [5], [13]. Position estimation is

computed by the following equation:

x =

N∑

i=1

pi · wi (2)

III. DESIGN

A. Architecture

The PF filter algorithm does not presents data dependence

between particles except for the resampling step. When the

number of particles increases the resampling execution time

can become a bottleneck. A strategy to reduce the resampling

execution time is to divide the total number of particles into

groups so parallelism level is increased [5]. Each particle

group is processed by a dedicated processor. Since the resam-

pling step is sequentially executed, there exists a trade off be-

tween the number of processors and the estimation error: as the

number of particle groups increases, so does the degradation of

the filter [14]. In order to reduce this performance degradation

a particle exchange must be performed among processors. In

[15] an optimization of the particles exchange procedure is

presented. A formal analysis, applying the Kullback-Leibler

divergence, proves that the exchange of particles with largest

weights between adjacent processors results in better accuracy

than a random particle mixing. In [14] this exchange is

performed after resampling thus the selection of particles

with largest weight is avoided. The analysis of algorithm

parallelization has been done in [14] allowing the selection

of an optimal configuration. Once one filter iteration has been

performed, the estimate of each processor is combined in order

to provide a global estimation [15].

The system consists of two modules: the measurement unit

and the processing unit. The system block diagram is shown

in Fig. 2-a. The measurement unit sets up the RSSI value and

computes the reciprocal of the noise variance. The processing

unit performs the PF algorithm and provides an estimated

position.

In order to process thousands of particles in real-time the

processing unit architecture must exploit data level parallelism

and at the same time take into account the strategy described

above. A parallelism level hierarchy is adopted. The first

Page 3: EAMTA_VLSI Architecture Design for Particle Filtering in

level is performed by introducing multiple processing elements

(PEs) each one performing the PF algorithm steps that do not

present data dependency. The second level consists in gath-

ering PEs in clusters so data input for the resampling step is

made up of the processed particle and weight of each PE inside

a cluster. For the final estimation of position, the estimate of

each cluster is combined as was previously mentioned. Particle

exchange among clusters is also performed.

The proposed VLSI design implements most area consum-

ing operations in external (out of the array) Look-up tables

(LUT). These LUTs are taken away from the processing

element dataflow and put them as global resources. For each

table there is a Broadcast module that sequentially reads the

table and performs interpolation. The interpolated value and

interpolation address are broadcasted to all PEs through buses.

Each PE locally computes its required interpolation address

and compares it with the current value in the bus. If an

equivalence is found, the corresponding data value is acquired

by the PE.

Figure 2-b shows a more detailed architecture of the pro-

cessing unit. It has 4 clusters with 4 PEs each. Sensor measure-

ment and the reciprocal of its variance 1/σ2 are communicated

to all PEs. Four global resources are introduced: Square,

Sqrt, Sensor and Normal LUT. Each broadcast module has

two independent buses: interpolation/address and data/bus.

Resampling, pseudo-random number generator (PRNG) and

Word-to-memory modules inside each cluster are also in-

troduced. All modules are explained in further subsections.

Communication among clusters is not shown to simplify the

diagram.

Each cluster has its own local memory and works without

data dependence of others except when the particle exchange

is performed. Processing elements belonging to a cluster share

local memory.

Regarding control logic, each cluster has its own control

logic that manages main memory reading and writing and also

global control signals. Furthermore, each processing element

that integrates a cluster has a dataflow pipeline whose control

is distributed. Since each pipeline stage has a variable delay

dependent on the time instant when the corresponding value is

present in data bus, global pipeline control is not affordable.

Therefore, each stage has a local control logic dependent on

data events.

B. Cluster Operation

Architecture cluster operation proceeds as follows: while

in execution, each PE inside a cluster reads a particle from

memory. Each Broadcast module sequentially reads its corre-

sponding LUT, interpolates and broadcasts interpolated value

and interpolation address to all PEs. Since the PE dataflow

is pipelined, a single table read is utilised to process several

particles. Main memory has two ports so memory reading and

writing is performed simultaneously.

Two arrays, one made up of particles and another one

of processed weights from each PE are the input for the

Resampling module. Once the arrays have been totally up-

dated, resampling is performed. The elements of resampling

Fig. 2: a) Block Diagram of the VLSI architecture for proposedtracking system, b) architecture of the processing unit.

arrays are processed sequentially. As soon as one element is

resampled, it is immediately updated by corresponding PE.

Once all data from local memory has been processed

communication among clusters is performed.

C. LUTs design

The functions implemented in the LUTs are: square, square

root, two ray propagation model (as shown in Fig. 1) and

normal distribution. All of them are evaluated with a piecewise

linear function with uniform segmentation. By performing

interpolation, a reduction in table size is achieved. At a point

x ∈ [a , b], a linear interpolation is calculated as follows:

f(x) =f(b)− f(a)

b− a· (x− a) + f(a) (3)

This operation is performed by the broadcast module shown

in Fig. 3. A counter generates 2N+M words where the N most

significant bits are used for LUT addressing and the remaining

M bits for interpolation.

The introduced dataflow is composed of several tabulated

functions and interpolations in cascade. When the interpolated

value from a broadcast module is captured by the correspond-

ing pipeline stage, it becomes the interpolation address for the

next tabulated function. It is desirable to find an appropriate

word length for LUT addressing, interpolation and function

value quantization. This length should maximise the ratio

between interpolation address word length and interpolated

value word length. At the same time, the approximation errors

Page 4: EAMTA_VLSI Architecture Design for Particle Filtering in

Fig. 3: Broadcast Module

TABLE I: Piecewise Linear Function Setup

FunctionN M Q R S SizeKbits

Range X Interp.Error

Square 9 2 14 17 - 7 [0,40] 5 · 10−4

Sqrt 10 2 11 13 5 11 [0,3200] 3 · 10−4

Sensor 10 2 10 12 1 10 [0,113] 4 · 10−4

Normal 9 1 10 11 3 5 [0,5] 5 · 10−3

should be reduced since they are propagated through the

dataflow. In this regard, the accuracy analysis introduced in

[16] for practical implementation of piecewise linear functions

is adopted. Table I shows the setup chosen for each piecewise

linear function implementation where N , M , Q are the number

of bits assigned for segmentation, interpolation and function

value quantization. R and S are output data resolution and

discarded input bits. The error introduced by each interpola-

tion, which is calculated as the median of the absolute error

over one thousand samples of evaluation interval, i.e.,

error(x) = mean(| f(x)− finterp(x))

f(x)|) (4)

is also included in the table.

The normal distribution implementation requires to evaluate

normal distributions with different values of variance. Any

normal distribution can be obtained from the standard normal

distribution. If a distribution with mean µ and variance σ2

must be evaluated for a value t, the following equations allow

the calculation using only the standard normal distribution

function:

z =t− µ

σ, (5)

pNormal =1

σ· pStandardNormal(z), (6)

where

PStandardNormal(z) =1√2 · π

· exp(−z2

2). (7)

Moreover, as the function is symmetric around the mean,

there is only need to store half of the evaluation interval,

reducing even more the LUT size.

The architecture comprises dual port memories thus the two

values for interpolation can be obtained simultaneously.

Fig. 4: PE Micro-architecture

D. PE Micro Architecture

Each PE sequentially performs the two algorithm steps that

do not present data dependency: sampling and update. Pro-

cessing is divided into several modules in order to implement

a level module pipeline: Sampling, Acquisition Square Value,

Acquisition Sqrt Value, Acquisition Sensor Value and Acqui-

sition Normal Value. Figure 4 shows the pipelined dataflow

microarchitecture.

1) Sampling Unit: The sampling unit processes data from

main memory current location. Memory word datawidth is 48

bits where each particle component has 12 bits. Range for

position and velocity is [−40, 40] m and [−25, 25] m/s. This

unit performs a translation in the plane by using a simplified

version of the dynamic model detailed in (1). This simplifica-

tion allows a reduction in the number of multiplications. For

this design the dynamic model is fixed but future designs will

consider a programable model. The translated positions and

velocities are computed as follows

px(k) = px(k − 1) + vx(k − 1) · △T +1

2· nx (8)

py(k) = py(k − 1) + vy(k − 1) · △T +1

2· ny (9)

vx(k) = vx(k − 1) + nx (10)

vy(k) = vy(k − 1) + ny (11)

where nx and ny are drawn from a uniform distribution

U[0,W ]. Depending on the value of the △T parameter, the

W value should be adjusted in order to provide similar

accelerations than the original model. The random noise is

generated by a linear feedback shift register [LFSR] [17]

with internal XORs of 16 bits with reconfigurable seed. This

pseudo random number generator is a shared resource inside

a cluster. Each PE takes a number at its corresponding turn.

The eight most significant bits are used fot the nx component

and the eight less significant bits for the ny component. Each

component noise is pre-multiplied by the variance value Q.

Either Q and △T registers are programmables of 8 bits length.

The output of the sampling unit has the same datawidht as

its input.

2) Acquisition Value Units: All acquisition units detect

when data input is equal to the current value in the interpola-

tion address bus. This detection is performed with a bitwise xor

operation. When an equivalence is detected the data present

in the data bus is acquired.

Page 5: EAMTA_VLSI Architecture Design for Particle Filtering in

The Acquisition Square Value unit, performs the sum of the

inputs squared. When x or y are negative two’s complement is

performed. Thus |x| and |y| have 11-bit word length and are

compared to the interpolation address bus. Once the squared

value is captured for both components, sum is performed with

17 bits output data width. The broadcast module for the Sqrt

function provides a 12-bit interpolation address bus. Therefore

the 5 less significant bits of x2 + y2 are discarded when

Acquisition Sqrt Value unit compares its data input with the

value present in the interpolation address bus. The same occurs

for the block Acquisition Sensor Value with the less significant

bit discarded from its input word.

The Acquisition Normal Value Unit generates a word using

(5) with µ equal to Potmeasurement. Once an equivalence is

detected, the data present in the bus is acquires. Finally it is

multiplied by the reciprocal of the standard deviation as stated

in (6). The reciprocal of the variance has 8-bit width as well as

the power measurement. In order to perform subtraction in (5),

the 5 less significant bits of the input word are discarded. The

word length after this equation is 16 bits. According to Table

I, the tabulated normal function requires 1 interpolation bit,

therefore the 6 less significant bits are not taken into account

resulting in a 10-bit word length. Once data value is captured

by the PE, it is multiplied by the reciprocal of the variance,

resulting in a 19-bit word lentgh.

E. Resampling unit

The resampling algorithm selected for implementation is the

modified Independent Metropolis Hasting (IMH) [12] which

substitutes division operation for comparison and particles

and their weights are sequentially processed. The algorithm

is summarized in the following pseudocode:

wprev = wk1;

for i← 2 to NUMPARTICLES dou ∼ U(0, 1);if ( u · wprev > wk

i) thenwprev = wprev; resample = 1;

else

wprev = wki; resample = 0;

endend

Algorithm 1: Implemented resampling algorithm

Figure 5 shows the architecture of resampling and word-to-

memory modules. The particle array is fulfilled with output

particles from sampling unit. Both arrays particle arrayand weight array must be fully updated in order to initiate

resampling operation. First particle of the whole set is always

resampled. Subsequent particles will be stored in memory

depending on the comparison among their weight and wprev .

The random number generator is implemented with a LFSR

of 16 bits.

The resample signal controls the data stored in memory. If

value of resample is 1, the data present in the particle register

is written to memory else actual processed particle is selected

and wprev is updated.

In order to sinchronize the translated particle with the

pipeline time schedule, it must be delayed as many times as the

Fig. 5: Word-To-memory Architecture

number of pipeline stages in between the sampling unit and the

Acquisition normal value unit. Each PE reads a particle from

a memory location and, once the particle is resampled, word-

to-memory unit stores it at the same location. Since a dual

port memory is considered and the architecture is pipelined,

memory reading and writing is done simultaneously. Control

is achieved with a read address and write adress counter. The

former is dependent on control signals from sampling unit

and the latter is dependent on control signals from word-to-

memory unit.

IV. EXECUTION TIME

Since the execution time of each module is variable, each

PE will complete its processing at different times. The resam-

pling module begins operation when the first PE has finished

processing its particle. Figure 6 shows the execution time of

the dataflow for a cluster made up of two processing elements.

Pipeline delay between output data values is given by the

slowest stage. In the presented design this corresponds to

the stage with the largest interpolation bus address, since it

takes 2N+M cycles in order to acquire the last interpolated

value. This is the case of the Sqrt function. In the worst case

execution time, a new particle is processed every 4096 cycles.

As resampling takes one cycle to process each particle, the

number of cycles to finish the resampling operation depends

on the number of PEs in a cluster. Therefore, the last element

of the resampling array will be updated every 2N+M + P ,

where P is the number of PEs in the cluster.

V. RESULTS

A. Simulation Results

A VHDL RTL model of the processing element was de-

veloped. The implementation flow was the following: first a

fixed point Matlab implementation of the processing element

described above was generated and compared to its floating

point counterpart to prove its proper operation. Second, an

RTL model that matches the fixed point Matlab implemen-

tation was developed. At this stage of the implementation,

Page 6: EAMTA_VLSI Architecture Design for Particle Filtering in

Fig. 6: Filter execution time.

0 2 4 6 8 10 12 14 16 18 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Distance [m]

Weight

Matlab Model vs RTL Model

Matlab Model

RTL Model

Fig. 7: Weights vs. distance.

the RSSI measurement, the reciprocal of the noise standard

deviation and the position estimation are generated off-line.

Figure 7 shows the distribution of weights vs distance for

the floating point Matlab implementation and the RTL model,

in the case where the measured power is −55.42 dBm and

σ = 2.5 dBm. Since RSSI measurements are 8-bit quantized,

the normal distribution is also quantized. It can be noticed that

the RTL model provides similar results to the floating point

Matlab implementation.

A 2-D tracking scenario can be simulated to show the

dynamic performance. In this case, the fixed point Matlab

implementation is used, instead of the RTL model, in order

to reduce simulation time. The scenario is composed of a unit

moving at nearly constant velocity and three fixed antennas

s1, s2 and s3 placed at positions: [0, 0], [−20, 0], [0, 20],respectively. The position of the target unit evolves with time

according to (1).

The mobile initial state is x0 =[−8m, 12m/s, 10m,−2m/s] and ∆T = 0.1s. The total

number of particles used is 4096, which are uniformly

distributed on a region delimited by the intervals [−20m, 20m]and [0, π] radians at the beginning of the simulation. Particle

velocities have been randomly initialised with uniform

distribution in the interval [17, 7] for x and [7,−3] for y.

Figure 8 shows the trajectory of the target unit (green

line) and simulation results for the Matlab model and the

RTL model, in red and black lines, respectively. Both models

provide very close results.

−8 −6 −4 −2 0 2 4 6 8

0

5

10

15

20

Floating Point Matlab Model

Fixed Point Matlab Model

Tartet trajectory

Antenna

Fig. 8: Tracking of a moving target with three antennas.

TABLE II: Synthesis Results

Module Area [µm2]

Sampling 37453

Acq. Square Value 6144

Acq. Sqrt Value 2268

Acq. Sensor Value 1932

Acq. Normal Value 13293

Total Area EP 87086

B. Synthesis Results

The RTL model of the processing element described in

section III was synthesized using Synopsis DC Compiler and

0.13µm CMOS technology. Since the array is composed of

several processing elements it is desirable to have the area

required for this basic unit. Table II shows the area of the

processing element and its modules.

VI. CONCLUSIONS

A VLSI architecture for particle filtering in real time was

presented. This architecture exploits the data level parallelism

in the algorithm and also takes into account performance

degradation due to resampling parallelization. Introducing

global resources allows an increase in concurrent hardware.

Processing dataflow was described along with a piecewise lin-

ear function implementation. An RTL model of the proposed

design was generated. Simulation shows that the architecture

correctly implements the PF adapted to the specific applica-

tion. Further work is needed to choose an optimal number of

PEs per cluster.

VII. ACKNOWLEDGMENTS

The results of this paper were partially supported by PICT

2010-2657 3D Gigascale Integrated Circuits for Nonlinear

Computation, Filter and Fusion with Applications in Industrial

Field Robotics of Agencia Nacional de Promocion Cientıfica y

Tecnologica (ANPCyT) of the Argentine Ministry of Science

and Technology (MINCYT).

REFERENCES

[1] N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach tononlinear/non-gaussian bayesian state estimation,” IEE Proc. Of Radar

and Signal Processing, vol. 140, no. 2, pp. 107–113, 1993.[2] M. Isard and A. Blake, “Condensation - conditional density propagation

for visual tracking,” International Journal of Computer Vision, vol. 29,no. 1, pp. 5–28, 1998.

Page 7: EAMTA_VLSI Architecture Design for Particle Filtering in

[3] D. Fox, “Kld-sampling: Adaptive particle filters and mobile robotlocalization,” in Advances in Neural Information Processing Systems

14, vol. 2, 2001, pp. 713–720.[4] D. F. C Kwok and M. Meila, “Real-time particle filters,” Proceedings

of the IEEE, vol. 92, no. 3, pp. 469–484, Mar 2004.[5] M. Bolic, P. M. Djuric, and S. Hong, “Resampling algorithms and

architectures for distributed particle filters,” IEEE Transactions on SignalProcessing, vol. 53, no. 7, pp. 2442–2450, July 2005.

[6] A. C. Sankaranarayanan, A. Srivastava, and R. Chellappa, “Algorithmicand architectural optimizations for computationally efficient particlefiltering,” IEEE transcactions on Image Processing, vol. 17, no. 5, pp.737–748, May 2008.

[7] S.-S. Chin and S. Hong, “Vlsi design of high-throughput processingelement for real-time particle filtering,” in Signals, Circuits and Systems,vol. 2, 2003, pp. 617–620.

[8] S. Hong, S. S. Chin, M. Boli, and P. M. Djuric, “Design and implemen-tation of flexible resampling mechanism for high-speed parallel particlefilters,” Journal of VLSI signal processing systems for signal, image andvideo technology, vol. 44, pp. 47–62, 2006.

[9] G. Kloos, J. E. Guivant, E. M. Nebot, and F. Masson, “Range basedlocalisation using rf and the application to mining safety,” in Proceedings

of the 2006 IEEE/RSJ International Conference on Intelligent Robotsand Systems, Oct 2006, pp. 1304–1311.

[10] S. Sanudo and F. R. Masson, “Desempeno del filtro de partıculas acotadoen una aplicacion de localizacion y seguimiento de camiones en unaexplotacion minera,” in XIV Reunion de Trabajo en Procesamiento de

la Informacion y Control, vol. 1, 2011, pp. 712–717.[11] H. Xia, H. L. Bertoni, L. Maciel, A. Lindsay-Stewart, and R. Rowe,

“Radio propagation characteristics for line-of-sight microcellular andpersonal communications,” IEEE Transactions on Antennas and Propa-

gation, vol. 41, no. 10, pp. 1439–1447, Oct 1993.[12] L. Miao, J. J. Zhang, C. Chakrabarti, and A. Papandreou-Suppappola,

“Algorithm and parallel implementation of particle filtering and its usein waveform-agile sensing.” Signal Processing Systems, vol. 65, no. 2,pp. 211–227.

[13] M. Bolic, P. M. Djuric, and S. Hong, “Resampling algorithms for particlefilters: A computational complexity perspective,” EURASIP Journal on

Applied Signal Processing, vol. 15, pp. 2267–2277, 2004.[14] A. Pasciaroni, S. Sanudo, J. Rodriguez, F. Masson, and P. Julian,

“Modelling and analysis of parallel particle filters,” in XV Reunion deTrabajo en Procesamiento de la Informacion y Control, vol. 1, no. 1,2013, pp. 1–6.

[15] B. Balasingam, M. Bolic, P. Djuric, and J. Miguez, “Efficient distributedresampling for particle filters,” in IEEE Int. Conf. on Acoustics, Speech

and Signal Processing (ICASSP), 2011, pp. 3772–3775.[16] O. Lischitz, P. Julian, J. Rodriguez, and O. Agamennoni, “Accuracy

analysis for an on-chip digital pwl realization,” in XIV Reunion deTrabajo en Procesamiento de la Informacion y Control, 2011, pp. 429–434.

[17] Z. Barzilai, D. Coppersmith, and A. L. Rosenberg, “Exhaustive gen-eration of bit patterns with applications to vlsi self-testing,” IEEE

Transactions on Computers, vol. C-32, no. 2, pp. 190–194, Feb 1983.