fusion between camera and lidar for autonomous surface...

Fusion between camera and lidar forautonomous surface vehicles

Vegard Kamsvåg

Master of Science in Cybernetics and Robotics

Supervisor: Edmund Førland Brekke, ITKCo-supervisor: Geir Hamre, DNV GL

Department of Engineering Cybernetics

Submission date: July 2018

Norwegian University of Science and Technology

Dedicated to my parents, for their unconditional love and support.

Problem Description

Autonomous surface vehicles (ASVs) can use a variety of exteroceptive sensors for sense-and-avoid and navigation purposes. For the sake of redundancy and robustness the ASVshould be equipped with both active (e.g., radar or lidar) and passive sensors (e.g., cam-era). While radar is required for larger vessels operating in open sea, lidar is a promisingalternative for smaller ASVs operating in confined environments, both due to smaller sizeand higher resolution.

The purpose of this project is to develop a sensor fusion system for lidar and optical cam-era to be used for sense-and-avoid purposes onboard vehicles such as DNV GLs ReVoltscale-model or NTNUs autonomous ferry. The project builds on a 5th year specializationproject written during the Autumn 2017, where data acquisition, registration and basicprocessing up towards measurement fusion were studied. The Masters thesis shall addressthe following tasks:

1. Installation of lidar and camera onboad test vehicle. DNV GLs ReVolt or MaritimeRobotics Telemetron are possible candidates.

2. Calibration of camera.

3. ROS-based software architecture for data reception from the two sensors, synchro-nized with the inertial navigation system of the vehicle.

4. Formulation of suitable measurement models for data from the two exteroceptivesensors.

5. Exploration of a multi-target tracking method that performs measurement-level fu-sion using measurements from both lidar and camera. This may use existing imple-mentations of JIPDA or MHT as its starting point.

6. Particular attention should be given to track initiation: examine to what extent twodifferent sensors can give more reliable track initiation decisions than one sensor.

7. Analyze strengths and weaknesses of the two sensors with basis in tracking results.

i

PrefaceThis thesis marks the conclusion of the authors Masters degree in Cybernetics and Roboticsat the Norwegian University of Science and Technology (NTNU). I would like to thank mysupervisors Edmund F. Brekke at NTNU and Geir Hamre at DNV GL for their guidance,support and helpful and enlightening discussions during the work of this thesis. In additionto my supervisors, I would like to thank Tom Arne Pedersen and Rune Green, both DNVGL, for driving the target boats during the data gathering, and my good friend and fellowstudent Albert Havnegjerde for his assistance with both the data gathering as well as beinga good discussion partner. I would also like to thank all the guys at the ITK workshop, foralways being helpful and friendly, and lending me tools and equipment when needed.

The original plan for this thesis was to cooperate with the Trondheim-based companyMaritime Robotics, and use their in-house developed 360 degree camera rig as a sensor.Due to a fire in their office spaces in early April where the camera rig was lost to theflames, the thesis was forced down a different path. While perhaps unfortunate, this justgoes to show how uncertainty is something we have to live with, which is also a centraltheme in this thesis. In this regard, I would like to thank Marco Leonardi at NTNU forbeing so kind as to lending me his camera on short notice, such that this thesis could becompleted.

This thesis is a continuation of the authors specialization project during the fall of 2017[1]. The convolutional neural network used for detecting boats in images is a direct imple-mentation of a network trained by Espen Tangstad during his masters thesis in the springof 2017 (see section 4.5). The target tracking system is based on a unpublished MATLABimplementation of the JIPDA filter, provided by Edmund Brekke. The MATLAB imple-mentation has been expanded upon by the author to include methods for track initiationand termination, and was implemented on real data gathered by the author. This thesiswork ties together the thus far unrelated works of Tangstad, the tracking implementation,and the navigation system on the ReVolt model ship in a full target tracking pipeline. Thederivations of the JPDA and JIPDA filters is partly based on unpublished lecture notesfor a future course on sensor fusion at NTNU, and can be provided by Edmund Brekkeupon request. This source is referenced as [2]. The sensor drivers and calibration softwareused are open-source ROS packages freely available online, and is not the authors originalwork. The author planned, organized and executed a series of experiments in a harbourenvironment, generating a data set which is not only used in this thesis, but hopefully willbe valuable in future research as well.

Vegard KamsvagTrondheim, July, 2018

iii

AbstractThe development of autonomous surface vessels (ASVs) has seen great progress in thelast few years, and are on the verge of becoming a reality. Sensing the environment in areliable way is a key element in making a ship fully autonomous. The sensors needed tomake a ship fully autonomous exist today, but the challenge remains to find the optimalway to combine them.

An ASV operating in a urban environment might need different exteroceptive sensors thana vessel operating at sea. Optical cameras and lidars (light detection and ranging) aresuitable candidates for close-range sensing of the environment. Different sensors havedifferent strengths and weaknesses, and in order to build a coherent world image on whiche.g. sense-and-avoid decisions can be based on, information from the different sensorsneed to be fused and included into the state estimation of surrounding vessels. This isdone using a target tracking system.

In this thesis, a target tracking framework based on the JIPDA filter is implemented andtested on real data gathered during a series of experiments in a harbour environment. Mea-surement models for both the lidar and the camera are formulated, and the sensors aregeometrically calibrated and integrated with the navigation system onboard the ReVoltmodel ship. The data from the lidar are clustered using a slightly modified version of theDBSCAN algorithm. Sensor measurements from a number of different scenarios with twomaneuvering targets are recorded, and the targets are tracked with the JIPDA filter usingthe lidar sensor as the primary sensor.

The results show that wakes behind the targets lead to many false tracks in close vicinityto the ReVolt model ship. The results also show that the detection probability of targets atrange is reduced due to the spread of the laser beams. The presence of wakes did not leadto track loss for the true targets. At ranges where the targets were steadily detected, thetargets were successfully tracked with few false tracks. It was also found that tracks can belost due to occlusions, where one of the targets block the other target from being detected.The implemented Faster R-CNN detector showed limited range, where the detections atranges over 20 meters are few and far between. At close ranges however, it shows thepotential to be used in mitigating false tracks due to wakes or clutter, and could be used toaid in track formation and confirmation.

v

SammendragUtviklingen av autonome overflatefartøy (ASV) har sett stor fremgang de siste arene, oger i ferd med bli en virkelighet. Sansing av omgivelsene pa en palitelig mate er et vik-tig element i a gjøre skip fullstendig autonome. Sensorene som skal til for a gjøre skipautonome fins pa markedet i dag, utfordringen som gjenstar er a kombinere disse pa enoptimal mate.

Et ASV som opererer i urbane omgivelser kan ha behov for andre eksteroseptive sensorerenn et skip som opererer til havs. Optiske kameraer og lidarer (light detection and rang-ing, lys-deteksjon og avstandsmaling) er lovende kandidater for sansing av omgivelsenepa nært hold. Forskjellige sensorer har forskjellige styrker og svakheter, og for a kunnegenerere et sammenhengende verdensbilde som sanse-og-unnga-avgjørelser kan baserespa ma informasjonen fra de forskjellige sensorene fusjoneres og inkluderes i tilstandses-timeringen til fartøy i omgivelsene. Dette gjøres ved a bruke et malfølgingssystem.

I denne oppgaven er et malfølgingssystem basert pa JIPDA-filteret implementert og testetpa reelle data samlet inn under en rekke eksperimenter gjennomført i et havnemiljø.Malemodeller for bade lidaren og kameraet er formulert, og sensorene er geometriskkalibrert og integrert i det eksisterende navigasjonssystemet pa ReVolt skalamodellen.Dataene fra lidaren klynges sammen ved hjelp av en lettere modifisert versjon av DBSCAN-algoritmen. Sensordata fra en rekke forskjellige scenarioer med to manøvrerende mal ertatt opp, og malene følges ved hjelp av JIPDA-filteret med lidaren som primær sensor.

Resultatene viser at bølger i kjølvannet bak malene fører til mange falske spor framalfølgingssystemet i nærheten til ReVolt. Resultatene viser ogsa at deteksjonssannsyn-ligheten pa større avstander reduseres pa grunn av spredingen av laserstralene til lidaren.Kjølvannsbølger fra malene førte ikke til at malfølgingssystemet mistet malene. Pa avs-tander hvor malene ble palitelig detektert greide systemet a spore malene, med fa falskespor. Det viste seg ogsa at spor kan mistes pa grunn av skygging, hvor det ene malet skyg-ger for det andre og med det hindrer det fra a bli sett. Det implementerte Faster R-CNNnettverket for a detektere bater i bilder viste seg a ha begrenset rekkevidde, med fa detek-sjoner pa avstander over 20 meter. Pa nært hold viser det derimot potensial til a kunnebrukes til a motvirke falske spor pa grunn av bølger fra kjølvann eller annen støy, og kanbidra i sporformasjon og sporbekreftelse.

vii

Table of Contents

Abstract i

Preface iii

Abstract v

Sammendrag vii

Table of Contents ix

Abbreviations x

Notation xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 The ReVolt Project . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

ix

2 Sensors and Detection Fundamentals 7

2.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 The Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . 8

2.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Convolutional Neural Nets for Computer Vision . . . . . . . . . . 14

2.3 Spatial Data Acquisition using 3D Lidar . . . . . . . . . . . . . . . . . . 18

2.3.1 The Velodyne VLP-16 . . . . . . . . . . . . . . . . . . . . . . . 18

3 State Estimation and Object Tracking 23

3.1 Review of Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Probability Density Functions . . . . . . . . . . . . . . . . . . . 24

3.1.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Bayesian State Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Optimal Bayesian Filter . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 The Probabilistic Data Association Filter . . . . . . . . . . . . . 32

3.3.2 JPDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.3 JIPDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Implementation 43

4.1 Sensors, Hardware and Processing Pipeline . . . . . . . . . . . . . . . . 43

4.1.1 Sensor Integration in ReVolt . . . . . . . . . . . . . . . . . . . . 44

4.1.2 Software Drivers and Synchronization . . . . . . . . . . . . . . . 46

x

4.2 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Lidar-Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.1 Calibration Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Transformation to a Common World Frame . . . . . . . . . . . . . . . . 52

4.5 Visual Detection based on Faster R-CNN . . . . . . . . . . . . . . . . . 54

4.5.1 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5.2 Training and validation data . . . . . . . . . . . . . . . . . . . . 55

4.5.3 Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . 56

4.5.4 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6 Lidar Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6.1 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . 59

4.6.2 The Modified Distance Parameter . . . . . . . . . . . . . . . . . 61

4.7 Target Tracking Framework . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7.1 Motion and Sensor Models . . . . . . . . . . . . . . . . . . . . . 63

4.7.2 Fusing the Measurements . . . . . . . . . . . . . . . . . . . . . 65

4.7.3 Implementing JIPDA . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Experimental Results and Discussion 71

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Tracker Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Cardinality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2 Time Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.3 Statistical Filter Consistency Measures . . . . . . . . . . . . . . 76

5.3 Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xi

5.3.3 Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.4 Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.5 Scenario 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.6 Scenario 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3.7 NIS and Average Innovation . . . . . . . . . . . . . . . . . . . . 90

5.4 The Camera as a Complimentary Sensor . . . . . . . . . . . . . . . . . . 90

5.5 Discussion of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.1 The Wake Problem . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.2 Point Targets versus Extended Targets . . . . . . . . . . . . . . . 95

5.5.3 The Range Limitations of the Lidar . . . . . . . . . . . . . . . . 96

5.5.4 Modeling Occlusions . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5.5 Overall Tracking Results . . . . . . . . . . . . . . . . . . . . . . 96

6 Conclusions and Future Work 97

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . 98

xii

AbbreviationsASV = Autonomous Surface VesselCCD = Charge-Coupled DeviceCPU = Central Processing UnitCNN = Convolutional Neural NetworkVOC = Visual Object ChallengeEKF = Extended Kalman FilterGPU = Graphical Processing UnitRAM = Random Access MemoryGNSS = Global Navigation Satellite SystemRTK = Real-Time KinematicINS = Inertial Navigation SystemNED = North-East-DownWGS = World Geodetic SystemPDAF = Probabilistic Data Association FilterJPDA = Joint Probabilistic Data AssociationIPDA = Integrated Probabilistic Data AssociationJIPDA = Joint Integrated Probabilistic Data Associationpdf = Probability Density FunctionROS = Robot Operating SystemRPN = Region Proposal NetworkRANSAC = RANdom SAmple ConsensusDBSCAN = Density Based Spatial Clustering of Applications with NoiseTCP = Transmission Control ProtocolIP = Internet ProtocolI/O = Input/OutputPoE = Power over EthernetUDP = User Datagram ProtocolSDK = Software Development Kit

xiii

Notation

Frequently Used Notation

Rba = Rotation matrix from frame a to frame bT ba = Homogeneous transformation matrix from frame a to frame btba = Translation vector from frame a to frame bC = Bold capital letter represents a matrixx = Bold lowercase letter represents a vectorχ2 = Chi-square distributionK = Camera intrinsic calibration matrix

Estimation and Tracking Notation

p(x) = Probability density function (pdf) for the random variable xp(x|y) = Pdf for x conditioned on yp(x, y) = Joint pdf for x and yP{·} = ProbabilityN (x;µ,Σ) = Normal distribution for random variable x, with mean µ and covariance Σxk = Prior state estimatexk = Posterior state estimatezk = Measurement predictionx0:k = State sequence from time step 0 to time step kzk = Measurementzk,i = Measurement izk,l = Lidar measurementzk,c = Camera measurementZk = Measurement set at time step kZ1:k = Cumulative measurement set up until time step kPk = Prior state covariance matrixPk = Posterior state covariance matrixPck = Covariance of the state updated with the correct measurement

Pk = Spread of the innovationsSk = Innovation covariance matrixWk = Kalman gainνk = Combined innovationνi,k = Innovation for measurement iγ = Validation gate thresholdV(k, γ) = Validation gate, with gate parameter γ

xv

mk = Number of validated measurementsµF (·) = Probability mass functionϕ = Number of false measurements in validation regionλ = Poisson distribution intensityβi,k = Association probability for measurement iβti,k = Marginal association probability for measurement i to track tptk(xtk) = Posterior Gaussian approximation for track tV = VolumePD = Probability of detectionPG = Gating probabilityτ = Vector containing individual track-wise detectionslt,ak(t) = Likelihood ratioxt,ak(t)k = Posterior event-conditional state

Pt,ak(t)k = Posterior event-conditional covariance

εtk = Prior existence probability for target tεtk = Posterior existence probability for target tηtk = Prior visibility probability for target tηtk = Posterior visibility probability for target tpεij = Markov chain coefficient for target existencepηij = Markov chain coefficient for target visibilityak = Association hypothesis¬Zk = Set of unassociated measurementswk = Process noisevk = Measurement noisevk,l = Lidar measurement noisevk,c = Camera measurement noiseRk = Measurement noise covariance matrixRl = Lidar measurement noise covariance matrixRc = Camera measurement noise covariance matrixQk = Process noise covariance matrixFk = State transition matrixFcv = Constant velocity model state transition matrixHk = Measurement matrixHl = Lidar measurement matrixh(xk) = Nonlinear measurement functionhc(xk) = Camera measurement functionσcv = Constant velocity model process noise strengthσ2l = Lidar measurement noise variancevmax = Velocity threshold used in track formationdL = Distance threshold used in track mergingT = Sample time

xvi

Chapter 1Introduction

1.1 Motivation

The development of autonomous vehicles has seen great progress in the last decade. Withadvances in technologies enabling perception of the surrounding environments, path plan-ning and real-time vehicle control, in combination with sophisticated sensors and increas-ing data processing performance, full vehicular autonomy is within grasp in the immediatefuture [3]. The sensor technologies needed to make autonomous ships a reality exists to-day. The challenge remains to find the optimal way to combine them reliably and costeffectively.

Autonomous surface vehicles (ASVs) operating in urban environments, e.g. ferries, mightneed slightly different exteroceptive sensors than ASVs operating in the open sea. A lidarwith a range of 100 meters may be more appropriate than a maritime radar with range ofseveral kilometers. Furthermore, the complexity of the environment means that the richinformation from optical cameras will be more important. In order to build a coherentworld image, which, e.g., collision avoidance decisions can be based on, the data fromthese sensors must be fused, together with data from interoceptive sensor systems suchas an inertial navigation system (INS). Koch [4] describes sensor fusion as the processof combining incomplete and imperfect pieces of information in such a way that a betterunderstanding of a underlying real-world phenomenon is achieved.

1

Chapter 1. Introduction

1.1.1 The ReVolt Project

The ReVolt is a shipping concept developed by the classification society DNV GL. TheReVolt aim at solving the growing need for transport capacity, by moving freight from theincreasingly stressed land based logistic networks, to the sea. The ReVolt vessel itself isenvisioned as an 60 metres long autonomous, battery powered short-sea cargo freighterwith a range of 100 nautical miles and a cargo capacity of 100 twenty-foot containers,operating at a speed of 6 knots. For the purpose of testing the autonomous capabilitiesof ReVolt, a 1:20 scaled model has been built. During previous student work, the scalemodel has been outfitted with dynamic positioning (DP) capabilities [5], and the aim ofthis thesis is to develop a sensor fusion system for lidar and optical camera to be used forsense-and-avoid purposes on board the ReVolt scale model. Parallel to the work describedin this thesis, other students are working on a remote control station, guidance and pathfollowing, as well as development of a digital twin for the ReVolt in their thesis work. TheReVolt scale model is shown in figure 1.1.

Figure 1.1: The ReVolt model ship.

1.2 Review of Previous Work

Substantial research has been conducted into the field of remote sensing and data fusionwith applications for autonomous vehicles, with a particular drive from the auto industry.Some examples of different approaches are given below.

• Stiller et al. [6] in 2000 proposed a multisensor concept with a variety of differentsensor technologies with widely overlapping fields of view for an autonomous, un-supervised vehicle. They used stereo vision, laser scanners, radar, and short range

2

1.2 Review of Previous Work

radar, combined by sensor fusion into a joint obstacle map. The research was con-ducted as a part of the German project Autonomes Fahren.

• Malich et al. [7] used low-level fusion of multibeam lidar and vision sensor into adetection and tracking framework. They used a cascaded AdaBoost (adaptive boost-ing [8]) detector based on haar-wavelet like features [9] for the vision system. Lidarreturns were used to generate regions of interest in the images.

• Aufrere et al. [10] at the NavLab group at Carnegie Mellon University proposeda high level fusion approach for object tracking using cameras and lidars for au-tonomous vehicles in cluttered urban environments. Their approach used a map-based fusion system, with a probability-based predictive model.

• Cho et al. [11] developed a multi-sensor system for moving object detection andtracking, building on the work of Aufrere et al. for the feature extraction for thelidar. Their approach fuses vision, lidar and radar. Detection in the vision moduleis represented as bounding boxes, and the data from the sensors were fused in anextended Kalman filter. The system detects and tracks pedestrians, bicyclists, andvehicles.

• Premebida et al. [12] demonstrated a perception system for pedestrian detection inurban scenarios using information from lidar and a single camera. Two sensor fusionarchitectures are described in their paper. A centralized architecture, where the fu-sion is done at the feature level, i.e. features from lidar and vision space combinedin a single vector for posterior classification using a single classifier. The decen-tralized architecture employs two classifiers, one per sensor feature-space, fused bya trainable fusion method applied over the likelihoods provided by the componentclassifiers. They showed that the trainable fusion method lead to enhanced detectionperformance, and maintenance of false-alarms under tolerable values in comparisonwith single-based classifiers.

• Weigel et al. [13] demonstrated a vehicle tracking and lane detection multi-sensorsystem, using a lidar and a monocular camera. Detected vehicles are tracked andmanaged by a multi-object extended Kalman filter using the data from the lidar andthe camera. The lidar was used to create appropriate regions of interest in the imageplane, and subsequently the measurements in the image plane was incorporated intothe Kalman filter.

There has also been done substantial research towards making autonomous vessels possi-ble. Two examples which focus on the sensing and perception aspect of such systems aregiven below.

3


• Wolf et al. [14] describe the perception and planning systems of an autonomous sur-face vehicle with the goal to detect and track other vessels at medium to long ranges.They employ a NASA JPL developed tightly integrated system termed CARACaS(Control Architecture for Robotic Agent Command and Sensing) that blends thesensing, planning and behaviour autonomy necessary for such missions. In theirpaper, they presents an autonomy system that detects and tracks vessels of a definedclass while patrolling near fixed assets. The sensor suite includes a wide-baselinestereo vision system for close-up perception and navigation, and a 360 degree cam-era head for longer range detection, identification, and tracking. The perception sys-tem termed SAVAnT (Surface Autonomous Visual Analysis and Tracking) receivessensory input from 6 cameras, stabilized by INS pose, detects objects of interest andcalculates absolute bearings for each contact.

• Elkins et al. [15] published a paper on the Autonomous Maritime Navigation project,with the stated goal of creating a set of sensors, hardware, and software that enablesautonomy on unmanned surface vehicle (USV) platforms. The sensor suite includescameras, radar, lidar, compass and GPS, integrated into a sensor fusion engine. Fu-sion algorithms are used to compile and correlate these data into a common tacticalpicture for the USV. In their paper, they showed the benefits of using a lidar forclose-range detections over more long-range sensors such as radars. They also ar-gued that the wavelength of the lasers in the lidar is such that the radiated energydoes not reflect well of the water surface, i.e. most points returned are not water,making it well suited in the detection of obstacles.

Previous research into fusion of camera and lidar show that the two sensors have com-plimentary characteristics, where the rich information in optical images make the camerasuitable for object detection and classification, and the point cloud from the lidar gives a3D situational overview of the surroundings. This makes the lidar and the camera goodcandidates to be used in a perception system for an autonomous vessel. The results pre-sented by Elkins et al. show that a lidar is well suited for close-range sensing and detectionfor an ASV operating in urban environments.

1.3 Contributions

The main contributions of this thesis are listed below.

• Measurement models for the lidar as well as the camera are formulated, and the sen-sors are geometrically calibrated. The measurements are transformed to a commonworld frame by integration with the navigation system on board the ReVolt modelship.

4

1.4 Thesis Outline

• The installation and integration into the existing control system of a camera and alidar on the ReVolt model ship.

• Implementation of a tracking system based on the JIPDA filter using real data gath-ered at sea. Although being given, the JIPDA implementation was expanded toinclude track formation, confirmation and termination capabilities, as well as com-puting association probabilities for image detections in addition to lidar measure-ments.

• A series of comprehensive experiments using real targets has been planned, orga-nized and executed at sea using the ReVolt model ship as a sensor platform. Theresulting data set has potential to be used in future research projects both for theReVolt as well as in other projects at NTNU.

1.4 Thesis Outline

The structure of this thesis is as follows; In chapter two, a theoretical description of the twosensors used is given. The pinhole camera model is presented, and subsequently expandedupon to give the model of a CCD camera. A brief introduction to convolutional neural netsfor image detection and classification is given, and the lidar sensor is described at the endof the chapter.

Chapter three deals with state estimation and target tracking. First, a brief primer onprobability is given, and the workhorse of state estimation, the Kalman filter as well asthe extended Kalman filter is presented. The target tracking theory first introduces theProbabilistic Data Association Filter (PDAF), before a brief description of the extensionof the PDAF to multi-target tracking in the Joint Probabilistic Data Association (JPDA)filter, and finally the Joint Integrated Probabilistic Data Association (JIPDA) which is usedin this thesis is presented.

Chapter four describes the physical implementation of the sensors on the ReVolt, as wellas the calibration procedures performed. The different coordinate systems, and the trans-formations between them is derived. The practical implementation of the Faster R-CNNframework is explained as well as the segmentation and clustering of the lidar point cloud.At the end of chapter four the implemented target tracking framework is described, includ-ing the motion and sensor models, as well as some practical considerations with regard tothe implementation of the JIPDA filter.

Chapter five presents and discusses the results of the target tracking framework from chap-ter four on a set of real-world data gathered by the author in the Dora harbour basin.

Chapter six concludes this thesis, and gives suggestions for future research areas to followgiven the results presented.

5


6

Chapter 2Sensors and DetectionFundamentals

The aim of this chapter is to introduce the theory and principles behind the sensors used inthis thesis, as well as to give an overview of the object detection process for the camera.This thesis employs a CCD camera and a 360° lidar as exteroceptive sensors to be fusedwithin a object tracking system. Object tracking refers to the problem of determiningthe location, path and characteristics of objects of interest by using sensor measurements.This thesis uses a detection-based tracking approach, and thus the tracking process relieson detection results in order to be able to maintain an estimate of the state of a target.The detection problem can be summarized as the process of recognizing the presence of atarget object within the sensor data, and object detection and classification for images willbe discussed.

2.1 Camera Model

A camera can be viewed as a mapping between the 3D world and a 2D image. This map-ping can be represented by a camera model, which is a matrix with certain properties.This section first introduces the basic pinhole camera, before generalizing this model toCCD (Charge-coupled device) sensors found in digital cameras. A model for the radialdistortion typically found in cameras with real (non-ideal) lenses is presented, and a mea-surement model associating world points with its corresponding pixel coordinates is givenat the end. The theory presented in this section is largely based on [16].

7

Chapter 2. Sensors and Detection Fundamentals

2.1.1 The Pinhole Camera Model

The basic pinhole camera model is a central projection of points in space onto a plane. Letthe centre of projection be the origin of a Euclidean coordinate system, and let the imageplane be the plane z = f . The line from the camera centre (also called the optical centre)perpendicular to the image plane is called the principal axis or principal ray of the camera.This ray meets the image plane in the principal point. A point in space with coordinatesx = [x, y, z]

T is mapped to a point on the image plane where a line joining the centre ofprojection (the origin) and the point x meets the image plane. This is illustrated in figure2.1a. As figure 2.1b illustrates by similar triangles, the point x in 3D space maps to the

(a) The pinhole camera model geometry. (b) The mapping of points in space to the imageplane by similar triangles.

Figure 2.1: The pinhole camera model. The figures are adapted from [16].

point xc =[f xz , f

yz , f]T

in the image plane. The final image coordinate is a constant,and ignoring this coordinate we see that[

x y z]T 7→ [

f xz f yz]T

(2.1)

describes the central projection mapping of the 3D world coordinates in R3 to the imageplane coordinates in R2. Using homogeneous coordinates, this mapping can convenientlybe represented by matrix product as

xyz1

7→fxfyz

=

f 0 0 00 f 0 00 0 1 0

xyz1

= diag(f, f, 1)[I3×3 03×1

] xyz1

(2.2)

where diag(f, f, 1) is a diagonal matrix, I3×3 is the 3 × 3 identity matrix, and 03×1 is a3×1 zero vector. We use the notation x for the world point given by the homogeneous co-ordinate vector [x, y, z, 1]

T , xc for the corresponding point in the image plane representedby a homogeneous 3-vector, and M for the homogeneous 3× 4 camera projection matrix,(2.2) can be written as

xc = Mx (2.3)

whereM , diag(f, f, 1)

[I3×3 03×1

]. (2.4)

8

2.1 Camera Model

CCD Cameras

The mapping given in (2.2) assumes that the origin of the image plane coordinates coin-cides with the principal point. In practice it may not, so the mapping between world andimage plane coordinates is augmented as[

x y z]T 7→ [

f xz + px f yz + py]T

(2.5)

where [px, py]T are the coordinates of the principal point. In homogeneous coordinates,

the mapping may now be written as

fx+ zpxfy + zpy

z

=

f 0 px 00 f py 00 0 1 0

xyz1

= K[I3×3 03×1

] xyz1

(2.6)

where

K =

f 0 px0 f py0 0 1

. (2.7)

is the camera calibration matrix. Moreover, in a camera where the sensor is a CCD array,the pixels may not be square. When the image coordinates are measured in pixels, thenon-square pixels introduce unequal scale factors in both axial directions. Defining thenumber of pixels per unit distance in image coordinates as mx and my in the x and ydirections, the homogeneous mapping between world and image coordinates is obtainedby premultiplying (2.7) with the factor diag(mx,my, 1), such that the general form of acalibration matrix for a CCD camera is [16]

K =

αx 0 x0

0 αy y0

0 0 1

(2.8)

where αx = fmx and αy = fmy represent the focal length of the camera in pixel unitsin the x and y dimensions, respectively, and x0 = mxpx and y0 = mypy represent thecoordinates of the principal point in terms of pixel dimensions. The parameters of thematrix K are known as the intrinsic camera parameters, and are typically found through acalibration procedure as described in [17].

Camera Coordinates and Rigid Transformations

Homogeneous coordinates are convenient for representing geometric transformations by amatrix product. Consider a point x in some right-handed coordinate frame. The nonhomo-geneous coordinate vector x is the 3-vector [x, y, z]

T , while its homogeneous counterpartis the 4-vector [x, y, z, 1]

T . The change of coordinates between two (arbitrary) Euclidean

9


coordinate systems a and b can be represented by a rotation matrix Rab and a translationvector tab in R3 as

xa = Rabxb + tab (2.9)

Using homogeneous coordinates, the same transformation can be written as

xa = T ab xb, where T ab =

[Rab tab

01×3 1

](2.10)

where xa and xb are now 4-vectors. 01×3 denotes the 1 × 3 zero vector. Thus far, theworld coordinates discussed in the previous section have been assumed given in the cameracoordinate frame, where the origin is located in the optical centre and with the principalaxis of the camera pointing down the positive z-axis. Generally points in space will beexpressed in terms of a different Euclidean coordinate frame, the world coordinate frame.The camera coordinate frame and the world coordinate frame are related via a rotationmatrixR and translation vector t such that

xcam = Rxworld + t. (2.11)

Combining (2.11) and (2.6), the camera matrix relating a point represented in the worldcoordinates with pixel coordinates can be written as

M = K[R t

]. (2.12)

The parameters of R and t relating the camera orientation to the world coordinate frameare called the extrinsic parameters. The camera matrix in (2.12) has 10 degrees of free-dom; 4 from the camera calibration matrix K, 3 for R and 3 for the translation t. Thereare several ways to parameterize the rotation matrixR, such as by Euler angles or quater-nions. For details on parameterizations of rotation matrices, the reader is referred to [18],[19] and [20].

Distortion Models

The assumption thus far has been that a linear model accurately models the imaging pro-cess. For cameras with real lenses, this assumption does not generally hold. The mostimportant deviation is typically radial distortion, which is prevalent in lenses with shortfocal length and wide field of view [16]. Radial distortion occurs when light rays bendmore near the edges of a lens than they do at its optical center. The remedy is to removeradial distortion as a preprocessing step, which in effect makes the camera a linear sensoragain. Figure 2.2 illustrates this step. Let the image plane coordinates of a point in thecamera frame be denoted by

[x y

]T. For the ideal, non-distorted pinhole camera model,

a world point x =[x y z

]Tin the camera coordinate system projected to the image

plane in normalized coordinates is expressed asxy1

=1

z

xyz

. (2.13)

10

2.1 Camera Model

Figure 2.2: A radially distorted image, and the corrected image which would have been obtainedwith a linear lens.

The actual image coordinates are related to the ideal image coordinates by a radial dis-placement, which can be modelled as [16][

xdyd

]=

[x(1 + k1r

2 + k2r4 + . . . )

y(1 + k1r2 + k2r

4 + . . . )

](2.14)

where[xd yd

]Tare the distorted image coordinates, and (1 + k1r

2 + k2r4 + . . . ) is a

distortion factor, which is a function of the radius r2 = x2+y2. The parameters k1, k2, . . .are the radial distortion parameters, which are normally calculated as a part of the cameracalibration. Typically, one or two coefficients are enough to compensate for the radialdistortion [21]. A lens could also introduce tangential distortion, which typically occurs ifthe lens and the sensor plane are not parallel. Figure 2.3b shows a case where the lens andsensor plane are not parallel, causing tangential distortion. The tangential distortion can

(a) No tangential distortion. (b) Non-parallel lens and sensor leading to tan-gential distortion.

Figure 2.3: Illustration of tangential distortion.

be modeled by [21] [xdyd

]=

[x+ 2p1xy + p2(r2 + 2x2)y + p1(r2 + 2y2) + 2p2xy

](2.15)

where p1 and p2 are the tangential distortion parameters. Combining the two distortionmodels, the distorted image coordinates in the normalized image plane can be expressed

11


as [xdyd

]=

[x+ x(k1r

2 + k2r4 + . . . ) + 2p1xy + p2(r2 + 2x2)

y + y(k1r2 + k2r

4 + . . . )) + p1(r2 + 2y2) + 2p2xy

](2.16)

while the distorted pixel coordinates and the normalized distorted coordinates are relatedthrough the camera matrix by udvd

1

= K

xdyd1

. (2.17)

This model specifies the projection of 3D points to the normalized image plane, but it doesnot give a direct solution to the back-projection problem, where we want to recover theline of sight from image coordinates. If both radial and tangential distortion is considered,there is no analytic solution to the inverse mapping, so a nonlinear search is required torecover the undistorted image coordinates from the distorted ones [21]. Computer visionsoftware can provide such methods, for example MATLAB [22] and OpenCV [23] readilyprovides functions for undistorting an image given the distortion parameters.

Complete Camera Sensor Model

Given that the measurements from the camera is a 2D coordinate in pixel values, andassuming that the radial and tangential distortions described in the previous section hasbeen corrected, the camera model becomes[

uv

]= HK

[R t

] 1

zw

xwywzw

(2.18)

where

H =

[1 0 00 1 0

](2.19)

is a projection matrix used to remove the bottom row from the homogeneous represen-tation, and the factor 1

zwscales the point x to normalized image coordinates (where the

image plane is located at unit distance along the principal axis). The pixel coordinates aregiven by [u, v]

T As is evident from the model there is a loss of information with a singlecamera, as we are going from three parameters in x to two parameters in

[u v

]T. If the

distortions described in the previous sections has been corrected in the image coordinates,this mapping can be inverted to give the normalized image plane projection of the imagecoordinates.

2.2 Computer Vision

Computer vision is a field of engineering and science concerned with extracting usefulinformation from images. This has proved to be a challenging task, and it is still today an

12

2.2 Computer Vision

active field of research. Visual data is very complex, and the same object represented bytwo different images could be perceived very differently by a computer based on variationssuch as changes in illumination, partial occlusion of objects, changes in orientation, defor-mation and so on. Such variations are illustrated in figure 2.4. Despite these challenges,the recent advancements made within the field of deep learning, and deep convolutionalneural nets in particular, has significantly improved the ability of computers to recognizeobjects in images [24]. The ImageNet Large Scale Visual Recognition Challenge has beenrun annually since 2010, and is a benchmark in object category classification and detec-tion on hundreds of object categories and millions of images [25]. The results of the top-5classification errors from the challenges up until 2016 is shown in figure 2.5. The top-5error rate is the fraction of test images for which the correct label is not among the fivelabels considered most probable by the classifier.

Figure 2.4: Challenges related to object classification in images.1

Figure 2.5: ImageNet top 5 errors over time.2

As seen in figure 2.5, deep learning algorithms dominate the field of object category clas-sification and detection, even outperforming humans with state-of-the-art deep convolu-tional neural nets.

1Image courtesy: http://cs231n.github.io/classification/2Image courtesy: https://www.dsiac.org/resources/journals/dsiac/winter-2017-volume-4-number-1/real-time-

situ-intelligent-video-analytics

13


2.2.1 Convolutional Neural Nets for Computer Vision

A convolutional neural net, abbreviated CNN, is a neural network suited for object detec-tion in computer vision applications. Since their introduction by LeCun et al. [26] in thelate 1980’s, convolutional neural nets have shown excellent performance at tasks such ashand-written digit classification and face detection. In 2012, Krizhevsky et al. [27] wonthe ImageNet 2012 classification benchmark with their deep convolutional neural network,dubbed AlexNet, achieving an top-5 test error rate of 15.3%, compared with the 2nd placeresult of 26.2%. Since then, the ImageNet classification benchmark has been dominatedby deep convolutional nets. The dramatic improvement in performance can be attributedto several factors [28]:

1. The availability of very large datasets, with millions of labeled examples.

2. The adaptation of powerful GPU processing implementations, making training ofvery large models tractable.

3. General improvement of algorithms. Examples are better model regularization strate-gies, such as dropout, to prevent overfitting [29], batch normalization [30] and resid-ual nets [31] to improve the training of deep networks.

In a classifier using classical computer vision techniques, features are extracted from animage, for example using a histogram of oriented gradients, and the extracted features aresubsequently used to train a classifier such as a support-vector machine. In a convolu-tional neural network on the other hand, the features are hidden, and feature extractionand classification is performed in a single pipeline. Features in a CNN are generated viaone or several layers of filter convolutions, which generates a set of abstract sub-imagesfrom the input image. For the user, a CNN classifier appears as a ”black box” where theinternal workings of the feature extraction and classification are largely hidden. In orderto understand how a convolutional neural net works, some basic concepts need to be in-troduced. The following sections first introduce the neuron and the concept of a neuralnetwork, followed by a introduction to convolutional neural nets commonly employed incomputer vision.

Neural Networks

In linear regression, a model can be considered as a linear combination of fixed nonlinearfunctions of the input variables, on the form

y(x,w) = w0 +n−1∑j=1

wjφj(x) (2.20)

where φj(x) are known as the nonlinear basis functions. A neural network utilized forregression and classification on the other hand, can be thought of as a combination of

14

2.2 Computer Vision

basis functions in parametric form, where the parameters of the basis functions are adaptedduring training. The term neural network has its origins in attempts to find mathematicalrepresentations of information processing in biological systems. The basic building blockof a neural network is the neuron, or perceptron, illustrated in figure 2.6a.

(a) A neuron, with its inputs,bias and output. (b) A simple neural network.

Figure 2.6: A neuron, alongside a simple neural network with a single hidden layer.

The basic neural network can be described as a series of functional transformations. FirstM linear combinations of the input variables x1, . . . , xD are formed, as

aj = b1j +D∑i=1

w1jixi (2.21)

where j = 1, . . . ,M , b1j is a bias, the parameters w1ji are the weights, and the superscript 1

indicates that the corresponding parameters are the parameters of the first layer in the net-work. The result of this weighted sum of the inputs, aj is known as the activation. Each ajis then transformed via a nonlinear, differentiable activation function, zj = σ(aj), whichis then input to the next layer in the network. Figure 2.6a illustrates this process for a sin-gle neuron. The quantities zj are referred to as hidden units, and the corresponding layersare referred to as hidden layers. A popular choice for the nonlinear activation functionsσ(·) is the rectified linear unit (ReLU),

σ(a) = max{0, a}. (2.22)

The outputs zj are again linearly combined to give output unit activations as

ak = b2k +M∑j=1

w2kjzj (2.23)

where k = 1, . . . ,M and K is the total number of outputs. This corresponds to thesecond layer of the network. Lastly, the output unit activations are transformed using an

15


appropriate activation function to give a set of network outputs yk. A simple networkwith a single hidden layer is illustrated in figure 2.6b. The choice for the output activationfunction is largely determined by the nature of the data and the assumed distribution ofthe target variables. For binary classification problems, the logistic sigmoid function iscommonly used [8].

Convolutional Neural Nets

A convolutional neural net or ConvNet, abbreviated CNN, is similar to ordinary neuralnetworks in that they are made up of neurons that have learnable biases and weights. Whatdistincts it from ordinary neural nets is that it assumes that the input data has a grid-liketopology, such as an image being a 3D grid of pixels, where the depth is represented by thecolor channels in the image. As the name implies, the CNN employs convolution in one ormore of the layers in the network, and the layers have neurons arranged in 3 dimensions.A traditional neural network use matrix multiplication by a matrix of parameters, with aseparate parameter describing the interactions between each input unit and each outputunit, meaning that every input unit interacts with every output unit. CNNs however, havesparse interactions, accomplished by convolving the input with kernels (square matriceswith weights as entries) which are smaller than the input. This drastically reduces theparameters in the network compared to the traditional network using matrix multiplication.If there are m inputs and n outputs, the matrix multiplication approach requires m × nparameters. For an input such as an image, which can have thousands or millions ofpixels, the number of parameters needed become very large. In a CNN, the number ofconnections for each output is limited by the kernel size and number of kernels [32].

CNNs employ parameter sharing, where the same parameter is used for more than one out-put in the model. This is accomplised by the convolution operation, where each memberof the kernel is used at every position of the input. This means that rather than learninga separate set of parameters for every location, only one set is learned, further reducingthe number of parameters needed in the model. A simple representation of a layer in aconvolutional neural net is shown in figure 2.7. The neurons (indicated by white circles)arranged in depth in the output activation layer are connected to the same inputs throughdifferent kernels, while all neurons at the same depth share kernel parameters. As indicatedin the figure, a single layer may consist of several filter kernels.

Figure 2.7: A CNN layer. The 3D input volume is transformed into a 3D volume of neuron activa-tions.

16

2.2 Computer Vision

A typical layer of a CNN consists of three stages; convolution, detection and pooling,illustrated in figure 2.8. Some terminology considers each stage in 2.8 as a separate layerin its own right, so the definition of what constitutes a convolutional layer may vary [32].

Figure 2.8: The three stages of a convolution layer.

The convolution stage performs several convolutions in parallel to produce a set of linearactivations. At the next stage, the linear activations are run through a nonlinear activationfunction, in a similar fashion as for the traditional neural network. In the pooling stage,a pooling function is used to modify the output layer, replacing the output of the net bya summary statistic of the nearby outputs. As an example, the max pooling operationreports the maximum output within a rectangular region. Other examples are the averageof a rectangular region, or a weighted average based on the distance from the central pixel.

A convolutional layer can be described by four hyperparameters. The term hyperparame-ter is used to distinguish them from the model parameters (weights and biases), and theyare not subject to optimization during training. The four hyperparameters are

1. The number of kernels, K.

2. The spatial extent of the filter kernel, F .

17


3. The stride S, the number of pixel displacements for each calculation.

4. The amount of zero padding to the brim of the image, P .

For a given input image of size[W1 ×H1 ×D1

]the spatial size of the output is

W2 = 1 + (W1 − F + 2P )/S

H2 = 1 + (H1 − F + 2P )/S(2.24)

Convolving the input with K kernels thus leads to the size of the output volume as W2 ×H2 ×K. Each filter kernel extends through the depth of the input, with dimensions F ×F ×D1, making the total number of parameters for a single layer K · F · F ·D1.

2.3 Spatial Data Acquisition using 3D Lidar

A lidar is an active electro-optical sensor that sends out a laser pulse, and subsequentlymeasures the parameters of the return signal bounced off some object. The lidar is knownunder several names, some examples being ladar, lidar, LIDAR, LADAR or laser radar.The term lidar is the most common, and will be used throughout this thesis. In a lidar, awaveform generator generates a laser waveform. Depending on the type of lidar, the setupcan include a single laser or a master oscillator with multiple lasers or laser amplifiers.There are many types of lidars, however this thesis uses a particular type of 3D lidar witha rotating detector array, which measures azimuth angle, elevation angle and range. Thetheory presented in this section is therefore not general to all lidars, but rather focuses onthe particular type of lidar sensor used. The measurement is performed by the laser pulsebeing guided through transmit optics, traversing some medium, typically atmosphere, to atarget. The laser pulse bounces off the target, and traverses the media again until receiveoptics captures the reflected pulse, guiding it to a detector or a detector array [33]. Thelaser and detector arrays rotate, taking multiple measurements as it scans the full 360degree field of view. The range to a target can be determined based on the travel time ofthe laser pulse by

r =c

2(trx − ttx), (2.25)

where c is the speed of light in the intervening medium, and ttx and trx is the transmissionand reception time of the laser pulse, respectively.

2.3.1 The Velodyne VLP-16

The lidar used in this thesis is the Velodyne VLP-16 Puck, which is a small, real-timerotating lidar which continuously steams data over a TCP/IP connection when it is poweredup. The Velodyne VLP-16 is shown in figure 2.9. The lidar has multiple return modes,where it can report either the strongest return signal, the last return signal, or both. The

18


default return mode is to report the strongest return, and this mode is what is used inthis thesis. The Velodyne VLP-16 features 16 laser/detector pairs mounted in a rotatinghousing, rapidly spinning to produce a 360° 3D point cloud. The specifications of theVLP-16 lidar are [34]

• Horizontal field of view of 360°.

• Horizontal angular resolution of 0.1°-0.4°, depending on rotational speed.

• Weight of 830 grams.

• Adjustable rotational speed between 5-20 Hz.

• Three laser return modes; strongest, last and dual.

• Vertical field of view of 30° (±15 °)

• Vertical angular resolution of 2°.

• Range of up to 100 meters (depending on application).

• Typical accuracy of ±3 centimeters.

Figure 2.9: The Velodyne VLP-16 Puck.

The range returned by the Velodyne VLP-16 lidar is measured along a beam at knownazimuth angles α, and the 16 laser/detector pairs are mounted vertically at 2° intervalsfrom -15° to 15° relative to the horizon in the lidar frame of reference. When the lidaris rotating at a frequency of 10 Hz the lasers fire at every 0.2°, with a total of 1800 laserfirings for a full rotation. The theoretical maximum number of range measurements fora full 360° scan is therefore 360/0.2 × 16 = 28800 measurements, where it is assumedthat every single laser beam is reflected, and subsequently detected at the lidar. The lidarreports distances relative to itself in a spherical coordinate system (radius r, elevation ω,azimuth α). The relationship between the spherical and the cartesian coordinate frames is

19


illustrated in figures 2.10 and 2.11. The lidar returns given in the sperical frame can beconverted to a cartesian coordinate frame using basic trigonometry by

x = P ′ sinα = r cosω sinα

y = P ′ cosα = r cosω cosα

z = r sinβ.

(2.26)

A single reflection from the lidar can be represented in a cartesian coordinate frame cen-

Figure 2.10: Lidar spherical coordinate system, with equivalent cartesian coordinates.

(a) Side view. (b) Top view.

Figure 2.11: The lidar cartesian coordinate frame, with added elevation and azimuth angles.

tered in the lidar, shown in figure 2.11, using the transformation given in (2.26) as:

Plij =

rij cosωi sinαjrij cosωi cosαj

rij sinωi

(2.27)

Here, αj is the jth azimuth angle in the scan, and ωi is the angle of the ith vertical lasermeasured from the horizon. The superscript l denotes the cartesian frame centered in the

20


lidar. rij is the range measured at the given azimuth and elevation angle. The VelodyneVLP-16 comes pre-calibrated from the factory, and Glennie et al. [35] did a calibrationand stability analysis of the VLP-16 lidar, where they found that the factory suppliedcalibration is within the stated ±3 centimeter ranging accuracy.

21


22

Chapter 3State Estimation and ObjectTracking

An autonomous vessel operating in a dynamic and unpredictable environment such as anurban harbour environment will have to rely on its sensors in order to generate a coherentworld view. A sensor will have limits to what it can percieve. The range and resolutionof a sensor is limited, and moreover the sensors are subject to noise which perturbs themeasurements in unpredictable ways. These limitations gives rise to uncertainty aboutthe environment surrounding the vessel, and the process of inferring the value of a quan-tity of interest from indirect, inaccurate and uncertain observations is called estimation.This chapter first introduces some basic probability-theoretic concepts important in laterderivations, and carries on in introducing probabilistic state-estimation techniques, as wellas target tracking in systems where the origin of measurements is uncertain. For the stateestimation, the Kalman filter and the Extended Kalman filter is introduced, before movingon to the target tracking algorithms in section 3.3.

3.1 Review of Basic Probability

This section serves as a review of basic concepts from probability and statistics whichare important for later chapters. The theory presented in this section is largely based onchapter 2 in [36].

23

Chapter 3. State Estimation and Object Tracking

3.1.1 Probability Density Functions

A probability density function (pdf) p(x) describes how a random variable, x, is dis-tributed. Let p(x) be the PDF for the random variable x over the interval [a, b]. Thepdf is a non-negative function that satisfies the axiom of total probability, which can bestated as ∫ b

a

p(x)dx = 1. (3.1)

Probability is given by the area under the density function. The probability that x liesbetween c and d is given by

Pr(c ≤ x ≤ d) =

∫ d

c

p(x)dx. (3.2)

A conditioning variable can be introduced by letting p(x|y) be a PDF over x ∈ [a, b]conditioned on y ∈ [c, d] such that

∀y∫ b

a

p(x)dx = 1. (3.3)

This can be extended to the N-dimensional case as p(x), where x = (x1, x2, . . . , xN ) withxi ∈ [ai, bi]. The joint density of x and y can be written as p(x,y). For the N-dimensionalcase, the axiom of total probability requires that∫ b

a

p(x)dx =

∫ bN

aN

· · ·∫ b2

a2

∫ b1

a1

p(x1, x2, . . . xN )dx1dx2 . . . dxN = 1 (3.4)

where a = (a1, a2, . . . , aN ) and b = (b1, b2, . . . , bN )

An important and much encountered probability density function is the Normal distribu-tion, also known as a Gaussian distribution. In the multivariate case, this distribution isdefined as

N (x;µ,Σ) = |2πΣ|− 12 e−

12 (x−µ)T Σ−1(x−µ) (3.5)

where | · | denotes the determinant, µ is the mean vector, and Σ is the covariance matrix.For a normally distributed process, normality is preserved through linear transformations.The distribution is fully described by the two parameters, the mean value and the covari-ance.

3.1.2 Bayes’ Rule

The joint probability density p(x,y) can be factored into a conditional and non-conditionalfactor as

p(x,y) = p(x|y)p(y) = p(y|x)p(x). (3.6)

24

3.2 Bayesian State Estimation

In the case where x and y are statistically independent, the joint probability can be factoredas p(x,y) = p(x)p(y). Rearranging (3.6) gives rise to Bayes’ rule:

p(x|y) =p(y|x)p(x)

p(y). (3.7)

Letting x represent the state of a system, and y be the data (e.g. a measurement), Bayes’rule can be used to infer the likelihood of the state given the data, also known as theposterior, p(x|y). This requires that we have a prior PDF over the state, p(x), and a sensormodel, p(y|x). This is done by expanding the denominator of (3.7) by marginalizationsuch that

p(x|y) =p(y|x)p(x)∫p(y|x)p(x)dx

. (3.8)

where the limits of the denominator integral is over the entire allowable domain of theintegration variable x.


Estimation is the process of inferring the value of a quantity of interest from imperfect, in-accurate and uncertain observations, and tracking is the estimation of the state of a movingobject. State estimation uses statistical knowledge of the various uncertainties and priorinformation, the measurements given from sensors influenced by noise, and the system dy-namics to estimate the quantities of interest. A general outline of the components withina tracking system is displayed in figure 3.1. This section first introduces the theory ofBayesian state estimation, including the Kalman filter and its extensions to non-linear sys-tems (the Extended Kalman filter), before delving into the data association problem centralto object tracking. The Probabilistic Data Association filter (PDAF) is introduced, and itsextension to the multi-target case with the Joint Probabilistic Data Association (JPDA) fil-ter is presented. The Joint Integrated Probabilistic Data Association (JIPDA) filter, whichis an extension of the JPDA to include target existence and visibility probabilities is pre-sented at the end. This is also the target tracking filter used in this thesis.

Figure 3.1: Overview of the components of a tracking system. The internal states of the componentswithin the red box is generally not available to the system designer. The figure is adapted from [37].

25


3.2.1 Optimal Bayesian Filter

Bayesian inference in dynamic systems is concerned with the calculation of the proba-bility density function p(x0:k|Z1:k) of a hidden state sequence x0:k = {xi}ki=0, givenan observable sequence of measurements Z1:k = {Zi}ki=1. If a prior distribution of thestates, p(x0:k), is known, and the relationship between the states and the measurements inthe form of a sensor model or measurement equation is available, then Bayes’ rule givenin (3.8) can be used to calculate the posterior pdf as

p(x0:k|Z1:k) =p(Z1:k|x0:k)p(x0:k)

p(Z1:k). (3.9)

In an object tracking application, measurements are received one by one over time, thus arecursive formulation of (3.9) which can be applied each time a new measurement arrivesis needed. For this purpose, the elements of (3.9) can be factored as follows:

p(Z1:k|x0:k) = p(zk|Z1:k−1,x0:k)p(Z1:k−1|x0:k−1) (3.10a)p(x0:k) = p(xk|x0:k−1)p(x0:k−1) (3.10b)p(Z1:k) = p(zk|Z1:k−1)p(Z1:k−1) (3.10c)

where (3.10a) made use of the fact that measurements at time k − 1 only depend on thestates up to time k− 1. Substituting into (3.9), rearranging and using a second applicationof Bayes’ rule gives the general recursive form [38]

p(x0:k|Z1:k) =p(zk|Z1:k−1,x0:k)p(xk|x0:k−1)

p(zk|Z1:k−1)p(x0:k−1|Z1:k−1). (3.11)

The Markov Assumption

The general recursive form of the Bayes filter given in (3.11) requires that the completestate and measurement history up until time k− 1 is available, and is used in the computa-tions. In order to make the filter computationally feasible, the Markov assumption is com-monly employed. The Markov assumption, or the complete state assumption, postulatesthat past and future data are independent if one knows the current state xk. Moreover, themeasurements are also assumed to be conditionally independent of previous measurementsand state histories, and depend only on the current state. With these assumptions, the distri-bution p(zk|Z1:k−1,x0:k) simplifies to p(zk|xk), and p(xk|x0:k−1) becomes p(xk|xk−1).Such a temporal generative model is known as a Hidden Markov model (HMM) [39],illustrated in figure 3.2. Using the stated assumptions, (3.11) can be restated as

p(x0:k|Z1:k) =p(zk|xk)

p(zk|Z1:k−1)p(xk|xk−1)p(x0:k−1|Z1:k−1). (3.12)

In Bayesian filtering, the filtering distribution p(xk|Z1:k) is of interest, in stead of the fulljoint posterior density p(x0:k|Z1:k). The filtering distribution can be derived from (3.12)

26


zk−1

xk−1 xk xk+1

zk zk+1

p( | )zk−1 xk−1p( | )zk xk p( | )zk+1 xk+1

p( | )xk xk−1 p( | )xk+1 xk

Figure 3.2: Hidden Markov model. Observable nodes are shaded in blue.

by marginalization over all previous states at all previous time steps [40],

p(xk|Z1:k) =

∫xk−1

· · ·∫

x0

p(x0:k|Z1:k)dx0 . . . dxk−1

=p(zk|xk)

p(zk|Z1:k−1)

∫xk−1

· · ·∫

x0

p(xk|xk−1)p(x0:k−1|Z1:k−1)dx0 . . . dxk−1

=p(zk|xk)

p(zk|Z1:k−1)∫xk−1

(p(xk|xk−1)

∫xk−2

· · ·∫

x0

p(x0:k−1|Z1:k−1)dx0 . . . dxk−2︸︷︷︸p(xk−1|Z1:k−1)

)dxk−1

(3.13)

where the inner integrals over x0 through xk−2 simplify to p(xk−1|Z1:k−1) through marginal-ization. The remaining integral is known as the Chapman-Kolmogorov equation [40]

p(xk|Z1:k−1) =

∫xk−1

p(xk|xk−1)p(xk−1|Z1:k−1)dxk−1. (3.14)

The posterior pdf can finally be stated as [39], [40]

p(xk|Z1:k) =

Filtering︷︸︸︷p(zk|xk)

p(zk|Z1:k−1)︸︷︷︸Normalization

Prediction︷︸︸︷∫xk−1

p(xk|xk−1)p(xk−1|Z1:k−1)dxk−1︸︷︷︸p(xk|zk−1)

. (3.15)

This is the recursive formula that all Bayesian filtering and tracking algorithms tries tocompute. The prediction step in (3.15) takes the conditional density at the previousstage p(xk−1|Z1:k−1) through the transition density p(xk|xk−1) to form the predicteddensity p(xk|Z1:k−1). The filtering step uses the new data zk through the likelihoodfunction p(zk|xk) to form the filtering distribution p(xk|Z1:k). The normalization fac-tor p(z|Z1:k−1) in (3.15) ensures that the filtering distribution is a valid pdf [40].

27


3.2.2 State Space Models

Almost all manouvering target tracking methods are model based, in that they assumethat the target motions and observations can be mathematically modelled with sufficientaccuracy [41]. State-space models are most commonly used, with additive noise, such asthe discrete-time model

xk+1 = f(k,xk,uk) + wk, p(wk) = N (wk; 0,Qk) (3.16)zk = h(k,xk) + vk, p(vk) = N (vk; 0,Rk) (3.17)

where xk, zk and uk are the state vector, measurement vector and control input vector,respectively, at the discrete time step k. The process noise wk and the measurement noisevk are white noise sequences. In a general tracking problem, the control input uk isusually not known. fk is a vector-valued function moving the current state and controlinput to the next state. The vector-valued function hk maps the current state to the currentmeasurement. These functions may also be time-varying. The covariance of the processnoise is given by Qk, and the covariance of the measurement noise is given by Rk.

The linear counterpart to (3.16) and (3.17), where the input uk is omitted, are the equations

xk+1 = Fkxk + wk, p(wk) = N (wk; 0,Qk) (3.18)zk = Hkxk + vk, p(vk) = N (vk; 0,Rk) (3.19)

where Fk and Hk are now (possibly time-varying) matrices.

The sensor model is given by (3.17) (in the linear case by (3.19)), and depends on thecharacteristics of the sensor and its noise characteristics.

3.2.3 Kalman Filter

The Kalman filter [42] is a linear state estimator, first introduced by Rudolf Kalman in1960. It was quickly adopted by the engineering community, especially within the field ofnavigation [43]. The Kalman filter is derived based on some fundamental assumptions:

• The system’s state evolves according to a known linear plant equation driven by aknown input and an additive zero-mean white process noise with known covarianceQk, as in (3.18).

• The measurements are given by a known linear function of the state with an additivezero-mean white measurement noise with known covariance Rk, as in (3.19).

• The initial state is unknown, assumed to be a random variable with a known meanand covariance.

• The initial error and noises are assumed mutually uncorrelated.

28


If these assumptions are satisfied, and if the initial state error and all the noises enteringinto the system are Gaussian distributed, the Kalman filter is the minimum mean squareerror (MMSE) estimator [37], which provides an exact, closed-form solution to (3.15). Aderivation of the Kalman filter equations from (3.15) is provided in [44].

First, we will introduce some notation used in the Kalman filter algorithm, presented intable 3.1.

xk The prediction (a priori estimate) of the true state xk

xk The a posteriori estimate of the true state xk

Pk The a priori estimated error covariance matrix, E[(xk − xk)(xk − xk)T ]

Pk The a posteriori estimate of the error covariance matrix, E[(xk − xk)(xk − xk)T ]

Sk The innovation covariance matrix

Wk The Kalman gain matrix

Table 3.1: Notation used in the Kalman filter algorithm.

The Kalman filter is an iterative updating scheme with two main steps, the prediction andthe update (or filtering) steps. The Kalman filter algorithm can be summarized as follows:

Input

• The process and measurement noise covariance matrices, Qk and Rk provided bythe system designer.

• The initial conditions, x0 = x0 and P0 = E[(x0 − x0)(x0 − x0)T ].

Prediction

The first step of the Kalman filter algorithm is predicting the state and covariance for thenext time step given the state estimate of the previous time step, using the system motionmodel and noise covariance, given by

xk+1 = Fkxk (3.20)

Pk+1 = FkPkFk + Qk. (3.21)

29


Update

In the update step the state vector is updated with the new measurement. In this simpleexample, it is assumed that there is only one measurement, which is known to originatefrom the target, i.e. the data-to-object associations are unambiguous. The residual vectorνk+1 is the difference between the measurement and the predicted measurement, definedas

νk+1 = zk+1 − zk+1 (3.22)

wherezk+1 = Hk+1xk+1 (3.23)

is the predicted measurement. The Kalman filter does not fully rely on either the mea-surement or the predicted state, but in stead applies a covariate scaling to the residual, theKalman gain Wk+1, which can be viewed as a comparison between confidence in themeasurements versus confidence in the predicted state. The confidence of the measure-ment is described by the innovation covariance matrix resulting from the covariance of theprediction Pk+1 and the sensor model,

Sk+1 = Hk+1Pk+1HTk+1 + Rk+1. (3.24)

The Kalman gain is given by

Wk+1 = Pk+1HTk+1S

−1k+1. (3.25)

The Kalman gain is applied to the residual, giving the posterior state estimate as

xk+1 = xk+1 + Wk+1νk+1. (3.26)

The state covariance matrix is also updated, taking into account the blending of the pre-dicted state and the measurements as

Pk+1 = (I−Hk+1Wk+1)Pk+1. (3.27)

The discrete-time variable is incremented, and when a new measurement is ready the al-gorithm moves to the prediction step. The Kalman filter calculations for one cycle issummarized in figure 3.3.

3.2.4 Extended Kalman Filter

In the case where the system plant equation and/or measurement equation are given bynonlinear equations, the standard Kalman filter cannot be used. As the measurementmodel for the camera described in chapter 4 is a nonlinear one, the methods of the ex-tended Kalman filter is used in the calculation of the innovation covariance matrix S.The Extended Kalman Filter (EKF) linearizes the nonlinear dynamics and/or measurementequations about a trajectory that is continually updated with the state estimates resulting

30


Figure 3.3: One cycle of the Kalman filter. The true evolution of the state is shaded in red, thestate estimation is shaded in blue, and the covariance computation is shaded in green. The figure isadapted from [37].

from the measurements, and is a nonoptimal state estimation algorithm for nonlinear sys-tems [37]. The linearization is based on the first order Taylor expansion of the nonlinearfunctions. The noise is assumed to enter additively, as for the standard Kalman filter.

The main difference between the standard Kalman filter and the EKF is that the Jacobiansof the plant- and measurement equations are used in calculating the state prediction co-variance, the innovation covariance, and the filter gain. The Jacobians for the plant- andmeasurement equations are given by

Fk =∂f(k,x)

∂x

∣∣∣∣x=xk

=

∂f1∂x1

∂f1∂x2

. . .∂f2∂x1

∂f2∂x2

. . ....

.... . .

∣∣∣∣∣x=xk

(3.28)

Hk+1 =∂h(k + 1,x)

∂x

∣∣∣∣x=xk+1

=

∂h1

∂x1

∂h1

∂x2. . .

∂h2

∂x1

∂h2

∂x2. . .

......

. . .

∣∣∣∣∣x=xk+1

. (3.29)

The EKF loop is displayed in figure 3.4. Since the EKF uses a series expansion of thenonlinear equations in the Kalman filter equations, the extended Kalman filter can divergeif the reference about which the linearization takes place is poor. This situation commonlyoccurs at the initial starting point of the recursive estimation process, the EKF is verysensitive to the accuracy of the initial conditions. Inaccurate prior information about thetrue state causes a large error in x0, which forces P0 to be large. This can lead to twoproblems in getting the filter started [43]:

1. A large initial prior error covariance P0 combined with a low-noise measurement atthe first iteration of the EKF algorithm causes the error covariance to jump from avery large value to a small value in one step. This can lead to numerical problems, anon-positive-definite error covariance matrix Pk at any point in the recursive processusually leads to divergence. The filter designer must take precausions in preserving

31


Figure 3.4: One cycle of the Extended Kalman filter. The true evolution of the state is shaded inred, the state estimation is shaded in blue, and the covariance computation (including the evaluationof the Jacobians) is shaded in green. The figure is adapted from [37].

the symmetry and positive definiteness of the error covariance matrix in the first stepof the recursive filter loop.

2. The initial state x0 is the best estimate of the state at the time of initialization, and isused as the reference for linearization in the first step of the algorithm. If this initialstate is far from the true state (i.e. the error is large), the first-order approximationused in the linearization will be poor. This may cause the filter to diverge.

3.3 Object Tracking

This section presents object tracking methods which are based on the point-target assump-tion, meaning that any potential targets will give rise to a single point measurement, if anymeasurements at all. The object tracking methods considers the possibility that a measure-ment could originate from clutter or noise as well as from the actual target itself, and thebasic idea is to calculate the association probabilities between a set of validated measure-ments and the predicted state of a target before using the measurements in the state update.This section starts with the Probabilistic Data Association Filter (PDAF), before progres-sively generalizing it to the multi-target case with the Joint Integrated Probabilistic DataAssociation filter (JPDA), and its extension to include calculation of target existence andvisibility probabilities in the Joint Integrated Probabilistic Data Association filter (JIPDA).

3.3.1 The Probabilistic Data Association Filter

As described in section 3.2.3, the basic Kalman filter assumes that the measurements re-ceived are unambiguously originating from the actual target, i. e. the tracking is performed

32

3.3 Object Tracking

with an ideal sensor. In a realistic tracking situation, this is not necessarily true. A realisticsensor could give returns that originate from clutter or noise rather than the actual targetof interest. In other words, the origin of the measurements are uncertain. This leads tothe data association problem, which is the problem of determining which measurementsare believed to originate from the target, and which measurements are clutter or noise,and can be discarded. The naive solution is to select the nearest gated measurement (ormeasurements) as the target-originated one, whereas the probabilistic data association fil-ter (PDAF) uses the probabilistic distances to all gated measurements, scaling each onewith its probabilistic distance. The derivations presented in this section is largely based onchapter 3 in [45]. The basic assumptions of the PDAF are listed below.

• There is only one target of interest, modeled by (3.18) and (3.19).

• The track has been initialized.

• The past information about the target is summarized approximately by

p(xk|Z1:k−1) = N [xk; xk, Pk]

where, as before, Z1:k−1 = {zi}k−1i=1 is the sequence of measurements up to and

including time k − 1.

• At each time a validation region as in (3.31) is set up.

• At most one of the possibly many validated measurements can be target originated.

• The remaining measurements are assumed due to false alarm or clutter and are mod-eled as independent identically distributed with uniform spatial distribution.

• The target detections occur independently over time with known probability PD.

The latest set of validated measurements, i.e. measurements at time k that falls within thevalidation region, is denoted

Zk = {zi,k}mki=1 (3.30)

where zi,k is the i-th validated measurement, and mk is the number of measurementswithin the validation region at time k.

The Validation Region

A validated measurement is a measurement falling within a range gate set up for detectingthe signal from the target. This is done to avoid searching the entire measurement space forthe signal from the target of interest. A measurement that falls within the gate, while notguaranteed to have originated from the target of interest, is a valid association candidate,thus the name validation region. Consider a target that has a valid track, meaning its filteris initialized. Then, according to the assumptions described in the previous section, one has

33


the predicted mean value of the measurement zk = Hkxk and the associated covarianceSk. With the assumption that the true measurement conditioned on the past is normallydistributed, the true measurement will be within the region described by

V(k, γ) = {z : [z− zk]TS−1k [z− zk] ≤ γ} (3.31)

where γ is the gate threshold determined by the chosen gate probability PG [45]. Thisthreshold is obtained from tables of the χ2 distribution, since the quadratic form in (3.31)that defines the validation region is χ2 distributed [37]. The volume of the validationregion is

V (k) = cnz|γSk|

12 = cnz

γnz2 |Sk|

12 (3.32)

where cnzis the volume of the nz-dimensional hypersphere, and nz is the dimension

of the measurement. An example of a gating process is shown in figure 3.5. In thisexample, measurements z1,k and z2,k are within the gating region, and are consideredvalid association candidates. The remaining measurements, indicated by red dots, areoutside the gating region and are discarded.

Figure 3.5: A validation gate with two validated measurements.

State Estimation in the PDAF

The prediction of the state xk, measurement zk and the covariance Pk from k − 1 to kis done in the same way as for the Kalman filter, given by (3.20) and (3.21). With the

34

3.3 Object Tracking

previously stated assumptions, the association events

θi,k =

{{zi,k is the target originated measurement} i = 1, . . . ,mk

{none of the measurements is target originated} i = 0(3.33)

are mutually exclusive and exhaustive for mk ≥ 1. Defining the association probability as

βi,k , P (θi,k|Zk) (3.34)

the conditional mean of the state can be written as [45]

xk =

mk∑i=0

xi,kβi,k (3.35)

where xi,k is the estimated state conditioned on the event that the i-th measurement iscorrect given as

xi,k = xk + Wkνi,k i = 1, . . . ,mk (3.36)

and the corresponding innovation is

νi,k = zi,k − zk. (3.37)

The gain matrix Wk is the same as in the Kalman filter, given in (3.25). Combining (3.35)and (3.36), the state update can be written as

xk = xk + Wkνk (3.38)

where νk is the combined innovation

νk =

mk∑i=1

βi,kνi,k (3.39)

In the case where there is no validated measurements, or none of the validated measure-ments are correct, the filter propagates the previous state estimate as the updated state.

The Covariance Update

The covariance for the updated state in (3.38) is [45]

Pk = β0,kPk + [1− β0,k]Pck + Pk (3.40)

where Pck is the covariance of the state updated with the correct measurement, similarly

as in the Kalman filter:Pck = Pk −WkSkW

Tk . (3.41)

The term Pk represents the spread of the innovations, defined as

Pk , Wk

[ mk∑i=1

βi,kνi,kνTi,k − νkν

Tk

]WT

k . (3.42)

35


The two first terms of (3.40) is a weighing between the prediction covariance and thecovariance updated with the correct measurement, weighted by the probability of no cor-rect measurements (β0,k) and the probability that the correct measurement is available(1− β0,k), respectively. Since one does not know which of the mk gated measurements isthe correct one, the positive semidefinite term given by (3.42) increases the covariance toaccount for the measurement origin uncertainty [45].

The Association Probabilities

The form of the association probabilities depend on the probability mass function µF (ϕ)of the number of false measurements in the validation region. Two models can be used forµF (ϕ) in the volume of interest V , a parametric Poisson model with spatial density givenby λ,

µF (ϕ) = e−λV(λV )ϕ

ϕ!(3.43)

and a nonparametric diffuse prior model,

µF (ϕ) = µF (ϕ− 1) = δ. (3.44)

The two models yield slightly different expressions for the assiciation probabilities βi(k),which will simply be stated below. For a full derivation of the expressions, the reader isreffered to [45], section 3.4.3. Using the Poisson clutter model, the association probabili-ties are

βi,k =

ei

b+∑mk

j=1 eji = 1, . . . ,m(k)

bb+

∑mkj=1 ej

i = 0(3.45)

whereei , e−

12ν

Tik

S−1k νi,k (3.46)

and

b ,2π

γ

nz2

λV (k)c−1nz

1− PDPGPD

. (3.47)

The nonparametric diffuse prior model is the same as the above expressions except forreplacing (3.47) with

b ,2π

γ

nz2

mkc−1nz

1− PDPGPD

. (3.48)

3.3.2 JPDA

The Joint Probabilistic Data Association filter (JPDA) is a multi-target extension of thePDAF presented in section 3.3.1. In the PDAF, all incorrect measurements are modeled

36

3.3 Object Tracking

as random interference, with uniform spatial distribution, while the JPDA allows the mea-surements from a target to fall within the validation region of a neighboring target. Thiscan happen over several sampling intervals, acting as a persistent interference. Such a sit-uation is depicted in figure 3.6, where two targets have overlapping validation regions, andmeasurement z1,k falls within the validation region of both targets. The performance of

Figure 3.6: Two overlapping validation gates, sharing the measurement z1,k. The JPDA filter ac-counts for the fact that this measurement is more likely to originate from target 2 than target 1 in thissituation, while the PDAF would make no such discrimination.

the PDAF degrades significantly when a neighboring target acts as a persistent interference[37]. In addition to allowing measurements from one target to fall within the validationregion of a neighboring target, the JPDA builds on the assumptions that:

• There is a known number of established targets in clutter.

• The past is summarized by an approximate sufficient statistic - state estimates (ap-proximate conditional mean) and covariances for each target.

• The states are assumed Gaussian distributed with the aforementioned mean and co-variances.

• Each target has a motion and measurement model as in (3.18) and (3.19). Differenttargets may have different models.

37


• The number of false measurements is assumed Poisson distributed with intensity λ.

Moreover, it is assumed that a target may generate at most one measurement (point targetassumption), and that any measurement comes from at most one target (no merged mea-surement assumption) meaning that measurements from two or several targets may notmerge into one measurement. The approach of the JPDA is to compute the measurement-to-target association probabilities jointly across the targets. Its key feature is the evaluationof the conditional probabilities of the association hypotheses. Following along the lines of[2], an association hypothesis ak can be defined as a vector ak = [ak(1), ak(2), . . . , ak(n)]

T

such that

ak(t) =

{j if measurement j is claimed by target t0 if no measurement is claimed by target t

(3.49)

where j = 1, . . . ,mk is the number of measurements, and t = 1, . . . , n is the number ofknown targets. Due to the point target assumption and the no merged measurements as-sumption, the association hypotheses are mutually exclusive and exhaustive. The theoremof total probability therefore yields

p(x1,k,x2,k, . . . ,xn,k|Z1:k) =∑ak

p(x1,k,x2,k, . . . ,xn,k|ak,Z1:k)P{ak|Z1:k} (3.50)

where the sum goes over all feasible association hypotheses. The event-conditional den-sities p(x1

k,x2k, . . . ,x

nk |ak,Z1:k) can be easily obtained using a Kalman filter whenever

a measurement zak(t),k is assigned to a target t, and use the prediction otherwise. Theequations for obtaining the association hypotheses probabilities P{ak|Z1:k} will be givenin section 3.3.3, and thus will not be stated here. Once the association hypotheses proba-bilities are found, the goal is to merge together different hypotheses, such that one is leftwith a single Gaussian for each track. The marginal association probabilities βtj,k for eachtarget t is found by summing over all joint events in which the marginal event of interestoccurs,

βtj,k = P{zj,k is assigned to target t|Z1:k} =∑

ak s.t. ak(t)=j

P{ak|Z1:k}. (3.51)

Once the marginal association probabilities are found, the state update for each target isdone as for the PDAF, with (3.38) and (3.40).

3.3.3 JIPDA

The Joint Integrated Probabilistic Data Association filter (JIPDA) [46] is a multi-targettracking filter along the lines of the JPDA presented in section 3.3.2, where the main dif-ference is the computation of track existence probabilities, as for the IPDA filter [47]. Fora single target, the JIPDA filter reduces to the IPDA filter. The track existence probabilitiesallow for automatic track initiation, maintenance and termination. The derivations in thefollowing sections is based on work from [37], [46] and [2]. The notation used is the sameas in [2].

38

3.3 Object Tracking

Assumptions

The JIPDA filter builds on the same assumptions as the JPDA, repeated here for conve-nience:

• There is a known number of established tracks in clutter.

• The past is summarized by an approximate sufficient statistic - state estimates (ap-proximate conditional mean) and covariances for each target.

• The states are assumed Gaussian distributed with the aforementioned mean and co-variances.

• Each target has a motion and measurement model as in (3.18) and (3.19). Differenttargets may have different models.

• The number of false measurements is assumed Poisson distributed with intensity λ.

Track Existence and Visibility Models

In the original paper, Musicki and Evans [46] model target existence using two distinctMarkov chains, which they named Markov Chain One and Markov Chain Two. MarkovChain One recognizes two possibilities; the target either does not exist, or it exists and isvisible with a probability of detection. Markov Chain Two recognizes three possibilities;in addition to the two possibilities of Markov Chain One, the target may also exist, but notbe visible. In this thesis, a similar approach to Markov Chain Two is used. We assumethat track existence and track visibility are separated and decoupled. Let the predictedexistence probability at time k be denoted εtk, and the predicted visibility probability bedenoted ηtk. The prediction equations for track existence and visibility are given by[

εtk1− εtk

]=

[pε11 pε12

pε21 pε22

] [εtk−1

1− εtk−1

](3.52)

[ηtk

1− ηtk

]=

[pη11 pη12

pη21 pη22

] [ηtk−1

1− ηtk−1

]. (3.53)

The Markov chain coefficients must satisfy

pε11 + pε12 = pε21 + pε22

= pη11 + pη12

= pη21 + pη22+ = 1.

(3.54)

By assuming there is a known number of established targets, no target births can occur,and the Markov model is given by the survival of existing targets.

39


Association Events and Posterior Conditional Densities

As described in section 3.3.2, the association hypotheses probabilitiesP{ak|Z1:k}must befound in order to perform the state update for each target. Using Bayes rule, the associationhypotheses probabilities can be written as

P{ak|Z1:k} = P{ak|Zk,mk,Z1:k−1}∝ p(Zk|mk,ak,Z1:k−1

)P{ak|mk,Z1:k−1}.

(3.55)

Before we find concrete expressions for the terms in (3.55), let’s first express the posteriorevent-conditional expected states and covariances as

xt,ak(t)k =

{xtk if ak(t) = 0

xtk + Wtk

(zak(t)k −Hxk

)if ak(t) > 0

(3.56)

Pt,ak(t)k =

{Ptk if ak(t) = 0(

I−WtkH)Ptk if ak(t) > 0.

(3.57)

In words, this means that a KF update is used if measurements are associated to the tar-get, and the prediction is used if not. The joint event-conditional posterior can then beexpressed as

p(x1k,x

2k, . . . ,x

nk |ak,Z1:k) =

n∏t=1

N(xtk; x

t,ak(t)k ,P

t,ak(t)k

). (3.58)

Association Events and Posterior Probabilities

Next, define the likelihood ratio

lt,ak(t) =1

λN(zak(t)k ; Hx

t,ak(t)k ,HP

t,ak(t)k HT + R

). (3.59)

Let the number of clutter measurements hypothesized by the current association hypoth-esis be ϕ. The first term on the right-hand side of (3.55) can then be expressed as [2]

p(Zk|mk,ak,Z1:k−1

)=

1

V ϕ

∏t:ak(t)>0

[λlt,ak(t)

](3.60)

where V is the volume of the surveillance region. To find the second term of (3.55),first define the individual track-wise detection events τ (t) as τ (t) = 1 if target t is de-tected, τ (t) = 0 otherwise. Further define the track-oriented configuration as a vectorcontaining the individual track-wise detections, τ = [τ (1), . . . , τ (n)]

T . The vector τcontains information about which targets are detected, but contains no information about

40

3.3 Object Tracking

measurement-to-target associations. The unconditional probability of the event τ is givenentirely by the individual detection events,

P{τ} =∏

t:ak(t)=0

(1− εtkP tDηtk

) ∏t:ak(t)>0

εtkPtDη

tk. (3.61)

Conditioned on τ , we have the following probabilities:

P{mk|τ} = eλV(λV)ϕ

ϕ!(3.62)

P{ak|τ ,mk} =ϕ!

mk!. (3.63)

The first probability follows from the fact that the number of measurements not accountedfor by τ , the number of clutter measurementsϕ is Poisson distributed with rate λV , and thenumber of measurementsmk are uniquely determined by ϕ and τ . The second probabilityfollows because given τ , we have mk!/ϕ! equally likely permutations of the detectedmeasurements, each constituting a separate association hypothesis. The second term of(3.55) can be rewritten as

P{ak|mk} = P{ak, τ |mk}∝ P{ak|τ ,mk}P{mk|τ}P{τ}

(3.64)

The association hypotheses probabilities P{ak|Z1:k} can now be found by combining(3.61)-(3.63) and (3.64) into one equation, where after cancelling the V -terms and movingthe terms involving λ into the proportionality constant, we are left with

P{ak|Z1:k} ∝∏

t:ak(t)=0

(1− εtkP tDηtk

) ∏t:ak(t)>0

εtkPtDη

tklt,ak(t). (3.65)

The total power of λ is always mk, since one λ is contributed by every target in the likeli-hood and every false alarm in the prior probability contribute one λ, explaining the ratio-nale behind moving it into the proportionality constant.

Marginal Association Probabilities

The key principle of the JIPDA is to merge together different hypotheses, such that in theend one is left with a single Gaussian for each track. This requires the marginal probabili-ties

βtj,k ∝∑

ak s.t. ak(t)=j

P{ak|Z1:k}. (3.66)

Once the marginal probabilities are calculated, the posterior Gaussian approximation foreach track can be found according to

ptk(xtk) ∝mk∑j=0

βtj,kN(xtk; xt,jk ,Pt,j

k

)≈ N

(xtk; xtk,P

tk

). (3.67)

41


The state and covariance update equations are subsequently done in the same way as forthe PDAF, with (3.38) and (3.40).

Posterior Existence and Visibility Probabilities

The final step in the JIPDA tracking algorithm is to update the existence and visibilityprobabilities. First, calculate the quantities

εt0,k =

(1− PDηtk

)1− εtkPDηtk

εtk (3.68)

ηt0,k =

(1− PD

)1− PDηtk

ηtk (3.69)

representing the posterior existence and visibility probabilities conditioned on no mea-surement originating from target t. The marginal existence and visibility probabilities cansubsequently be found by

εtk = βt0,kεt0,k +

mk∑j=1

βtj,k (3.70)

ηtk =1

εtk

(βt0,kε

t0,kη

t0,k +

mk∑j=1

βtj,k

). (3.71)

42

Chapter 4Implementation

In order to implement the target tracking system on the ReVolt model ship, some practicalissues must be given some attention. This chapter describes the integration of the sensorsin the ReVolt model ship, the intrinsic calibration of the camera, as well as the extrinsiccalibration between the camera and the lidar. Furthermore, the coordinate systems used aswell as the transformation of measurements to a common world frame is described. Theconvolutional neural network used to detect ships in images is briefly explained, as wellas some details concerning the implementation and use of the detector in MATLAB. Theclustering method used on the lidar data is described, and at the end the complete targettracking framework used in this thesis is presented, including the motion model for thetargets as well as the sensor models used.

4.1 Sensors, Hardware and Processing Pipeline

The camera used on ReVolt is a single Point Grey Blackfly GigE color camera, with a 1/3”Sony ICX445 CCD sensor and a resolution of 1288×964 pixels. The camera is equippedwith a fixed focal length low-distortion wide-angle lens, with a 125° field of view whenpaired with a 1/3” sensor. The camera is powered by Power over Ethernet (PoE), and ex-poses 6 general purpose I/O pins which can be used to control the operation of the camera.The key specifications for the camera is listed in table 4.1. The specifications are pulledfrom the technical reference manual for the camera [48]. The lidar is connected to thesystem via its interface box, providing connections for power and Ethernet. The specifica-tions for the Velodyne VLP-16 were listed in section 2.3.1, and will not be repeated here.The existing control system on ReVolt runs on the Robot Operating System (ROS) [49],this makes ROS a natural choice for the sensor drivers and data acquisition in this thesis

43

Chapter 4. Implementation

Model BLFY-PGE-13S2C-CS

Resolution 1288× 964 pixels (1.3 Megapixels)

Framerate 30 frames per second

Sensor SONY ICX445, CCD, 1/3”

Readout Method Global shutter

Interface GigE PoE

Lens Mount CS-mount

Table 4.1: Point Grey Blackfly specifications.

as well.

4.1.1 Sensor Integration in ReVolt

The two sensors used in this thesis was integrated into the existing system on board theReVolt model ship. The ReVolt is equipped with an on-board computer running the exist-ing control system, as well as sensors used in the navigation system. The ships positionis provided by a Hemisphere VS330 GNSS Compass with real-time kinematic (RTK) dif-ferential GNSS, giving a typical position accuracy of < 0.1 meters [50]. The ship is alsoequipped with a Xsens MTi-G-710 GNSS/INS inertial navigation system, used to providethe orientation of the ship relative to a local North East Down (NED) reference framefollowing the ship.

The ReVolt model has two targa hoops mounted on the deck, one close to the bow and theother close to the stern of the ship. Originally these targa hoops were used as mounts forthe two antennas of the Hemisphere GNSS, but they also provide a convenient mountingplacement for the sensors. The mechanical workshop at the Department of EngineeringCybernetics (ITK) at NTNU designed and built a mounting bracket for the lidar and thecamera, and performed some modifications to the fore targa hoop to accommodate thesensor bracket. The sensor bracket was mounted as shown in figure 4.1. The lidar wasplaced on top, to give it a clear, unobstructed view of the surroundings. The camera wasmounted directly below, looking straight ahead. Both the camera and the lidar send theirdata over a network connection. The camera sends its data as UDP packets, while the lidarsends its data as TCP packets. The camera is also powered via the Ethernet cable throughPoE. For this reason a network switch with PoE capabilities was used in connecting thesensors to the existing system. The interconnections between the sensors and the existingcontrol system is displayed in figure 4.2. Due to the high bandwidth of the data streamsfrom both the lidar and the camera, and the limited processing power available on theon-board computer, the software drivers for the camera and lidar was run on a separatelaptop which was installed in ReVolt. The specifications of the laptop is listed in table 4.2.Both the GNSS receiver and the Xsens INS send their data to the on-board computer via a

44

4.1 Sensors, Hardware and Processing Pipeline

Figure 4.1: The sensor mounting bracket on the fore targa hoop.

On-boardcomputer

Network Switch

Wireless AP

Laptop

Vector VS330

IMU

Point Grey Blackfly

RS-232

Ethernet LAN

5V Pulse

VLP-16

Figure 4.2: Physical interconnections.

serial RS-232 connection. The on-board computer is connected to a wireless access point,providing a connection to the system from the outside. The new PoE network switch wasconnected to this access point, providing a connection between the laptop processing thelidar and camera data and the on-board computer. The Xsens INS also has a syncout pinproviding a means to send trigger signals to other devices at a set interval. This triggersignal was connected to the trigger input of the camera for synchronization, and set up totrigger at a frequency of 10 Hz.

45


Model HP zbook 15

CPU Intel Core i7-4800MQ 8×2.7 GHz

Memory 8 GB RAM

Graphics Quadro K2100M/PCIe/SSE2

Storage 250 GB Solid State HDD

Operating System Linux Ubuntu 16.04 LTS (Xenial)

Table 4.2: Laptop computer specifications.

4.1.2 Software Drivers and Synchronization

The software drivers used for the sensors are based on open-source ROS packages devel-oped by the ROS community. The driver for the camera is the pointgrey camera driverpackage [51], which depend on the FlyCapture2 SDK [52] for interfacing with the cam-era. If the FlyCapture2 SDK is not installed on the target computer, the ROS package willdownload and install it automatically during the build process. The camera driver allowsthe user to configure parameters such as the triggering mode, shutter duration, frame rateand so on through a launch file. Each time a new image is received, the driver publishthe image onto the ROS network as a sensor_msgs/Image topic. The driver for thelidar is the velodyne driver package [53], which continuously receive and decode the TCPpackets from the lidar. The driver converts the measurements received from polar to carte-sian coordinates in the lidar frame of reference, as in (2.27) in section 2.3.1. All the pointsreceived corresponding to a full 360° scan are subsequently published to the ROS networkas a sensor_msgs/PointCloud2 topic. Since the sensors in the navigation systemas well as the lidar sends their data continuously, outside the users control, the time-stampsupon reception in ROS are used to temporally align the data. The camera can be triggeredexternally by the user, and this was set up to trigger upon the reception of the first TCPpacket in the first scan received from the lidar, triggering at the same frequency as the lidar.Both the lidar and the camera have a sampling frequency of 10 Hz.

4.2 Camera Calibration

In order to associate world points with pixel values in an image, the intrinsic calibrationparameters and the radial distortion parameters described in section 2.1.1 is needed. ROShas a built-in camera calibration software package, which allows easy calibration of a cam-era using a checkerboard calibration target. The camera calibration package uses OpenCVcamera calibration [23], a description of the software and its functions can be found in[54]. The intrinsic camera model is inspired by Heikkila and Silven [21], while the maincalibration procedure is inspired by Zhang [17]. Using the calibration software is straight-forward; the user must provide a checkerboard pattern with squares of known dimensions,

46

4.3 Lidar-Camera Calibration

and while the camera publishes images as a ROS topic of type sensor_msgs/Image,the camera calibration can be started by running

$ rosrun camera_calibration cameracalibrator.py--size 10x7 --square 0.100image:=/my_camera/image camera:=/my_camera

The size parameter tells the calibration software how many inner corner points there are inthe checkerboard. In this example, a 11×8 checkerboard was used, where there are 10×7inner corners. The square parameter describes the side length of the checkerboard squaresin millimeters. The image parameter refers to the topic name where the camera images arebeing published, while the camera parameter refers to the namespace of the camera nodein ROS. The camera matrix was found as

K =

αx 0 x0

0 αy y0

0 0 1

=

344.033691 0 624.0137220 343.379790 482.9677920 0 1

, (4.1)

and the distortion parameters were[k1 k2 p1 p2

]=[0.007694 −0.004260 0.000716 −0.000181

], (4.2)

where k1, k2 are the radial distortion parameters, and p1, p2 are the tangential distortionparameters.

Figure 4.3: A screenshot of the camera calibration procedure in ROS.


In order to associate the coordinates of measurements from the lidar with pixels in imagescaptured with the camera, a transformation between the two coordinate systems must befound. The camera coordinate system is described in section 2.1.1 and illustrated in figure

47


Figure 4.4: The camera and lidar mounted together on the ReVolt scale model ship.

Figure 4.5: The lidar and camera coordinate systems mounted on ReVolt.

2.1, while the lidar coordinate system is described in section 2.3.1 and illustrated in figure2.11. The sensors are mounted together as illustrated in figure 4.4. The rigid-body trans-formation between the lidar and the camera frame of reference consists of a rotation anda translation, for a total of 6 extrinsic calibration parameters. Dhall et al. [55] proposedan accurate and repeatable method to estimate the extrinsic calibration parameters using3D-3D correspondences, which is used in this thesis. The method uses augmented realitymarkers to generate 3D points in the camera frame of reference. The proposed method usesArUco markers [56] attached to rectangular planar surfaces, which given a calibrated cam-era provides the rotation and translation between the camera and the center of the marker.

48


The ArUco markers are special encoded binary patterns that facilitate error detection andcorrection in the marker patterns themselves. Two examples of ArUco markers are shownin figure 4.6. Since the marker size is known, once a marker has been detected it is possibleto estimate its pose (with respect to the camera) by iteratively minimizing the reprojectionerror of the corners of the marker. The authors of [56] provides this functionality in a freeopen-source software library [57]. The authors of [55] provides their calibration softwareas a ROS package. A detailed description on how to use the package can be found in [58].

Figure 4.6: An example of two ArUco markers. The left marker has marker ID 10, the right markerhas marker ID 100.

4.3.1 Calibration Setup

In the calibration, two ArUco tags were printed and attached to two cardboard planes ofknown dimensions. To ensure that enough points from the lidar was hitting the cardboardplane such that a line could be fitted, the cardboard planes were hung at a distance ofapproximately 1.5 meters, and rotated approximately 45° relative to the floor. At thisdistance, the cardboard planes covered almost the entire vertical spread of the laser beamsfrom the lidar. The physical calibration setup is shown in figure 4.7. All distance measuresgiven in this section have units of meters.

Figure 4.7: The lidar-camera calibration setup.

49


3D Points in the Camera Frame

Given that the 3D locations of the center of the ArUco markers are known, as well as thedimensions of the marker plane, the location of the corners of each plane can be calculated.The ArUco tags give the rotation and translation, [R|t], between the camera and the centerof the ArUco marker. The location of the corners in the markers frame of reference (wherethe marker plane is the plane z = 0, and the origin is in the center of the marker) isknown, and the corner points can be transformed to the cameras frame of reference usingthe transformation found through the ArUco marker.

3D Points in the Lidar Frame

For the lidar, the corner points of the marker plane is found by first detecting the edgesof the cardboard. For each edge of the marker plane, the user draws a polygon aroundthe points corresponding to a single edge, for a total of 8 edges when using two ArUcomarkers. Random Sample Consensus [59] (RANSAC) is then used to fit lines to eachedge, using the points marked by the user. When the line equations for the four edges arefound, the corners of the marker planes can be found as the intersection of the edge lines.These lines may not intersect exactly, in this case the midpoint of the shortest line segmentbetween the two lines is used as the corner point.

Since the point correspondences are known, a closed-form solution to the transformationbetween the coordinate frames can be found. The calibration procedure uses the Kabschalgorithm [60] to find the rotation between the two point clouds, and the translation can befound once the rotation is known. The initial translation between the vector sets describingthe respective points is removed by translating the centroid of each vector set to the origin.Let Pi denote the vector of the ith point in the lidar frame, and Qi denote the vector of theith point in the camera frame. Let the centroid of each vector set be denoted by P and Q.The rotation from the lidar frame to the camera frame can then be found by solving

R = arg minR∈SO(3)

n∑i=1

∣∣∣∣(R(Pi − P )− (Qi − Q))∣∣∣∣2 (4.3)

where SO(3) denotes the 3D rotation group, ensuring that the solution is a valid rotationmatrix. Now let P and Q be two matrices, where the ith column is the vector Pi − P andQi − Q, respectively. Compute the cross-covariance matrix H as

H = PTQ. (4.4)

Next, calculate the singular value decomposition (SVD) of H ,

H = USV T . (4.5)

To decide whether or not we need to correct our rotation matrix to ensure a right-handedcoordinate system, compute the determinant

d = detV UT . (4.6)

50


The optimal rotation matrix can then be calculated as

R = V

1 0 00 1 00 0 d

UT . (4.7)

Once the rotation between the two coordinate frames is found, the translation between theframes can be found as the distance between matched points in the aligned frames.

Calibrated Values

The rotation matrix between the lidar frame and the camera frame was found using thecalibration routine as

Rcl =

0.764258 −0.644885 0.005757090.0236893 0.0191512 −0.9995360.644476 0.764039 0.0299133

(4.8)

and the translation from the camera frame to the lidar frame (in the camera frame of refer-ence) was

tcl =

−0.00117−0.10957−0.10682

(4.9)

Using homogeneous coordinates, this can be combined into a homogeneous transforma-tion matrix

T cl =

[Rcl tcl

01×3 1

](4.10)

A screenshot from ROS showing the camera view as well as the point cloud during thecalibration is shown in figure 4.8.

51


Figure 4.8: The camera view of the ArUco markers along with the 3D point cloud from the lidar,displayed in RViz (ROS).

4.4 Transformation to a Common World Frame

Section 4.3 described the method used to find the transformation between the lidar and thecamera frame of reference. The measurements received by the camera and the lidar aretransformed to a common North East Down (NED) world frame, defined relative to theEarth’s reference ellipsoid. The z-axis points downward normal to the tangent plane of theellipsoid, and the x-axis points towards true north. The y-axis points eastward, completingthe right-handed coordinate system.

The ReVolt model ship has a defined BODY coordinate system, which moves and rotateswith the ship. As described in the previous section, the transformation between the lidarand the camera was found with a calibration procedure. What remains is to find the trans-formation between the camera frame and the BODY frame, as well as the transformationbetween the BODY and the NED frame. As illustrated in figure 4.5, the BODY x-axis isaligned with the camera z-axis, and the body y-axis is aligned with the camera x-axis. Therotation from the camera frame to the BODY frame is therefore

Rbc =

0 0 11 0 00 1 0

. (4.11)

The translation between the origin of the camera frame and the origin of the BODY framewas hand measured to be

tbc =

0.870

−0.36

(4.12)

in the BODY frame of reference. The transformation matrix between the camera and the

52

4.4 Transformation to a Common World Frame

BODY frame is thus

T bc =

[Rbc tbc

01×3 1

]. (4.13)

The units of translation are in meters. The navigation system onboard ReVolt measuresthe roll (φ), pitch (θ) and yaw (ψ) angles of the BODY frame relative to the NED frame.The roll, pitch and yaw angles are illustrated in figure 4.9. The rotation matrix from the

Figure 4.9: Roll, pitch and yaw rotations of the BODY frame.

BODY frame to the NED frame is given as [61]

Rnb =

cψcθ −sψcφ + cψsθsφ sψsφ + cψsθcφsψcθ cψcφ + sψsθsφ −cψsφ + sψsθcφ−sθ cθsφ cθcφ

(4.14)

where cφ is cos(φ) and sφ is sin(φ). The position of the origin of the BODY frame is givenby the navigation system in geodetic coordinates on the World Geodetic System (WGS)84 reference ellipsoid as latitude, longitude and altitude. This must then be transformedto a position in the local NED frame. Let µ0, l0 and h0 be the latitude, longitude andheight of the origin of the local NED frame in geodetic coordinates, respectively. Further,let µb, lb and hb be the latitude, longitude and height of the origin of the BODY framein geodetic coordinates. The transformation to NED coordinates starts by calculating theEarth-centered Earth-fixed (ECEF) cartesian coordinates of both the origin of the localNED frame as well as the BODY frame by [61]xeye

ze

=

(N + h) cos(µ) cos(l)(N + h) cos(µ) sin(l)( r2p

r2eN + h

)sin(µ)

, (4.15)

where N is the prime vertical radius of curvature,

N =r2e√

r2e cos2(µ) + r2

p sin2(µ), (4.16)

where re = 6378137m and rp = 6356752m are the equatorial and polar Earth radii, whichare WGS-84 parameters. The position of a point in a local NED frame can now be found

53


by the rotation xnynzn

= Rne

xe,b − xe,0ye,b − ye,0ze,b − ze,0

(4.17)

where

Rne =

− sin(µ0) cos(l0) − sin(l0) − cos(µ0) cos(l0)− sin(µ0) sin(l0) cos(l0) − cos(µ0) sin(l0)

cos(µ0) 0 − sin(µ0)

(4.18)

and xe,0, ye,0 and ze,0 are the ECEF coordinates of the origin of the NED frame, and xe,b,ye,b and ze,b are the ECEF coordinates of the origin of the BODY frame. Now that theposition of the origin of the BODY frame in the NED frame is found, the transformationof points in the BODY frame to the NED frame can be expressed as

T nb =

[Rnb tnb

01×3 1

](4.19)

where tnb is the position of the BODY frame origin in the NED frame, found by equations(4.15)-(4.17). To summarize, the transformation from the lidar frame of reference to thelocal NED frame is found by the compound transformation

T nl = T nb T bc T cl (4.20)

and the transformation from the camera frame of reference to the local NED frame is foundby the transformation

T nc = T nb T bc . (4.21)

4.5 Visual Detection based on Faster R-CNN

The visual detection of boats in images used in this thesis is based on the frameworkof Faster R-CNN, developed by Ren et al. [62]. This framework was used due to theavailability of pre-trained networks developed during a Master thesis project at NTNU[63]. A brief introduction to the framework of Faster R-CNN is given in this section, somebackground on how the implemented model was trained, and some details concerningthe implementation of the network in MATLAB. For a detailed explanation of the fullframework of Faster R-CNN the reader is referred to [62].

4.5.1 Faster R-CNN

Faster R-CNN is a region-based CNN, employing Region Proposal Networks (RPN) toform a single, unified network for object detection. Faster R-CNN is composed of twomodules. The first module is a deep fully convolutional neural network which proposesregions, and the second module is the Fast R-CNN detector [64] that uses the proposed

54


Figure 4.10: Faster R-CNN as a single unified network for object detection. The figure is adaptedfrom [62].

regions in classification. Figure 4.10 illustrates the system of Faster R-CNN. The RegionProposal Network takes an image of any size as inputs, feeding it through convolutionallayers to generate feature maps. The area of activation is then fed into the RPN, whichoutputs object proposals in the form of rectangular regions, each with a objectness score.The objectness score is a measure of membership to a set of object classes versus back-ground. The rectangular regions are subsequently fed into the classification layers wherethe classification score is obtained. An important feature of Faster R-CNN is that the RPNshares convolutional layers with the object detection networks, reducing the cost for com-puting proposals, making it attractive for real-time applications, such as remote sensing foran ASV. In their paper, Ren et al. investigated two models for the classification network,the 5 convolutional layer deep Zeiler and Fergus model [65] dubbed ZF-net, and the 16convolutional layer deep Simonyan and Zisserman model [66], dubbed VGG-16. In thisproject, the VGG-16 model is used, as this network achieved the best results in the Masterthesis work of Tangstad [63].

4.5.2 Training and validation data

The network used in the thesis is the best-performing model developed in Tangstads thesiswork. This model is trained as a single-class classifier, trained to detect boats in images.The network was trained using the 2007 and 2012 VOC (Visual Object Challenge) [67]dataset and a selection of images pulled from the image database imagenet [68]. In theimages used from these image sets, the only ground-truth label and bounding boxes areof the boat-class. In addition, a custom dataset was generated during the thesis work,with images of boats from the Trondheim harbor and at open sea, which was included inthe training and test sets. For details on the training, and the performance of the trained

55


network, the reader is referred to [63].

4.5.3 Implementation Aspects

A complete MATLAB implementation of Faster R-CNN is available on Github. In order torun Faster R-CNN, there are some software requirements that need to be met. The stepsare listed in the Github repository for the framework [69], and will be repeated here forconvenience. The steps are setup-dependent, depending on whether or not the frameworkwill be executing on a GPU or a CPU, and also on the version of GPU used. The FasterR-CNN implementation available at the Github repository uses Caffe [70], which is adeep-learning framework utilizing Nvidia CUDA for parallel computing, developed byBerkeley AI Research. Caffe is well documented online, with installation instructions,tutorials and more [71].

In order to use Faster R-CNN, a Caffe MEX file is needed. A MEX file is a dynamicallylinked subroutine that the MATLAB interpreter loads and executes, enabling MATLAB toutilize Caffe routines. A pre-compiled Caffe version for Nvidia CUDA version 6.5 isavailable in the repository, but for newer Nvidia GPUs a newer version of CUDA willneed to be compiled along with Caffe. If a new compiling of Caffe is necessary, all thesteps needed are provided in the Faster R-CNN branch of Caffe on Github [72]. If a GPUis utilized for parallel computing, a GPU with minimum 8 GB of memory is needed to runthe VGG-16 version. In this project Caffe was recompiled with CUDA version 9.0, due toCUDA version 6.5 not supporting the newer Pascal architecture on the Nvidia GTX 1070graphics card used. MATLAB is also needed to run the Faster R-CNN implementation.Once all requirements are met, the steps necessary before testing are

1. Run faster_rcnn_build.m

2. Run startup.m

Once this is done, the user must provide a valid network (pre-trained networks are availableon the Github repository), and the model is ready for use.

4.5.4 Image Preprocessing

Before the images captured by the camera were fed into the Faster R-CNN classifier, somepreprocessing was necessary. The images captured by the camera suffered from vignetting[73], causing the image brightness to drop in the image periphery. Figure 4.12a shows anexample of an image captured, where the edges of the image are very dark compared tothe center of the image. To remedy this situation, the Hadamard product (also knownas the Schur product) between the image and a matrix C with dimensions matching the

56


image resolution was taken. The Hadamard product between two matrices is the element-wise multiplication of the entries of each matrix. The matrix C approximates a invertedGaussian function, where the ijth entry was given by

Cij = 3.5− 3.5 exp(− (i− 483)2 + (j − 624)2

2× 3502

)+ 1.5. (4.22)

The image consists of three color channels, and for each color channel, the new correctedimage Ic is given by

Ic = I ◦C (4.23)

where ◦ denotes the Hadamard product. The inverted Gaussian function used in the cor-rection is illustrated in figure 4.11. The center of the Gaussian (483, 624) coincides with

Figure 4.11: Inverse Gaussian function used to brighten edges of images.

the principal point in terms of pixel dimensions, found through the camera calibration. Theother values in (4.22) were found through manual tuning of the parameters and visualizingtheir effect on the image. The effect of multiplying the image with the inverted Gaussianmatrix can be seen in figure 4.12b. As seen in the figure, the vignetting effect is severelyreduced.

(a) Image suffering from vignetting. (b) Corrected image.

Figure 4.12: Darkened edges in the image caused by vignetting, and the corresponding correctedimage.

57


4.6 Lidar Segmentation

The tracking filter employed in this thesis is based on the point target assumption, meaningthat each target generate at most one measurement. The lidar generate a point cloud withpotentially thousands of points in each scan, where a single target could be the source ofhundreds of points. Figure 4.13 illustrates an example where a ship was in close proximityto the lidar, resulting in many point measurements. For this reason, clustering is performed

-2 -1 0 1 2 3 4 5 6 7 8N [m]

3

4

5

6

7

8

9

10

E [m

]

2D Point Cloud

Figure 4.13: Lidar returns from a ship in close proximity, projected down to the x-y plane.

on the raw point cloud, to maintain the at-most-one-measurement assumption in the track-ing framework. Clustering is the process of partitioning a finite unlabeled data set into afinite set of clusters, where the data in each cluster are similar in some sense [74]. For thetask of clustering points originating from a target, it makes sense to assume that points thatare spatially close to one another belong to the same object.

While there are many different clustering techniques, they can roughly be classified aseither hierarchical clustering or partitional clustering [75]. Hierarchical clustering meth-ods partition the data into a nested series of partitions based on a criterion for mergingor splitting clusters based on similarity. Partitional clustering methods on the other hand,identify the partition that optimizes a clustering criterion, with no hierarchical structure.In this thesis, a density-based partitional clustering approach is employed. Density-basedmeans that for each point in a cluster the neighborhood of a given radius must contain atleast a specified minimum number of points. This makes the clustering algorithm robustagainst random noise, where the hypothesis is that random noise will seldom appear asdense clusters, but rather randomly spread out. The algorithm used is Density Based Spa-tial Clustering of Applications with Noise (DBSCAN) by Martin Ester et al. [76]. The

58


DBSCAN algorithm is a well known and much used clustering algorithm, one examplebeing [77], where they use DBSCAN to cluster lidar data to identify road lane markings.

4.6.1 The DBSCAN Algorithm

Before we delve into the algorithm itself, some concepts central to the algorithm must beexplained. The concepts and algorithm presented in this section is based on the originalpaper by Martin Ester et al. [76] as well as [78]. The DBSCAN algorithm relies on twoparameters; ε, which determines how close points should be to be considered a part of thesame cluster, and minPts, which determine how many neighbors within a ε neighborhooda point should have to be considered a part of the same cluster. Furthermore, it uses theconcept of direct density reachability, density reachability, and density connectivity.

A point p is direct density reachable from a point q with respect to ε, minPts if p is in theε-neighborhood of q, and the ε-neighborhood of q contains at least minPts points.

A point p is density reachable from a point q with respect to ε, minPts if there is a chainof points p1, . . . , pn, p1 = q, pn = p such that pi+1 is direct density reachable from pi.

Two border points in the same cluster may not be density reachable from each other, sincethe core point condition might not hold for both. Two points p and q are considered densityconnected if there is a point o such that both p and q are density reachable from o withrespect to ε and minPts. DBSCAN also use the definition of core points and border points.A core point is a point inside the cluster with at least minPts points in the ε-neighborhood,while a border point is, as the name implies, a point on the border of the cluster where thiscondition may not hold. These concepts are illustrated in figure 4.14, where points labeledC are core points, points labeled B are border points, and the point N is considered asnoise. In this example, minPts=3, and the ε radius for each point is illustrated with circles.Direct density reachability is indicated with arrows.

The basic steps of extracting a single cluster in the DBSCAN algorithm can be summa-rized as follows: First select an arbitrary point from the dataset satisfying the definition ofa core point as a seed. Second, retrieve all points that are density reachable from the seedobtaining the cluster containing the seed. The DBSCAN algorithm is illustrated as pseu-docode in Algorithm 1. The algorithm is adapted from [78]. The RangeQuery-functionfinds all neighbors of the point p within the radius ε. The distance used in the range querydoes not necessarily have to be euclidean distance, however for the application of cluster-ing lidar points, euclidean distance makes the most sense, and is what is used in this thesis.

59


Figure 4.14: Illustration of the DBSCAN cluster model. The figure is adapted from [78].

input : Dataset P , Radius ε, Density threshold minPts, Distance function distoutput: Point labels label

1 foreach point p in P do // Iterate over all points2 if label(p) 6= undefined then // Skip prev. visited points3 continue4 end5 Neighbors N← RangeQuery (P, p, dist, ε); // Get neighbors6 if |N| < minPts then // Label as noise7 label(p)← Noise;8 continue9 end

10 c← next cluster label; // Start new cluster11 label(p)← c;12 Seed set S← N \{p}; // Expand neighborhood13 foreach q in S do // Iterate over all neighbors14 if label(q) = Noise then label(q)← c;15 if label(q) 6= undefined then continue;16 Neighbors N← RangeQuery (P, q, dist, ε);17 label(q)← c;18 if |N| < minPts then continue; // Core-point check19 S← S ∪ N20 end21 end

Algorithm 1: The DBSCAN Algorithm

60


4.6.2 The Modified Distance Parameter

Since the laser beams from the lidar get spread out as the range increases, the constantparameter ε was replaced with a function which calculates ε based on the range from thelidar to the point. Close targets will give many returns closely spaced together, whilereturns from targets further away will be more spread out. A similar idea was proposed by[77]. In this thesis, it was observed that a constant ε led to over-segmentation of targetswhen they were close, where the boats were clustered together with wakes. Increasing εlinearly with the range led to the same problem, where two separate targets were clusteredtogether despite being separated by several meters, due to the large clustering radius ε. Anon-linear function was therefore used. The distance function used in this thesis is

ε(r) =1

2ln(r) (4.24)

where r is the range from the lidar to the point in consideration. This function was foundthrough experimentation. The function for r ∈ [0, 100] meters is shown in figure 4.15. The

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

Figure 4.15: ε(r) from 0 to 100 meters.

distance function doesn’t make sense for r ≤ 1, however the lidar does not give returnsbelow r = 2 meters, to avoid returns originating from the ReVolt ship itself. Figure 4.16shows the result of clustering the points in figure 4.13 using DBSCAN, with minPts=3.The convex hull of the cluster is drawn as the red polygon, with its center as the red dot.As can be seen in the figure, all points belonging to the boat are captured in the cluster. Infigure 4.17, a target at greater distance is clustered with DBSCAN. The returns from thelidar are more spread out in this case.

61


-2 -1 0 1 2 3 4 5 6 7 8N [m]

3

4

5

6

7

8

9

10

E [m

]

2D Point Cloud

Figure 4.16: Lidar returns from figure 4.13, clustered with DBSCAN.

-6 -5 -4 -3 -2 -1 0 1 2N [m]

26

27

28

29

30

31

32

E [m

]

2D Point Cloud

Figure 4.17: Lidar returns from a target at distance, clustered with DBSCAN.

4.7 Target Tracking Framework

This section aims to tie together the material presented thus far, describing the completeimplemented target tracking system. First, the motion and sensor models used are pre-

62


sented, before going into some detail on the JIPDA implementation. The whole system ata high level is illustrated in figure 4.21.

4.7.1 Motion and Sensor Models

Constant Velocity Motion Model

The motion model used in the target tracking framework is the constant velocity model.The state vector of a target in the constant velocity model is given as x = [N,E, VN , VE ]

T ,where N , E, VN and VE are the north and east positions and velocities of the target in astationary world-fixed reference frame, respectively. The constant velocity (white noiseacceleration) model in discrete time can be written as [41]

xk+1 = Fcvxk + wk p(wk) = N (wk; 0,Qcv) (4.25)

where the state transition matrix Fcv and process noise covariance matrix Qcv is given as

Fcv =

1 0 T 00 1 0 T0 0 1 00 0 0 1

(4.26)

Qcv = σ2cv

T 4/4 0 T 3/2 0

0 T 4/4 0 T 3/2T 3/2 0 T 2 0

0 T 3/2 0 T 2

(4.27)

where T = t(k + 1) − t(k) is the system sample time. The process noise strength σcv isselected according to the targets’ expected maneuverability. The name ”white noise accel-eration model” stems from the fact that accelerations along the north and east directionsare modeled as white noise.

Sensor Models

As described in section 4.1.2, the lidar software driver converts the measurements receivedto cartesian coordinates. The measurements received are given in the lidar frame of refer-ence, and transformed to the NED frame using (4.20). The object tracking system tracksobjects in the 2D North-East plane, so only the N and E coordinates of points are usedin tracking objects. The measurement model for the lidar is therefore a linear Cartesianmeasurement model of the form

zk,l = Hlxk + vk,l p(vk,l) = N (vk,l; 0,Rl) (4.28)

where vk is the measurement noise, and H is given by

Hl =

[1 0 0 00 1 0 0

](4.29)

63


The measurement noise covariance matrix Rl was determined based on the extent of thetargets used in the experiments described in the next chapter. Depending on the orientationof the target, the centroid of the cluster given by DBSCAN could wander along the lengthof the target in question. For example, if the aft of the target is directly facing the lidar, ascenario such as illustrated in figure 4.18 could occur. Here, the cluster centroid is centeredon the stern of the target. As the largest target has a length of approximately 7.1 meters,the measurements were assumed to be distributed along the length of the target with 5σprobability. The measurement noise covariance matrix was therefore chosen to be

Rl = σ2l

[1 00 1

](4.30)

with σ2l = 0.5. Experimental tuning also showed that this value for the measurement noise

covariance gave good tracking results in comparison with other values.

Figure 4.18: The cluster centroid centered on the stern of the target.

The camera measurement model is somewhat more complicated. As explained in section2.1.1, there is a loss of information when projecting 3D points into the 2D image plane,and without additional information, a measurement in the image plane can only give a lineof sight in the 3D world coordinate frame. Let xn = [N,E,D]

T be the predicted positionof a target in the NED frame. Furthermore, let the coordinates in the camera frame begiven by xc = [xc, yc, zc]

T . The line-of-sight angle relative to the z-axis in the cameraframe of reference is then given by (see figure 4.19)

ψc = tan−1

(xczc

). (4.31)

In order to update the innovation covariance, the Jacobian in terms of the state in NEDcoordinates is needed, as described in section 3.2.4. In terms of NED coordinates, themeasurement can be expressed using the inverse transformation of (4.21). If

T nc =

[Rnc tnc

01×3 1

](4.32)

is the transformation from the camera frame to the NED frame, then the inverse transfor-mation is given by

T cn =

[(Rnc )T −(Rnc )T tnc01×3 1

]=

[Rcn tcn

01×3 1

](4.33)

64


Multiplying (4.33) with the position in NED coordinates, where D is taken to be zero(since the plane D = 0 is the plane parallel to the sea level, where the targets are tracked),and substituting into (4.31), we get

ψc = tan−1

(r11N + r12E + tnn,1r31N + r32E + tcn,3

)(4.34)

where rij is the ijth entry ofRcn, tcn,i is the ith coordinate of tcn, andN and E are the northand east NED coordinates.

The measurement model for the camera, assuming the measurement noise enters addi-tively, can then be expressed as

zk,c = hc(xk) + vk,c p(vk,c) = N (vk,c; 0,Rc) (4.35)

where the nonlinear measurement function hc(xk) is given by (4.34). The Jacobian ofhc(xk) evaluated at the predicted state is used in computing the innovation covariancematrix, as described in section 3.2.4.

Figure 4.19: A point in the NED frame, and its relation to the line-of-sight angle in the cameraframe.

4.7.2 Fusing the Measurements

The images captured by the camera are run through the Faster R-CNN convnet, whichgives a set of bounding boxes with accompanying scores for each detected object in the

65


image. The center of each bounding box is projected to the normalized image plane usingthe inverse camera calibration matrix K−1, and the line-of-sight angle for the center ofthe bounding box for each detection can subsequently be found by ψdc = tan−1 xdc , wherexdc is the x-coordinate of the center of the dth bounding box in the normalized imageplane. In the normalized image plane we have that zc = 1. The association probabilitiesbetween the predicted measurement angle for each target and the line-of-sight angle(s)for each detection are then computed using JIPDA in the same manner as for the lidarmeasurements. Using these association probabilities to update the target state has not beenimplemented in this thesis, and remains as future work.

4.7.3 Implementing JIPDA

A key element of the JIPDA filter is the computation of the marginal association probabili-ties (3.66), as described in section 3.3.3. A complicating factor in this is that the number ofassociation hypotheses grows exponentially with respect to the number of measurementsand tracks. Based on the number of measurements and tracks, a complete calculation ofthe marginal association probabilities may be infeasible, and complexity-mitigating stepsshould be made.

Calculating the Marginal Association Probabilities

A complexity-mitigating step in the multi-target case is to cluster tracks that share mea-surements together, and approximate the marginal association probabilities within eachcluster. This avoids the computation of association probabilities that are unlikely, consid-ering only plausible associations. The concept of a cluster of tracks is illustrated in figure4.20. The number of tracks and measurements within each cluster may still make thecalculations infeasible, so approximations may be necessary. Several methods for approx-imative evaluation of the association probabilities exist, such as in [79], where they exploitthe fact that a lot of the structure is repeated among the different association hypothe-ses. This is however a patented method. The method used in the implemented trackingsystem is Murtys’ M-best method [80], which generates the M-most probable associationhypotheses for a given set of measurements and tracks.

Murtys’ M-Best Method

Murtys’ M-best method builds upon the classical assignment problem from combinatorialoptimization, by casting the problem of finding the most probable assigment hypothesisas a weighted bipartite matching problem. The nodes on one side are the measurements,while the nodes on the other side are the targets, and each arch gives the log-likelihoodthat a given measurement should be assigned to a given target. Finding the best hypothesisis then reduced to the problem of finding the assignment that maximises the sum of the

66


Figure 4.20: A set of tracks sharing measurements within their gate grouped into a cluster, indicatedby the shaded validation regions.

log-likelihoods of the given assignment, which can be found in polynomial time. Murtysmethod maintains a list of of problem/solution pairs. Each pair consists of an assignmentof measurements to tracks and a score matrix based on the association probabilities fromthe assignment hypothesis. The list is initialized with the initial problem to be solved, andin each iteration the best solution from the problem/solution pairs is found, which is thenremoved from the list. In the next iteration, the problem set contains the initial problemminus the previously found best solution, which will subsequently give the second-bestsolution. The method continues in this fashion until the M-best assignments are found, orthe list of problems is empty.

Track Management

The JIPDA filter builds on the assumption that tracks already exists, and does not considerthe problem of track formation (or termination). Note that the existence of tracks does notnecessarily imply that the targets exist. In this thesis, a simple, yet effective method fortrack formation is employed. The track formation procedure was inspired by [81], with amodification with regard to when tracks are considered confirmed. The formation of pre-

67


liminary tracks goes as follows: At each time step, measurements which are not associatedwith any tracks are stored to the next time step. At the next time step, for each measurementnot associated with any target, the distance to every previously non-associated measure-ment is calculated. If this distance is below a threshold, a new preliminary track is formedwith a set of predefined initial values.

Denote the set of unassociated measurements at the prevous time step as ¬Zk−1. Thecriterion for starting a new preliminary track can then be described as

|¬Zpk − ¬Zqk−1| < Tvmax p = 1, . . . , ϕk, q = 1, . . . , ϕk−1 (4.36)

where T = tk − tk−1 is the system sample time, and vmax is the maximum velocity ofthe target. If a set of measurements satisfy (4.36), the measurements are used to generatethe initial values of the new track. The initial existence probability is set to εtinit = 0.5,and the initial visibility probability is set to νtinit = 0.8. To avoid the formation of manyspurious tracks in cluttered situations, the fact that JIPDA calculates existence probabilitiesis exploited in track confirmation. A preliminary track is considered confirmed if εtk ≥ 0.8,and added to the list of confirmed tracks. In a similar fashion, tracks that have εtk ≤ 0.01are considered dead, and removed from the list of confirmed (or preliminary) tracks.

Moreover, tracks within a pre-specified distance threshold dT of one another are hypothe-sized to belong to the same target, and the track with the shortest lifetime is removed fromthe list of confirmed tracks.

Tracking Parameters

The JIPDA algorithm was run with the following set of parameters:

68


Parameter Value

σ2cv 0.5σ2l 0.5PG 0.99PD 0.90vmax 2.5εtinit 0.5νtinit 0.8dT 3 mpε11 0.99pε12 0.01pε21 0pε22 1pη11 0.9pη11 0.1pη11 0.48pη11 0.52λ 0.01

Table 4.3: Parameters used in the tracking system.

Raw 3D PointCloud

Transformationto NED frame

Land Removal& 2D

projection

NavigationSystem

DBSCANClustering

ClusteringParameters

MeasurementValidation andAssociation

Cluster Tracks

Measurement-to-trackAssociation and

Existence/VisibilityProbabilities

Track StateUpdate

TrackManagement

PreliminaryTracks

Unassociated Measurements

Dead Tracks

Track StatePrediction

Image

Rectification Faster R-CNN

Clusters

ImagePreprocessing

Transformationto Camera

Frame

MeasurementValidation andAssociation

Cluster Tracks

Measurement-to-track Association

andExistence/Visibility

Probabilities

NavigationSystem

List of ConfirmedTracks

Detections

Figure 4.21: The implemented tracking framework. Using camera measurements in the state updateremains as future work. This is indicated by the dashed red line.

69


70

Chapter 5Experimental Results andDiscussion

5.1 Experimental Setup

In order to evaluate the proposed target tracking and sensor fusion framework, a series ofexperiments with the purpose of gathering real sensor data from different scenarios wasperformed. All experiments were performed in the Dora harbor basin in Trondheim, seefigure 5.1. The satellite image was taken from Google maps.

Due to problems with the dynamic positioning system during the experiments, the ReVoltwas moored to the harbor during the maneuvers. The mooring spot is indicated by a circlein figure 5.1. In the following, the ReVolt will be referred to as the ownship, while thetargets will be referred to as target 1 and target 2. Two boats were used as targets, onePolarcirkel 560 work with a length of 5.6 meters (target 1), and one Nidelv 690 sport(target 2) with a length of 7.14 meters. An overview of the experiment area with the twotarget boats can be seen in figure 5.2.

As a step in the data post-processing, all returns originating from land or other mooredvessels in the area were manually removed from the lidar point cloud before they werefed into the tracking framework. In order to evaluate the performance of the trackingsystem, a position ground-truth (GT) for each target is helpful. This was generated bya mobile phone equipped with a GNSS receiver on each target, saving the position onceper second. The mobile phones do not have differential GNSS capabilities, so there issignificant uncertainty in the position data, however it will serve as an indication of theperformance of the tracker when visually comparing the GT tracks with the tracks output

71

Chapter 5. Experimental Results and Discussion

Figure 5.1: The experiment area in the Dora harbour basin. Maneuvers were performed inside themarked area.

Figure 5.2: An overview of the experiment area. Target 1 can be seen on the left, and target 2 is therightmost boat. Image courtesy of Tom Arne Pedersen, DNV GL.

72

5.1 Experimental Setup

(a) Polarcirkel 560 work. (b) Nidelv 690 sport.

Figure 5.3: The two boats used as targets. The image of the Polarcirkel 560 work was suppliedby Tom Arne Pedersen, DNV GL, while the image of the Nidelv 690 sport was supplied by RuneGreen, DNV GL.

by the tracking system. There was calm seas and no precipitation during the experiments.

73


5.2 Tracker Evaluation Metrics

In order to evaluate the tracking system in a consistent and systematic way, performancemeasures must be defined. Different classes of performance metrics can be defined basedon the availability of data. Gorji et al. [82] define two classes of metrics; sensor-relatedmeasures, and track-related measures, where the latter also takes the tracker into consid-eration. The track-related measures can further be divided into algorithm-based measuresand algorithm-free measures. The algorithm-based measures are developed for individualtypes of trackers, whereas algorithm-free measures can be applied to every tracker. Thismakes the algorithm-free metrics attractive to use, since they can be used to compare theperformance of different tracking algorithms in the future. Gorji et al. list a number ofdifferent metrics that can be used in the performance evaluation, while some rely on hav-ing an accurate ground truth and thus are mostly suited for simulations, the most relevantones for our experiments will used in evaluating the tracking results. The definitions ofeach individual metric is taken from [82]. The consistency of the tracking system are alsochecked using statistical measures.

5.2.1 Cardinality Metrics

The cardinality metrics measure numerical characteristics of the obtained results. Thecardinality metrics used in this thesis are listed in the following.

Number of Valid Tracks (NVT) A track is validated if it is associated with only onetarget, and also, the assigned target is not associated with any other track.

Number of Missed Targets (NMT) A target is missed if it is not associated with anytrack. The number of missed targets is incremented for every missed target at every timestep through the whole duration of the scenario. This metric is similar to the track lossrate as defined in [83].

Number of False Tracks (NFT) A track is counted as a false one if it is not associatedwith any target. This value is incremented by the number of false tracks at every time stepthroughout the duration of the scenario.

Number of Broken Tracks (NBT) The number of broken tracks is found by checkingeach track associated with a target, and if the last time k of the track is smaller than thelast appearance time of the target, the number of broken tracks is incremented by one.

74

5.2 Tracker Evaluation Metrics

Track Continuity (TC) The continuity for the lth target is defined as

Tcl =1

N tl

Ntl∑

n=1

∆knl∆kl

(5.1)

where N tl is the total number of tracks assigned to the target, ∆knl is the duration of the

nth track, and ∆kl corresponds to the appearance time of the lth target.

5.2.2 Time Metrics

The time metrics provide information about the persistency of a track. Three time metricsare used in this thesis, listed in the following.

Rate of False Alarm (RFA) The rate of false alarm is defined as the number of falsetracks per time step. Using the definition of the number of false tracks, this measure canbe stated as

RFA =NFT

∆t(5.2)

where ∆t is the time duration of the scenario in question.

Track Probability of Detection (TPD) The track probability of detection is calculatedby summing the number of time steps that a target is associated with a valid track, anddividing by the number of time steps that the target is present in the scenario. Mathemati-cally, this can be expressed as

P ld =Tl

∆kl(5.3)

where P ld is the probability of detection for the lth track, and Tl is the duration that the lth

target is assigned to a track. ∆kl is the appearance time of the lth target. The final trackprobability of detection is then found by taking the average over all individual probabilitiesas

Pd =1

L

L∑l=1

P ld (5.4)

where L is the total number of targets.

Rate of Track Fragmentation (RFT) The rate of track fragmentation is found by ex-amining the number of changes in the assigned track IDs for each target.

75


5.2.3 Statistical Filter Consistency Measures

A state estimator is consistent if the estimation error is unbiased, i.e. zero-mean, and thatthe actual mean square error of the filter matches the filter-calculated covariance. As ameasure of filter consistency, the time-average normalized innovation squared (NIS) [84]can be used,

εν =1

K

K∑k=1

νTk S−1k νk. (5.5)

Under the hypothesis that the filter is consistent, Kε has a χ2-distribution with Knz de-grees of freedom, where nz is the dimension of the measurement vector. The hypothesiscan be tested by using values from the χ2 distribution table, and defining an acceptanceregion

ε ∈ [r1, r2] (5.6)

where [r1, r2] is the two-sided 95% probability region for the χ2 distribution related to thecorresponding NIS. Filter bias can be estimated by calculating the Average Innovation (AI)for all states [85], which should ideally be zero. The average innovation can be calculatedby

ν =1

K

K∑k=1

νk. (5.7)

5.3 Tracking Results

This section presents the maneuvers that was performed, as well as the tracking resultsusing the lidar as the sensor. While the maneuvers performed are interesting from a strictlytarget tracking perspective, future applications, such as collision avoidance or extendedobject tracking was kept in mind when designing the scenarios. In the track plots, thestart of a track is marked with a circle, and track termination is marked by a cross. Tracksassociated with the targets are annotated with track ID as well as the duration of the track.The track-to-target association is done by hand in post-processing for performance analysispurposes.

5.3.1 Scenario 1

The first scenario involves a single target, target 1. In this scenario, the boat moved in azig-zag pattern, performing a full stop at each turn. This scenario involves a maneuveringtarget, while the motion model used in the tracking assumes (nearly) constant velocity. Thewhole scenario ran over 1046 time steps. The target started close to the ownship, passingon a crossing course relative to the ownship heading. In the straight line segments thetarget had a velocity between 1.5 and 2 meters/second. Figure 5.4 shows the GNSS data

76


Figure 5.4: Scenario 1 GNSS track.

recorded for this run. Figure 5.5 shows the tracks output by the tracker for this scenario.The track labeled track 1 is the track from the target, starting at the second time step. Ascan be seen in the figure, there are several short tracks in the vicinity of the ownship. Thesetracks originated from waves generated by the target. Track 1 died at k = 838 due to a

Figure 5.5: Tracks for scenario 1.

lack of detections from the target over 14 time steps, which lead to the track existence

77


probability dropping below the track death threshold. The target became visible againshortly after, and was subsequently correctly tracked (track 5) until the end. The trackingsystem does not seem to be affected by the fact that the target maneuvers in this scenario,suggesting that the constant velocity model is robust to (non-aggressive) maneuvers. The

Metric Value

Number of valid tracks, target 1 2

Number of missed targets 20

Number of false tracks 103

Number of broken tracks, target 1 1

Track continuity, target 1 0.49

Rate of false alarm 0.0985

Track probability of detection, target 1 0.9799

Rate of track fragmentation, target 1 1

Table 5.1: Evaluation of tracking performance, scenario 1.

number of false tracks in this scenario is high due to the many false wake-originated tracksat the start. This also leads to a high rate of false alarm. The number of missed targets isdue to the missing detections between tracks 1 and 5.

5.3.2 Scenario 2

In this scenario, the two targets started at opposite sides of the harbour basin, and movingacross the basin passing one another in the middle. This scenario was performed in orderto test the effect of occlusion, where the closer target blocks the farther one from the lidar’sperspective, as well as testing how the tracker responds to close tracks at distance wherethe detections might be spurious. The larger target, target 2, passed on the far side fromthe ownship’s point of view. Figure 5.6 shows the recorded GNSS tracks for the target inthis scenario. Both targets had an approximate velocity of 1.6 meters/second through thescenario. The resulting tracks are shown in figure 5.7. There are no false tracks recordedin this scenario, however the tracks for both targets are fragmented. The gap between track2 and track 3 is due to target 2 being detected for a short period, giving birth to track 2,before the target remained undetected for 11 time steps, causing the track to die. It wassubsequently detected again at k = 48, and was successfully tracked for the remainderof the experiment. A similar situation caused the gap between tracks 1 and 4, where thesmaller target 1 was spuriously detected at this distance causing the existence probabilityfor track 1 to drop below the death threshold before the target was aquired again. Dueto the relative size difference between the targets, target 2 was only partially occluded bytarget 1. This is illustrated in figure 5.8, where the partial occlusion lead to two clustersfrom target 2.

78




The number of missed targets in this scenario is, similarly to scenario 1, due to misseddetections of the targets at a range of approximately 55 meters for target 1, and approx-imately 60 meters for target 2. The number of false tracks, and thus the number of falsealarms is zero in this scenario, as no false tracks were observed. The tracks of both targetsare fragmented, due to the aforementioned missed detections.

79


Figure 5.8: Target 2 partially occluded by target 1. The light blue ellipse represents the validationgate for each target, while the red ellipse represents the state covariance. The red asterisks are thecluster centroids.

Metric Value












Total track probability of detection 0.9787




80


5.3.3 Scenario 3

In scenario 3, the two targets started at a distance on a collision path towards one another.Both targets veered off before collision, going separate ways. The main purpose withthis scenario is to test the tracking systems robustness with regard to track coalescence,as well as this being an interesting scenario from a collision avoidance standpoint forfuture projects. The GNSS track for each target is shown in figure 5.9. Target 1 had anapproximate average velocity of 1.8 meters/second, and target 2 had an average velocityof 1.7 meters/second throughout the scenario. Again we see track fragmentation at the


start, this time for target 1, due to spurious detections at distance. Target 2 was detectedand subsequently tracked from k = 93. The short track labeled 4 in figure 5.10 was dueto a wave in the wake of target 2, which died when the track came within the minimumtrack distance, keeping track 3 due to the longer lifetime of this track. This scenario alsodisplays a fairly high number of missed targets. Both targets started at a distance on thelimits of the range of the lidar, leading to delayed track aquisition time. A single falsetrack was observed over 15 time steps, giving the number of false tracks in this scenario.Due to a delayed track aquisition time, both targets display a lower track probability ofdetection.

81



Metric Value
















82


5.3.4 Scenario 4

In scenario 4, the two targets are moving on parallel paths crossing in front of the ownship.Target 1 then veers off its path, moving away from the ownship. In addition to examin-ing the effects of occlusion, this scenario also tests how the tracking system responds toa sudden maneuver with the constant velocity model. Both targets had an approximateaverage velocity of 2 meters/second for this scenario. In this scenario the smaller target


is on the far side of the ownship, causing it to be fully occluded for an extended period.This is evident in the gap between track 2 and track 3, as seen in figure 5.12. There islittle spatial gap between the tracks, this is due to the cluster centroid being close to thebow of the target before full occlusion, and the aft was the first part of the target becomingvisible again. The track fragmentation at the end is due to spurious visibility of the targetat distance. Target 2 is successfully tracked through the maneuver, the broken track at theend is due to the target disappearing behind a building in the vicinity. No false tracks wereobserved in this scenario, but the number of missed targets is fairly high. This is due to theaforementioned occlusion of target 1, causing the target to be lost over several time steps,as well as track loss as the target moved outside the lidar range. This also led to a lowtrack continuity for target 1.

83



Metric Value
















84


5.3.5 Scenario 5

In scenario 5, the two targets started close to the ownship, travelling on parallel paths awayfrom the ownship. This scenario is designed to test the range limitations of the lidar as atracking sensor, and in future applications how the camera can aid in maintaining trackswhen the targets are outside the lidar range. The targets moved with an average velocityof 2.3 meters/second away from the ownship. As seen in figure 5.14, there is a lot of false


tracks around the ownship. This is caused by wakes from the targets being detected by thelidar, which lead to several spourious, short-lived tracks as the targets passed. Figure 5.15is a snapshot of the lidar point cloud at time step k = 148 illustrating the wakes behind thetargets. Each red asterisk represent a cluster being sent to the tracker as a measurement,causing spourious tracks. It is also seen that target 1 is lost before target 2 in this scenario.This is not surprising, since target 1 has a lower profile in the sea than target 2, leading tofewer detections than from the larger target 2. Target 2 is lost at a range of approximately85 meters away from the ownship. This scenario has both a very high number of missedtargets, as well as a large number of false tracks. The high number of missed targets is dueto the scenario running until the targets are well outside the lidar range. This also lead to alow total track probability of detection. This scenario also illustrates the lidars sensitivityto target extent, as the larger target 2 was tracked much longer than the smaller target 1.The high number of false tracks is again due to the wakes, causing a high rate of falsealarm.

85



Figure 5.15: Wakes behind the targets in scenario 5.

86


Metric Value
















87


5.3.6 Scenario 6

In scenario 6 the targets approach the ownship from a distance, approaching on parallelpaths. Scenario 6 is the second half of scenario 5, where the two targets return along thesame path as they left. The targets approached with an average velocity of approximately2.4 meters/second. The recorded GNSS path for the scenario is illustrated in figure 5.16,and the tracks output by the tracking system is shown in figure 5.17.


Again it can be observed that target 2 is detected and tracked at a greater distance thantarget 1. There is also a large amount of false tracks around the ownship in this scenario,resulting from the wakes generated by the two targets passing the ownship. The tracklabeled track 6 actually originated from wakes from target 2, and track 1 was lost due toocclusion by target 1. Track 6 subsequently started tracking the target afterwards. Thisscenario is similar to scenario 5, with both a high number of missed targets and a highnumber of false tracks. The rate of false alarm is again high due to the false tracks.

88



Metric Value
















89


5.3.7 NIS and Average Innovation

Scenario 5 was used in the calculation of the NIS and the average innovation (AI) to eval-uate the covariance consistency of the filter. This scenario was used since it tracks thetargets all the way from up close until they are out of range. The NIS and AI were cal-culated for a variety of process noise values σ2

cv over 200 time steps, and the results areshown in table 5.7 and table 5.8. The 95% probability region for the NIS is found fromtables of the χ2-distribution to be [r1, r2] = [1.7324, 2.2865], and the NIS values closestto this region is emphasized in bolt for both targets. The process noise variance in the

Target σ2cv = 0.05 σ2

cv = 0.1NIS AI NIS AI

Target 1 4.11 (−0.02,−0.28)T 3.28 (−0.05,−0.23)T

Target 2 1.94 (0.04,−0.09)T 1.92 (0.00,−0.08)T

Table 5.7: NIS and AI for target 1 and target 2 in scenario 5.

Target σ2cv = 0.5 σ2

cv = 1.0NIS AI NIS AI

Target 1 2.82 (−0.23,−0.20)T 2.79 (−0.24,−0.18)T

Target 2 1.85 (0.02,−0.07)T 1.61 (0.00,−0.08)T

Table 5.8: NIS and AI for target 1 and target 2 in scenario 5, continued.

tracking framework was selected to be σ2cv = 0.5 based on these results, and is a com-

promise between the values closest to being covariance-consistent for both targets. Noticethat the AI for target 1 has a slight negative bias, this could be due to wake measurementsfalling within the validation gate of the target early on in the scenario.

5.4 The Camera as a Complimentary Sensor

As described in section 4.7, the predicted measurements for each target are transformedinto the camera frame using (4.33), and converted into an line-of-sight angle referenced tothe z-axis in the camera frame. The center of the bounding boxes given by Faster R-CNNare also converted to a line-of-sight angle in the camera frame, and the measurement-to-track association probabilities are subsequently calculated by the JIPDA filter. The fieldof view of the camera is 125°.

For the associations to make sense, the transformations between the coordinate frameshave to be accurate, which again relies on the accuracy of the calibration between the

90


sensors, as well as the camera calibration. Moreover, the measurements from each sensormust be synchronized in time. Images were captured with a sample frequency of 10 Hz. Infigure 5.18, a single image recorded from scenario 2 is displayed, with the track locationsgiven by the tracking framework marked as yellow pluses.

Figure 5.18: Image from scenario 2. The range to the track labeled T 3 is approximately 60 meters,and the range to the track labeled T 1 is approximately 50 meters, both measured by the lidar.

Figure 5.19 is a cropped version of figure 5.18, which shows that the predicted track loca-tion is aligned with the center of each target, giving a visual indication that the calibrationprocedure (and the transformations between the coordinate frames) is accurate.

The Faster R-CNN model was trained as a single-class detector to detect boats in the im-ages, and it was found that most of the detections originated from the surroundings, suchas from houses or structures in the vicinity, rather than from the actual targets. The detec-tions from scenario 5 was examined to get an estimate of the practical range of the cameraas a detector using the implemented Faster R-CNN model. The results are displayed intable 5.9. The range estimate to the targets was based on the measurements from the li-dar. As is evident from table 5.9, none of the detections originated from any of the targetsat ranges over 20 meters. The other scenarios have similar detection results, except forsome detections when the broadside of the targets are facing the camera in scenario 2.Moreover, at close ranges, most of the detections originated from the surroundings ratherthan from the targets. This is not necessarily a problem, as seen in figure 5.20, the as-sociation probabilities calculated by JIPDA (displayed as Beta: T 1:0.873 in the image)only associates the track with the actual target detection. Since a lot of the misdetectionsoriginate from structures such as houses in the background, adding a few ”background”classes such as house, building, car and so on to the image classifier could reduce suchmisdetections. Figures 5.20 through 5.23 shows some of the cases where good detections

91


Figure 5.19: Cropped version of 5.18, showing the targets and their position given by the trackingsystem projected into the image.

Range Target False Images with no Imagesdetections detections target detections

≤ 10 meters 44 62 2 43

10-20 meters 20 60 20 40

20-30 meters 0 93 40 40

30-40 meters 0 140 45 45

Table 5.9: Evaluation of image detections in scenario 5.

of the targets was achieved, showing that the computation of the association probabilitiescould be helpful in discriminating between false and true tracks. The measurement noiseinfluencing the gating distance has not been optimized in these examples, neither has theclutter intensity λ. Since the Faster R-CNN detector is a direct reimplementation of thework done by Tangstad in his thesis [63], and has not been the main focus of this thesis,improving the visual detection of boats remains as future work.

92


Figure 5.20: Computed association probabilities for image detection, scenario 1.


93




94

5.5 Discussion of the Results

5.5 Discussion of the Results

This section discusses the results found in the previous sections, and assesses the overallperformance of the tracking system.

5.5.1 The Wake Problem

Wakes behind the targets when they were close led to many false tracks output by thetracking system. This is evident from the number of false tracks in scenarios 1, 5 and 6(103, 135 and 242, respectively). Other than when the targets were close, very little clutterwas observed in the data. This suggests that the parametric Poisson clutter model mightnot be the best suited clutter model for the tracking system with the lidar as the sensor.The Poisson clutter model assumes that the clutter is independent of the target, which doesnot seem to be the case for boats at sea. Brekke et al. [86] developed a way to model thewakes as a probability density function behind the target for the nonparametric PDAF, anextension of the JIPDA filter to include such a model could aid in the tracking when wakesare present. The detections from the camera could also be used to discriminate betweenwakes and targets, by using the association probabilities between detections and tracks.Figure 5.20 shows an example where the detection of the target has a higher associationprobability with the actual target track (T 1) than the track originating from wake (T 2).Even though the wakes led to many false tracks, the tracks of the actual targets were neverlost in these cluttered scenarios.

5.5.2 Point Targets versus Extended Targets

The detection model for the lidar sensor is a very simple one, where points that are closeenough with a specified spatial density are clustered together and presented to the trackingsystem as a measurement. When the targets are close to the ownship, the point cloud isrich with information about the extent and shape of the target(s) (see e.g. figure 5.15),which could be utilized to discriminate the clusters between actual targets and wakes withmethods like feature extraction, using e.g. spin images [87], or other geometrical features[88], and subsequently classifying the cluster as clutter or target. Other methods includeutilizing extended object tracking methods such as using Kalman filter approaches, e.g.[89], or random matrix approaches, e.g. Koch [90], Feldman et al. [91] and Granstrom etal. [92]. Extended object tracking methods are attractive due to the fact that they estimatethe extent of a target in addition to its state, which is valuable information for close-rangemaneuvering.

95


5.5.3 The Range Limitations of the Lidar

The results also show that the detection probability of smaller targets, such as leisurecrafts used in the experiments, depends on the range from the lidar. The smaller of thetwo targets used in the experiments had a low probability of detection at ranges beyond50 meters. At 50 meters the laser beams from the lidar has a vertical spread of h =r sinωi ≈ 1.75 meters, and a smaller target could easily come between the vertical spreadof the beams. The larger target, target 2, was detected at longer ranges, as seen in scenario5. In this case, the track was maintained successfully up to a range of approximately 80meters. The lidar manufacturer states that each individual laser beam has a effective rangeof approximately 100 meters. A lidar sensor with a higher vertical angular resolution couldimprove the effective range. As the aim is to track and avoid targets in urban environments,the detection of small targets such as kayakers or even swimmers is important from a safetystandpoint, and could be difficult at range using a lidar with low vertical angular resolution.

5.5.4 Modeling Occlusions

As seen in scenario 4, occlusion of a target over several time steps may lead to trackloss for the occluded target. Granstrom et al. [93] model occlusions by modelling theprobability of detection as non-homogeneous. The basic idea is to lower the probabilityof detection if a point is located behind a target estimate from the sensors point of view.Such an approach depends on knowledge about the extent of the target blocking the otherpoints, and is thus most suited for extended target tracking methods, unless the extent oftargets is known through some other means.

5.5.5 Overall Tracking Results

The main limitation with regard to tracking the actual targets seems to be the range be-tween the lidar and the target, as well as the extent of the target itself, as described in sec-tion 5.5.3. Even in situations where wakes give birth to many false tracks, the targets aresuccessfully tracked while they are within range of detection. However, the birth of manyfalse tracks, especially close to the ownship, is undesirable in a situation where the track-ing system is used in making sense-and-avoid and navigation decisions for an autonomousvessel. Methods to mitigate the birth of false tracks from wakes using the camera is a pos-sibility to investigate further. The motion model used in the tracking framework assumesnear constant velocity, however this has not appeared to be a limitation when trackingmaneuvering targets.

96

Chapter 6Conclusions and Future Work

6.1 Conclusion

A ROS-based software achitecture for data reception from the camera and the lidar hasbeen implemented on the ReVolt model ship. The lidar and camera have been calibrated,giving an accurate transformation between the two coordinate systems, as well as trans-formation to a common world frame. Measurement models for both sensors have beenformulated, and a modified version of the DBSCAN algorithm has been utilized to clus-ter the lidar data to satisfy the point-target-assumption in the tracking system. A seriesof tracking scenarios using real targets at sea has been planned, organized and executed,giving a comprehensive data set which may be valuable also in future research.

The target tracking system has been built upon an existing JIPDA implementation, andtargets are tracked using the lidar as the primary sensor. The tracking results show thatthe lidar is sensitive to wakes from the targets at close range, leading to many false tracksoutput by the tracking system. The targets are not lost due to wakes. The target probabilityof detection with the lidar is reduced at range, depending on the extent of the target inquestion. At ranges between 10-50 meters, the targets are mostly successfully trackedwith few false tracks. Some track fragmentation is observed, mostly at ranges where thetarget detection probability is low, leading to track deaths and births, as well as some caseswhere the smaller target is fully occluded by the larger target. The clustering of lidardata using DBSCAN performed satisfactorily, no noticeable over- or undersegmentationof targets was observed. The dynamic radius function used in stead of the normal constantradius aided in separating the wakes from the targets at close ranges, while still being ableto cluster the more spread-out points at range.

The Faster R-CNN detector showed that it has a very limited range with the camera and

97

Chapter 6. Conclusions and Future Work

lens used, where the detections were few and far between at ranges over 20 meters. Themodel used is a direct reimplementation of a previously trained model [63], and has notbeen the main focus in this thesis. At closer ranges, where the targets are steadily detected,the measurement-to-track association probabilities computed by JIPDA show potential tobe used in the tracking framework. At close ranges wakes in the lidar data is a challenge,however this is not the case for the camera. The camera could be used at close rangeto discriminate between target-originated tracks and wake-originated tracks, as well asaiding in the track formation and confirmation. This remains as future work. Moreover,the camera used is a straight-ahead-looking single camera giving a reduced field of view,360° coverage would give a better situational overview.

6.2 Suggestions for Future Work

This thesis cover a broad specter of research areas as well as much practical work, andit has not been possible to go into great depth in all topics covered. The thesis workhas laid the basic foundations of a complete situational awareness system, and the dataset produced opens up many possible avenues of further research. Some suggestions forpossible future research topics are listed below.

• Implement the proposed target tracking framework in ROS on ReVolt, and examineits real-time capabilities.

• Examine the possibility of including more classes in the image detection framework,to reduce the amount of false detections from background objects.

• Perform experiments to determine the measurement noise covariance matrix Rl forthe lidar. The method used in this thesis is based on an assumption on the size ofthe targets, a more rigorous way of determining the measurement noise covariancecould be considered. A possible method is given in [85].

• Information about the extent of targets is useful at close range, for collision avoid-ance and path planning purposes, as well as occlusion modelling. Extended objecttracking methods using the lidar data could give such information. Possible ap-proaches include Kalman filter approaches, e.g. [89], or random matrix approaches,e.g. Koch [90], Feldman et al. [91] and Granstrom et al. [92]. A recent master the-sis by Kristian Ruud at NTNU has also showed promising results by combining theGaussian mixture extended Kalman filter (GMEKF) with the general probabilisticdata association filter (GPDA) [94]. This method was tested on real data gatheredwith the lidar used in this thesis, and the results show that it outperformed the ran-dom matrix approach.

• Investigate the possibility of using detections from the camera to mitigate falsetracks due to wakes, as well as aiding in track formation and confirmation. Thiscould for instance be based on the measurement-to-target association probabilitiescomputed by JIPDA.

98

6.2 Suggestions for Future Work

• In this thesis, lidar measurements originating from static objects in the vicinity wasmanually removed in post-processing. A way to filter out static surroundings auto-matically is needed in a real-time application.

99

Chapter 6. Conclusions and Future Work

100

Bibliography

[1] V. Kamsvag, “Fusion of lidar and camera for collision avoidance purposes”, 2017.

[2] E Brekke, Fundamentals of Sensor Fusion: Target tracking, navigation and SLAM.Teaching material for future course in sensor fusion, NTNU, 2018.

[3] E. Jokioinen, J. Poikonen, M. Hyvonen, A. Kolu, T. Jokela, J. Tissari, A. Paasio,H. Ringbom, F. Collin, M. Viljanen, R. Jalonen, R. Tuominen, and M. Wahlstrom,Remote and autonomous ships - the next steps, 2016. [Online]. Available: http://www.rolls- royce.com/˜/media/Files/R/Rolls- Royce/documents/customers/marine/ship-intel/aawa-whitepaper-210616.pdf.

[4] W. Koch, Tracking and Sensor Data Fusion: Methodological Framework and Se-lected Applications. Springer, 2014.

[5] H. Alfheim and K. Muggerud, “Development of a dynamic positioning system forthe revolt model ship”, Master Thesis, Norwegian University of Science and Tech-nology, 2017.

[6] C. Stiller, J. Hipp, C. Rossig, and A. Ewald, “Multisensor obstacle detection andtracking”, Image and Vision Computing, vol. 18, no. 5, pp. 389 –396, 2000.

[7] M. Mahlisch, R. Schweiger, W. Ritter, and K. Dietmayer, “Sensorfusion using spatio-temporal aligned video and lidar for improved vehicle detection”, in 2006 IEEEIntelligent Vehicles Symposium, 2006, pp. 424–429.

[8] C. M. Bishop, Pattern Recognition and Machine Learning, 1st ed. 2006. Corr. 2ndprinting, ser. Information science and statistics. Springer, 2006.

[9] R. Szeliski, Computer Vision, Algorithms and Applications. Springer, 2011.

[10] R. Aufrere, J. Gowdy, C. Mertz, C. Thorpe, C. Wang, and T. Yata, “Perception forcollision avoidance and autonomous driving”, Mechatronics, vol. 13, no. 5, pp. 1149–1161, 2003.

101

http://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/customers/marine/ship-intel/aawa-whitepaper-210616.pdf




BIBLIOGRAPHY

[11] H. Cho, Y. W. Seo, B. V.K. V. Kumar, and R. R. Rajkumar, “A multi-sensor fusionsystem for moving object detection and tracking in urban driving environments”,in 2014 IEEE International Conference on Robotics and Automation (ICRA), May2014, pp. 1836–1843.

[12] C. Premebida, O. Ludwig, and U. Nunes, “Lidar and vision-based pedestrian detec-tion system”, Journal of Field Robotics, vol. 26, no. 9, pp. 696–711, 2009.

[13] H. Weigel, P. Lindner, and G. Wanielik, “Vehicle tracking with lane assignment bycamera and lidar sensor fusion”, in 2009 IEEE Intelligent Vehicles Symposium, Jun.2009, pp. 513–520.

[14] M. T. Wolf, C. Assad, Y. Kuwata, A. Howard, H. Aghazarian, D. Zhu, T. Lu, A.Trebi-Ollennu, and T. Huntsberger, “360-degree visual detection and target track-ing on an autonomous surface vehicle”, Journal of Field Robotics, vol. 27, no. 6,pp. 819–833, 2010.

[15] L. Elkins, D. Sellers, and W. R. Monach, “The autonomous maritime navigation(amn) project: Field tests, autonomous and cooperative behaviors, data fusion, sen-sors, and vehicles”, Journal of Field Robotics, vol. 27, no. 6, pp. 790–818, 2010.

[16] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.New York, NY, USA: Cambridge University Press, 2003.

[17] Z. Zhang, “A flexible new technique for camera calibration”, IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 22, pp. 1330–1334, Dec. 2000.

[18] O. Egeland and J. T. Gravdahl, Modeling and Simulation for Automatic Control.Marine Cybernetics, 2002.

[19] J. Wittenburg, Kinematics: Theory and Applications. Springer, 2016.

[20] D. S. Bernstein, Geometry, Kinematics, Statics and Dynamics. Princeton UniversityPress, 2012.

[21] J. Heikkila and O. Silven, “A four-step camera calibration procedure with implicitimage correction”, in Proceedings of IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, Jun. 1997, pp. 1106–1112.

[22] Matlab version 9.2.0.538062 (r2017a), The Mathworks, Inc., Natick, Massachusetts,2017.

[23] G. Bradski, “The OpenCV Library”, Dr. Dobb’s Journal of Software Tools, 2000.

[24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, vol. 521, 436 EP –,May 2015.

[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large ScaleVisual Recognition Challenge”, International Journal of Computer Vision (IJCV),vol. 115, no. 3, pp. 211–252, 2015.

[26] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel, “Backpropagation applied to handwritten zip code recognition”, Neu-ral Computation, vol. 1, no. 4, pp. 541–551, Jan. 1989.

102

BIBLIOGRAPHY

[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks”, in Advances in Neural Information Processing Sys-tems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., CurranAssociates, Inc., 2012, pp. 1097–1105.

[28] M. D Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Net-works”, ArXiv e-prints, Nov. 2013. arXiv: 1311.2901 [cs.CV].

[29] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdi-nov, “Improving neural networks by preventing co-adaptation of feature detectors”,ArXiv e-prints, Jul. 2012. arXiv: 1207.0580.

[30] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Train-ing by Reducing Internal Covariate Shift”, ArXiv e-prints, Feb. 2015. arXiv: 1502.03167 [cs.LG].

[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recogni-tion”, ArXiv e-prints, Dec. 2015. arXiv: 1512.03385 [cs.CV].

[32] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,http://www.deeplearningbook.org.

[33] P. McManamon, Field Guide to Lidar. Society of Photo-Optical InstrumentationEngineers (SPIE), 2015.

[34] VLP-16 User Manual, English, version 63-9243 Rev. D, Velodyne LiDAR, Inc.,138 pp., 2018.

[35] C. L. Glennie, A. Kusari, and A. Facchin, “Calibration and Stability Analysis of theVLP-16 Laser Scanner”, ISPRS - International Archives of the Photogrammetry,Remote Sensing and Spatial Information Sciences, pp. 55–60, Mar. 2016.

[36] T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.

[37] Y Bar-Shalom and X.-R. Li, Multitarget-multisensor Tracking: Principles and Tech-niques. YBS Publishing, 1995.

[38] M. Schreier, Bayesian Environment Representation, Prediction, and Criticality As-sessment for Driver Assistance Systems. Darmstadt, 2016.

[39] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent Robotics andAutonomous Agents series). The MIT Press, 2005.

[40] S. Challa, M. R. Morelande, D. Musicki, and R. J. Evans, Fundamentals of ObjectTracking. Cambridge University Press, 2011.

[41] X. R. Li and V. P. Jilkov, “Survey of maneuvering target tracking. Part I. Dynamicmodels”, IEEE Transactions on Aerospace and Electronic Systems, vol. 39, no. 4,pp. 1333–1364, Oct. 2003.

[42] R. E. Kalman, “A new approach to linear filtering and prediction problems”, Trans-actions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960.

[43] R. G. Brown and P. Y. C. Hwang, Introduction to Random Signals and AppliedKalman Filtering. Wiley, 2012.

103

http://arxiv.org/abs/1311.2901





http://www.deeplearningbook.org

BIBLIOGRAPHY

[44] S. Challa and D. Koks, “Bayesian and Dempster-Shafer fusion”, Sadhana, vol. 29,no. 2, pp. 145–174, Apr. 2004.

[45] Y Bar-Shalom, P. K. Willet, and X. Tian, Tracking and Data Fusion: A Handbookof Algorithms. YBS Publishing, 2011.

[46] D. Musicki and R. Evans, “Joint integrated probabilistic data association: JIPDA”,IEEE Transactions on Aerospace and Electronic Systems, vol. 40, no. 3, pp. 1093–1099, Jul. 2004.

[47] D. Musicki, R. Evans, and S. Stankovic, “Integrated probabilistic data association”,IEEE Transactions on Automatic Control, vol. 39, no. 6, pp. 1237–1241, Jun. 1994.

[48] Technical Reference Manual FLIR BLACKFLY GigE Vision, version 14.0, FLIRIntegrated Imaging Solutions Inc, 2017, 146 pp.

[49] M. Quigley, B. P. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R.Wheeler, and A. Ng, “ROS : An open-source robot operating system”, 2009.

[50] A. El-Rabbany, Introduction to GPS: The Global Positioning System. Artech House,Inc., 2002.

[51] Pointgrey camera driver - ROS Wiki, http://wiki.ros.org/pointgrey_camera_driver, Accessed: 2018-06-21, Open Source Robotics Foundation.

[52] FlyCapture SDK, https://www.ptgrey.com/flycapture-sdk, Ac-cessed: 2017-12-17, FLIR Integrated Imaging Solutions Inc.

[53] Velodyne driver - ROS Wiki, http://wiki.ros.org/velodyne?distro=kinetic, Accessed: 2018-06-21, Open Source Robotics Foundation.

[54] OpenCV, Camera calibration and 3d reconstruction, goo.gl/KyzQSo, 2016.

[55] A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, “LiDAR-Camera Cal-ibration using 3D-3D Point correspondences”, ArXiv e-prints, May 2017. arXiv:1705.09785 [cs.RO].

[56] S. Garrido-Jurado, R. Munoz-Salinas, F. Madrid-Cuevas, and M. Marin-Jimenez,“Automatic generation and detection of highly reliable fiducial markers under oc-clusion”, Pattern Recognition, vol. 47, no. 6, pp. 2280 –2292, 2014.

[57] R. Munoz-Salinas and S. Garrido-Jurado, ArUco Library, 2013. [Online]. Avail-able: https://sourceforge.net/projects/aruco/.

[58] A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, ROS package to cal-ibrate a camera and a LiDAR, https : / / github . com / ankitdhall /lidar_camera_calibration, Accessed: 2018-06-19.

[59] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm forModel Fitting with Applications to Image Analysis and Automated Cartography”,Commun. ACM, vol. 24, no. 6, pp. 381–395, Jun. 1981.

[60] W. Kabsch, “A solution for the best rotation to relate two sets of vectors”, ActaCrystallographica Section A, vol. 32, no. 5, pp. 922–923,

[61] B. Vik, Integrated Satelite and Inertial Navigation Systems. Norwegian Universityof Science and Technology, 2014.

104

http://wiki.ros.org/pointgrey_camera_driver

http://wiki.ros.org/pointgrey_camera_driver

https://www.ptgrey.com/flycapture-sdk

http://wiki.ros.org/velodyne?distro=kinetic

http://wiki.ros.org/velodyne?distro=kinetic


https://sourceforge.net/projects/aruco/

https://github.com/ankitdhall/lidar_camera_calibration

https://github.com/ankitdhall/lidar_camera_calibration

BIBLIOGRAPHY

[62] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Ob-ject Detection with Region Proposal Networks”, ArXiv e-prints, Jun. 2015. arXiv:1506.01497 [cs.CV].

[63] E. J. Tangstad, “Visual detection of maritime vessels”, Master Thesis, NorwegianUniversity of Science and Technology, 2017.

[64] R. Girshick, “Fast R-CNN”, in The IEEE International Conference on ComputerVision (ICCV), Dec. 2015.

[65] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net-works”, in Computer Vision – ECCV 2014: 13th European Conference, Zurich,Switzerland, September 6-12, 2014, Proceedings, Part I, D. Fleet, T. Pajdla, B.Schiele, and T. Tuytelaars, Eds. Springer International Publishing, 2014, pp. 818–833.

[66] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ArXiv e-prints, Sep. 2014. arXiv: 1409.1556 [cs.CV].

[67] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “Vi-sual object classes challenge 2012 dataset (voc2012)”, 2012. [Online]. Available:http://host.robots.ox.ac.uk/pascal/VOC/voc2012/.

[68] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database”, in CVPR09, 2009.

[69] Faster R-CNN: Towards real-time object detection with region proposal networks,https://github.com/ShaoqingRen/faster_rcnn, Accessed: 2017-12-15.

[70] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding”,ArXiv e-prints, Jun. 2014. arXiv: 1408.5093 [cs.CV].

[71] Caffe deep learning framework, http://caffe.berkeleyvision.org/,Accessed: 2017-12-17.

[72] Caffe for Faster R-CNN, https://github.com/ShaoqingRen/caffe/tree/faster-R-CNN, Accessed: 2017-12-17.

[73] M. Forsyth and J. Ponce, Computer Vision, A Modern Approach. Pearson, 2012.

[74] R. Xu and D. Wunsch, “Survey of clustering algorithms”, IEEE Transactions onNeural Networks, vol. 16, no. 3, pp. 645–678, May 2005.

[75] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: A review, 1999.

[76] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for dis-covering clusters in large spatial databases with noise”, in Proceedings of the Sec-ond International Conference on Knowledge Discovery and Data Mining, ser. KDD’96,Portland, Oregon: AAAI Press, 1996, pp. 226–231.

[77] D. C. Hernndez, V. D. Hoang, and K. H. Jo, “Lane surface identification based onreflectance using laser range finder”, in 2014 IEEE/SICE International Symposiumon System Integration, 2014, pp. 621–625.

105



http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

https://github.com/ShaoqingRen/faster_rcnn


http://caffe.berkeleyvision.org/

https://github.com/ShaoqingRen/caffe/tree/faster-R-CNN

https://github.com/ShaoqingRen/caffe/tree/faster-R-CNN

BIBLIOGRAPHY

[78] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited,revisited: Why and how you should (still) use DBSCAN”, ACM Trans. DatabaseSyst., vol. 42, no. 3, 19:1–19:21, Jul. 2017.

[79] P. Horridge and S. Maskell, “Real-time tracking of hundreds of targets with efficientexact jpdaf implementation”, in 2006 9th International Conference on InformationFusion, 2006, pp. 1–8.

[80] I. J. Cox and M. L. Miller, “On finding ranked assignments with application tomultitarget tracking and motion correspondence”, IEEE Transactions on Aerospaceand Electronic Systems, vol. 31, no. 1, pp. 486–489, 1995.

[81] E. S. Henriksen, “Automatic testing of maritime collision avoidance methods withsensor fusion”, Master Thesis, Norwegian University of Science and Technology,2018.

[82] A. A. Gorji, R. Tharmarasa, and T. Kirubarajan, “Performance measures for mul-tiple target tracking problems”, in 14th International Conference on InformationFusion, 2011, pp. 1–8.

[83] E. Brekke, O. Hallingstad, and J. Glattetre, “Tracking small targets in heavy-tailedclutter using amplitude information”, IEEE Journal of Oceanic Engineering, vol. 35,no. 2, pp. 314–329, 2010.

[84] Y. Bar-Shalom, X. Rong Li, and T. Kirubarajan, Estimation with Applications ToTracking and Navigation: Theory Algorithms and Software, 1st ed. John Wiley &Sons, Inc, 2001.

[85] E. F. Wilthil, A. L. Flaten, and E. F. Brekke, “A target tracking system for asvcollision avoidance based on the pdaf”, in Sensing and Control for AutonomousVehicles: Applications to Land, Water and Air Vehicles, T. I. Fossen, K. Y. Pettersen,and H. Nijmeijer, Eds. Springer International Publishing, 2017, pp. 269–288.

[86] E. Brekke, O. Hallingstad, and J. Glattetre, “Improved target tracking in the pres-ence of wakes”, IEEE Transactions on Aerospace and Electronic Systems, vol. 48,no. 2, pp. 1005–1017, 2012.

[87] A. E. Johnson, “Spin-images : A representation for 3-D surface matching”, PhDThesis, Carnegie Mellon University, 1997.

[88] K. O. Arras, O. M. Mozos, and W. Burgard, “Using boosted features for the detec-tion of people in 2d range data”, in Proceedings 2007 IEEE International Confer-ence on Robotics and Automation, 2007, pp. 3402–3407.

[89] B. Ristic and D. J. Salmond, “A study of a nonlinear filtering problem for trackingan extended target”, in 7th International Conference on Information Fusion, 2004,pp. 503–509.

[90] J. W. Koch, “Bayesian approach to extended object and cluster tracking using ran-dom matrices”, IEEE Transactions on Aerospace and Electronic Systems, vol. 44,no. 3, pp. 1042–1059, 2008.

[91] M. Feldmann, D. Franken, and W. Koch, “Tracking of extended objects and grouptargets using random matrices”, IEEE Transactions on Signal Processing, vol. 59,no. 4, pp. 1409–1420, 2011.

106

BIBLIOGRAPHY

[92] K. Granstrom and U. Orguner, “New prediction for extended targets with randommatrices”, IEEE Transactions on Aerospace and Electronic Systems, vol. 50, no. 2,pp. 1577–1589, 2014.

[93] K. Granstrom, S. Reuter, D. Meissner, and A. Scheel, “A multiple model phd ap-proach to tracking of cars under an assumed rectangular shape”, in 17th Interna-tional Conference on Information Fusion, 2014, pp. 1–8.

[94] K. A. Ruud, “Lidar tracking of vessels at sea using an ellipsoidal contour model”,Master Thesis, Norwegian University of Science and Technology, 2018.

107

fusion between camera and lidar for autonomous surface...

Documents