event-based optical-ﬂow for autonomous driving using ... · event-based vision is a quickly...

Neuroscientific System TheoryTUM Department of Electrical and Computer EngineeringTechnical University of Munich

Event-based optical-flow for autonomousdriving using synthetic sensory data from

an open-source simulator

Scientific Report for Obtaining the Degree

Master of Science (M.Sc.)

from the TUM Department of Mechanical Engineering

Supervisors Prof. Dr. sc.nat. Jörg ConradtFlorian Mirus, Dipl.-Math.Christoph Richter, Dr.rer.nat.

Submitted by Konstantin RittBrunnerstraße 3780804 München

Final Submission Munich, 29. September 2017

i

Abstract

Event-based vision is a quickly developing field. Many novel event-based approaches to existingproblems have been developed in recent years and demonstrated their capability for improve-ments with respect to conventional frame-based vision. Due to the novelty of the field, it is stilldifficult to apply some these approaches to scenarios in the real world. This new vision paradigmis interesting for the application in the autonomous driving domain, as the properties of sensorssuch as the Dynamic Vision Sensor (DVS) suggest to solve some remaining challenges in this field.In this work, the difficulty to assess real world data is reduced by formulating a synthetic datasource. The DVS data is generated from frames that are rendered in a driving simulation. Anevent-based vision approach is implemented to process the resulting stream of events. In a firstprocessing stage, the optical flow induced by the driving motion is estimated. These results areused to attempt an event-driven approach to collision avoidance.Although the goal of estimating the time to contact to an imminent collision with an obstacle isnot achieved, the work in this thesis gives insights into challenges of both optical flow estimationand inherent to the driving domain. The implemented emulator, that generates synthetic events,demonstrates that DVS data can be obtained synthetically, to be used for development in the fieldof event-based vision.

Contents

1 Introduction 1

2 Overview and Related Work 3

2.1 Event-Based Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 DVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Event Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Event-Based Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Event-Based Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.2 Event-Based Collision Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Concept 12

3.1 Event Generation from Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Car Racing Simulation as Image Frame Source . . . . . . . . . . . . . . . . . . 14

3.1.2 Address Events from Image Frames . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.3 Address Event Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


3.2.1 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Lukas Kanade Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Local Plane Fitting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Application: Time to Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Focus of Expansion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Time to Contact with Obstacle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Ground Truth Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results 23

iv CONTENTS

4.1 Emulation of DVS Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Optical Flow: Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Local Planes Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.2 Lukas Kanade Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Focus of Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 Reference Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.2 Influence of Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.3 Influence of Sensor Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.4 Influence of Driving Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 Time to Contact Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Discussion 40

5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


5.4 Time to Contact Estimation for Autonomous Driving . . . . . . . . . . . . . . . . . . . 41

6 Conclusion 44

7 Appendix 45

7.1 Event Transmission with ROS-topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Ground Truth Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

References 48

List of Figures

2.1 Surface of events in spatio-temporal domain. . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Optical flow and corresponding Focus of Expansion. . . . . . . . . . . . . . . . . . . . . 10

3.1 Concept overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Event emulation architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Rendering scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Functioning principle of emulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5 Regions of interest in driving scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Pinhole camera model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Setup for validation test case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Comparison of events generated with DVS and emulator. . . . . . . . . . . . . . . . . . 24

4.3 Average event rate over time depending on event threshold. . . . . . . . . . . . . . . 25

4.4 Amount of interpolated events for a change in simulation time step ∆t. . . . . . . . 26

4.5 Dataset scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 Tuning local planes optical flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Tuning results for Lukas Kanade approach. . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.8 Focus of expansion estimation error with local planes optical flow estimation. . . . . 32

4.9 Focus of expansion estimation error with Lukas Kanade optical flow estimation. . . 33

4.10 Pitch angle as additional degree of freedom. . . . . . . . . . . . . . . . . . . . . . . . . 36

4.11 Focus of expansion estimation error with Lukas Kanade optical flow estimation. . . 37

4.12 Focus of expansion estimation when turning. . . . . . . . . . . . . . . . . . . . . . . . . 38

4.13 Time to contact estimation using Lukas Kanade optical flow. . . . . . . . . . . . . . . . 39

4.14 Time to contact estimation using local plane fitting optical flow. . . . . . . . . . . . . 39

5.1 FOE accuracy for events originating from additional vertical lines. . . . . . . . . . . . 42

vi LIST OF FIGURES

7.1 Logarithmic mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Transmission duration for differing event packet sizes . . . . . . . . . . . . . . . . . . . 46

7.3 Focal length from field of view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

List of Tables

3.1 .aedat file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Dataset configuration overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Local plane fitting parameter assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Parameters selected for Lukas Kanade approach. . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Influence of simulation time step on FOE estimation. . . . . . . . . . . . . . . . . . . . 33

4.5 Influence of resolution on FOE estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Influence of event threshold on FOE estimation. . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Influence of event threshold mismatch on FOE estimation. . . . . . . . . . . . . . . . . 35

4.8 Influence of velocity on FOE estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Glossary

AER Address Event Representation. 3, 6, 8, 16API Application Programming Interface. 14

cDVS colour-change DVS. 4CPU Central Processing Unit. 13–15

DAVIS Dynamic and Active-pixel Vision Sensor. 7, 27, 44DVS Dynamic Vision Sensor. 1–7, 9, 12, 13, 15–17, 23–25, 27, 33, 34, 37, 40, 41, 44

eps events per second. 7, 8, 27

FOE Focus of Expansion. 9–11, 19–21, 29–32, 34–38, 40–42, 46FOV Field of View. 21, 27, 46FPS frames per second. 4, 14, 16

GPU Graphics Processing Unit. 13, 14, 16

IMU Inertial Measurement Unit. 21

jAER Java Address-Event Represenation. 17, 18, 28, 41

OpenGL the Open Graphics Library. 14–16, 34, 40, 41

PBO Pixel Buffer Object. 13–15

ROI Region of Interest. 11, 20ROS Robot Operating System. 45

SNN Spiking Neural Network. 44

TCP Transmission Control Protocol. 45TTC Time to Contact. 9–12, 19–22, 26, 27, 38, 41–44

Chapter 1

Introduction

With the Dynamic Vision Sensor (DVS), a novel family of cameras is being developed. The tech-nology is motivated by the neuro-biological working principle of retinas and introduces a newway of perceiving visual information. Conventional cameras capture visual data in a synchro-nized manner to obtain a frame. The value at each pixel corresponds to the absolute brightnessvalue at one point in time. Instead of capturing optical information in the form of frames atdiscrete times, the DVS observes relative illumination changes at each pixel. This happens in anasynchronous manner, i.e. each pixel operates on its own and generates events that encode theperceived information. One implication of this is best understood when pointing a conventionalcamera and an event-based asynchronous vision sensor on a static scene, such as a still life paint-ing. The conventional camera will capture the scene in all its detail. The asynchronous camerawill generate no data. Only when the still life becomes alive, i.e. when the illumination changesor motion appears in the scene, then the sensor will generate events.There are a variety of advantages to the way sensors like the DVS capture visual information.The detection of relative brightness change greatly reduces the amount of redundant data beingrecorded. Additionally, a high temporal resolution is accomplished on the hardware level. Thephotoreceptors generate events just microseconds after the perceived illumination change. Dueto the asynchronous operation of the pixels and a logarithmic response to brightness, high dy-namic range is achieved.This way of visual data representation is still fairly new. Research in computer vision focused ona frame-based approach for decades. In order to benefit from the advantages offered by event-based vision, it is necessary to rethink many existing approaches or even to start from scratch.The benefits of this new vision paradigm have been demonstrated in a variety of applications.To my knowledge there is no published research on employing the DVS in the context of au-tonomous driving. Yet, the properties that set apart the DVS from conventional cameras promoteits use in machine vision for this domain. Despite fast-paced progress both in research and inindustry, computer vision remains a key challenge for developing fully autonomous vehicles. Ad-vancing algorithms and increasing computing power constantly improve machine perception inthe driving environment in real-time. Nevertheless, many tasks are not performed robustly inreal-time yet. Current research platforms still rely on a range of different sensors to achieve ro-bust sensing. Autonomous vehicles employ both active sensors like RADAR, LIDAR or ultrasoundsensors for depth information as well as cameras to extract contextual information. However,even state of the art sensors have a limited range of operation. Extreme lighting conditions canrender camera information useless. Driving at high speeds can lead to motion blur, and insuffi-cient frame rates reduce the accuracy of methods like optical flow computation.Properties specific to event-based cameras offer the potential to address these challenges of visionsystems in autonomous driving. The sparse data representation facilitates real-time processing,as redundancy is removed at hardware level. The high temporal precision benefits solving prob-lems during fast motion. High dynamic range can increase robustness in challenging lighting

2 1 Introduction

conditions in the road environment.Currently, however, event-based vision cannot solve these problems in autonomous driving. Giventhe novelty of the field and the potential already being demonstrated in a variety of applications,it is worth assessing possible applications in this field as well.Applying event-based vision on real world data can be challenging. Many existing approaches areverified in a controlled lab environment. The early stage of development makes it necessary to re-duce disturbances. In the driving scenario this can mean to avoid vibrations or the full complexityof the road environment. Another factor is the need to collect and measure validation data andground truth. These considerations make it a costly and laborious task to generate appropriatedatasets.In this thesis I create the controlled environment in form of a simulation that provides the neces-sary DVS data. This way the driving scenario can be adapted and the number of variables affectingperformance can be reduced. The synthetic asynchronous sensory data is generated by an emu-lator that flexibly interfaces the rendering process of the simulation. An existing application ofevent-based vision is implemented and investigated in the context of autonomous driving.

Chapter 2

Overview and Related Work

Humans perform the complex task of navigating in an unknown environment predominantlybased on visual input. Artificial systems like robots or autonomous cars, however, rely on a myriadof sensors. This thesis explores capabilities of a neuromorphic asynchronous sensor to be appliedfor computer vision in the autonomous driving domain. Section 2.1 introduces this novel type ofsensor. Optical flow, a concept prevalent both in biological and computer vision, is presented inSection 2.2. Section 2.3 investigates applications of event-based sensors for computer vision.

2.1 Event-Based Sensing

2.1.1 DVS

The Dynamic Vision Sensor (DVS) is an asynchronous visual sensor [16, 17, 26]. It is a so-calledartificial retina sensor and differs from conventional cameras in many ways. Instead of syn-chronously capturing absolute luminance at all pixels represented by frames, each pixel operatesindependently and emits visual information in the form of address-events. The output in this formis called Address Event Representation (AER) and includes the x , y address of the respective pixeland a timestamp. Events are emitted whenever the intensity of incoming light changes beyond apredefined event threshold. Visual information is represented by a relative change in luminance.This dynamic behavior mimics that of biological retinas. In a static scene with no apparent mo-tion, no events will be emitted. This characteristic yields a reduction of redundancy of generatedvisual information. A high temporal precision is related to this advantage over conventionalframe-based cameras. An increased sample rate can be achieved by suppressing transmission ofredundant information [25]. The DVS emits events with microsecond accuracy when a change inbrightness occurs. Additionally, the dynamic range (∼120dB) exceeds that of conventional cam-eras by far (∼60dB). This is achieved due to the logarithmic response to changes in illuminationby the photoreceptors.Aside of these clear advantages, there are challenges of asynchronous visual sensors. Due to thenovelty of the hardware and the additional circuitry required per pixel, the resolution of availableartificial retina sensors is far below the resolution common in today’s CMOS sensors. The unifor-mity of the sensor response is limited due to pixel-to-pixel hardware mismatches. Consequently,event thresholds vary in the order of a few percent between pixels resulting in noise. The biggerchallenge, however, originates on a software level. A frame-based understanding of visual data isingrained into computer vision. To adequately process this asynchronous and quasi-continuousform of visual data representation, it is necessary to adapt existing or design new approaches.

4 2 Overview and Related Work

2.1.2 Event Emulation

Emulating event-based sensors can be a good way to work on advancing event-based algorithms.Currently, available silicon retina hardware is both expensive and being continually improved.Implementing the behavior of these sensors in software makes them more accessible for devel-opers. By adapting the emulation behavior, it is possible to explore ideal or enhanced sensorybehavior.A few publications explore emulation of event-based sensors [12, 14, 22]. Two works performevent emulation based on frames obtained with conventional cameras, the most recent study byMueggler et al. obtains frames from simulation. Katz et al. [14] implement both the behaviorof a DVS and the so called colour-change DVS (cDVS), which responds to relative color changes.The emulated events are generated based on recordings from an inexpensive CMOS image sensorwhich operates with a frame rate of up to 125 frames per second (FPS). From these conventionalrecordings, the emulation computes the absolute brightness of each pixel on a logarithmic scale.The DVS behavior of detecting relative change in brightness is modeled by comparing these abso-lute values between consecutive frames. Events are generated when this relative change exceedsthe preset threshold. To better approximate real DVS behavior, each pixel threshold can be rep-resented using a normal distribution of threshold mismatch with mean and standard deviationof the real sensor. Unlike the real sensor, the emulated time resolution is limited by the framerate. A detected relative change in brightness between frames cannot be timed with microsecond-precision. The timestamps can either be the frame time or obtained via interpolation betweenconsecutive frame times. The authors conclude that the resulting quantized event times are oneof the major limitations of the system. Only visual input with relatively slow motion results inmeaningful DVS emulation.The basic implementation presented by Garcia et al. [12] is very similar. They introduce exten-sions to model the DVS behavior more accurately and to change output encoding of events. Localinhibition mechanisms are modeled by suppressing all but one event spike in a predefined area.Only the pixel with maximum change in brightness is allowed to generate a spike, under theassumption that neighboring cells have similar values and transmit redundant information. An-other extension is an adaptive threshold that mimics a photoreceptor that is not creating sufficientcharge to generate an event directly but enough to trigger a spike after some time. In an attemptto reduce high spike-rate events, the authors implement spike-time encoding. The magnitudeof change in brightness is encoded in the timing of the event within a predefined time bin. Anearlier spike then represents a greater change of brightness than one occurring towards the endof the time bin. This implementation is also aimed at utilizing image and video databases as asource for event-based processing. The accuracy of the emulated data is equally limited by theunderlying frame rate of conventional camera input.The recently published dataset by Mueggler et al. [22] emulate an event-based sensor based onsimulated visual data. By reducing the simulation time step and a resulting high frame rate,the problem of insufficient temporal accuracy of the events can be overcome. Additionally, thesimulation allows accurate computation of ground truth values.

2.2 Optical Flow Estimation

Optical flow is the apparent relative motion between a scene and an observer, e.g. a camera.Estimation of optical flow is the attempt to recover this relative motion from the projection onthe image plane. Many applications in computer vision rely on the knowledge of optical flowsuch as object recognition and tracking, or visual odometry. Thus, their performance relies onthe accuracy of optical flow estimation. Solving for optical flow, however, is an under-determined

2.3 Event-Based Computer Vision 5

problem, due to the aperture problem. It describes the inability to recover the overall motion ofan edge in 2D from observing the edge only locally. Solely the component normal to the edge canbe obtained.Optical flow estimation has been intensively studied for many years. Three categories of ap-proaches can be distinguished: correlation methods, gradient methods and frequency methods.Orchard and Etienne-Cummings[23] provide an overview. In this work, I will only consider gra-dient methods. They are widely used and provide a good trade-off between necessary computa-tional resources, complexity and accuracy.The first assumption, the so-called Brightness Constancy, is common to all three methods. It statesthat the brightness E(p, t) at pixel p = [x , y]T is constant over time t.

dE(p, t)d t

= 0 (2.1)

Applying the chain rule to (2.1) yields equation (2.2). Introducing Ex , Ey and Et as partialderivatives of image brightness and vx = d x/d t and vy = d y/d t as flow velocity componentsyields equation (2.3).

dE(p, t)d t

=∂ E∂ x

d xd t+∂ E∂ y

d yd t+∂ E∂ t= 0 (2.2)

Ex vx + Ey vy + Et = 0 (2.3)

Here the two unknown optical flow components v = [vx , vy]T are subject to only one constraint.This is a mathematical manifestation of the aperture problem. Horn and Schunk[13] solve theunder-determined nature of the problem by assuming smoothly varying optical flow in the imageand iteratively solving the resulting minimization problem. This approach is considered as the pi-oneer of global differential methods yielding dense optical flow fields. It is dense in the sense thatan optical flow vector is estimated for every pixel. Another fundamental approach was proposedby Lucas and Kanade [18]. They assume a constant optic flow vector within a local neighborhoodto solve the under-determined formulation of (2.3). This so-called local gradient based approachyields sparse flow fields; the velocity cannot be estimated at every pixel.Today, many different implementations of frame-based optical flow estimation exist (e.g. currently127 are on the Middlebury benchmark website 1). They vary widely in accuracy and computa-tional effort. The most accurate methods are far from real-time capability. Usage in real-timeapplications on natural scenes is still considered difficult.

2.3 Event-Based Computer Vision

With increasing computational power and improving techniques, real-time capabilities constantlyimprove. The development of artificial retina cameras, such as the DVS, however, allows to tacklethis problem with less computational resources. The dynamic response to relative changes inbrightness reduces the amount of redundant visual information transmitted with respect to frame-based vision; unchanged parts of the image do not generate events. The sparse representationlowers memory requirements and the necessary computational effort to process the apparentstream of data. Thus, a potential for improved real-time application arises at hardware level. Byapproaching vision tasks in an event-based manner it is possible to reach sample rates in the orderof kHz, hardly achievable by conventional frame-based vision. Every incoming event is processedindependently and improves the estimate.

1http://vision.middlebury.edu/flow


Event-driven sensors facilitate bioinspired ways of processing visual input in the form of AER. Anumber of neuromorphic systems have been presented, that directly incorporate event-based sen-sor input to perform control tasks using spiking neural networks [11, 27]. These systems combineevent-based sensing and processing on dedicated hardware to model the power-efficient neuronaldata processing of biological systems. Despite their great potential, these approaches will not beinvestigated further as they exceed the scope of this thesis.A number of event-based formulations of computer vision tasks have been proposed for generalpurpose computing. These include lower level tasks like optical flow estimation. Subsection 2.3.1gives an overview. To date a great variety of high level event-based applications have been de-veloped. Some of these with relevance in autonomous driving are egomotion estimation [7, 15],feature detection and tracking [9, 29] and collision avoidance [8, 20]. Event-based collisionavoidance and time-to-contact estimation are introduced in Subsection 2.3.2.

2.3.1 Event-Based Optical Flow

The DVS’s advantage of providing events with sub-millisecond temporal precision can also bene-fit the computation of optical flow. At high velocities, the accuracy of conventional frame-basedapproaches suffers from both motion blur and large inter-frame displacements [28]. The hightemporal precision avoids both effects.This potential has motivated a number of approaches to event-driven optical flow estimation.Benosman et al. [3] reformulate the popular approach by Lucas and Kanade [18] for asynchronousvisual sensors. Another gradient based approach improves real-time capabilities by avoiding es-timation of temporal and spatial gradients [2]. Instead, the optical flow velocity is obtained fromlocally fitting parameters to a linear model. Barranco et al. [1] introduce a method combiningboth relative brightness changes from a DVS and absolute intensity from frames. While they areable to show an increase in accuracy, the real-time performance is not easy to assess.The energy efficiency and robustness of biological motion estimation serves as motivation forother approaches. Biologically inspired systems range from simple and computationally efficientimplementations of optic flow on embedded hardware [10] to spiking neural networks performingthe optical flow estimation [23]. Despite the great potential of bioinspired approaches I focus ondigital implementations of the gradient based methods presented in the following. They combinereal-time capability [25], ease of implementation and suit the scope of this thesis.

Lucas-Kanade Approach

Benosman et al. [3] first presented an event-based method that employs a local gradient basedapproach after Lucas and Kanade [18]. Every incoming event is treated independently and up-dates the optical flow field to make use of the asynchronous behavior of the DVS. To solve theunder-determined equation (2.3) it is assumed that the optical flow v = [vx , vy]T is locally con-stant in an n×n neighborhood of the current event. The two unknowns vx and vy can be obtainedfrom the resulting system of m = n2 equations in (2.4) by employing least square minimizationtechniques.

∇E(x1, y1)T...

∇E(xm, ym)T

vxvy

=

−Et1...−Etm

with ∇E(x , y)T =

∂ E∂ x∂ E∂ y

T

(2.4)

In conventional imaging, the spatial gradient ∇E can be obtained using finite differences in graylevels, while the temporal gradient Et is obtained from changes observed between frames. The


high temporal precision of the DVS allows to approximate the temporal gradient by summationof events at pixel (x , y) for a time period ∆t (2.5).

∂ E∂ t≈

1∆t

t∑

t−∆t

e(x , y, t) (2.5)

Out of a lack of gray levels, Benosman et al. estimate the spatial gradient by summations of localevents over a temporal window∆t and applying backward finite difference (2.6). The availabilityof absolute brightness values from frames together with relative brightness changes in the formof events, as the Dynamic and Active-pixel Vision Sensor (DAVIS) offers, could greatly increaseaccuracy.

∂ E∂ x i≈

t∑

t−∆t

e(x i , yi , t)− e(x i−1, yi , t)

∂ E∂ yi≈

t∑

t−∆t

e(x i , yi , t)− e(x i , yi−1, t)

(2.6)

As Brosch et al. [4] and Rueckauer and Delbruck [25] state, this is where the sparse nature ofDVS data introduces difficulties. While the original formulation in (2.6) with backward finitedifferences uses inconsistent orders of derivatives, the correct higher order approximations showincreased susceptibility to noise especially given a small amount of events in the neighborhood[4]. Gradient based approaches are generally known to be susceptible to noise due to their depen-dence on spatial gradients [23]. Rueckauer and Delbruck increase the signal-to-noise ratio with aSavitzky-Golay filter and obtain improved results [25]. The authors report a real-time capabilityof their implementation for up to 3 · 105 events per second (eps).

Local Plane Fit

Events that originate from motion of an edge can be visualized as a surface in the spatio-temporaldomain of pixel coordinates over time (Fig. 2.1). Picture the indicator of a clock being observedby a DVS and the resulting events. The sweep of the seconds hand will result in a spiral in thespatio-temporal domain for each minute that passes.

Figure 2.1: Surface of events in spatio-temporal domain (from [2]).

From the parameters of this surface it is possible to estimate orientation and amplitude of the


underlying optical flow. Benosman et al. [2] introduced an approach that locally fits a planeinto this surface of neighboring recent events Ωe. Similarly to the Lucas-Kanade approach, theyassume constant flow velocity in the vicinity of the current event. Thus, the resulting surfaceof events is assumed locally planar and can be described by four parameters a, b, c and d withequation (2.7).

ax + b y + c t + d = 0 (2.7)

Instead of estimating spatial and temporal gradients a plane is fitted to best approximate thenearby events. The parameters a, b, c and d are calculated by using least squares regression on ahomogeneous system of equations obtained from (2.8) for all events ei in the neighborhood Ωe.

a b c d

x iyit i1

= 0 ∀ ei(x i , yi , t i) ε Ωe (2.8)

The original formulation involves an iterative approach to increase robustness to noise. Neighbor-ing events that lie away further than a certain threshold are not considered in the next iterationand removed from Ωe. The resulting estimation for the plane is robust to noise and compensatesmissing events in the local neighborhood.The optical flow is estimated from the final plane parameters. The original formulation using theinverse of the plane’s gradient g = [− a

c ,− bc ] by Benosman et al. [2] suffered from denominators

a and b approaching zero for edges moving along the principal axes [25]. Brosch et al. [4] andRueckauer and Delbruck [25] derive a more robust formulation (2.9). This equation uses thenormalized direction of the gradient multiplied with its inverse magnitude to compute the opticalflow velocity.

vxvy

=1|g |2

g =−c

a2 + b2

ab

(2.9)

Brosch et al. [4] find this to be a suitable algorithm for motion that originates from a single object.The approach is unreliable, however, for smooth gradient edges apparent in the observed scene[25]. The resulting cloud of events is no longer approximated well by a plane. Rueckauer andDelbruck [25] achieve real-time optical flow estimation using a Savitzky-Golay filter for an eventrate of more than 106 eps.

General Remarks on Event-Based Optical Flow

To benefit from the paradigm shift of event-based visual data representation, it is necessary toadequately adapt frame-based algorithms or to develop new approaches. The current state of theart in event-based optical flow estimation cannot tap the full potential yet. Challenges arise fromnoisy and fissured contours [25]. Highly textured areas generate events that can be misinter-preted for motion [25]. The sparse data representation also yields problems, as most event-basedalgorithms rely on averaging techniques to reduce errors due to noise. If insufficient events arepresent locally, these methods cannot accurately estimate optical flow.The sparsity of AER data implicates a resulting sparse optical flow field using the aforementionedapproaches. Optical flow is computed per event. This limits the use of these event-based meth-ods as some applications require knowledge of dense optical flow fields. Another drawback is thebehavior in textured regions. Events are fired quickly in a small neighborhood and mistakenlyprocessed as events from the same edge yielding wrong velocity estimates [25]. Both presented


event-based formulations suffer from the aperture problem. They yield normal optical flow per-pendicular to a moving edge. The tangential velocity component of the flow field cannot berecovered. More sophisticated approaches like feature matching would be necessary to recoverthe full optical flow at the cost of more complex computations [9].There is one possible limitation common to both presented gradient based algorithms when beingapplied to the driving scenario. There, the generated optical flow stems largely from ego motionof the camera attached to a car, yielding a circular diverging optical flow field (see ??). The per-formance to estimate such a flow field is not assessed, neither in the original publications [2, 3]nor in the papers evaluating their performance [4, 25]. The authors used mainly pure rotationaland translational stimuli without camera motion, for which ground truth can be derived. I expectreduced accuracy of the presented algorithms operating on the less uniform flow fields. In trans-lational ego motion, the flow amplitude depends on the scene depth and the flow angle changesradially around the Focus of Expansion (FOE) (see Fig. 2.2). The resulting variation complicatesremoval of outliers by averaging techniques. Censi and Scaramuzza [7] even suggest this to bean "intrinsic limitiation of the DVS", as translation produces small apparent motion in comparisonto rotation and thus less events. Driving at higher speeds, on the other hand, might compromisereal-time capabilities, as the amount of events generated increases with the driving speed.

2.3.2 Event-Based Collision Avoidance

The characteristics of event-driven computer vision suggest advantages for collision avoidanceas well. Both high sample rates and the temporal precision of the DVS can benefit this task, inparticular at increased speeds [8].Milde et al. [20] describe event-based collision avoidance in a preliminary study. Based on opti-cal flow estimation extracted from the stream of events, they compute a direction pointing awayfrom obstacles. The results are obtained offline and offer no measure of accuracy of the approach.In a very recent publication Milde et al. [21] perform obstacle avoidance and target acquisitionimplemented in a spiking neural network on neuromorphic hardware. Clady et al. [8] introducean event-based approach to collision avoidance by estimating Time to Contact (TTC). TTC esti-mation is a vision concept that exists both in biological and artificial systems. It serves to estimatethe time it takes for the observer to collide with an object at their current relative motion. Thiscan be achieved without knowledge of depth information or velocity, merely from visual input[6]. Different approaches to estimate TTC have been derived but most are based on optical flowestimation [6, 8]. Clady et al. [8] accurately estimate the TTC in real-time in experiments fora robotic platform moving at about 1m/s. They suggest applicability of their approach also forhigher speeds. Therefore, I consider TTC an interesting application for the evaluation of event-based vision in autonomous driving in this thesis. The approach is elaborated further in thefollowing.

Time to Contact

The trivial way to derive the TTC τ with the obstacle at distance Zc while moving at a constantrelative velocity Zc in camera coordinates is using equation (2.10). However, in the general casethe depth Zc is not known.

τ= −Zc

Zc, with Zc =

dZc

d t(2.10)

Camus [6] derives how to compute τ using the pinhole camera model. The computation is per-formed without knowledge of depth or relative velocity. Equation (2.11) yields the TTC at pixel


p. It requires the knowledge of the optical flow v(p, t) and the location of the so-called Focusof Expansion (FOE) p foe. Both unknowns can be estimated from data obtainable from the imageplane.

τ(p, t) =v(p, t)T (p − pfoe)||v(p, t)||2

(2.11)

During purely translational ego motion of an observer in a mostly static scene the perceivedoptical flow field is diverging around the FOE. Fig. 2.2 qualitatively visualizes such a flow field ina driving scenario.

FOE

Optical Flow

Normal Flow

Figure 2.2: Optical flow originating from translational ego motion and corresponding Focus of Expansion (FOE).

The FOE is the projection of the relative motion vector between the static scene and the observeronto the image plane. All flow vectors diverge from this point. With an ideal flow pattern theFOE could be obtained directly by triangulation of just two flow vectors. Noise and inaccuraciesin the flow estimation render this straight forward approach infeasible.Clady et al. [8] employ the local plane fit by Benosman et al. [2] to obtain the optical flow field.This estimate contains only the component in normal direction of the edge, the normal flow vnas in Fig. 2.2. Thus, the FOE can lie on all positions of its negative semi-half plane with equalprobability. Combining these probabilities of each velocity estimate to a probability map for allpixels yields the FOE estimate p foe = [xfoe, yfoe]T at the pixel with maximum probability. Itsaccuracy relies only on correctness of the orientation of the underlying velocity vectors.Clady et al. adapt (2.11) to take into account that instead of the optical flow v only the normalflow vn is known. Σe is introduced as the surface of events in the spatio-temporal domain. Thetangential flow component vt is orthogonal to the gradient ∇Σe of this surface. The propertyvn

T∇Σe = 0 is used to eliminate the tangential component and yields equation (2.12). Thisallows to compute the TTC of an obstacle at pixel p when its normal flow and the FOE are known.


τ(p, t) =12(p − p foe)

T∇Σe(p, t), with ∇Σe(p, t) =

1vnx(x , y0)

,1

vny(x0, y)

T

(2.12)

Clady et al. provide experimental results of a robotic platform operating in a warehouse withground truth from odometry. The event-based TTC estimation performs well during constantforward motion at ≈ 1m/s. A Region of Interest (ROI), where the obstacle is assumed, is prese-lected.The TTC estimation in (2.12) depends both on the accuracy of orientation of the optical flow(through FOE estimation) and its amplitude (through vn). As a result the TTC estimation greatlysuffers from noise that affects optical flow estimation as well. Clady et al. handle this problemby averaging TTC estimates for a great amount of pixels corresponding to the obstacle (>5000events/s). For smaller obstacles two challenges remain, the selection of events corresponding tothe obstacle and robustly averaging for less available estimates.In the case of turning maneuvers the optical flow field no longer diverges around the FOE. Therotational component adds a flow velocity independent of depth and heading. Although sepa-ration of rotational and translational flow components is possible, it is computationally complex[20]. The presented collision avoidance strategies based on optical flow ignore the results duringturning [8, 20].

Chapter 3

Concept

One goal of this thesis is to generate synthetic sensory data of asynchronous vision sensors. Manyevent-based vision algorithms are not yet ready for real world data. Providing synthetic data fromsimulation can be a powerful tool to enable advancements. This way it is possible to look intothe great amount of challenges that come with natural data systematically one by one. Anothergoal is to investigate event-based vision in the autonomous driving domain. Novel sensors likethe Dynamic Vision Sensor (DVS) have the potential to solve many problems in computer visionon the road. However, because of the difficulty to assess real world scenarios, both goals areintertwined. Synthetic data facilitates the investigation in the field of driving. Obtaining groundtruth data to assess performance is generally complex and flawed by measurement errors. Usingsynthetic data from simulation solves this problem.This work consists of three independent components that together assess an application of event-based computation in the autonomous driving domain. Fig. 3.1 gives an overview. In a racingsimulation, frames are rendered for a virtual camera mounted on a car. The emulator, presentedin Section 3.1, takes these frames and generates address events according to the working principleof asynchronous visual sensors like the DVS. This synthetic sensory data can then be processedin an event-driven manner. Due to the amount of related applications, optical flow estimation isselected as an exemplary first processing step, as shown in Section 3.2. Time to Contact (TTC) isimplemented as one of the related applications. Section 3.3 describes how to predict the time ittakes for the car to collide with an obstacle based on the optical flow field.

EmulatorTORCSOpen SourceSimulation

OpticalFlow

Estimation

Application:Time toContact

Figure 3.1: Concept overview: The arrows symbolize the data transferred between the entities. The simulationprovides rendered frames to the emulator (DVS schematic from [14]). The emulator generates address eventsbased on the brightness changes in the frames (red: darker; blue: lighter). Optical flow in form of vectors on theimage plane is estimated based on the incoming events and provided to TTC estimation.

3.1 Event Generation from Simulation 13

3.1 Event Generation from Simulation

The basis of my work is the implementation of an emulator that creates synthetic DVS data basedon the open-source car simulation TORCS [30]. Working with synthetic data reduces difficulties,such as noise, vibrations and challenging lighting conditions, that are inherent to natural scenes.Many state of the art event-based algorithms fail when applied on natural data [25]. In simulation,the setup and complexity of the driving environment can be adapted within the limits of theunderlying framework. In frame-based vision, it is common practice to use synthetic data foranother reason, knowledge of ground truth data for evaluation purposes [5].The main requirement for the emulation module is to reach sufficient accuracy by finding a goodtrade-off between complexity and accuracy. The real DVS is a complex device that offers its usersmore than 20 parameters to configure the sensor 1. The emulator serves as a functional tool thatdoes not need to model all of this behavior. The aim is to reach an accuracy sufficient for drawingconclusions towards the use of real data. Real-time performance of the emulator is a requirementof lesser importance.

Shared memoryGPU

Emulator

ROS topic/ logfile

TORCS

PBO: Accessand transfer

framesSimulation

&Rendering

CPU

Figure 3.2: Event emulation architecture: Two processes run in parallel on the Central Processing Unit (CPU):the simulation and the emulation of events (DVS schematic from [14]). Rendering of frames is performed on theGraphics Processing Unit (GPU). The frames are shared via shared memory access.

The overview in Fig. 3.2 shows how synthetic events are generated. Two processes are running inparallel. The first process executes the simulation of the car racing game TORCS. Subsequently,the rendering process is delegated to the GPU. There, the dedicated hardware efficiently generates2D frames from 3D scenes. To make these image frames available to the emulator, a so calledPixel Buffer Object (PBO) is implemented. The PBO transfers the rendered images from GPUmemory to shared memory. In the second process, the emulator accesses the frames and generates

1https://inilabs.com/support/hardware/biasing/

14 3 Concept

address events corresponding to changes in brightness per pixel. These events are then publishedvia network connection or logged to a file. The following chapters describe the independentprocesses in greater detail.

3.1.1 Car Racing Simulation as Image Frame Source

The events are generated by comparing successive image frames, similar to Garcia et al. [12] andKatz et al. [14]. These existing emulators use recorded images from conventional frame-basedcameras as input. In the approach described here, the open-source 3D car racing simulationTORCS provides the input frames rendered using the Open Graphics Library (OpenGL).In order to make these rendered images available for further processing, modifications to the ren-dering pipeline are necessary. Each frame is computed in two steps. While each simulation stepis performed on the CPU, the vector graphics based rendering step is done on the GPU. OpenGL isan Application Programming Interface (API) that allows client programs like TORCS to draw andrender 2D or 3D graphics. Rendering is computationally expensive and, therefore, performed inparallel on the GPU specifically designed for this task. During game-play, the completed render-ings are shortly buffered in the so called framebuffer on the GPU, displayed on screen and thendiscarded as visualized in Fig. 3.3(a).

CPU

GPU

Screen

Rendering

Simulation

Frame n

Rendering

Simulation

Frame n+1

Rendering

Simulation

tn tn+1 tn+2

CPU

GPU

Screen

Rendering

Simulation

Frame n

tn

Rendering

Frame n+1

Simulation

tn+1

glReadPixels

-blocked- to RAM

glReadPixels

-blocked- to RAM

tn+2

CPU

GPU

Screen

Rendering

Simulation

Frame n

tn

Rendering

Frame n+1

Simulation

tn+1

PBO1(glRead)

to RAM

PBO2(glRead)

to RAMSimulation

tn+2

PBO1(glRead)

Rendering

(a)

(b)

(c)

time

time

time

Figure 3.3: Rendering scheduling: (a) Ideal rendering schema (b) CPU blocking frame access (c) Frame accessusing PBO.

Instead, we want CPU access to the rendered frame and buffer it in system memory to make itavailable to the emulator running in a different process. OpenGL provides the function glRead-Pixels to access pixels currently in the framebuffer of the GPU. As Fig. 3.3 (b) shows, the functioncall only returns after the rendering is completed. This leads to wasted computational resourcesand a reduced amount of frames per second (FPS) in real-time operation. OpenGL offers theimplementation of a buffer object referred to as PBO for this use case. An asynchronous accessto resources on the GPU is possible by employing PBOs. Fig. 3.3 (c) shows that the CPU cycle

3.1 Event Generation from Simulation 15

is not involved in the glReadPixels operation if PBOs are used. By delegating control over thetransfer operation to OpenGL, synchronization is avoided. The CPU is free to compute e.g. thenext simulation time step. The actual CPU access to the requested frame managed by the PBO isdelayed and ideally occurs blocking free. Each frame is then stored in shared memory to allowaccess of the emulator running in parallel. A latency of at least one frame time is introduced tothe possible real-time use of the synthetic events. As the process running the simulation of TORCSis operating at full load, the emulation is performed in a parallel thread.

3.1.2 Address Events from Image Frames

The emulation of synthetic events is based on the functioning principle of DVS cameras [16, 17].These sensors detect changes in log intensity of incoming light, referred to as brightness here. Ifthe change in brightness exceeds a predefined event threshold an event is generated, as describedin Section 2.1.The emulator creates events for each newly rendered frame where pixels change brightness asvisualized in Fig. 3.4. OpenGL internally represents color in 8 bit values per color on a linear scalefrom 0 to 255, as TORCS is built on a legacy version of OpenGL. Newer versions provide higherprecision of up to 32 bit. A brightness value is obtained by equally weighing the color values red,green and blue. This overall linearly scaled grayscale value is mapped to a logarithmic scale. Areference frame holds the log intensity value per pixel at which the last event was created. Ateach pixel, this reference value is compared to the newly obtained log intensity. Events e(x , y, t)are generated if the change in brightness exceeds the event threshold. The reference frame isupdated according to polarity and amount of events per pixel.

OFF ON

Th-Th

y

x

y

x

tn

ONOFFON...

231284...

771011...

tntntn...

Pol x y t

log scaling comparator event stream

reference frameinput frame

+

-

[0;255]

[0;255]

e(x , y, t)

∑

e(x , y, t)

Figure 3.4: Functioning principle of emulator: linearly scaled input frame is mapped to logarithmic scale andcompared to reference value per pixel. For a change in brightness greater than the threshold events are generated.The reference frame is updated accordingly.

This procedure is performed for each pixel and frame rendered in TORCS. While DVS camerasoperate with microsecond time precision, the temporal accuracy of synthetic events is limited bythe frame time between generated frames. It cannot be determined when exactly the changein brightness exceeded the threshold in this period of time. If the brightness jumps multiplesof the threshold from one frame to the next, multiple events are generated. In a DVS, a strongcontrast edge creates multiple events at the same pixel with minimal time differences [1]. In the

16 3 Concept

emulator, the timestamps for multiple created events are determined by interpolating betweenthe frames’ timestamps. This does not increase the temporal precision but linearly approximatesthe change in brightness occurring between the frames. The amount of possible events betweenframes is limited to five events with interpolated timestamps. The limitation is introduced in or-der to reduce the effort of providing an event stream with monotonically increasing timestamps.If the brightness changes more than five times the amount of the event threshold, this can in mostcases be prevented by reducing the inter-frame displacement by lowering the time step betweenframes.Processing frames to provide Address Event Representation (AER) output is performed in par-allel to simulation and rendering tasks and keeps up with on average 300 FPS from TORCS inreal-time (CPU: AMD Phenom II X4 945, GPU: Nvidia GeForce 9500, RAM: 8GB). Frame render-ings are omitted in real-time mode when simulated time is falling behind execution time. Theresulting temporal precision of ∼ 3ms is three orders of magnitude inferior to that of a real DVS.Better results are possible using a highly parallelized approach on directly on the GPU. OpenGLoffers to efficiently formulate computations by per pixel operations in so-called shader programs.I attempted such an implementation, however, the outdated rendering pipeline of OpenGL 1.3used in TORCS does not function with shader programs formulated in versions 3.0 or above.In order to time the synthetic events with sub-millisecond precision, I implemented an offlinemode, where simulated time is no longer synchronized with execution time. Without this re-quirement, the simulation time step can be reduced to the order of microseconds and frames arerendered for each simulation step. Synchronization is only maintained for shared memory accessto ensure every frame is processed by the emulator. With a simulated frame rate in the order of50 kFPS and the interpolation of timestamps between frames it is possible to time events withmicrosecond precision.

3.1.3 Address Event Interface

A DVS event is fully described by a tuple of polarity of change in brightness (ON for a change toa brighter value), pixel coordinates in x and y, and a timestamp. Depending on the emulationmode the events are either broadcasted using TCP in ROS messages2 to allow real-time processingor saved to an ".aedat" file3 for later use. In both channels time stamp monotonicity is ensured,i.e. time stamps of successive events are of greater or equal value. The TCP interface allowsnetworked access to the event packages generated in the emulator. I achieved an event transmis-sion rate of up to 6 Meps, sufficiently high for real-time broadcasting of data recorded in naturalscenes. The transmission performance is described in more detail in the appendix in Section 7.1.The ".aedat" file type is a custom binary format that is used by the jAER and cAER frameworksfrom iniLabs, where neuromorphic technologies such as the DVS are commercially available. Thevalues are stored according to Tab. 3.1. The pixel intensity values, however, remain unspecifiedfor use with DVS data.

Table 3.1: .aedat file format

type x address y address polarity pixel intensity timestamp [µs]1 bit 10 bit 9 bit 2 bit 10 bit 32 bit

2https://github.com/uzh-rpg/rpg_dvs_ros3https://inilabs.com/support/software/fileformat/

https://github.com/uzh-rpg/rpg_dvs_ros

https://inilabs.com/support/software/fileformat/

3.2 Optical Flow Estimation 17


Optical flow estimation is an integral part in many computer vision applications in autonomousdriving. Apart from slow moving robotic platforms [8, 20], event-based optical flow is hardlyever used for estimating translational flow fields. The previously described synthetic source ofDVS data opens up a new possibility to do so.

3.2.1 Software Framework

Instead of implementing existing optical flow estimation algorithms, I use the open-source im-plementations provided with the Java Address-Event Represenation (jAER) project4. jAER is aframework for real-time processing and visualization of event-based data, supported by iniLabs5.It is used for rapid development of event-based applications by scientists around the world.The optical flow algorithms evaluated by Rueckauer and Delbruck [25] are included in the pack-age ch.unizh.ini.jaer.projects.rbodo.opticalflow. These implementations include theapproaches described in Subsection 2.3.1.jAER handles incoming events in packets. They can either be received from a sensor directly orfrom previously recorded ".aedat" files. Event-based algorithms are implemented as filters thatreceive the packets of events. These events can then be processed and forwarded to subsequentfilters. Results of one processing stage can easily be passed on to other algorithms by modifyingthe stream of events. After passing an optical flow filter, an event is modified by adding the flowestimate to its representation. These events are referred to as flow events. The final filter outputis then rendered to provide a visualization of the results.The following elaboration of filters for optical flow estimation serves as a short insight into theworking principle of the algorithms, with an emphasis on parameter settings affecting the estima-tion. Further implementation details are available in the publication by Rueckauer and Delbruck[25].

3.2.2 Lukas Kanade Approach

The Lukas Kanade Approach is based on the assumption of locally constant flow in an n × nneighborhood of the current event. The size of the neighborhood is determined by the param-eter searchDistance to each side of the current event. In order to relate to the neighboringevents, each incoming event populates the histogram of recent events per pixel. The parametermaxDtThreshold specifies for how long event timestamps are kept in memory. From this his-togram, both spatial and temporal gradients in equation (2.4) are computed. Written as Av = b,least squares estimation yields the solution for the flow velocity vector with equation (3.1).

v = (AT A)−1AT b (3.1)

The eigenvalues λ1 and λ2 of covariance matrix AT A serve to determine if it is invertible andto provide a confidence measure of the obtained estimate. The eigenvalues are compared to aconfidence threshold th and the resulting procedures are summarized in equation (3.2). A higherthreshold th increases the accuracy by discarding noisy estimates but also reduces the amount of

4https://github.com/SensorsINI/jaer5https://inilabs.com/

https://github.com/SensorsINI/jaer

https://inilabs.com/

18 3 Concept

optic flow events.

v =

(AT A)−1AT b, for λ1 ≥ λ2 > th−Et

∇E||∇E||2 , for λ1 ≥ th> λ2

0, for th> λ1 ≥ λ2

(3.2)

Matrix A and vector b contain spatial and temporal gradients of image brightness E. The im-plementation in jAER offers different options to compute these gradients. A variety of finite dif-ference methods is available, including second order derivatives. Additionally, a Savitzky-Golayfilter is implemented. With it, in short, low-order polynomials are fitted to supporting points bylinear least squares, yielding gradients directly from its fitting coefficients.A few other settings influence the estimation of the optical flow field. Most notably is the pa-rameter refractoryPeriodUs. It defines a refractory period during which no new optical flowwill be estimated at the same pixel. This can both speed up processing time and avoid wrongestimates. Events fired at the back of a moving edge produce flow estimates that, by definition ofthe spatial gradient, point away from the region with higher event frequency, thus, in the oppositedirection of the moving edge that just passed. This can be avoided by suppressing the estimate fora certain time, according to the dynamics of the observed scene. While the optical flow estimateis suppressed, the timestamp of the event still contributes to the histogram of recent events.

3.2.3 Local Plane Fitting Approach

Instead of holding multiple recent events in memory, the local plane fitting operates on the so-called surface of active events. These events are referred to as a surface, for how they appear whenvisualized in spatio-temporal space. An array stores the time stamps of the most recent event perpixel. Different available approaches can be used to fit a locally planar surface to this cloud ofevents. The orientation of this surface in the spatio-temporal domain yields the underlying normaloptical flow using equation (2.9).The parameter refractoryPeriodUs takes a different meaning in this approach. Contours witha gradual change in brightness yield multiple events at the same pixel. These broader edgesmake the plane fitting approach fail. The refractory period prevents updates of the surface ofactive events for some time in order to reduce this effect.Every incoming event populates its local neighborhood with size searchDistance to each sidefrom the array of active events. Events older than the threshold maxDtThreshold are neglected.The central part of this approach is to obtain the plane parameters from these selected events.The original formulation uses least squares linear regression. Rueckauer and Delbruck [25] alsoprovide a two-dimensional linear Savitzky-Golay filter that approximates the parameters of thesmoothed surface. Common to these methods is the use of threshold th3. A surface with vanishinggradients in x or y pixel direction yields high motion estimates in these directions. Optical flowvelocities for surface gradients smaller than th3 are discarded as unrealistically high.In an iterative approach, the least squares fitting is improved by discarding outlier events. Is thedistance of an event to the fitted plane greater than threshold th2, then the event is removedfrom the neighborhood. The iteration is terminated for a change in the fitting parameters belowthreshold th1.The Savitzky-Golay filter and the single fit variant, for which no iterative improvement takesplace, are by one order of magnitude faster than the iterative approach.

3.3 Application: Time to Contact 19

3.3 Application: Time to Contact

Optical flow serves different systems, both biological and artificial, as a tool to gain clues aboutmotion from visual input. Ultimately, these clues are used to extract high level information aboutthe environment. The estimation of Time to Contact (TTC) is one approach to do so. Based onoptical flow events, this application estimates the time to an imminent collision with an obstacleat the current relative velocity.

3.3.1 Focus of Expansion Estimation

The TTC approach described by Clady et al. [8] serves as basis of the formulation in this thesis.Their method is event-driven and based on the knowledge of the normal optical flow field. Thefirst step is to estimate the position of the Focus of Expansion (FOE), the projection of the currentrelative motion onto the image plane. This is done by using the property of the optical flow fieldthat diverges from the FOE during translational ego motion. The effects of superposed translationand rotation when turning are shown in ??. Clady et al. [8] formulate an averaging techniqueto find the most probable location of the FOE from all optical flow estimates. Algorithm 1 showsthe resulting procedure. The event-driven approach updates the estimate for every valid opticalflow vector (step 2 in Algorithm 1). This vector contains the component in normal direction ofthe edge, the normal flow vn . As a result, all positions in the negative half space of the vectorqualify for the FOE position with equal probability (step 3). These probabilities of each velocityestimate are combined to a probability map Mprob (step 4). The camera’s optical axis is alignedwith the car’s axis pointing in the forward driving direction. Therefore, the position of the FOEis assumed to stay in a circle A around the image center, as visualized in Fig. 3.5. The amount ofprobability updates can be greatly reduced. Only for strong skidding of the car this assumptiondoes not hold true. As the FOE changes over time, an exponential decaying function reducescontributions of older events (step 5). The currently highest probability yields the FOE estimatep foe = [xfoe, yfoe]T (step 6).The error of this estimation stems from the angular error of the underlying velocity vectors only.The formulation depends on events and optical flow estimates on all sides of the FOE for a uniquesolution to exist.

Algorithm 1 | Computation of the Focus of Expansion, adapted from [8]Require Mprob ε Rm×n and Mtime ε Rm×n (Mprob is the probability map and holds the likelihoodfor each spatial location and Mtime the last time when its likelihood has been increased).

1: Initiate the matrices Mprob and Mtime to 02: for every incoming e(p, t) at velocity vn do3: Determine all spatial location pi such that (p − pi)T vn > 04: for all pi ∈ A: Mprob(pi) = Mprob + 1 and Mtime(pi) = t i

5: ∀ pi ∈ A, update the probability map Mprob(pi) = Mprob(pi)e−ti−Mtime(pi )

∆t

6: Find pfoe = [xfoe, yfoe]T , spatial location of the maximum value of Mprob correspondingto the FOE location

7: end for

20 3 Concept

FOE

B

A

Figure 3.5: Regions of interest in driving scenario.

3.3.2 Time to Contact with Obstacle

With an estimate for the FOE, the TTC is calculated using equation (2.12). Every optical flowvelocity can be used to calculate a TTC. As described in Subsection 2.3.2 the TTC estimate issusceptible to noise. Better robustness is achieved for an average of flow velocities originatingfrom the same obstacle. Clady et al. [8] assume that all flow events in a fixed Region of Interest(ROI) directly below the FOE stem from an obstacle on the ground. The overall TTC average isformed only considering optical flow estimates from this region.In the driving scenario, however, the assumption that all events from a particular ROI correspondto an obstacle is too crude. Both obstacles and other contours, like line markings, will causeTTC estimates plausible for a collision. In order to tell optical flow velocities originating from anobstacle or line markings apart, it is necessary to make further assumptions.Potential obstacles are assumed below the FOE with limited horizontal offset. Therefore, onlyflow events from region B in Fig. 3.5 are considered for computation of TTC estimates. As theFOE corresponds to the heading of the vehicle, events originating in this region mainly lie onthe road, ahead of the moving vehicle. Valid flow estimates during forward translational motiondiverge from the FOE. As the flow events describe normal optical flow, this condition includes allevents where the FOE is in their negative half space. As a precaution to avoid poor TTC estimates[19], optical flow velocities originating too close to the FOE are discarded.To assign the remaining flow events to potential obstacles, I introduce a simple clustering method.Each quadratic cluster carries state variables for the center of mass, cluster activity, size and therespective TTC estimate. A new valid flow event is either added to an existing nearby cluster orit creates a new one. Each contributing event updates the cluster’s state variables. The updateis weighted relative to the cluster activity. A single event has less impact on a cluster with highactivity. The size of a cluster or obstacle can only increase with forward motion. Flow estimatesnear the boundary increase the size of a cluster. Every assigned event also shifts the center ofmass. This way obstacles are followed on the image plane. Each event updates the activity ofevery cluster, increasing the activity of the assigned cluster and decreasing all others. The decreaseis higher for clusters that have not been updated themselves recently. This way, the contributionof a single event to the cluster states is higher at lower event rate and vice versa. Clusters arediscarded when the activity falls below a threshold.

3.4 Ground Truth Computation 21

This basic method does not isolate events of different objects, it merely serves as a tool to breakdown the region B into entities with high event rate. The simple solution is accepted underthe assumption that an obstacle generates a substantial amount of events, outweighing possibleoutliers.

3.4 Ground Truth Computation

One of the advantages of using synthetic data originating from simulation is that ground truthvalues can be calculated. Other than from simulation, it can be difficult to obtain ground truthfor optical flow at all. Rueckauer and Delbruck [25] describe a method of how to use an InertialMeasurement Unit (IMU) to obtain ground truth data for simple cases. The camera ego motion isrestricted to rotation in a static scene. Non-rigid motion or moving objects can not be represented.Additionally, the ground truth value is flawed by the IMU measurement error. The MPI SintelDataset [5] is based completely on synthetic scenes from an open-source movie and providesground truth optical flow for evaluation of frame-based approaches. Providing an adequatelygood data set for event-based optical flow estimation exceeds the scope of this thesis.It is less complex to obtain ground truth for the FOE and TTC estimation. The FOE correspondsto the projection of the current relative velocity between observer and observed scene v car. Statevariables for all simulated vehicles are obtained from the simulation engine of TORCS, both inworld and car coordinates. The camera, for which the scenes are rendered, is fixed to the carand its optical axis Zc is aligned with the car’s forward pointing x-axis. Equation (3.3) givesthe complete mapping of three dimensional space to image coordinates using the ideal pinholecamera model.

s

uv1

= C x =

fx 0 cx0 f y cy0 0 1

xyz

=

fx 0 cx0 f y cy0 0 1

R | t

XYZ1

(3.3)

It is best understood together with Fig. 3.6. A vector in three dimensional space P = [X , Y, Z]T ,expressed in homogeneous coordinates, is transformed to camera coordinates [u, v] and the scal-ing factor s, using rotation R and translation t . Car to camera coordinates are transformed witha constant rotation and no translation as the car is a rigid body.

v car =

−vcar,y−vcar,zvcar,x

c

=

0 −1 0 00 0 −1 01 0 0 0

vcar,xvcar,yvcar,z

1

car

(3.4)

The last transformation uses the camera specific intrinsic matrix C . It consists of the principalpoint [cx , cy]T , assumed at the center of the sensor, and the focal length f . This intrinsic cameravalue is internally represented by the Field of View (FOV) in TORCS. The simple conversion isprovided in the appendix in Section 7.2. With these parameters, the FOE is calculated from thecar’s velocity v car using equation (3.5).

xfoe,gt = f−vcar,y

vcar,x+ cx

yfoe,gt = f−vcar,z

vcar,x+ cy

(3.5)

6www.docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html

www.docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html

22 3 Concept

Zc

X c

Yc

x

y

v

u

optical axis

principalpoint[cx , cy]T

P = [X , Y, Z]T

p = [u, v]T

image planeZ

c = f

Figure 3.6: Pinhole camera model6.

TTC estimation is performed on a collision with another car. Ground truth for this scenario isobtained by simply projecting the relative velocity v onto the direction of the distance betweenthe cars at positions x 1 and x 2 with equation (3.6).

τgt = [x 2 − x 1]T v||v ||2

(3.6)

Chapter 4

Results

4.1 Emulation of DVS Events

The synthetic Dynamic Vision Sensor (DVS) events emulated from simulation form the basis ofthis thesis, as the subsequent modules rely on this data. The principal requirement formulatedfor the emulation is to model the DVS with sufficient accuracy. This includes omitting behaviorof the real sensor not regarded as substantial while key characteristics persist. The results of thistrade-off are presented on the basis of a validation test case. Fig. 4.1 shows the setup used tocompare real DVS data with emulated events. A DVS with 128×128 pixels is directed at a screenwith 60Hz refresh rate. A simple animation of a moving white bar over black background stim-ulates the generation of events. The emulation module grabs the frames with screen frequencyand generates synthetic events. Both outcomes are compared qualitatively and quantitatively.

Figure 4.1: Setup for validation test case. DVS directed at LCD screen (60Hz) displaying moving white bar overblack background.

Fig. 4.2 compares events collected during 20ms from the DVS and the emulator. It can be dis-tinguished between a moving bar with a gradual gradient and a sharp edge from the pattern ofgenerated events. For a gradual gradient edge, events are generated over a wider range of pixels.These do not occur precisely along the vertical edge due to threshold mismatches between pixels.In the emulator the pixel threshold variation is achieved by normally distributing threshold val-ues with standard deviation of the real sensor. The sharp edge yields an accumulation of eventsdirectly at the edge (i.e. stronger contrast in this visualization), as the the logarithmic intensitychange is high between adjacent pixels and multiple events are triggered at the same pixels. Thischaracteristic is preserved in the emulator by generating multiple events between two consecutive

24 4 Results

frames according to the multitude of log intensity change per pixel.Fig. 4.2 also shows differences of the emulator. Noise inherent to the DVS is not modeled. Noevents are generated ahead or rearward of the moving edge. This DVS characteristic is unwantedand usually limited or removed by filtering. Hence, this was not modeled in the emulator.

Figure 4.2: Comparison of events generated with DVS and emulator: Events collected during 20ms for movingwhite bar over black background to the right. Top row: DVS; bottom row: emulator; left column: gradual gradientedge; right column: sharp edge.

Fig. 4.3 depicts the averaged event rate over time. The events are observed only at a vertical arraywith a width of one pixel. The initial rise of events occurs when the leading edge of the movingbar reaches the vertical pixel array. Both the real sensor and the emulator yield different eventrates depending on the brightness threshold settings. For the DVS two predefined configurationswere used. The fast profile uses a lower brightness threshold to better capture changes in fastchanging scenes. Consequently the yielded event rate is higher than that of the slow profile. Theemulator behaves equivalently when the illumination threshold th is adapted. Fig. 4.3 also dis-plays deviations of the emulator model. The DVS response includes a smoothed peak, while the

4.2 Datasets 25

emulator response yields an almost constant event rate after the onset. Katz et al. [14] explainedthis behavior with optical distortions by the lens. Again, the effects of noise are apparent in theDVS data. A perceivable event rate persists before the onset and after the trailing edge.

0

5

10

15

20

25

30

35

40

45

50

55

6 7 8 9 10

even

tra

te[k

eps]

time [s]

DVS fast

DVS slow

Emulator th=30

Emulator th=50

Figure 4.3: Averaged event rate for moving bar with gradual gradient edge observed at one vertical pixel arrayover time. Varying thresholds for both DVS and emulator.

In order to provide a sufficiently good basis for evaluation of optical flow algorithms at high driv-ing speeds, a sufficient time precision of the emulated events needs to be established. This isachieved by reducing the simulation time step size. The effect of doing so is visualized in Fig. 4.4.It displays the amount of events corresponding to multiplicity of change in illumination thresh-olds between frames. The greater the change of brightness, the more events with interpolatedtimestamps are generated between frames. These can originate from edges with a steep spatialgradient in brightness, or from large inter-frame displacements. The latter is unwanted behavior,as it indicates insufficient temporal precision. The decrease in the amount of events stemmingfrom higher multiplicity shows that large inter-frame displacements can be avoided with decreas-ing time steps, as events stemming from a large jump in brightness persist for smaller time steps.

4.2 Datasets

A number of datasets in four different scenes are recorded using the emulator in the car racingsimulation TORCS. The Fig. 4.5 provides an overview.

26 4 Results

10

100

1000

10000

100000

1e+ 06

1e+ 07

th 2×th 3×th 4×th 5×th

amou

ntof

even

ts

multiplicity of event threshold jump between frames

∆t = 50µs∆t = 100µs∆t = 200µs∆t = 500µs

Figure 4.4: Amount of interpolated events for a change in simulation time step ∆t .

(a) (b)

(c) (d)

Figure 4.5: Dataset scenarios: (a) high-rise; (b) high-rise orange; (c) open field; (d) curves.

The first two datasets (a) and (b) only differ in the texture used for the buildings. Both datasetsare recorded on a straight stretch of 280m passing high-rise buildings. Another car is placed at theend of the straight segment to serve as obstacle for Time to Contact (TTC) estimation. Scenario(c) is set on another straight road segment of 70m length without nearby buildings. A car blocksthe road at the end of the road. The curves dataset features driving along a route along the aboveroad segments and several curved segments without obstacles.Additionally to this selection of scenarios, the recorded datasets differ in parameter settings usedin the simulation, the configuration of the virtual camera, i.e. the emulator, and the trajectory of

4.3 Optical Flow: Parameter Tuning 27

Table 4.1: Dataset configuration overview.

parameter used values (reference value in bold)

simulation time step ∆t[µs] 20; 40resolution [pixels] 320× 240; 320× 240 super-sampledevent threshold DVS (log range: [0;255]) 10; 20; 40event threshold mismatch 0%; 2%car velocity vcar[m/s] 5; 10; 20; 30dynamics constant translation; pitch rotation; curvesscenario high-rise; high-rise orange; open field; curves

the simulated car. These parameters are summarized in Tab. 4.1.I identified one set of parameters for a reference dataset which is used to tune and assess event-based optical flow estimation in section 4.3. The corresponding values are indicated in bold facein the table. Based on this set of parameters only one setting is changed at a time to providebetter comparability.The value for the simulation time step of∆t = 20µs corresponds to the lowest time step possible inthe TORCS simulation. Smaller time steps lead to numerical errors in the update of state variablesand result in unexpected motion of the car. A resolution of 320×240 pixels is the lowest availablesetting in TORCS and comes closest to available asynchronous sensors (DVS: 128×128, Dynamicand Active-pixel Vision Sensor (DAVIS): 240 × 180). The resolution of the reference dataset isobtained using super-sampling to reduce anti-aliasing effects in the rendered frames. The eventthreshold is determined to yield an average event rate of 300k events per second (eps), in orderto stay in the range of real-time optical flow estimation with both Lukas Kanade and Local Planesalgorithms. The threshold range corresponds to the range of the logarithmic response to linearpixel brightness. The linear 8-bit range is mapped logarithmically to the same range between 0and 255. In the reference dataset the DVS is not modeled with a threshold mismatch betweenpixels, whereas the real sensor has a value of 2.1% [16]. The virtual camera Field of View (FOV)of 35 is determined heuristically to provide a better view of the road ahead, in view of theapplication to estimate TTC with an obstacle on the road. With 72km/h, the reference velocityis set between average speeds in cities of around 30km/h and common speed limits of around110km/h. Constant translation ideally yields a purely diverging flow field. This characteristicis used to be able to better assess performance of optical flow estimation algorithms. For thesame reason, the scenario high-rise is selected as reference. The straight and clear edges of thebuildings serve as a good source of optical flow.

4.3 Optical Flow: Parameter Tuning

Both optical flow algorithms have a set of parameters that need to be set according to the un-derlying motion in the observed scene. The previously described reference dataset is used forobtaining these parameters with an event rate of 250keps. As no ground truth for the opticalflow field is computed, I heuristically determine the parameters. Approximations of the apparentoptical flow by hand are used as a starting point and the parameter choice is validated by quali-tative assessment of the outcome. The constant translational motion helps to identify outliers inthe estimation, as the underlying visual motion is a purely diverging flow field independent ofdepth.

28 4 Results

4.3.1 Local Planes Approach

The most important parameters in this approach are maxDtThreshold, searchDistance andrefractoryPeriodUs. The first two are connected in that searchDistance defines the size ofthe spatial neighborhood, and maxDtThreshold respectively the temporal neighborhood out ofwhich the so-called surface of active events is generated. To accurately estimate the flow velocityfrom fitting a locally planar surface to the surface of active events, these neighborhoods need tobe appropriately sized.Rueckauer and Delbruck [25] suggest that searchDistance=3 generally provides best results,while a smaller neighborhood speeds up processing. Both a spatial size of 2 and 3, yielding aneighborhood of respectively 5× 5 and 7× 7 pixels, are considered for obtaining results.Fig. 4.6 summarizes the qualitative assessment for choosing the parameter maxDtThreshold.Flow velocities in regions (a), (b) and (c), indicated in the image, are calculated by hand from cor-responding event timings. A starting value for the temporal neighborhood is derived from thesevelocities. Motion of a slow moving edge is detected only with a sufficiently large temporal win-dow maxDtThreshold. The lantern post in region (a) moves at a normal velocity vx = −80px/s,the left building in region (b) moves upwards with vy = 7px/s and the building in region (c)moves both to the right with vx = 30px/s and up with vy = 15px/s. These apparent velocities areused to obtain a preliminary value for maxDtThreshold. The edge of the left building in region(b) leaves a trail of events while moving upwards with a velocity of 7px/s. In a spatial neighbor-hood of 5× 5 pixels, the event timestamps at the boundary are about 350ms old at the time theedge generates an event at the central pixel. The temporal neighborhood maxDtThreshold is setaccordingly to include these events in the surface of active events.The result of correctly setting maxDtThreshold is visualized by the two flow estimates of region(b) in Fig. 4.6. The temporal window of 80ms is too small to include sufficient events for a cor-rect plane fitting. The calculated time of 350ms correctly estimates the normal flow, also on theneighboring building.The optical flow field in spots (c) and (e) show what happens with an oversize temporal window.With 80ms the up and right motion of the building on the image plane is accurately estimated.The larger temporal window of 160ms, however, also yields an increased amount of opposingflow estimates. Similar reasoning applies for region (e).The flow velocity of the lantern in region (a) does not show this effect. The clear backgrounddoes not cause any other events that would populate the surface of active events. Consequentlythe plane fitting succeeds without appropriately sizing the temporal neighborhood. The disordedflow field in spot (d) is another similarity for all analyzed settings. The texture behind the lanternpost leads to varying events along the edge of the post, depending on the background brightness.The resulting events are not approximated well by the plane fitting approach.The parameter refractoryPeriodUs prevents updates of the active plane of events for a certaintime. In this test case this increases accuracy, as some broad edges create multiple events. Thelocal plane fitting approach tends to fail, without filtering these events. The flow estimation forfiltered events is skipped. As result the overall processing time decreases but at the same time theflow field becomes more sparse. A refractory period of around double the temporal window size isobserved as a good trade-off between the amount of filtered out events and improved estimationat broad edges. The tendency of lower amount of events with increasing refractory period can beseen in table Tab. 4.2.All variants of fitting a plane to the surface of events use the parameter th3 to determine exces-sive velocities. The default value of th3 yields good results in the reference dataset. With anincreased value, the high velocities apparent shortly before the collision at the end of the datasetare filtered out. Lowering the value introduces noisy estimates of high velocity.Among the different variants of the local plane fitting approach implemented in the Java Address-Event Represenation (jAER) framework, the linear Savitzky-Golay (SG) filter and the single-fit

4.3 Optical Flow: Parameter Tuning 29

(a)

(d)

(b)(c)

(e)

maxDtThreshold=

350ms80ms 160ms

80ms

80ms 160ms

flow direction

maxDtThreshold=

maxDtThreshold=

Figure 4.6: Tuning local planes optical flow: regions (a) and (d) yield similar results for all settings. Regions (b),(c) and (e) show different optical flow fields depending on parameter maxDtThreshold. Flow directions are colorcoded. The parameter searchDistance=2 in all configurations.

(SF) plane fitting provide best processing times and highest accuracy [25]. It is difficult to deter-mine qualitatively what variant performs better on this dataset. A simple quantitative measureis introduced to ease the selection. With ground truth for the Focus of Expansion (FOE), it ispossible to determine a flow vector violates the ground truth of a purely diverging flow field. A

30 4 Results

relative angle between the flow vector vn and the distance between FOE and flow location p−pfoegreater than 90 is considered invalid, as the flow estimate yields normal flow. The amount ofinvalid flow events is used to select a set of parameters. Optical flow with varying configurationis estimated on the reference dataset for the same stream of events to ensure comparability.Tab. 4.2 provides an overview of results for different configurations. Overall, a longer refrac-tory period yields a more sparse flow field. Both valid and invalid flow events are omitted. Thecomparison between a spatial size of two and three pixels shows better results for the smallerneighborhood; less invalid estimates are counted for a denser flow field. The single fit approach(SF) performs less accurate than the Savitzky-Golay (SG) method for the predetermined temporalsizes. The selection of one set of parameters is only helped by these measures. The selection of thefinal parameters, indicated in bold face in the table, is based both on qualitative and quantitativeassessment.

Table 4.2: Local plane fitting parameter assessment. Comparison of different variants on the basis of amountof invalid flow estimates. Amount of flow events out of total events decreases with refractory period. Selectedparameters in bold face.

variantspatial

size[px]temporalsize[ms]

refractoryperiod[ms]

invalid estimates outof flow events [%]

flow events outof total events [%]

SG 2 80 40 23.62 45.91SG 2 80 80 21.34 40.29SG 2 80 150 18.42 33.94SG 2 80 300 18.24 21.23

SG 3 115 60 23.92 43.86SG 3 115 115 21.67 38.41SG 3 115 200 18.27 28.10SG 3 115 300 17.79 22.60

SF 2 80 150 28.23 21.68SF 2 80 200 26.84 16.21

SF 3 115 200 31.73 26.83SF 3 115 300 30.07 21.28

4.3.2 Lukas Kanade Approach

The sizing of the local neighborhood for the Lukas Kanade approach follows the same reason-ing as above, as the underlying motion is the same. Parameters refractoryPeriodUs and threquire tuning. The threshold th discards high velocity estimates based on eigenvalues in theleast squares estimation. An increased value filters out higher velocities as unrealistic. A valuegreater than the default filters out fast velocities apparent when approaching the obstacle. Alower threshold introduces more noisy estimates and increases the count of invalid events in thisdataset. I leave the value at the default level.The refractory period takes the same effect on the flow estimation as with the local plane fit-ting approach. For the duration of the refractory period, no new flow estimates are performedfor events at the same pixel. This leads to a more sparse flow field for increased values ofrefractoryPeriodUs while the processing time is reduced. Fig. 4.7 shows the effect of therefractory period for two differing configurations of the Lukas Kanade method with varying spa-tial and temporal sizes. The figure gives the portion of invalid events out of total events, as wellas the portion of flow events out of total events. Both percentages decrease with the refractory

4.4 Focus of Expansion 31

period. The effect is also visible from the rendered flow field, as events at the back of a movingedge are suppressed, that would otherwise produce flow estimates with wrong orientation.

0123456789

101112131415161718192021

40 80 120 160 200 240 280 320

[%]

refractory period [ms]

invalid flow / flow events, 80ms, size 2invalid flow / flow events, 115ms, size 3

flow events / total events, 80ms, size 2flow events / total events, 115ms, size 2

Figure 4.7: Tuning results for Lukas Kanade approach: two variants with differing spatial and temporal neighbor-hood size. The amount of flow events and invalid flow estimates decreases with the refractory period.

Again, the choice of the refractory period is a trade off between a more sparse flow field andincreased accuracy. The configuration with a spatial search distance of 2 pixels to each side per-forms better overall, as a lower percentage of invalid flow estimates is achieved with a higheramount of flow events. A refractory period of 150ms is selected, as the accuracy only slightlyimproves with higher values.The final selection of the parameters used for the optical flow estimation using the Lukas Kanademethod are summarized in Tab. 4.3.

Table 4.3: Parameters selected for Lukas Kanade approach.

variantspatial

size[px]temporalsize[ms]

refractoryperiod[ms]

confidencethreshold th

SG 2 80 150 1.0

4.4 Focus of Expansion

The FOE estimation is performed with the two different optical flow algorithms to compare theirperformance. As the FOE is estimated based on the assumption of a diverging flow field, only theaccuracy of the flow orientation affects the computation.The results for the reference scenario dataset are presented first and in more detail. To identifyinfluences on the accuracy of the optical flow and FOE estimation, the dataset configuration isvaried, one parameter at a time. This ensures comparability between datasets.

32 4 Results

4.4.1 Reference Scenario

This dataset is configured with the parameters in bold face in Tab. 4.1. The accuracy of the FOEestimation is computed with equation (4.1). The estimated values xfoe and yfoe are integer values,as they correspond to the pixel with maximum probability for the FOE.

∆xfoe = |xfoe,gt − xfoe|∆yfoe = |yfoe,gt − yfoe|

(4.1)

The result derived with local planes optical flow is displayed in Fig. 4.8. The first second of thescene is not displayed as both optical flow estimation and FOE estimation need time to initialize,until the histogram of events is populated and the probability map converges. The figure showsa great difference in accuracy for the two dimensions. In y-direction the FOE position is off by afew pixels for the entire time. The error in x-direction is about four times as high. At about 6s,the error drops, as the FOE estimate moves from right to left, passing by the ground truth in themiddle. Towards the end of the scene, an increased amount of events from the obstacle yields animproved estimate in x-direction. The results are quantized, as the ground truth is constantly atthe center for the purely translational motion and the estimate is in whole numbers.

02468

1012141618202224262830323436

1 2 3 4 5 6 7 8 9 10 11 12

pixe

ls

time [s]

µx ±σ = 21.79±5.16∆xfoe

µy ±σ= 5.07±1.63∆yfoe

Figure 4.8: Focus of expansion estimation error with local planes optical flow estimation.

The Lukas Kanade approach yields better results for both components. The results are visualizedin Fig. 4.9. The error in x-direction is high initially and drops to about the same level as the errorin y-direction.As with the local planes approach, the vertical error is low for the entire scene, with a standarddeviation of less than one pixel. The persistence of this discrepancy between x and y-directionacross optical flow methods suggests that this is related to the underlying data. In Section 5.4 Ielaborate further on why the FOE estimate in x-direction is in general less robust and susceptibleto wrong flow estimates in the driving scenario.


0

2

4

6

8

10

12

14

16

18

20

22

24

1 2 3 4 5 6 7 8 9 10 11 12

pixe

ls

time [s]

µx ±σ = 7.82±8.05∆xfoe

µy ±σ= 2.29±0.93∆yfoe

Figure 4.9: Focus of expansion estimation error with Lukas Kanade optical flow estimation.

4.4.2 Influence of Simulation Parameters

The underlying simulation in the car racing game TORCS is the source of frames from whichevents are emulated. The simulation parameters therefore exert great influence on the event-based estimation of optical flow.

Simulation Time Step

The state variables of the dynamic equations governing the simulation of the cars are updatedwith the simulation step size ∆tsimulation. The original implementation in TORCS uses a step sizeof ∆tsimulation = 2ms. The DVS, however, operates in the order of microseconds and a step sizein this order of magnitude is necessary to accurately time events. With a time step smaller than20µs the simulation starts to behave unexpected, resulting in cars moving in incorrect directions.The reasons for this and its implications are discussed in Section 5.1.The reference dataset is recorded with lowest feasible simulation step size of 20µs. The effectsof using only double the step size are summarized in Tab. 4.4. The average error is far higher forall but the error in x-direction for the local planes fitting optical flow.

Table 4.4: Influence of simulation time step on FOE estimation.

∆tsimulation FOE error µ±σ reference: 20µs

40µs ∆xfoe,LP 19.15 ± 3.77 21.79 ± 5.16∆yfoe,LP 12.67 ± 4.48 5.07 ± 1.63∆xfoe,LK 19.49 ± 3.70 7.82 ± 8.05∆yfoe,LK 8.01 ± 4.96 2.29 ± 0.93

34 4 Results

Resolution

The influence of the spatial resolution on the generated events is equally big. The renderingpipeline of the Open Graphics Library (OpenGL) determines the color of each pixel based on athree dimensional model. Whenever a pixel of a frame is rendered with a changed brightness,greater than the event threshold, an event is generated. This change of brightness can occur,due to an object moving into the area rendered to the pixel. The quantization due to the pixelsleads to effects such as aliasing, when fine structures like edges are distorted because they arenot sampled at a sufficiently high resolution.TORCS provides an implementation to reduce aliasing effects, however, the effect persists. Thereference dataset is recorded using anti-aliasing by super-sampling the frames in TORCS in aresolution of 640× 320 pixels that I implemented additionally. The emulation is then performedby obtaining one pixel brightness from an average of four pixel values to obtain a DVS image with320× 240 pixels. Tab. 4.5 provides an overview of results obtained without this further spatialrefining. The local plane fitting approach yields slightly reduced average errors, while the FOEestimation based on the Lukas Kanade approach suffers from the reduced spatial resolution.

Table 4.5: Influence of resolution on FOE estimation.

resolution[pixels] FOE error µ±σ

reference:320× 240 super sampled

320× 240 ∆xfoe,LP 20.46 ± 7.36 21.79 ± 5.16∆yfoe,LP 3.84 ± 1.87 5.07 ± 1.63∆xfoe,LK 9.46 ± 7.07 7.82 ± 8.05∆yfoe,LK 4.55 ± 3.03 2.29 ± 0.93

4.4.3 Influence of Sensor Parameters

The virtual sensor parameters adapt how the emulator generates events based on the renderedframes. While the real sensor has a multiple adaptable biases, the emulator only models the eventthreshold. On the other hand, the synthetic DVS can simulate behavior that the real sensor is notcapable of, such as the absence of event threshold mismatch between pixels.

Event Threshold

The event threshold defines the logarithmic change in brightness necessary to generate an event.The higher the threshold, the fewer the amount of events generated in the same scene. A lowthreshold, on the other hand, can lead to multiple events being generated by one strong contrastedge passing and in the real sensor leads to increased background noise.The level of the event threshold also affects the optical flow and FOE estimation. Tab. 4.6 givesthe results for a halved and doubled event threshold. An event threshold of 40 leads to an eventrate of 68.6 keps, a value of 10 yields 824.4 keps and the reference dataset 259.4 keps. Therefore,the threshold level also influences the real-time capability.The results show different behavior for the two optical flow estimation approaches. The smallerthreshold of 10 benefits the accuracy of FOE estimation using the Lukas Kanade approach. Equiv-alently, the accuracy decays for an increased threshold of 40. Estimation results with the localplane fitting approach show reverse tendencies. For a threshold of 10 the average error in y-direction increases, whereas the increased threshold improves the estimation in x-direction. Thiseffect is discussed further in Section 5.3.


Table 4.6: Influence of event threshold on FOE estimation.

event threshold FOE error µ±σ reference: 20

10 ∆xfoe,LP 20.47 ± 3.69 21.79 ± 5.16∆yfoe,LP 8.42 ± 3.70 5.07 ± 1.63∆xfoe,LK 4.15 ± 4.26 7.82 ± 8.05∆yfoe,LK 1.82 ± 1.88 2.29 ± 0.93

40 ∆xfoe,LP 17.98 ± 6.54 21.79 ± 5.16∆yfoe,LP 5.31 ± 3.39 5.07 ± 1.63∆xfoe,LK 15.66 ± 8.11 7.82 ± 8.05∆yfoe,LK 6.61 ± 4.78 2.29 ± 0.93

Event Threshold Mismatch

The real sensor has a hardware dependent mismatch of the event threshold between pixels. Whileone pixel might generate an event, the neighboring pixel might not, in spite of the same relativechange in brightness apparent at both. This can have a direct effect on the optical flow estimation,it can result in missing events in the surface of active events.The reference dataset is recorded without any threshold mismatch. Tab. 4.7 gives the results ofFOE estimation with a threshold mismatch of 2.1%, according to the values reported for the realsensor [16]. Both methods show a slight overall decrease in accuracy.

Table 4.7: Influence of event threshold mismatch on FOE estimation.

event thresholdmismatch

FOE error µ±σ reference: 0%

2.1% ∆xfoe,LP 21.12 ± 4.32 21.79 ± 5.16∆yfoe,LP 6.53 ± 3.05 5.07 ± 1.63∆xfoe,LK 10.57 ± 7.77 7.82 ± 8.05∆yfoe,LK 2.29 ± 1.46 2.29 ± 0.93

4.4.4 Influence of Driving Scenario

The reference dataset is a very restricted scenario. It consists of purely translational motionwith constant velocity and no other degrees of freedom that are existent in the real world. Anyadditional degree of freedom also changes the observed motion by the sensor. The driving domainis restricted in comparison to flying vehicles, but still includes a wide range of velocities anddynamic behavior putting the described approaches for event-based vision to test.

Velocity

Cars operate on a range of velocities. Without introducing further dynamic behavior, the test casespresented here are along the same trajectory as the reference case at different speeds, rangingfrom 5m/s to 30m/s. This way the robustness of the optical flow and FOE estimation over a rangeof velocities is put to test.The comparability of the results to the reference for these test cases is diminished due to thelimitations concerning the simulation time step. For lower velocities, the time step had to beincreased in the same ratio, e.g., the time step size was doubled for half the speed. For the test

36 4 Results

case with a velocity of 30m/s, in turn, the time step size could not be reduced further.Tab. 4.8 shows the averaged errors in y-direction for the two different optical flow algorithms. Fora better overview, the errors in x-direction are omitted here. The FOE error average increases forthe reduced velocities of 5m/s and 10m/s. As the motion consists purely of translational motionthe apparent visual flow directly scales with the ego motion of the car. To equalize the effect oftuning the optical flow according to the underlying motion, the FOE estimation is also performedwith the refractory period and the temporal size of the neighborhood scaled proportionally. Theseresults are indicated as tuned.

Table 4.8: Influence of velocity mismatch on FOE estimation.

velocity [m/s] FOE error µ±σ reference: 20m/s

5 ∆yfoe,LP 13.76 ± 6.49 5.07 ± 1.63∆yfoe,LK 8.22 ± 4.70 2.29 ± 0.93

5(tuned) ∆yfoe,LP 6.93 ± 3.39 5.07 ± 1.63∆yfoe,LK 2.97 ± 1.73 2.29 ± 0.93

10 ∆yfoe,LP 13.11 ± 7.09 5.07 ± 1.63∆yfoe,LK 7.11 ± 3.78 2.29 ± 0.93


30 ∆yfoe,LP 5.76 ± 4.16 5.07 ± 1.63∆yfoe,LK 3.40 ± 2.43 2.29 ± 0.93


For the velocities 5m/s and 10m/s the tuning improves the FOE estimation with average errors ina similar range as the reference. At a velocity of 30m/s the adjustment of optical flow estimationincreased the average FOE error. In the unadapted case the FOE error is in a similar range as thereference.

Pitch Rotation

−0.25

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10 11 12

pitc

han

gle[]

time [s]

ϕy

Figure 4.10: Pitch angle as additional degree of freedom.

This scenario still limits the trajectory of the car on a straight line but introduces the pitch angleϕy as a degree of freedom. The car can rotate about the car’s y-axis, perpendicular to the driving


direction and parallel to the ground. Fig. 4.10 shows the pitch angle resulting from shifting gears.The rotation is sensed by the DVS as uniform visual flow across the image plane, as rotationaloptical flow is independent of depth.Due to the superposition of rotational and translational flow, the ground truth flow field is nolonger purely diverging. Consequently the FOE estimation accuracy suffers during the rotationabout the pitch axis. The continued slight change in pitch angle ϕy , however, does not impair theestimation based on the Lukas Kanade method further. The average FOE error using local planefitting is with µx ±σ = 17.71±4.62 and µy ±σ = 12.16±3.68 higher than in the reference case.

0

2

4

6

8

10

12

14

16

18

20

22

24

1 2 3 4 5 6 7 8 9 10 11 12

pixe

ls

time [s]

µx ±σ = 5.79±5.24∆xfoe

µy ±σ= 3.47±4.17∆yfoe

Figure 4.11: Focus of expansion estimation error with Lukas Kanade optical flow estimation.

Turning

Similarly to the pitch rotation, another degree of freedom is introduced to the driving dynamics.Turning introduces a rotation about the yaw axis. The flow field is composed of rotational andtranslational components and is no longer purely diverging. As the FOE estimation is based onthis assumption, the accuracy of flow estimation degrades. Fig. 4.12 shows the estimated opticalflow for a turning maneuver. The estimated position for the FOE, indicated with the yellow cir-cle, is misplaced to the left of the ground truth, indicated with a red diamond. The superposedrotational component of motion yields optical flow vectors pointing to the right, which causes theinaccurate position estimate.

38 4 Results

Figure 4.12: Focus of expansion estimation when turning. Ground truth FOE in red, estimated FOE in yellow.

4.5 Time to Contact Estimation

The Time to Contact (TTC) is estimated on the basis of the optical flow field in this thesis. Throughprojective geometry, it is possible to compute the time it takes for a collision with an obstacle, fromnothing more than the perceived relative motion between the obstacle and the moving camera.For the approach implemented in this thesis, a restriction to underlying translational motion ap-plies.The estimation of TTC is attempted in a test scenario with strictly constrained translational for-ward motion at a constant velocity. A car, positioned across the driving lane, serves as obstaclefor the application. Two different algorithms are used to estimate the optical flow field.With neither approach I could accurately estimate the TTC. Fig. 4.13 shows the estimation resultsfor different clusters associated to the obstacle. The collision is imminent in only 0.3s, howeverall estimated times indicate a collision in more than 3s. The estimated FOE, indicated with ayellow circle, approximates accurately the ground truth, indicated with a red diamond.The TTC computation based on the local plane fitting approach does not yield accurate results ei-ther. Fig. 4.14 illustrates the estimation state for a collision imminent in 1s. Two clusters of eventswith high activity, indicating the rate of events contributing to the estimate, yield an estimate thatis off by more than 50%.

4.5 Time to Contact Estimation 39

Figure 4.13: Time to contact estimation using Lukas Kanade optical flow. Ground truth FOE in red, estimatedFOE in yellow. Collision imminent in 0.3s.

Figure 4.14: Time to contact estimation using local plane fitting optical flow. Ground truth FOE in red, estimatedFOE in yellow. Collision imminent in 1s.

Chapter 5

Discussion

In this thesis, synthetic data is generated by emulating asynchronous events from a simulationcorresponding to the functioning principle of asynchronous vision sensors. The data is used toperform optical flow estimation and an application is implemented to attempt event-driven visionin the autonomous driving domain.

5.1 Simulation

The primary source of all synthetic data generated is the simulation in the car racing game TORCS.Despite its age of 20 years development is still ongoing and the open-source availability is the mainreason for the use in this thesis. With access to the source code, modifications to include the eventemulation and an ability to generate customized datasets are possible.Compared to state of the art computer graphics, TORCS does lack realism in both physics simu-lation and rendering. These shortcomings manifest themselves most prominently in a time stepsize limited by the precision of the underlying simulation engine. TORCS is not meant to operatewith microsecond precision and uses time variables in seconds instead. Rounding errors amountto unexpected behavior, such as rotation and sidewards motion, when the time step size is de-creased to far. Therefore, the temporal precision of frames rendered with TORCS is limited to20µs.A comparison of the accuracy of Focus of Expansion (FOE) estimation with different levels of anti-aliasing, shows the potential of improvement with increased rendering precision. Especially thinedges, prime sources for events, tend to be poorly represented without sufficient anti-aliasingmethods. Many advanced approaches exist in modern computer graphics. TORCS relies on adeprecated version of the Open Graphics Library (OpenGL) and updates concerning the graphicsrely on unsupported features. A more computationally efficient approach for the emulation isinfeasible with TORCS for the same reason. The legacy version of OpenGL also poses difficultiesto properly investigate high dynamic range capabilities with synthetic data. The internal colorrepresentation is limited to 8-bit per color, while newer versions provide up to 32-bit capability.

5.2 Emulation

A central part of this work is the implementation of a flexible emulator that serves as a syntheticsource for events from an asynchronous vision sensor like the Dynamic Vision Sensor (DVS). Theemulator is implemented as an independent module that interfaces with OpenGL. This way it canbe flexibly integrated into any application that renders frames with OpenGL with little effort. Thesynthetic data provided by the emulator can either be accessed via ROS topics over a networked

5.3 Optical Flow Estimation 41

connection or saved to a binary file. The file format used is compatible for import with the JavaAddress-Event Represenation (jAER) framework.In order to validate the emulator output, a direct comparison with real DVS data is performed.The DVS is aimed at a screen to generate events for a motion stimulus of a white bar on blackbackground. The DVS behavior is mimicked in the bounds of the DVS model used by the emulator.Synthetic data generated with this emulator is used to assess the feasibility of Time to Contact(TTC) estimation based on optical flow for autonomous driving.


Optical flow is only indirectly assessed in this thesis. Computation of ground truth for the opti-cal flow field is a possible advantage of data originating from simulation. Providing an accuratesource of ground truth for optical flow would have exceeded the scope of this thesis, especiallyconsidering the limited access to the 3D-representation with the outdated OpenGL version usedin TORCS. Due to the lack of ground truth, the tuning procedure proved to be difficult. Quali-tative measures, accompanied with an simple quantitative measure, are used to obtain suitableparameters for the underlying motion.The performance of FOE estimation based on optical flow allows to indirectly assess the perfor-mance of optical flow as well. The FOE estimation only depends on the accuracy of orientation ofthe underlying flow field in purely translational flow. Across different datasets, the Lukas Kanadeapproach achieves a more accurate FOE estimate than the local plane fitting approach. Thesedifferences are conform with the results described by Rueckauer and Delbruck [25]. They findthat the Lukas Kanade approach has a lower average angular error across multiple test cases.In a set of test cases the driving speed is varied from 5m/s to a velocity of 30m/s. The accuracy ofFOE estimation decays for both investigated optical flow algorithms. Changing the tuning param-eters according to the change of motion restores the accuracy to almost the level of the referencedataset. This shows a susceptibility in the accuracy of both optical flow algorithms to changesin the perceived motion. Depending on the tuning parameters, only a range of velocities is ac-curately estimated. Values exceeding this range are filtered as excessively high, whereas lowervelocities are omitted as the neighborhood is not defined sufficiently large to capture the motion.

5.4 Time to Contact Estimation for Autonomous Driving

Time to Contact (TTC) estimation is a way of allowing collision avoidance using only visual dataas input. Its feasibility was shown for a slow moving robot based on event-based optical flow [8].This approach is used in this thesis to assess bringing event-based vision to autonomous driving.The very nature of the driving environment, however, poses a first problem to this approach.The road below the horizon and the aisle above the horizon caused by the road, are a poorsource for events. Clear structures, such as buildings, posts or lanterns are found only to thesides. This imbalance manifests itself in a reduced accuracy in the estimation of the FOE inx-direction independent of the underlying flow method. The FOE is estimated by updating aprobability map with each flow estimate. Due to the sparsity of events along a vertical columnin driving direction, the estimate in x-direction is mostly determined by events coming outsideof this column. The resulting probability is more equally distributed in x-direction. This makesthe estimate susceptible to wrong flow estimates from inside the column that further restrict thelocation. The effect is demonstrated by introducing additional vertical lines in the texture of abuilding that lies directly in the driving direction for the scenario high rise orange. These vertical

42 5 Discussion

lines introduce events in y-direction that contribute to further confine the probability of the FOE,as they originate close to its x-position. Fig. 5.1 compares the error of FOE estimation with andwithout these vertical lines. The improvement is visible, especially towards the end of the scene,as the building approaches and generates more flow estimates. In general however, the drivingenvironment does not offer any additional sources of events in the aisle of driving direction.

02468

10121416182022242628303234

1 2 3 4 5 6 7 8 9 10 11 12

pixe

ls

time [s]

∆xvertical : µx ±σ= 15.01±5.09∆xnovertical : µx ±σ= 21.37±5.36

Figure 5.1: Focus of expansion estimation error with local planes optical flow estimation

Driving is for the largest part composed of translational motion. The respective flow field isinversely dependent on depth. This results in a wide range of velocities in any forward motion,as objects far away appear to move very slowly and objects close by or peripheral move relativelyfast. Additionally, cars have a wide operating range of velocities. The event-based optical flowalgorithms used in this thesis, however, are found to have an operating range limited by theirtuning parameters. The reduced accuracy of flow estimation in the driving scenario also directlyaffects the TTC estimation attempted.The formulation of TTC estimation according to [8] is not achieved. The described difficulty ofproviding accurate optical flow estimation in the driving scenario is one contributing factor. TTCestimation is based on the optical flow field. First, the so called FOE is approximated, accordingto the orientation of the flow field. In a second estimation step, flow events associated to anobstacle and their distance to the FOE are used to compute the time it takes for a collision withthat obstacle. This way, the position error of the FOE contributes to the final estimation of theTTC. While the FOE can be derived independent of flow magnitude, the TTC computation requiresaccurate flow estimation in both direction and amplitude. As I was unable to obtain an accurateestimation for the TTC, for neither of the applied optical flow algorithms, I conclude that bothmethods provide limited flow magnitude accuracy. The Lukas Kanade approach yielded resultsoff by the order of 10, while the local plane fitting approach was off by about 50%. These relativeinaccuracies are also reported correspondingly by Rueckauer and Delbruck [25].Another challenge lies in the opposing requirements for accurate FOE estimation and obtaininga good TTC estimate for an obstacle. Estimating the FOE relies on estimates from all over theperceived scene to obtain robust results by a kind of triangulation. The TTC estimate, on the

5.4 Time to Contact Estimation for Autonomous Driving 43

other hand, improves for more events originating from the obstacle. In the presented testcases,the obstacle does not provide sufficient flow estimates until shortly before the imminent collision.The driving scenario is not suited well for the formulation of TTC estimation provided by [8].

Chapter 6

Conclusion

This thesis shows a way of how insights into event-based vision are possible using purely syntheticdata obtained from a simulation. The application of event-based vision approaches showed thatthe accuracy of the underlying simulation is limited. The quality of the simulated Dynamic VisionSensor (DVS) data relies highly on the accurate timing of events. A more up to date simulationcould provide increased resolution in both space and time, to generate more realistic syntheticevents.The emulated data was used to investigate an event-based vision approach in the circumstances ofautonomous driving. Although the attempt to estimate Time to Contact (TTC) was not successful,the investigation provided insights into performance of two event-based optical flow algorithms.Both methods examined showed a limited operating range, in which they accurately estimatethe underlying visual motion. It proved to be a challenging task to tune this methods withoutknowledge of ground truth. These limits take a particular effect for translational motion, as awide range of velocities appears.In order to gain more insights for event-based optical flow estimation, the existing project couldbe extended to compute optical flow ground truth directly in the simulation. This would be anefficient way to provide benchmarks for optical flow estimation. As the emulator has the capabilityto save frames already, a direct comparison of event-based and frame based approaches is possibleby computing a baseline for optical flow from state-of-the-art frame based algorithms.The emulator used in this thesis for event-based vision, is not limited to the driving scenario. Theflexible implementation allows a range of different applications. Particularly interesting is theuse of the data for asynchronous processing, i.e. in Spiking Neural Networks (SNNs). Instead ofusing the asynchronous events in a serial processor, the full potential of asynchronous vision canbe explored in neuromorphic systems.Extensions to the emulator itself are worth to be considered as well. Possible alterations areto include support for stereo-vision or to emulate the behavior of the Dynamic and Active-pixelVision Sensor (DAVIS), that captures both relative and absolute brightness values.

Chapter 7

Appendix

7.1 Event Transmission with ROS-topics

In a master thesis by Randl [24], broadcasting events using Robot Operating System (ROS) topicsproved to be a bottleneck. A maximum event transmission rate of 100kHz was achieved usingTransmission Control Protocol (TCP). This rate is insufficient for many applications, especiallyfor natural scenes. My program is implemented in C++ instead of Python, as Randl suggested inhis thesis. The transmission performance can be greatly improved by disabling Nagle Algorithmon the TCP connection by passing tcpNoDelay() as a transport hint for the subscriber. Fig. 7.1shows that a maximum event frequency of 6MHz can be reached with a packet size of 30000events at a publishing rate of 200Hz. At the same time, the transmission duration can be reducedby 90% to 1.3ms while increasing the transmission package size from 1000 events to 30000, ascan be seen in Fig. 7.2.

0

1000

2000

3000

4000

5000

6000

7000

0 10000 20000 30000 40000 50000 60000 70000 80000

Even

tFr

eque

ncy[k

Hz]

Package Size

Figure 7.1: Logarithmic mapping

46 7 Appendix

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10000 20000 30000 40000 50000 60000 70000 80000

Dur

atio

n[m

s]

Package Size

standard deviation

Figure 7.2: Transmission duration for differing event packet sizes

7.2 Ground Truth Computation

In order to compute ground truth values for the Focus of Expansion (FOE) it is necessary to obtainthe focal length f of the camera, an intrinsic value. In the car racing simulation TORCS the camerasettings are specified via the Field of View (FOV), which specifies how much of the scene can beobserved in degrees. The focal length f can be obtained using equation (7.1). Fig. 7.3 visualizeshow it can be derived from simple intercept theorem. The image dimension d, in this case thesensor height, is in pixels. This yields a focal length f in pixels rather than mm.

Yc

Zc

FOVyd

f

Figure 7.3: Focal length from field of view.

fx ,y =d2

1tan(FOVx ,y)

(7.1)

Bibliography

[1] Barranco, F., Fermuller, C., and Aloimonos, Y. „Contour Motion Estimation for AsynchronousEvent-Driven Cameras“. In: Proceedings of the IEEE 102.10 (2014), pp. 1537–1556.

[2] Benosman, R., Clercq, C., Lagorce, X., Ieng, S.-H., and Bartolozzi, C. „Event-Based VisualFlow“. In: IEEE transactions on neural networks and learning systems 25.2 (2014), pp. 407–417.

[3] Benosman, R., Ieng, S.-H., Clercq, C., Bartolozzi, C., and Srinivasan, M. „Asynchronousframeless event-based optical flow“. In: Neural networks : the official journal of the Interna-tional Neural Network Society 27 (2012), pp. 32–37.

[4] Brosch, T., Tschechne, S., and Neumann, H. „On Event-Based Optical Flow Detection“. In:Frontiers in neuroscience 9 (2015), pp. 1–15.

[5] Butler, D. J., Wulff, J., Stanley, G. B., and Black, M. J. „A Naturalistic Open Source Movie forOptical Flow Evaluation“. In: European Conf. on Computer Vision (ECCV). Ed. by Fitzgibbon,A. Part IV, LNCS 7577. Springer-Verlag, 2012, pp. 611–625.

[6] Camus, T. „Calculating Time-to-Contact Using Real-Time Quantized Optical Flow“. In: NISTInteragency/Internal Report (NISTIR) - 5609 (1995).

[7] Censi, A. and Scaramuzza, D. „Low-Latency Event-Based Visual Odometry“. In: IEEE Inter-national Conference on Robotics and Automation (ICRA). 2014, pp. 703–710.

[8] Clady, X., Clercq, C., Ieng, S.-H., Houseini, F., Randazzo, M., Natale, L., Bartolozzi, C., andBenosman, R. „Asynchronous visual event-based time-to-contact“. In: Frontiers in neuro-science 8 (2014), pp. 1–10.

[9] Clady, X., Maro, J.-M., Barre, S., and Benosman, R. B. „A Motion-Based Feature for Event-Based Pattern Recognition“. In: Frontiers in neuroscience 10 (2016), pp. 1–20.

[10] Conradt, J. „On-Board Real-Time Optic-Flow for Miniature Event-Based Vision Sensors“.In: IEEE International Conference on Robotics and Biomimetics. Piscataway, NJ: IEEE, 2015,pp. 1858–1863.

[11] Conradt, J., Galluppi, F., and Stewart, T. C. „Trainable sensorimotor mapping in a neuro-morphic robot“. In: Robotics and Autonomous Systems 71 (2015), pp. 60–68.

[12] Garcia, G. P., Camilleri, P., Liu, Q., and Furber, S. „pyDVS: An Extensible, Real-Time Dy-namic Vision Sensor Emulator using Off-the-Shelf Hardware“. In: IEEE Symposium Serieson Computational Intelligence (SSCI). 2016, pp. 1–7.

[13] Horn, B. and Schunck, B. G. „Determining Optical Flow“. In: Artificial Intelligence 17.1-3(1981), pp. 185–203.

[14] Katz, M. L., Nikolic, K., and Delbruck, T. „Live Demonstration: Behavioural Emulation ofEvent-Based Vision Sensors“. In: IEEE International Symposium on Circuits and Systems(ISCAS). Piscataway, NJ: IEEE, 2012, pp. 736–740.

[15] Kueng, B., Mueggler, E., Gallego, G., and Scaramuzza, D. „Low-Latency Visual Odome-try using Event-Based Feature Tracks“. In: IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS). IEEE, 2016, pp. 16–23.

48 Literatur

[16] Lichtsteiner, P., Posch, C., and Delbruck, T. „A 128×128 120 dB 15 µs Latency AsynchronousTemporal Contrast Vision Sensor“. In: IEEE Journal of Solid-State Circuits 43.2 (2008),pp. 566–576.

[17] Liu, S.-C. and Delbruck, T. „Neuromorphic sensory systems“. In: Current opinion in neuro-biology 20.3 (2010), pp. 288–295.

[18] Lucas, B. D. and Kanade, T. „An Iterative Image Registration Technique with an Applica-tion to Stereo Vision“. In: Proceedings of the 7th International Joint Conference on ArtificialIntelligence - Volume 2. IJCAI’81. San Francisco, CA, USA: Morgan Kaufmann PublishersInc, 1981, pp. 674–679.

[19] Matthies, L., Brockers, R., Kuwata, Y., and Weiss, S. „Stereo vision-based obstacle avoidancefor micro air vehicles using disparity space“. In: IEEE International Conference on Roboticsand Automation (ICRA). Piscataway, NJ: IEEE, 2014, pp. 3242–3249.

[20] Milde, M. B., Bertrand, O. J., Benosmanz, R., Egelhaaf, M., and Chicca, E. „Bioinspiredevent-driven collision avoidance algorithm based on optic flow“. In: International Confer-ence on Event-Based Control, Communication, and Signal Processing (EBCCSP 2015). Piscat-away, NJ: IEEE, 2015, pp. 1–7.

[21] Milde, M. B., Blum, H., Dietmüller, A., Sumislawska, D., Conradt, J., Indiveri, G., and San-damirskaya, Y. „Obstacle Avoidance and Target Acquisition for Robot Navigation Usinga Mixed Signal Analog/Digital Neuromorphic Processing System“. In: Frontiers in neuro-robotics 11 (2017), p. 28.

[22] Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., and Scaramuzza, D. „The event-cameradataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM“.In: The International Journal of Robotics Research 36.2 (2017), pp. 142–149.

[23] Orchard, G. and Etienne-Cummings, R. „Bioinspired Visual Motion Estimation“. In: Pro-ceedings of the IEEE 102.10 (2014), pp. 1520–1536.

[24] Randl, K. R. „Active Dynamic Vision Based on Micro-Saccades“. Master’s Thesis. München:Technische Universität München, 2017.

[25] Rueckauer, B. and Delbruck, T. „Evaluation of Event-Based Algorithms for Optical Flowwith Ground-Truth from Inertial Measurement Sensor“. In: Frontiers in neuroscience 10(2016), pp. 1–17.

[26] Serrano-Gotarredona, T. and Linares-Barranco, B. „A 128×128 1.5% Contrast Sensitiv-ity 0.9% FPN 3 µs Latency 4 mW Asynchronous Frame-Free Dynamic Vision Sensor Us-ing Transimpedance Preamplifiers“. In: IEEE Journal of Solid-State Circuits 48.3 (2013),pp. 827–838.

[27] Stewart, T. C., Kleinhans, A., Mundy, A., and Conradt, J. „Serendipitous Offline Learningin a Neuromorphic Robot“. In: Frontiers in neurorobotics 10 (2016), p. 1.

[28] Sun, D., Roth, S., and Black, M. J. „A Quantitative Analysis of Current Practices in OpticalFlow Estimation and the Principles Behind Them“. In: International Journal of ComputerVision 106.2 (2014), pp. 115–137.

[29] Tedaldi, D., Gallego, G., Mueggler, E., and Scaramuzza, D. „Feature Detection and Trackingwith the Dynamic and Active-pixel Vision Sensor (DAVIS)“. In: International Conference onEvent-Based Control, Communication, and Signal Processing. Piscataway, NJ: IEEE, 2016,pp. 1–7.

[30] Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. TORCS,The Open Racing Car Simulator. 2014.

Disclaimer

I hereby declare that this thesis is entirely the result of my own work except where otherwiseindicated. I have only used the resources given in the list of references.

Munich, 29. September 2017 (Signature)

event-based optical-ﬂow for autonomous driving using ... · event-based vision is a quickly...

Documents