pomdp-based decision making for cognitive cars using an ... · pomdp-based decision making for...

POMDP-based Decision Making forCognitive Cars using an Adaptive State

Space.

Study Thesisof

Sebastian Klaas

At the faculty of Computer ScienceHumanoids and Intelligence Systems Laboratories

Reviewer: Prof. Dr. Ing. Rudiger DillmannAdvisor: Dipl. Inform. Sebastian Brechtel

January 01, 2011 – February 28, 2011

KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

I declare that I have developed and written the enclosed thesis completely bymyself, and have not used sources or means without declaration in the text.Karlsruhe, February 28, 2011

Sebastian Klaas

i

AbstractThis thesis analyzes the use of a Partially Observable Markov Decision Process(POMDP) based decision making for cognitive cars. Hence the modeling of ve-hicles and of the vehicle environment is described (including state-, action- andobservation-spaces, transition- and observation-probabilities as well as discount-ing in infinite-horizon-(PO)MDPs).The focus lies on the combination of continuous and discrete descriptions. Thechances and problems of both static and adaptive state spaces (both have beenimplemented during this work) are being analyzed and discussed.Last but not least the future perspective of this general approach is being evalu-ated and the work which has to be done until a POMDP-based decision makingcan be applied to traffic situations in real, complex and highly dynamic environ-ments is outlined.

German AbstractDiese Arbeit untersucht die Moglichkeiten eines Einsatzes teilweise beobachtbarerMarkow’schen Ent-scheidungsprozesse (POMDPs) zur Planung bei kognitivenAutomobilen. Hierzu wird unter anderem auf eine Moglichkeit, den Fahrzeug-und Umweltzustand als POMDP zu beschreiben, eingegangen (mit Zustands-,Beobachtungs- und Aktionsraum, Beobachtungs- und Ubergangswahrscheinlich-keiten sowie Diskontierung in (PO)MDPs mit unendlichem Zeithorizont).Besondere Beachtung findet das Zusammenspiel diskreter und kontinuierlicherModellierung. Außerdem werden die Moglichkeiten und Probleme von statischenund adaptiven Zustandsraumen untersucht und verglichen (hierzu wurden beideVarianten implementiert).Prinzipielle Probleme und Perspektiven des POMDP-basierten Ansatzes wer-den aufgezeigt und es wird darauf eingegangen, welche weiteren Arbeiten nochnotwendig sind, um POMDP-basierte Planungssysteme in realen Verkehrssitua-tionen einzusetzen.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Overview on this thesis . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background 52.1 Formal Definition of POMDPs . . . . . . . . . . . . . . . . . . . . 52.2 Policy Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 SARSOP Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 (PO)MDP Model Design 93.1 Static State Space Model . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.3 Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.4 Reward Function . . . . . . . . . . . . . . . . . . . . . . . 113.1.5 Transition Model . . . . . . . . . . . . . . . . . . . . . . . 113.1.6 Observation Model . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Adaptive State Space Model . . . . . . . . . . . . . . . . . . . . . 133.2.1 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.4 Reward Function . . . . . . . . . . . . . . . . . . . . . . . 143.2.5 Transition Model . . . . . . . . . . . . . . . . . . . . . . . 143.2.6 Observation Model . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 174.1 Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Static State Space . . . . . . . . . . . . . . . . . . . . . . 194.2.2 Adaptive State Space . . . . . . . . . . . . . . . . . . . . . 20

4.3 Influence of parameters . . . . . . . . . . . . . . . . . . . . . . . . 214.3.1 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Main Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4.1 Static State Space . . . . . . . . . . . . . . . . . . . . . . 27

vii

4.4.2 Adaptive State Space . . . . . . . . . . . . . . . . . . . . . 28

5 Summary 295.1 Comparison between the static and adaptive approaches . . . . . 295.2 General Problems and Chances of a POMDP-based Decision Making 295.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 German Abstract 31

Bibliography 33

viii

1 Introduction

1.1 Motivation

According to the German Federal Statistical Office ([Fed10]) about 2.3 milliontraffic accidents happened on German streets in 2009 with a total of about 400thousand victims of which 4152 were deathly injured. Although, probably due tothe improvement of passive safety systems (i.e. airbags, seat belts etc.), the num-ber of victims has continuously been decreasing since the early 70s, the overallnumber of traffic accidents (including not only car accidents) has stayed constantover the last decades.In order to improve road safety many research facilities and car manufacturersall over the world are currently developing active safety systems such as driverassist systems (e.g. break assist systems, lane change assist systems, adaptivecruise control, collision avoidance systems and many more). Others are workingon autonomous or cognitive cars and intend to minimize the influence of humandrivers. In addition to the safety improvement autonomous cars could improveinfrastructure capacity utilization, environment protection (due to a more antici-patory driving style) and driving comfort (especially on long highway journeys).A basic component of autonomous vehicles is the decision making module, whichhas to find reasonable behaviors considering the specific traffic situation.For this task it is useful to consider uncertainty. Not only the observation whichis provided by the sensor systems is uncertain (e.g. due to noise) but also thebehavior of other traffic participants is not known in advance and can only beestimated. Partially Observable Markov Decision Processes provide means fordecision making under uncertainty and shall therefore be considered as an alter-native to traditional approaches.

1.2 Related Work

The tasks a cognitive car or a human driver has to perform in everyday situationsare actually highly complex and diverse. Perception of the environment (using acomplex sensor system), interpretation of the situation, the execution of drivingmaneuvers and decision making may be the most important ones.To achieve the goals, which have been described in the above section, cognitivecars do in fact need a highly sophisticated decision making module.Figure 1.1 shows the basic planning problem. There are a multiple of different

1

1 Introduction

Figure 1.1: Planning in standard traffic situations [Hum11]

decisions in each traffic situation - some of which are improving the situation andsome of which are making it worse. Unfortunately (as shown in the figure) deci-sions which look promising at the first look may lead to severe problems later on.In the year 2007 most of the teams which participated at the DARPA Urban Chal-lenge used planners based on more or less complex state machines. The UrbanChallenge was a competition for autonomous cars taking place in an urban-likeenvironment in which the cars had to interact completely autonomously. It hasbeen organized by the Defense Advanced Research Projects Agency, a researchoffice for the American Department of Defense. Although the environment wasstill far from a real urban environment, this competition showed the possibilitiesas well as the limitations of a state machine-based approach. The state machineused by Team AnnieWAY (a project funded by the German Research Foundation,involving the KIT and other German research groups (CRC/TR 28)) modeled thecars behavior as hierarchical states, in order to model quite complex situations. Adetailed description of the planner used back in 2007 can be found in [GJPD08].Those hierarchical state machines were sufficient to model all the important sit-uations for this challenge. But as has been pointed out before the problem ofautonomous driving in real environments is highly complex and it is nearly im-possible to handle all situations with the use of state machines.

Hence new and more sophisticated approaches that are able to handle all thefactors which make decision making complex are to be investigated now and dur-ing the next years.Those approaches have to handle a number of different problems: First of all deci-sion making in an on-road situation depends on the environment, which is actuallyhighly dynamic. There are many cars and other vehicles as well as pedestriansand bicyclists participating in traffic situations. Their number, position and be-

2

1.2 Related Work

Figure 1.2: Team AnnieWAY’s autonomous car at the DARPA Urban Challenge2007.

havior is changing rapidly and their actions may not be reasonable at every time.On the other hand the number of different possible traffic situations, each of themenforcing a different behavior, is simply too big to be handled by a more or lesshand-coded system (like i.e. state machines). Moreover environment perceptionis typically noisy and incomplete. For example the possibility that there are carsor other traffic participants occluded behind houses (as shown in figure 1.3) orbehind other vehicles and can not be detected using the car’s sensor system has

Figure 1.3: Occluding of a traffic participant behind a house.

to be taken into account when deciding for an action to be executed.

3

1 Introduction

To solve those and many more problems, algorithms are needed that are able tohandle different kinds of situations without explicit modeling of all those situationsand that are capable of handling uncertain observations. One possible solutionto those problems is the use of Partially Observable Markov Decision Processes(POMDPs). POMDPs are a combination of Hidden Markov Models (HMMs, cp.[RJ86]) and Markov Decision Processes (MDPs) (cp. [SB98]), providing a modelfor reinforcement learning approaches. The general idea behind a POMDP-baseddecision making system is to let the algorithm choose appropriate actions on itsown - of course according to given goals and constraints.In the last years POMDPs have been successfully used in many research projects inthe field of service robotics (cp. [SRKLD08], [SRJLD08], [SKP+08], [HBPM07]).For example in [SRKLD08], a scenario has been modeled, in which a servicerobot interacts with human persons in order to fetch a cup and bring it to a givenlocation. The results are very promising, especially when being compared toMDPs and to numerous other approaches. Another promising robotic POMDPimplementation is presented in [HBPM07]. Hoey et al. implemented a real-time system assisting persons with dementia during hand-washing. Hence theirsystem has to adapt itself to the user (concerning awareness and other factors).Nevertheless the complexity of all those systems is comparably low and - by now- there are no experiments proving the performance and usability of POMDP-based decision making in highly dynamic and (due to the fact that autonomouscars are driving) frequently changing environments like those in which autonomousvehicles are typically acting.

1.3 Overview on this thesis

This thesis aims to implement a working (PO)MDP-based planner for cognitivecars and to evaluate the opportunities POMDPs provide in this area based onsimple traffic situations with a comparably low number of participants. Section 2describes the theoretical background of POMDPs and MDPs, the formal def-inition and basic algorithms used to solve POMDPs. Section 2.3 describes the(PO)MDP solver which is being adapted for this piece of work. Section 3 describesa first solution using a static, predefined state space and the adaptive state spacewhich was implemented as a solution to some of the problems discovered with thestatic version. Section 4 gives an overview on the results obtained with the twoapproaches and on the main problems. Section 5 gives a short summary and triesto point out a couple of questions which are to be answered in future researchand which may lead to further improvements on the way to a POMDP-baseddecision making system which is applicable to real traffic situations. And finallysection 6 gives a German abstract of this thesis, recalling the main implementationdecisions and results.

4

2 Theoretical Background

2.1 Formal Definition of POMDPs

Basically Partially Observable Markov Decision Processes are an extension ofMarkov Decision Processes (MDPs, cp. [SB98]). Formally a POMDP is specifiedas a tuple (S,A,O, T, Z,R, γ, b0) with S being a finite set of states the problemcan adopt. POMDPs consist of a second set (A) of discrete actions and a thirdone, called O, of possible observations or measurements. T (s, a, s′) = P (s′|a, s)is the Transition Probability Function, describing the probability for a transitionfrom state s to state s′ when executing an action a. This transition probabil-ity extends every single transition of the model shown in figure 1.1 by uncertaintransition as figure 2.1 shows it for one transition. Z(s, o) = P (o|s) is the Obser-

Figure 2.1: Uncertain transition in a POMDP.

vation Probability Function, describing the probability of an observation o beingobserved when the system is in state s. The immediate reward function R(s, a)defines the numeric reward the system receives for applying action a in state sand γ specifies a time discount factor, discounting rewards for future state andaction combinations.Using this discount factor γ, equation 2.1 calculates the summed reward for aseries of states st and actions at.

Rtotal =n∑t=0

(γt ∗R(st, at)) (2.1)

5


limn→∞

Rtotal (2.2)

Extending equation 2.1, equation 2.2 defines an upper bound for the total ex-pected reward assuming an infinite horizon.Obviously POMDPs extend MDPs by uncertain observation. Thus at every time-step t the current state st is not known for sure and hence a POMDP solver hasto manage so called belief states bt ∈ B. Those belief states bt are probabilitydistributions over all states of the system, defining the current degree of belief foreach of them at a time-step t. They can be considered as a description of knowl-edge about the current state of the system. The initial belief state b0 describesthe belief of the system at the beginning of the execution. At each time-step thebelief of the system has to be updated, this is typically done by using Bayesianforward-filtering.In order to make a decision about the best action at a time t with a given beliefstate bt representing the state of the system at time t, POMDP policies are com-puted. Those policies encode the best action for each belief distribution, usingso-called alpha vectors.

2.2 Policy Computation

A POMDP policy π : B → A maps every belief b to an action a. The goal of eachPOMDP solver is to find an optimal policy π∗, which maximizes the expectedtotal reward E[limn→∞Rtotal].One of the basic algorithms for POMDP policy computation is Value Iteration.Value iteration does iteratively compute a finite set Γ of α−vectors, representing avalue function V : B → R, which maps the possible belief states to their expectedtotal reward and represents the POMDP-policy (equation 2.3).

V π(b0) =∞∑t=0

(γt ∗R(bt, π(bt))) = maxα∈Γ(α ∗ b0) (2.3)

R(bt, π(bt)) =∑s∈S

(R(s, π(bt)) ∗ bt(s) (2.4)

Each of those computed α− vectors is connected to an action a and when storedas a policy they can easily be used to evaluate the utility of action a for everygiven belief state b by evaluating the argmax of the second part of equation 2.3.Actually the computation of optimal policies using value iteration or other opti-mal POMDP-solving algorithms is computationally complex. Hence approximatepoint-based algorithms like SARSOP, which is going to be described in Section 2.3,are the better choice for most realistic applications. Those have a far better per-formance with only slightly inferior solutions.

6

2.3 SARSOP Solver

2.3 SARSOP Solver

The POMDP-based decision making framework presented in this thesis is basedon the ”Approximate POMDP Planning Toolkit”(APPL, http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl). APPL is a POMDP-Solver toolkit implement-ing the SARSOP algorithm [KHL08], which tries to decrease computational effortfor POMDP solving in order to make POMDPs more affordable for many differenttasks. SARSOP uses a point-based (discretized) approach, operating not in thewhole belief-space but in the ”Optimally Reachable Belief Space”. This subset ofthe belief-space contains beliefs which are reachable from the given belief point byapplying a sequence of optimal actions. Actually the ”Optimally Reachable BeliefSpace” is some kind of approximate POMDP solution and can not be known inadvance. Hence a sophisticated iterative approach is used to keep calculation asclose to the optimal solution as possible.

SARSOP (Successive Approximation of the Reachable Space under OptimalPolicies) consists of three main steps: Sampling, Backup and Pruning. While theBackup and Pruning steps are very similar to other point-based algorithms (i.e.Heuristic search value iteration[SS04] and Point-based value iteration [PGT03]),the Sampling step differs from those of the other algorithms, in order to sampleonly in the ”Optimally Reachable Belief Space” described above.The basic idea of sampling is to find an approximate representation for the belief-space B - typically a set of points out of the belief space. In contrary to otherpoint-based algorithms, which sample in a set of belief points reachable from b0

by applying an arbitrary sequence of actions from A, SARSOP does only samplein the set of points which is reachable from b0 when applying optimal sequencesof actions. This much smaller set of belief points is called the ”Optimally Reach-able Belief Space” and is shown in figure 2.2 as a subset of the reachable space(reachable from bo).

Throughout the sampling step, SARSOP builds a tree TR with nodes repre-senting sampled points b and their respective beliefs and with edges representingactions a and observations o. The sampling algorithm picks a node b, chooses anaction a and an observation o and performs sampling by applying equation 2.5for each snew in the state space (η is a normalization factor, Z and T are theprobability density functions defined in section 2.1).

bnew(snew) = ηZ(snew, a, o)∑s

T (s, a, snew)b(s) (2.5)

In order to prevent sampling in the whole reachable space (what is being done bymany other approximate point-based algorithms), the actions a and observationso are chosen by evaluating the upper and lower bounds of a value function similarto that described in equation 2.3. These bounds are initialized before the first

7

http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl



Figure 2.2: The ”Optimally Reachable Belief-Space”, based on [KHL08].

sampling step and continuously updated during backups.

The Backup step does (for each node b in TR) collect the information stored inthe children of a node and propagates the value function (or its upper and lowerbounds) back to the root b0. This is basically done by computing α− vectors foreach action and choosing the one with the highest expected reward.

Afterwards the Pruning step deletes α − vectors and belief points which arenot considered useful for the further computations. At first nodes of TR whichare definitely not in the optimally reachable space are deleted. The resulting treeis then being considered as an approximation of the optimally reachable space.As a second step, α − vector pruning does delete all α ∈ Γ which are dominatedby others over this whole space (α1 dominates α2 at a belief b, when V1 > V2 isfulfilled for every belief point b′ near b and Vi = α− vectori ∗ b′).

These three main steps are repeated until the gap between upper and lowerbounds of the value function in the root node of the belief tree is smaller than apredefined target precision.

8

3 (PO)MDP Model Design

As described before, the intuitive solution for a POMDP-based decision makingis to use the SARSOP solver from APPL, described in the previous section andto extend it by a framework generating and describing the problem at hand asa POMDP or MDP problem. The goal, which has been stated at the beginningof this work, is to create an off-line planner capable to handle driving on straightroads as well as overtaking situations and ”T”intersections (figure 3.1) with maybetwo or three additional cars.For this aim, the solver is being extended by a problem definition for autonomous

Figure 3.1: Traffic situation at a ”T” intersection

cars and their environment.

3.1 Static State Space Model

3.1.1 State Space

In this first approach a (PO)MDP model based on a static state space is beingused. The state space of this (PO)MDP model is made up of a vector for eachcar in the environment. Each of those vectors contains Cartesian x and y coor-dinates of the respective car, a velocity v (in x direction) and two boolean valuesdescribing whether the car exists and whether it is crashed or not. Those statespace variables are discrete and (except the boolean ones) have explicitly definedbounds. Both, bounds and discretization values, are defined at initialization time.The whole state space is initialized at once.

9


3.1.2 Action Space

The action space of the problem is built up similar to the state space. Each actionconsists of two parts: acceleration (in x-direction) and velocity in y-direction.While acceleration actions are modeled to take one second each, lane changes takelonger in reality and, in order to get more natural trajectories, they should takelonger in the vehicle model as well. Therefore they are modeled with an actionduration of three seconds.

3.1.3 Discounting

To support different action durations, an action length is introduced and the dis-count calculation is being adapted. It is understood that the system with differentaction durations should behave equal to a system with constant action length. Toensure this every action a gets its own discount based on γ, which is the standarddiscount of the POMDP model (which is described in chapter 2 and is normallyapplied to all actions), and d(a) being the action’s duration. In this model, due tothe different action durations, the standard discount γ does only apply to actionswith length one, whereas all other actions are discounted with an action specificfactor γd(a).

Assuming constant action durations the reward calculation is normally beingdefined recursively as in equation 3.1.

Rt = R(at, st) + γ ∗R(at+1, st+1) (3.1)

Changing the duration of action at to d(at) as described above, equation 3.2ensures that all later actions are discounted as if at was a sequence of d(at) actionswith length one.

Rt = R(at, st) + γd(at) ∗R(at+d(at), st+d(at)) (3.2)

In addition those d(a) parts of action a (each part with length one) have to bediscounted as if they were single actions of length one (equation 3.3).

R(a, s) =

d(a)∑i=0

γd(a) ∗R(a, s) = R(a, s) ∗ γn+1−1γ−1

(3.3)

R(a, s) =

∫ d(a)

i=0

γd(a) ∗R(a, s) = R(a, s) ∗ γn−1log γ

(3.4)

Equation 3.4 transfers equation 3.3 to continuous space in order to allow non-integer durations. γn−1

log γsupplements the solvers discount γd(a) and in our model

it is being used as a reward factor ensuring continuous discounting within eachaction duration d(a). Actually the use of continuous discounting is not necessary

10

3.1 Static State Space Model

at the moment, but it allows future modifications to use action durations whichare not a multiple of a second in case this will be useful at anytime.Action length, reward factor and timed discount are additional information, whichis stored together with each action.In order to make the solver compatible with this redesign of actions, a coupleof modifications are needed. First it has to be ensured, that the solver uses thetimed discount instead of the standard discount throughout the whole computa-tion. Second the solvers belief tree, which is set up in the sampling step, has tobe adapted. An action should be represented as an edge with depth ”duration”instead of depth one.

3.1.4 Reward Function

The (PO)MDP model described in this thesis is based on a quite simple rewardfunction. R(a, s) defines the reward which is being awarded when action a isexecuted while the system is in state s. This function is constructed by applyingdifferent rewards and costs.The most important and numerically biggest one is the crash cost, which is takeninto account every time the combination (s, a) results in a crash, either with oneof the other cars or because the autonomous car leaves the road.Other costs are for acceleration, lane changes (both of them are considered asuncomfortable) and driving on the left lane. Those costs should tempt the carto drive more smoothly and to obey the rule of keeping to the right hand sideof the road. Additional costs for time consumption, fuel consumption and manyother factors are also possible in order to enforce different behaviors (i.e. obeyingadditional basic traffic rules).The current implementation does only have one kind of rewards - a speed reward,which awards driving fast. Another possible reward is a reward for reaching apredefined destination.

3.1.5 Transition Model

For the calculation of state transition probabilities the system makes use of aone step Bayesian network. Based on a particle distribution for the current stateand the positions of the cars, which is generated by equidistant sampling on thediscrete space, and on the action that has been chosen by the solver, the sys-tem computes the start point and the end point of the trajectories of all cars.Afterwards the Bayesian network is used to predict the next possible positionsof the cars using Bezier interpolation along with their respective probabilities.Those positions are then matched with the systems discrete states. The particleprobabilities are used as state transition probabilities for our model. This proce-

11


dure is being repeated for each prediction step (equivalent to 0.1 seconds), usingadapted current car positions and trajectory start points, until the action durationis reached. Figure 3.2(a) shows the principle of continous transition computation,refering to the discrete behavior shown in figure 3.2(b). Assume the black dotsrepresent the results of equidistant state sampling in the discrete states, each ofthem has equal probability. The red dots represent the continuous particles, re-sulting from the forward prediction of each of those samples. Afterwards thoseresults are matched to the discrete states and the probability of each discrete stateis calculated as the sum of the probabilities of all continuous particles that havebeen matched to this state.

For further information regarding the ”Dynamic Bayesian Network Library”,

(a) Continous description of the statetransition.

(b) Discrete description of the statetransition.

Figure 3.2: State transition example.

which is being used for those computations, compare [KSGD09].

3.1.6 Observation Model

For the ease of use, the observation space is implicitly defined by the state space.Actually uncertainty in observations is missing in this approach, the observationmodel used in this approach assumes full observability and the observation prob-ability is always one.Altogether the system is modeled as a ”Pseudo”-POMDP, which lacks uncertainobservations. Hence the results are basically equal to those a MDP model wouldproduce. Nevertheless this ”Pseudo”-POMDP model can easily be extended to afull POMPD at any time.

12

3.2 Adaptive State Space Model


3.2.1 State Space

The static state space approach described before produces a couple of problems.In order to solve some of them this second approach contains an adaptive statespace. The number of states which have to be considered at each computationstep has to be kept as small as possible while the theoretically reachable statespace should cover the whole vehicle environment or the whole world. Hence, inthe new approach, only one single initial state and a dedicated terminal state areinitialized at the very beginning. All the other states are step by step initializedwhenever needed.

For these changes a redesign of the whole system is necessary. First of all thestate space has to be rewritten. Every time the system reaches a state it has tocheck, whether this specific state does already exist or not. If not, a new state isbeing created and added to the state space.In contrary to the first approach described in section 3.1, the state space does onlycontain one dedicated terminal state (instead of numerous crashed states and nu-merous states in which the car does not exist). This dedicated terminal state isdefined by its reward value and transition probability: ∀a : R(sterminal, a) = 0 and∀a, s′ : T (sterminal, a, s

′) = 0. Every crash leads to this terminal state.All the other (non-terminal) states are organized as vectors containing a x, a yand a v value for every car in the system. Those values represent aligned intervalsof a given width or discretization value.

3.2.2 Action Space

In contrary to the action space is predefined and remains similar to the one usedin the previous approach. In total nine actions, each of them representing acombination of an acceleration/deceleration and a lane change action (either to theleft lane or to the right lane), are being defined. As in the previous chapter thoseactions have different durations (one second for pure acceleration/decelerationactions and three seconds for actions containing a lane change).

3.2.3 Discounting

The action duration, a timed discount and a reward factor are stored in additionto the action values. All of these are computed as described in section 3.1 andare following the same mathematical and logical rules.

13


3.2.4 Reward Function

The reward function of this model is nearly equal to the one used in the previ-ous approach. There are the same rewards and costs (which may of course beextended by others if necessary). Basically the difference to the previous sectionis, that the reward function has been moved into the transition calculation andis being calculated on particle level now. Hence the reward R(s, a) is a weightedsum of all single particle rewards, which are being awarded during computation.Furthermore the computation is done in continuous space while it has been donein discrete space before. Those changes allow an easier and more precise im-plementation of the reward calculation (e.g. the crash costs are computed onparticle level and the speed reward, which is being multiplied by the velocity, canbe modeled more precise in continuous space, using the real velocity of a particle,than in discrete space, using the mid-point of the interval which represents thevelocity). Moreover an easier integration of the reward representation into theBayesian Network, which is being planned for the future, is possible.

3.2.5 Transition Model

More changes have been applied to the transition model. While the first solutionused one single Bayesian Network with only one time step and directly calledthis network to predict every single time step (0.1 seconds) of the trajectory,the new system uses two (structurally identical) Bayesian Networks one with 10time steps for actions with duration one second and a second one with 30 timesteps for those actions which have a duration of three seconds. Those networkspredict the whole trajectory based on information on the current position of allcars and on the chosen behavior for the autonomous car. The first two steps ofthis Bayesian Network can be seen in figure 3.3. Although the computations arestill based on the same library as in the first approach ([KSGD09]), the versionof the DBNL used in this approach is a completely restructured version of theone used before. Actually the Bayesian Network used in this approach consists ofdifferent nodes for each car. A ”X”-node, which contains the vehicle model and thecars position or state. The ”B”-nodes represent the behaviors of the cars, the ”T”-nodes their trajectory and the ”C”-nodes represent the context of the respectivecar (i.e. interaction with other cars, the distance to other cars and the position onthe road). When starting the particle inference, information about the behaviorfor the autonomous car and the current state of the system is being provided tothe DBNL (Dynamic Bayesian Network Library). All other information is beingpredicted by the Bayesian Network using predefined models. The edges whichcan be seen in figure 3.3 represent the data-flow during inference.The trajectories of the autonomous car are in every step (except the first onewhere no previous trajectory information exists) based on the given behavior, the

14


current position and the previously computed trajectory. The trajectories of theother cars are based on the current position of the car and on the behavior ofthe car ( which is in this case being generated from the DBNL based on contextinformation). In contrary to the first approach the other cars have a simpleintelligence, which allows them to react on the autonomous car’s behavior. Hencethe autonomous car gets to know what kind of reaction can be expected from othertraffic participants and is expected to adapt its own behavior to this knowledge.For more information on the different vehicle, trajectory and behavior models see[GBD10]. Although [GBD10] is based on an older version of the DBNL, mostof the information is still valid. Gindele et al. do for example explain the usedbehavior model more detailed than this thesis is able to.

3.2.6 Observation Model

As in the first approach (based on a static state space) there is no observationmodel implemented yet. Actually this approach dose not even make use of aPOMDP solver but of a very simple MDP solver. The SARSOP based solver ofthe first approach will be migrated to adaptive state spaces in future work andwill lead to a more sophisticated and more intelligent solution, compared to thisone. In the long-run, the model will be extended by real observation.

15


Figure 3.3: Model of the Bayesian Network used in this approach.

16

4 Results

4.1 Behavior

Evaluations show that the autonomous car’s behavior, which has been chosen bythe solver, is reasonable. Although the following data has been recorder using thefirst approach, they basically count for both the static and adaptive state spaceversions. The general algorithm which is responsible for these results is basicallythe same in both systems.In diagram 4.2(a) it is shown, that the autonomous car ensures a safety marginwhen coming into an overtaking position. This behavior can be explained by theuncertainty concerning position and behavior of the other car. The autonomouscar has to expect a braking- or acceleration-action of the other car and has toensure that there is no crash-risk in either case. Moreover the car decides to giveway to an other car which arrives behind with a higher velocity. Diagrams 4.1(a)and 4.1(b) show, that the car does also obey the rule of driving on the right lane,if possible. In case their is no crash risk because the car in front is faster, theautonomous car decides to stay on the right lane or make a lane change to theright lane and vice versa.This behavior ensures that the autonomous car will besituated on the right lane again after an overtaking scenario. More evaluationsshow, that the car’s safety distance increases with an increasing speed differencebetween the two participated cars.Using the adaptive state space approach with the new transition model describedin section 3.2.5 leads to minor changes in the car’s behavior. As described before,the new transition computation simulates a more or less intelligent behavior ofthe other cars. Hence the autonomous car beliefs those other cars are actingreasonable and will for example overtake instead of provoking a rear-end collision.Consequentially, the autonomous car does not give way in case the other (faster)car approaches from behind but expects the other car to overtake.(The diagrams 4.1(b) and 4.1(a) show the x-distance between the two cars andthe y-position of the autonomous car. Positive x-distance means the autonomouscar is situated behind the other car. y − position = 0 indicates the middle of theright lane, y−position = 6 indicates the middle of the left lane. The colors implylane-changes, where yellow (y − velocity = 2) indicates a lane change to the leftand black (y − velocity = −2) a lane change to the right side. The other car issituated on the right lane. Velocities of the two cars are fixed according to thecaption of the diagrams.)

17

4 Results

(a) Lane change behavior when the other car is faster.

(b) Lane change behavior when the autonomous car is faster.

Figure 4.1: Lane change behavior of the autonomous car with different velocities.The x-axis shows the distance between the two cars and the y-axisshows the current position of the autonomous car (0 = middle of theright lane, 6 = middle of the left lane. The colors indicate a lanechange. The second car is positioned on the right lane (y-axis-value:0).

18

4.2 Performance

4.2 Performance

4.2.1 Static State Space

This approach proved relatively good performance with two cars in following orovertaking situations. On the other hand performance is comparably poor whenadding a third or even more cars.The following performance measurements (table 4.1) have been made on a QuadCore System Core i5 750 (2.67 Ghz) with 4 GB of RAM and a 32-bit operatingsystem.The number of ”real states” is the number of real car states without the crashedand non-existing states. The number of those auxiliary states is given in thecolumn ”additional states”. The given ”time” is the time needed for the wholerun of the solver. Those values suggest a nearly linear correlation between the

Table 4.1: Performance measurements for the static state space approach.

cars real states additional states actions time [h:mm:ss]2 360 360 3 0:06:532 180 180 3 0:03:192 60 60 3 0:01:012 360 360 9 0:19:212 180 180 9 0:09:132 60 60 9 0:03:002 720 360 3 0:11:202 360 180 3 0:05:022 120 60 3 0:01:372 720 360 9 0:30:252 360 180 9 0:14:062 120 60 9 0:04:392 1080 360 3 0:16:382 2160 720 3 0:39:253 2160 720 3 9:33:00

number of states and the time consumption as well as between the number ofactions and the time consumption. But this linear relation does only hold as longas the number of cars stays constant. On the other hand the time consumptionincreases rapidly with more than two cars. This seems to be not a problem due tothe total number of states but basically a problem of state space dimension. Highdimensions result in a higher overhead due to sampling and due to an increasednumber of particles which have to be predicted in the Bayesian network. Actuallythis huge number of particles seems to account for the main performance loss withmore than two cars in the system.Another important information that can be extracted from table 4.1 is the huge

19

4 Results

amount of additional, auxiliary states. Those are actually not necessary and dodefinitely decelerate the system.

4.2.2 Adaptive State Space

As mentioned before the introduction of an adaptive state space does not onlyenforce many changes to the (PO)MDP model but also enforce a change in theused solver technology. This means, that the performance analyses are hard tocompare. However the following section tries to outline the potential of this ap-proach with a solver comparable to the one used in the previous chapter.Although the used solver algorithm is less optimal than the SARSOP algorithmused in the static state space approach, a smaller state space could be observed.First of all this is a result of the lower dimensionality of the state space (dueto the use of a dedicated terminal state). Second the algorithm can only workin states which are reachable from the initial state, whereas the first approachinitialized all states at the beginning. Using of a more sophisticated solver (like aSARSOP-based one) would of course lead to a further decrease of the number ofstates (due to sampling in the ”optimally reachable space”, cp. section 2.3).Table 4.2 gives an overview on the number of states in both the static and adaptiveapproach for different (equal) situations. The number of ”theoretical states” is thenumber of states calculated based on the state space limitations and discretizationvalues (without considering the dedicated terminal state). As the given ”savings”

Table 4.2: Differences in the size of state spaces.

theoretical states adaptive states static states savings112 68 168 59.52%252 192 336 42.86%688 428 1032 58.53%900 732 1200 39.00%1600 1357 2000 32.15%2240 1917 2800 31.54%

indicate, the number of states in the adaptive approach is about 30 % to 60 %lower than in the static state space approach.Table 4.3 shows the computation duration for these state spaces. All measure-ments have been made with nine actions, 5000 MDP iterations, a maximum of3000 observable states (for vector initialization) and 64 particles (the number ofparticles is identical to the calculation for table 4.1). The measurements showstill a more or less linear correlation between the number of states initialized andthe time needed for the computations. The third measurement is of course anoutlier, probably due to numerical issues. This calculation takes quite a long timeto converge, although the residual looks good after a couple of iterations.

20

4.3 Influence of parameters

Table 4.3: Performance measurements for the adaptive state space approach.

theoretical states adaptive states time [h:mm:ss]112 68 0:00:58252 192 0:02:43688 428 0:13:01900 732 0:12:321600 1357 0:21:382240 1917 0:30:36

Compared to the results shown in table 4.1 those values suggest an improvementwhen comparing the time consumption where the number of theoretical states intable 4.3 equals the number of real states in table 4.1 and the number of actionsin both cases is nine. Actually those massive improvements are not only causedby an improved state space and other optimizations. They are rather a result ofthe changed solver. Although the algorithm used in the MDP solver for the staticapproach is less optimal than the SARSOP algorithm, the MDP solver may wellbe faster because it lacks a couple of initializations which are part of the SARSOPalgorithm.


Another topic, which is worth a look, is the question of the optimal standarddiscount for the POMDP problem. The following diagrams (4.2(a), 4.2(c) and4.2(d)) have been recorded with different discount values while all the other pa-rameters stayed constant. Again these recordings belong to the static state spaceapproach but count for both approaches presented in this thesis. The system hadto choose between three possible actions (lane change to the left and right lane,no lane change). The maximum distance between the two cars was 60 m in bothdirections. Both cars were situated on the right lane and the velocity of the othercar was fixed (4 m/s). The two axes of the diagrams show the distance betweenthe two cars (xother − xautonomous) and the velocity of the autonomous car. Thecolors describe the lateral velocity of the ego car from -2 m/s (lane change to theright lane) to 2 m/s (lane change to the left lane) as can be seen in the legend(figure 4.2(b)). A standard discount (discount applied for actions with durationone second) γ = 0.95 (diagram 4.2(a)) results in a safety distance of more than20 m in front of the autonomous car when the autonomous car is faster than theother car and a safety distance of nearly 40 m behind the autonomous car in casethe autonomous car is driving in front of the other car and its velocity is lowerthan the one of the other car. If the distance between the two cars is falling belowthis safety distance, the autonomous car is going to start a lane change in orderto increase the safety margin and to avoid a crash.

21

4 Results

(a) Measurements with discount γ = 0.95. (b) legend

(c) Measurements with discount γ = 0.75.

(d) Measurements with discount γ = 0.50.

Figure 4.2: Lane change behavior of the autonomous car with different discountvalues. The x-axis shows the distance between the two cars and they-axis shows the velocity of the autonomous car. The colors indicatea lane change. The second car is positioned on the right lane. Thevelocity of the other car is fixed (4 m/s).

22


If the discount is chosen lower (γ = 0.75) a smaller safety margin can be observed.As can be seen in diagram 4.2(c) the safety margin in front of the autonomous caris lower than before, but still big enough considering the low speed of the cars.The safety distance behind the car is nearly vanished.Decreasing the discount further results in a nearly vanished safety distance (dia-gram 4.2(d)).Obviously, a lower discount value results in a shorter time horizon which is beingconsidered by the system when choosing an action. Hence a crash which will ormay happen in a couple of time steps is less risky with a lower discount. Thesechanges in crash risk leads to the behavior pointed out before.

Other parameters which should be observed are the discretization intervals,which are responsible for the discretization of the state space. There are threeof them. One for each of the three state dimensions x-position, y-position andvelocity. While the y-interval is naturally chosen according to the width of thelanes (in this case 6 m) and the velocity-interval is chosen with regard to thepossible acceleration actions (in this case 4 m/s) the x-interval is nearly free tochoose.

The first circumstance which catches one’s eye when regarding the diagrams(4.3(a), 4.3(c), 4.3(d), 4.3(e) and 4.3(f)) is that the safety distance the autonomouscar keeps increases with increasing interval sizes. Actually a bigger interval sizecauses less information about the concrete position of both the autonomous carand other cars. This is due to the equidistant sampling within those intervals. Thesystem does not know the exact position but only a couple of possible positions(depending on the number of samples). The uncertainty considering the positionwithin the interval results in a decreasing information content of the system state,when the interval size increases. With ∆x = 8m and higher the amount ofinformation is that low, that the safety margin does even increase, when bothcars have equal velocities.For the purposes of this thesis an interval size of 6 m seams adequate and hasbeen chosen for all of the other analyzes.

4.3.1 Rewards

Rewards and costs are the most important instrument for controlling the carsbehavior. The choice of the kind and number of rewards which are awarded fordifferent combinations of states and actions (s, a) is very much related to the goalof driving as a human driver and of course in accordance to traffic rules, whichthe autonomous car is expected to obey. Nevertheless there are multiple choicesand those have different influences on the car’s reactions. Moreover changing thevalues of rewards and costs influences the systems behavior heavily.The standard reward function used for all of the above evaluations is defined usingthe following costs and rewards:

23

4 Results

(a) Measurements with discretization value ∆x = 2m. (b) legend

(c) Measurements with discretization value ∆x = 4m.

(d) Measurements with discretization value ∆x = 6m.

Figure 4.3: Lane change behavior of the autonomous car with differentdiscretization-values. The x-axis shows the distance between the twocars and the y-axis shows the current velocity of the autonomous car.The colors indicate a lane change. The second car is positioned on theright lane (y-axis-value: 0) with velocity 4 m/s.

24


(e) Measurements with discretization value ∆x = 8m.

(f) Measurements with discretization value ∆x = 12 m.

Figure 4.3: Lane change behavior of the autonomous car with differentdiscretization-values. The x-axis shows the distance between the twocars and the y-axis shows the current velocity of the autonomous car.The colors indicate a lane change. The second car is positioned on theright lane (y-axis-value: 0) with velocity 4 m/s. (continued)

25

4 Results

c o s t c r a s h= −1000;co s t comfor t= −12;reward speed= 3 ;c o s t l e f t l a n e = −6;co s t l ane change = −3;

The crash costs are important to prevent the car from taking high risks. Thespeed reward is expected to award higher velocities because they let the car reachits destination earlier. The costs for driving on the left lane force the car to obeythe rule of keeping to the right-hand side of the road and the comfort and lanechange costs punish uncomfortable and inefficient acceleration and a high numberof lane changes.Using the above rewards and costs the reward function is basically build up asseen below (ax and vy being the actions acceleration and lateral velocity):

reward = 0 ;reward += cos t comfor t ∗ a x ;reward += reward speed ∗ v e l o c i t y ;i f ( c a r y p o s i t i o n = l e f t l a n e )

reward += c o s t l e f t l a n e ;reward += cos t l ane change ∗ v y ;

i f ( crash happened )reward = c o s t c r a s h ;

Changing the above standard parameters to the following ones results in a changeof the cars behavior. The autonomous car does not change to the right lane incase it is situated on the left one except when driving on the left lane is morerisky than on the right lane.

c o s t c r a s h= −1000;co s t comfor t= −12;reward speed= 3 ;c o s t l e f t l a n e = 0 ;co s t l ane change = 0 ;

Comparing figure 4.1(b) on page 18 and figure 4.4 shows, that the car is notforced to stay on the right lane with the changed reward function. In this casethe car is of course not going to change back to the right lane after overtaking,whereas it would change back in case the standard rewards are applied. In factthose standard rewards show a good compromise in order to enforce a reasonablebehavior of the car. In order to enforce additional behavior this reward set shouldbe extended and adapted carefully.

26

4.4 Main Problems

Figure 4.4: Lane change behavior with changed rewards. There are no costs fordriving on the left lane. The x-axis shows the distance between thetwo cars and the y-axis shows the current position of the autonomouscar (0 = middle of the right lane, 6 = middle of the left lane. Thecolors indicate a lane change. The second car is positioned on theright lane (y-axis-value: 0). The autonomous car is faster.

4.4 Main Problems

4.4.1 Static State Space

The biggest problem of this approach is its bad performance for more complexsituations, i.e. situations where more than two cars are involved. This prob-lem could partly be solved by reducing the number of states that are needed torepresent the problem. But still a high dimensional state space results in a hugesampling overhead. Maybe a more flexible sampling solution with less hand-coded(but deterministic) sampling could reduce this overhead by using a fixed numberof samples instead of one depending on the dimensionality of the state space. Thiswould decrease the number of particles in the Bayesian Network and improve itsperformance.Another problem is the structure of the state space with lots of crashed states andstates in which the car does not exist. Actually those are not needed anymoreand could be replaced by one dedicated terminal state, using a more handy statespace representation.The data structures used in this approach were mostly provided by APPL andare partly not optimally chosen for our specific problem and the POMDP modelwhich is being used. Hence a redesign of those data structures is expected toreduce overhead and moreover to make programming more easy due to a simplerstructure.

27

4 Results

Finally it would be useful to adapt the state space over time. Normally it’s notknown in advance which situations are going to happen while driving and consid-ering everything would result in a huge state space. This enormous state spacecould of course not be handled with the resources at hand. In order to solve thisproblem it is eligible to be able to insert and maybe delete new states at runtimewhen new situations arise or old states are not of interest anymore. Moreover apossibility to merge and split states would help to keep state space small with-out losing important information. States which are logically the same would bemerged. Later on, when situation changes, merged states could be split again intoa number of logically different states.

A couple of these problems are going to be solved with the introduction of anadaptive state space as described in section 3.2. These changes come along witha redesign of many data structures and the whole system. Other problems willbe part of future research.

4.4.2 Adaptive State Space

There are still a couple of problems with this approach. First of all the adaptivestate space itself is a problem for evaluation purposes. Although the MDP solverdoes nearly reach the whole space, there are a couple of states which are notsampled (for example states that depend on a special behavior from the othercars which is not being simulated by the Bayesian Network). This makes simu-lation and evaluation more complicated. Moreover those problems will increasewith the use of a more sophisticated solver, which samples only in a small part ofthe reachable space and is expected to ignore many states. Actually this is thestrongest reason for a future change to an on-line solver.

Second the computation of problems containing more than two cars will stillhave a relatively bad performance due to the implemented sampling. The equidis-tant sampling does very much depend on the dimensionality of the state space.Thus a new sampling algorithm will be needed in future.

28

5 Summary

5.1 Comparison between the static and adaptiveapproaches

Both approaches, which are presented in this thesis, prove the use of POMDP-based decision making in cognitive cars. Considering simple traffic situations,both approaches are able to handle given situations in a correct manner.The results, described in the previous section, imply that the adaptive state space-based approach is superior considering performance (as expected before). On theother hand the first approach, which is based on a static state space, is easier tohandle. Especially as long as the system is being used as an off-line solver.However, with a couple of future optimizations (i.e. integration of SARSOP algo-rithm) and with regard to more complex traffic situations, the second approachwill prove to be superior. Moreover the second approach provides the possibilityto be used as an on-line solver (although much development is needed to reachthis goal).

5.2 General Problems and Chances of aPOMDP-based Decision Making

As described in the previous chapter, this work showed promising results andproved that a POMDP-based decision making may be applied to cognitive cars.Compared to a state machine based approach, this approach is more flexibleand, once all basic functions (i.e. observations, road networks) are integrated,extending it to handle complex situations should not be too complicated anymore(algorithmically spoken). The probability-based handling of traffic situations candefinitely be considered more powerful than traditional approaches.On the other hand it is computationally costly and from today’s point of viewit is not sure whether an on-line POMDP-based decision-making will really beaffordable considering the resources available in an autonomous car.Anyway, there is a lot of work to do until this planner may really be applicableto real-world situations.

29

5 Summary

5.3 Future Work

First of all both approaches described in this thesis have so far not used realPOMDPs, but MDPs or to be more precise ”Pseudo”-POMDPs, which are miss-ing uncertain observation, hence the adaptive solver needs to be extended tohandle real POMDPs and the POMDP problem has to be extended by uncertainobservations. Moreover the MDP solver used in the adaptive state space approachis comparably simple and does not make use of a couple of improvements whichare for example part of the SARSOP algorithm (see chapter 2). Hence perfor-mance of the solver is still improvable.Moreover the adaptive state space is still not as adaptive as intended. The currentapproach is missing the implementation of clustering on the states. Clusteringmight enable the algorithm to merge states which can be considered as logicallythe same from the current point of view and split them again at a later point oftime in case situation changes and a state looks too general. This improvementshould decrease the size of the state space and hence increase the performance ofthe algorithm. Another improvement of performance should be reached with thepossibility to delete states which are not important anymore because they haveproven to be only part of beliefs in suboptimal policies.Another open question is whether the model of compound actions, which is beingused in both approaches, is optimal. Currently there is for example no possibilityto decide to brake during a lane change. This decision has to be made at thebeginning of the lane change. Maybe a more sophisticated action model wouldbe helpful.Furthermore a lot of computations should be moved to the Bayesian Network. Inorder to change this the interpretation of situations should be dedicated to theBayesian Network. Along with the interpretation the whole reward model maybe moved and the state space could maybe be changed to a more abstract one.Finally the goal is to develop an on-line solver which is capable to handle real-time situations instead of the off-line-calculation of policies the solution describedin chapter 3 does. Although this approach could theoretically be applied to alltraffic situations, the off-line computation of policies, which are able to handlecomplex situations appearing in urban traffic, is expected to exceed the resourcecapacities at hand.

30

6 German Abstract

Diese Studienarbeit untersucht die Nutzung von POMDPs (teilweise beobachtbareMarkow’sche Entscheidungsprozesse) zur Entscheidungsfindung in kognitiven Au-tomobilen.Zu diesem Zweck wird in einem ersten Schritt ein vorhandenes POMDP-Toolkit(APPL, http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl) genutzt. Umden darin enthaltenen, auf dem SARSOP-Algorithmus [KHL08] basierenden Solverwird dann ein entsprechendes(PO)MDP Modell zur Beschreibung des Fahrzeugesund seiner Umgebung herum gebaut. Dieses erste Modell besteht aus einemstatischen und diskreten Zustandsraum welcher bereits bei der Initialisierung desSystems vollstandig aufgebaut wird. Dieser besitzt fur jedes Fahrzeug in derUmgebung kartesische x und y Koordinaten, eine Geschwindigkeit in x-Richtungsowie je zwei Flags. Diese Flags speichern Informationen uber die Existenz desFahrzeuges bzw. daruber, ob das Fahrzeug in einen Unfall verwickelt worden istoder nicht.Neun Aktionen ermoglichen dem autonomen Fahrzeug zu beschleunigen oderabzubremsen sowie die Spur zu wechseln. Diese Aktionen sind entsprechend somodelliert, dass jede Aktion, die zu einem Wechsel der Spur fuhrt, drei Sekundendauert (im Gegensatz zu einer Dauer von lediglich einer Sekunde bei ”normalen”Aktionen). Hierzu sind Anpassungen des Discounts des Systems und auch imHandling des Discounts im Solver entsprechend erforderlich.

Zur Berechnung der Transaktionswahrscheinlichkeiten wird ein Bayes’sches Netzgenutzt, welches nach Eingabe der aktuellen Fahrzeugposition sowie der, durchdie gewahlte Aktion bestimmten, Trajektorienendpunkte die Zielposition des kog-nitiven Fahrzeuges pradiziert.Ein echtes Beobachtungsmodell gibt es in diesem System noch nicht. Es wird alsonoch mit einem ”Pseudo”-POMDP gearbeitet, welches die Zustande als sicherbeobachtbar annimmt und damit prinzipiell die gleichen Ergebnisse liefert wie einMDP - allerdings spater zu einem echten POMDP erweitert werden kann.

Dieses Modell zeigt bereits erste Erfolge. Die Reaktion des Fahrzeuges entsprichtin großen Teilen dem gewunschten Verhalten. Nichtsdestotrotz zeigen sich auchdie Problem dieses Ansatzes recht schnell. Diese liegen vor allem in der statischenModellierung des Zustandsraumes, welche der hochgradig dynamischen Umwelteines kognitiven Automobils schon bei vergleichsweise einfachen Sachverhalten

31


6 German Abstract

nicht mehr gerecht wird bzw. zu Performanceverlusten fuhrt.

Eine Weiterentwicklung dieses Systems unter Einbindung eines dynamischenZustandsraumes lost einen Teil dieser Probleme. Dieser adaptive Zustandsraumfugt neue Zustande je nach Bedarf ein und vermeidet damit das Einfugen un-notiger Zustande. Außerdem sind bei dieser Weiterentwicklung einige Optimierun-gen und Vereinfachungen enthalten.Wahrend die Discountberechnung und der Aktionsraum prinzipiell unverandertsind, wurde das Transitionsmodell komplett uberarbeitet. Hierbei wurde eingroßerer Teil der Berechnungen an die neu strukturierte DBNL (Dynamic BayesianNetwork Library) ausgelagert. Das Bayes’sche Netzwerk erhalt nun von demSolver lediglich den aktuellen Systemzustand in Form der Fahrzeugpositionen undGeschwindigkeiten sowie die gewahlte Aktion und pradiziert die Fahrzeugtrajek-torien und die Endpositionen der Fahrzeuge. Diese werden dann anschließendwieder in den (PO)MDP Zustandsraum uberfuhrt.Neu ist ebenfalls, dass die anderen Fahrzeuge sich bei der Berechnung der Tran-sitionen im Bayes’schen Netz ebenfalls teilweise verhalten. Eine bis zu einemgewissen Grad vernunftige Reaktion der anderen Verkehrsteilnehmer auf die Ver-haltensentscheidungen des kognitiven Automobils wird damit vorausgesetzt undermoglicht eine Verbesserung des Fahrzeugverhaltens des Autonomen Fahrzeugs.

Wie die ersten Versuche mit diesem neuen Modell zeigen, bringen vor allemeinige Detailoptimierungen tatsachlich Verbesserungen. So verringert sich dieGroße des Zustandsraumes durch das Weglassen unnotiger Hilfszustande deutlichund auch die rechnerische, maximale Zustandsraumgroße wird nicht erreicht (diesliegt vor allem darin begrundet, dass einige dieser Zustande bereits unmittelbareinen Unfall implizieren und eine Initialisierung dieser damit unnotig ware aberauch darin, dass der Zustandsraum vom nicht-autonomen Fahrzeug nicht voll-standig ausgenutzt wird).

Nichtsdestotrotz gibt es noch viel Arbeit fur die Zukunft: Zum einen wirdangestrebt den Zustandsraum durch eine Moglichkeit des Loschens von Zustan-den sowie durch Funktionen zum Vereinigen und Auftrennen von Zustanden weit-erzuentwickeln. Dieser weitergehende dynamische Ansatz sollte die Anzahl derbenotigten Zustande weiter verringern und damit auch komplexere Berechnungenfur reale Szenarien ermoglichen.Zum anderen stellt das bisherige System genau genommen erst ein MDP dar. Esfehlt derzeit sowohl an einem vollwertigen Beobachtungsmodell als auch an eineran den adaptiven Zustandsraum angepasste Version des SARSOP Solvers. Weit-ere Arbeiten sollen sich daher auch auf die Weiterentwicklung zu einem echtenPOMDP konzentrieren.

32

Bibliography

[Fed10] Federal Statistical Office, (Statistisches Bundesamt). Verkehrsunfalle- Unfallentwicklung im Straßenverkehr 2009. Wiesbaden, Germany,2010.

[GBD10] Tobias Gindele, Sebastian Brechtel, and Rudiger Dillmann. A proba-bilistic model for estimating driver behaviors and vehicle trajectoriesin traffic environments. In 13th International IEEE Conference onIntelligent Transportation Systems, Madeira Island, Portugal, 2010.

[GJPD08] Tobias Gindele, Daniel Jagszent, Benjamin Pitzer, and Rudiger Dill-mann. Design of the planner of team AnnieWAY’s autonomous vehi-cle used in the DARPA Urban Challenge 2007. In Intelligent VehiclesSymposium, Eindhoven, Netherlands, 2008.

[HBPM07] Jesse Hoey, Axel Von Bertoldi, Pascal Poupart, and Alex Mihailidis.Assisting persons with dementia during handwashing using a par-tially observable Markov decision process. In The 5th InternationalConference on Computer Vision Systems, Bielefeld, Germany, 2007.

[Hum11] Humanoids and Intelligence Systems Laboratories. Humanoids andIntelligence Systems Laboratories. http://his.anthropomatik.

kit.edu/, Feb 2011.

[KHL08] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Ef-ficient point-based POMDP planning by approximating optimallyreachable belief spaces. In Proceedings of Robotics: Science and Sys-tems, Zurich, Switzerland, 2008.

[KSGD09] Ralf Kohlhaas, Ferdinand Szekeresch, Tobias Gindele, and RudigerDillmann. Dynamic Bayesian Network Library - Ein C++ Frame-work fur Berechnungen auf dynamischen Bayes’schen Netzen. In Pro-ceedings of the 21th Fachgesprach Autonome Mobile Systeme, Karl-sruhe, Germany, 2009.

[KZP+08] Soren Kammel, Julius Ziegler, Benjamin Pitzer, Moritz Werling, To-bias Gindele, Daniel Jagszent, Joachim Schroder, Michael Thuy,Matthias Goebl, Felix v. Hundelshausen, Oliver Pink, ChristianFrese, and Christoph Stiller. Team AnnieWAY’s Autonomous System

33

http://his.anthropomatik.kit.edu/

http://his.anthropomatik.kit.edu/

Bibliography

for the DARPA Urban Challenge 2007. Journal of Field Robotics,25(9), Sep 2008.

[OPHL09] Sylvie C. W. Ong, Shao Wei Png, David Hsu, and Wee Sun Lee.POMDPs for robotic tasks with mixed observability. In Proceedingsof Robotics: Science and Systems, Seattle, USA, 2009.

[PGT03] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based valueiteration: An anytime algorithm for POMDPs. In Proceedings Inter-national Joint Conferene on Artifical Intelligence, Acapulco, Mexico,2003.

[RJ86] L. Rabiner and B. Juang. An introduction to hidden markov models.ASSP Magazine, IEEE, 3(1), Jan 1986.

[SB98] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning:An Introduction. MIT Press, Cambridge, MA, 1998.

[SKP+08] Christoph Stiller, Soren Kammel, Benjamin Pitzer, Julius Ziegler,Moritz Werling, Tobias Gindele, and Daniel Jagszent. Team An-nieWAY’s Autonomous System. In Gerald Sommer and ReinhardKlette, editors, Robot Vision, volume 4931 of Lecture Notes in Com-puter Science. Springer Berlin / Heidelberg, 2008.

[SRJLD08] Sven R. Schmidt-Rohr, Rainer Jakel, Martin Losch, and RudigerDillmann. Compiling POMDP Models for a Multimodal ServiceRobot from Background Knowledge. In European Robotics Sympo-sium, Prague, Czech Republic, 2008.

[SRKLD08] Sven Schmidt-Rohr, Steffen Knoop, Martin Losch, and Rudiger Dill-mann. Bridging the Gap of Abstraction for Probabilistic DecisionMaking on a Multi-Modal Service Robot. In Proceedings of Robotics:Science and Systems, Zurich, Switzerland, 2008.

[SS04] Trey Smith and Reid Simmons. Heuristic search value iteration forPOMDPs. In Proceedings of the 20th conference on Uncertainty inartificial intelligence, Banff, Canada, 2004.

34

pomdp-based decision making for cognitive cars using an ... · pomdp-based decision making for...

Documents