shayan taherian, sampo kuutti, marco visca and saber fallah

8
Self-adaptive Torque Vectoring Controller Using Reinforcement Learning Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah Abstract— Continuous direct yaw moment control sys- tems such as torque-vectoring controller are an essential part for vehicle stabilization. This controller has been extensively researched with the central objective of main- taining the vehicle stability by providing consistent stable cornering response. The ability of careful tuning of the parameters in a torque-vectoring controller can signifi- cantly enhance vehicle’s performance and stability. How- ever, without any re-tuning of the parameters, especially in extreme driving conditions e.g. low friction surface or high velocity, the vehicle fails to maintain the stability. In this paper, the utility of Reinforcement Learning (RL) based on Deep Deterministic Policy Gradient (DDPG) as a parameter tuning algorithm for torque-vectoring controller is presented. It is shown that, torque-vectoring controller with parameter tuning via reinforcement learning performs well on a range of different driving environment e.g., wide range of friction conditions and different velocities, which highlight the advantages of reinforcement learning as an adaptive algorithm for parameter tuning. Moreover, the robustness of DDPG algorithm are validated under scenarios which are beyond the training environment of the reinforcement learning algorithm. The simulation has been carried out using a four wheels vehicle model with nonlinear tire characteristics. We compare our DDPG based parameter tuning against a genetic algorithm and a conven- tional trial-and-error tunning of the torque vectoring con- troller, and the results demonstrated that the reinforcement learning based parameter tuning significantly improves the stability of the vehicle. I. INTRODUCTION Active stability technologies play a major role in modern vehicle dynamic systems in order to enhance the lateral vehicle stability and reduction in fatal ac- cidents. These technologies such as torque-vectoring controllers (i.e., the control of the traction and brak- ing torque of each wheel), can effectively enhance the vehicle handling performance in extreme manoeu- vres [1][2]. Torque-vectoring controller has a prominent advantage of stabilizing a vehicle through choosing appropriate control parameters. However, if the param- eters cannot be adapted to the environmental changes, then control design would be cumbersome and will not guarantee good robustness. In order to solve this problem, the adaptive control design has received wide attention. Some methods of tuning parameters have been proposed such as fuzzy logic method [3][4], which have optimization problem of parameters and requires more knowledge of the environment. Another method Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah are with CAV-Lab in Department of Mechanical Engineering, University of Surrey, Guildford, UK [email protected] is evolutionary algorithm [5][6] such as genetic algo- rithm, which is a strong tool for tuning the control parameter with less computational requirements for prior knowledge, however depending on the complex- ity of the problem the calculation speed will be slow. Others have used neural networks to learn the optimal parameters for each state [7][8], typically using super- vised learning to train the neural network. However, the downside of supervised learning methods is that during training, the ground truth (i.e. correct prediction for given inputs) for each prediction is required to measure the error in the network’s predictions. There- fore, supervised learning is poorly suited to problems where the ground truth is not readily available or is difficult to estimate. On the other hand, reinforcement learning is a technique inspired by human and animal learning, where the neural network is trained through a trial-and-error mechanism, without requiring any ground truth labels. Therefore, reinforcement learn- ing is better suited for this type of problem where the correct predictions are not known, as it enables the network to learn the optimal policy effectively through interaction with its environment. The success of reinforcement learning [9][10] was perceived as a breakthrough in the field of artificial intelligence. This algorithm attempts to learn the optimal control policy through interactions between the system and its en- vironment without the need of prior knowledge of an environment or the system model. In this algorithm, an agent takes an action based on environment state and consequently receives a reward. Reinforcement learn- ing algorithms are generally divided into three groups [11][12]: value based, policy gradient and actor-critic methods. In this paper, DDPG algorithm [13] is used, which is in the family of actor-critic group algorithm, to tune the parameters of the torque-vectoring controller. DDPG agent is an actor-critic reinforcement learning that computes optimal policy that maximizes the long- term reward. The actor is a network attempting to execute the best action given the current state. The critic estimates the maximum value function of given state (i.e approximate maximizer). Then uses the re- ward from environment to determines the accuracy of the prediction value. In [14], authors proposed DDPG as a reinforcement learning algorithm for tuning the sliding mode controller parameters for autonomous underwater vehicle. Results demonstrated that DDPG achieved a good performance in terms of stability, fast arXiv:2103.14892v1 [eess.SY] 27 Mar 2021

Upload: others

Post on 18-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

Self-adaptive Torque Vectoring Controller Using ReinforcementLearning

Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

Abstract— Continuous direct yaw moment control sys-tems such as torque-vectoring controller are an essentialpart for vehicle stabilization. This controller has beenextensively researched with the central objective of main-taining the vehicle stability by providing consistent stablecornering response. The ability of careful tuning of theparameters in a torque-vectoring controller can signifi-cantly enhance vehicle’s performance and stability. How-ever, without any re-tuning of the parameters, especiallyin extreme driving conditions e.g. low friction surface orhigh velocity, the vehicle fails to maintain the stability.In this paper, the utility of Reinforcement Learning (RL)based on Deep Deterministic Policy Gradient (DDPG) as aparameter tuning algorithm for torque-vectoring controlleris presented. It is shown that, torque-vectoring controllerwith parameter tuning via reinforcement learning performswell on a range of different driving environment e.g.,wide range of friction conditions and different velocities,which highlight the advantages of reinforcement learningas an adaptive algorithm for parameter tuning. Moreover,the robustness of DDPG algorithm are validated underscenarios which are beyond the training environment ofthe reinforcement learning algorithm. The simulation hasbeen carried out using a four wheels vehicle model withnonlinear tire characteristics. We compare our DDPG basedparameter tuning against a genetic algorithm and a conven-tional trial-and-error tunning of the torque vectoring con-troller, and the results demonstrated that the reinforcementlearning based parameter tuning significantly improves thestability of the vehicle.

I. INTRODUCTIONActive stability technologies play a major role in

modern vehicle dynamic systems in order to enhancethe lateral vehicle stability and reduction in fatal ac-cidents. These technologies such as torque-vectoringcontrollers (i.e., the control of the traction and brak-ing torque of each wheel), can effectively enhancethe vehicle handling performance in extreme manoeu-vres [1][2]. Torque-vectoring controller has a prominentadvantage of stabilizing a vehicle through choosingappropriate control parameters. However, if the param-eters cannot be adapted to the environmental changes,then control design would be cumbersome and willnot guarantee good robustness. In order to solve thisproblem, the adaptive control design has received wideattention. Some methods of tuning parameters havebeen proposed such as fuzzy logic method [3][4], whichhave optimization problem of parameters and requiresmore knowledge of the environment. Another method

Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah arewith CAV-Lab in Department of Mechanical Engineering, Universityof Surrey, Guildford, UK [email protected]

is evolutionary algorithm [5][6] such as genetic algo-rithm, which is a strong tool for tuning the controlparameter with less computational requirements forprior knowledge, however depending on the complex-ity of the problem the calculation speed will be slow.Others have used neural networks to learn the optimalparameters for each state [7][8], typically using super-vised learning to train the neural network. However,the downside of supervised learning methods is thatduring training, the ground truth (i.e. correct predictionfor given inputs) for each prediction is required tomeasure the error in the network’s predictions. There-fore, supervised learning is poorly suited to problemswhere the ground truth is not readily available or isdifficult to estimate. On the other hand, reinforcementlearning is a technique inspired by human and animallearning, where the neural network is trained througha trial-and-error mechanism, without requiring anyground truth labels. Therefore, reinforcement learn-ing is better suited for this type of problem wherethe correct predictions are not known, as it enablesthe network to learn the optimal policy effectivelythrough interaction with its environment. The successof reinforcement learning [9][10] was perceived as abreakthrough in the field of artificial intelligence. Thisalgorithm attempts to learn the optimal control policythrough interactions between the system and its en-vironment without the need of prior knowledge of anenvironment or the system model. In this algorithm, anagent takes an action based on environment state andconsequently receives a reward. Reinforcement learn-ing algorithms are generally divided into three groups[11][12]: value based, policy gradient and actor-criticmethods. In this paper, DDPG algorithm [13] is used,which is in the family of actor-critic group algorithm, totune the parameters of the torque-vectoring controller.DDPG agent is an actor-critic reinforcement learningthat computes optimal policy that maximizes the long-term reward. The actor is a network attempting toexecute the best action given the current state. Thecritic estimates the maximum value function of givenstate (i.e approximate maximizer). Then uses the re-ward from environment to determines the accuracy ofthe prediction value. In [14], authors proposed DDPGas a reinforcement learning algorithm for tuning thesliding mode controller parameters for autonomousunderwater vehicle. Results demonstrated that DDPGachieved a good performance in terms of stability, fast

arX

iv:2

103.

1489

2v1

[ee

ss.S

Y]

27

Mar

202

1

Page 2: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

convergence and low chattering. In [15], reinforcementlearning was used to tune PID controller parameterswhere the auto tuning strategy was through actor-critic reinforcement learning method resulting in sta-ble tracking performance of the system. However, thegeneralization of the trained models for the scenariosbeyond the training environment were not validated.Therefore the performance of the algorithms in newenvironments can not be guaranteed.In this paper, reinforcement learning algorithm is in-corporated into the torque-vectoring controller to ad-just the control input weights in order to maintainthe stability of the vehicle. The contribution in thispaper lies into the utilization of DDPG algorithm asan intelligent tuning strategy for the torque-vectoringcontroller in a real application for different scenarios.It has been shown that proper tuning of the controlinput weights results in improvement in handling andstability of the vehicle. In order to validate the effective-ness of the DDPG algorithm, simulation environmenthas been set under different velocities ranging from80-130 kmh−1 and different friction surface rangingfrom 0.4-0.6. The performance is then compared witha conventional trial-and-error approach (which will bereferred as manual tuning for the rest of the paper), anda self-tuning genetic algorithm, to show the enhancedeffectiveness of DDPG as a tuning strategy for activeyaw control systems. Moreover, to validate the general-ization capability of the trained model, simulation hasbeen deployed in the environment that DDPG was nottrained in, to ensure the robustness of the algorithmunder unknown operating conditions. The simulationresults have been conducted using a four wheels ve-hicle model [16] with nonlinear tire characteristics inMATLAB/Simulink environment. Moreover, for solvingthe on-line tuning algorithm, MATLAB reinforcementlearning toolbox is used for parameter tuning of thetorque-vectoring controller. The remainder of the paperis structured as follows: Section II describes the vehiclemodel used in this paper. Section III introduces thetorque vectoring algorithm used for stabilization pur-poses in this paper. Section IV presents the reinforce-ment learning algorithm used for tuning parameters ofthe torque vectoring control. Section V demonstratesthe simulation results of the trained algorithm, andfinally concluding remarks are presented in Section VI.

II. VEHICLE MODEL

In this paper, vehicle simulation model is the sixstate nonlinear four wheel vehicle model presentedin [16]. The model states are [x, y,ψ, ψ, X,Y]T where xand y represents the vehicle longitudinal and lateralvelocities, respectively, in the body frame and ψ, ψ, Xand Y denotes yaw angle, yaw rate, longitudinal andlateral coordinates in the inertial frame. To describe thevehicle model, normal forces are assumed to be con-stant and roll angle is neglected in the formulation. The

tire model used in this paper is described by pacejkamodel [17]. This is an empirical nonlinear tire modelwhich is able to closely capture the tire dynamics.

III. TORQUE VECTORING CONTROL

A. System Architecture

The overall architecture of the proposed controlframework is shown in Fig. 1. In this structure, thecommand interpreter module (CIM) generates desiredforces at the centre of gravity (C.G) based on inputsfrom the reference generation block (δ f and Tf f ) inorder to interpret the driver’s intention for the vehiclemotion. CIM continuously monitors the inputs as wellas the states of the vehicle to provide accurate desiredvalues to the torque vectoring controller to keep thevehicle within the stable region. It is noted that CIMgenerats the desired C.G forces with assumption ofnormal driving conditions (e.g. high surface fricitioncondition). For the brevity of the paper, the calculationof CIM block is referred to [18]. The inputs to thereinforcement learning are the errors between the ac-tual and desired forces followed by the actual states ofthe vehicle. Moreover, the generated error between theactual and reference signal as well as the actions fromreinforcement learning (Wd f ) are fed to the controllerin order to generate the adjustment torque required toassist the driver for improving handling and stabilityof the vehicle.

B. Controller

A torque-vectoring controller was developed to en-sure lateral/yaw stability of the vehicle. The primaryobjective of this controller is to provide a safe drivingexperience in the case of unexpected driving conditione.g. low surface friction condition. The desired C.Gforces (Fdes) in Fig. 1 are represented as:

Fdes = [F∗x , F∗y , G∗z ]T (1)

where F∗x ,F∗y and G∗z are the desired C.G longitudinaland lateral forces and yaw moment, respectively. Theactual C.G forces of the vehicle are:

F = [Fx, Fy, Gz]T (2)

where Fx, Fy and Gz are the actual C.G longitudinaland lateral forces and yaw moment of the vehicle,respectively. Each C.G force components is a functionof longitudinal and lateral tire forces (see Fig. 2) as:

Fx = Fx( fx1, . . . , fx4, fy1, . . . , fy4) (3)

Fy = Fy( fx1, . . . , fx4, fy1, . . . , fy4) (4)

Gz = Gz( fx1, . . . , fx4, fy1, . . . , fy4) (5)

where, fxi and fyi,(i = 1, . . . ,4) are longitudinal andlateral tire forces on each wheel of the vehicle, respec-tively. Equations (3)-(5) reveal a possible way to obtain

Page 3: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

Referencegenerator CIM

δf

Tff

Fdes e Reinforcementlearning

Wdf

Controller VehicleδT Tff

δf

++

+

e.g.µ,vx, etc

x, y, ψ, etc

F

Fig. 1: Adaptive torque vectoring controller based on reinforcement learning

the desired C.G forces by controlling the tire forces. Thetire force vector can be represented as:

f = [ fx1, fx2, fx3, fx4, fy1, fy2, fy3, fy4]T (6)

The corresponding adjusted C.G forces to minimize theerror between the actual forces F( f ) and desired forcesFdes, can be written as:

F( f + δF) ≈ F( f ) +∇F( f )δF (7)

where, ∇F( f ) is the Jacobian matrix and δF is thevector containing the control actions needed to min-imise the error between the actual and desired forceswhich is:

δF = [δ fx1 ,δ fx2 ,δ fx3 ,δ fx4 ,δ fy1 ,δ fy2 ,δ fy3 ,δ fy4 ]T (8)

It is noted that in this work in order to derive torquevectoring control actions, only δ fxi

is considered forcontroller designing procedure and the effect of lateralcontrol actions are neglected:

δ fyi = 0, i = 1, . . . ,4 (9)

To formulate the required control action δF, the errorbetween the desired C.G forces at t+∆t and actual C.Gforces at t is described:

E = Fdes − F( f ) = [Ex, Ey, Ez]T (10)

Then the control actions δF for time t + ∆t can becalculated from Fdes− F( f + δF), where F( f + δF) is theC.G forces at t + ∆t. Then using (7), (9) and (10), onehas:

Fdes− F( f + δF)≈ Fdes− F( f )−∇F( f )δF = E−∇F( f )δF(11)

where

∇F =

δFx( f )

δFδFy( f )

δFδGz( f )

δF

=

δFxδ fx1

δFxδ fx2

δFxδ fx3

δFxδ fx4

δFyδ fx1

δFyδ fx2

δFyδ fx3

δFyδ fx4

δGzδ fx1

δGzδ fx2

δGzδ fx3

δGzδ fx4

(12)

Then an objective function consists of weighted combi-nation of error between the actual and desired vehicle

X

Yδfx1

δfx2

δfx4

δfx3

Fx

Fy

Gz

fx2

fx1

fx4

fx3

fy2

fy1

fy3

fy4

Fig. 2: Force convention

C.G forces and control action must be defined. Themathematical representation of objective function is:

P =12[(E−∇FδF)

TWE(E−∇FδF)+

δTF Wd f δF]

(13)

where WE ∈R3x3 is the weight on longitudinal, lateraland yaw moment error, and Wd f ∈ R4x4 is the weighton control action. Since (13) is a quadratic form withrespect to the tire force adjustment δF, the necessarycondition of the solution is given by solving the equa-tion

∂P∂δF

= 0 (14)

As a result of minimization of the objective function,the control action required to stabilize the vehicle is[18]:

δF = [Wd f + (∇F( f )TWE)∇F( f )]−1(∇F( f )T(WE)E)(15)

The applied corrective torque on each wheel is δT =Re f f × δF where Re f f is the effective wheel radius.It is noted that in this paper, there is no adjustmenton lateral tire forces and only longitudinal correctiveactions are calculated from the controller design. Subse-quently, only the weights on the longitudinal correctiveactions are tuned using reinforcement learning algo-rithm, while keeping WE as fixed values.

Page 4: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

IV. REINFORCEMENT LEARNING ALGORITHM

In this section an overview of reinforcement learn-ing framework followed by DDPG algorithm usedfor parameter tuning of torque-vectoring controller ispresented.

A. Reinforcement learning overview

To implement a Reinforcement Learning (RL) basedcontroller, we consider a standard RL setup for con-tinuous control in which an agent interacts with theenvironment, aiming to learn from its own actions.The formulation of reinforcement learning is based ona Markov Decision Process. At each time step t, t =0,1, ..., T, the agent receives an observations st, takesan action at from a set of possible action A. As aconsequence of the actions, the agent receives a rewardrt and observes a new state st+1 [19][13]. The goal inreinforcement learning is to learn a policy π, whichmaps states to actions: π: S → A resulting in max-imizing the expected cumulative discounted rewardR = E[∑t=T

t=0 γtr(st, at)] with discounting factor γ ∈ [0,1]and E denoting the expectation of the probability. Thestate-action value (Q value) at time step t representsthe expected cumulative discounted reward from timet. Reinforcement learning problem is solved using Bell-man’s principle of optimality. That is, if the optimalvalue of the state-action Q∗(st+1, at+1) for the next timestep is known, then the optimal state-action value forthe current time step can be calculated by maximizingr(st, at) + Q∗(st+1, at+1).In this paper, the actions of the reinforcement learningframework are the control action weighting parametersWd f . The observation for reinforcement learning arethe errors between the actual and desired forces atthe C.G of the vehicle as well as the vehicle statesst = [ex, ey, eMz, x, y,ψ, ψ, X,Y].To encourage the agent to tune the weighting parame-ters, a reward function must be appropriately definedby the user. In our model the objective is to find aselection of weighting parameters to enforce the vehicleto follow the reference signals while maintaining thestability. Therefore, a reward function must be designedsuch that the weighting matrix Wd f generated fromreinforcement learning makes the reward function andthe objective function in (13) to be as similar as possible.Hence, the error between the actual and reference C.Gforces that are used in the cost function (13) are chosenfor the reward function in order to find a suitable selec-tion of weighting matrix for torque vectoring controller.It is noted that proper selection of a reward functiondetermines the behavior of controller, and thus affectsthe stability of the vehicle. Therefore, in order to ensurethe reliability of the torque vectoring controller a well-defined reward function rt provided at every time stept is introduced:

rt = −(10e2x + 5e2

y + 5e2Mz)× 0.01 + 6Mt + Nt (16)

TABLE I: DDPG parameter values

Reward discount factor 0.99Critic learning rate 1e−3

Actor learning rate 1e−4

Target network update 1e−3

Mini-batch size 70Variance decay rate 1e−3

Variance 30Sample time 0.5

Number of neurons 100Experience buffer length 1e6

The first term in the reward function encourages theagent to minimize the errors, while the logic conditionsencourage the agent to keep the error below somethreshold. The two logics in (16) are: Mt = 0 if thesimulation is terminated, otherwise Mt = 1. Nt = 1 if thecomponents of error is ex, ey < 200N , otherwise Nt = 0.In this way a large positive reward is applied when theagent is close to its ideal conditions which is when theerrors of C.G are small. On the other hand, a negativereward is applied when the vehicle fails to maintainthe stability, which discourages the agent from losingthe directionality of the vehicle. It is noted that thesimulation is terminated when the sideslip angle of thevehicle is greater than 8°. Moreover, the action signaltakes the values between 40 and 1e3 as lower and upperbound for all parameters in Wd f .

B. Deep Deterministic Policy Gradient

DDPG is an evolution of Deterministic Policy Gradi-ent [20] algorithm. It is in the family of actor-critic net-work [21], model-free, off-policy algorithms which uti-lize Deep Neural Networks as function approximators.DDPG allows to learn policies in high-dimensional,continuous state and action spaces. The DDPG usedin this paper, is inspired from [13]. A brief descriptionof the theory will be discussed in this section, however,a keen reader is encouraged to read the original paper.The DDPG algorithm uses two deep neural networks:actor and critic network. The actor network is responsi-ble for state-action mapping through the deterministicπ(st|θπ) where θπ represents the actor neural networkweight parameters, and critic network is for Q-valuefunction Q(st, at)

Qπ(st, at) = Ert ,st+1∼E[r(st, at) + γ(Qπ(st+1,π(st+1))] (17)

The action value function Q is approximated usingDNN with net weights parameters θQ. For learning theQ-value, Bellman’s principle of optimality is used tominimize the root mean squared loss

L(θQ) = Est∼ρβ ,at∼β,rt∼E[(Q(st, at|θQ)− yt)2] (18)

where

yt = r(st, at) + γQ(st+1,π(st+1|θQ)) (19)

Page 5: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

0 100 200 300 400 500 600 700

Episode Number

-2000

-1500

-1000

-500

0

500

1000

1500E

pis

ode

rew

ard

Episode Reward

Average Reward

0 100 200 300 400 500 600800

900

1000

Fig. 3: Episode reward for training the torque vectoringparameters

0 5 10 15

-500

0

500

Input to

rque [N

.m]

Applied torques on front wheels

Applied torques on rear wheels

0 5 10 15

Time [s]

-50

0

50

Input ste

ering

Fig. 4: top figure: Applied torque on each wheel,bottom figure: Input steering

The actor policy π(st|θπ) is updated using:

∇θπ ≈ Est∼ρβ [∇aQ(s, a|θQ)|s=st ,a=π(st)∇θππ(s|θπ)|s=st ] (20)

which is the policy gradient. For learning the policy πgradient ascent is performed with respect to the policyparameter θπ to maximize the Q-value. In DDPG thetarget networks are used to stabilize the training [22].Gaussian noise is used for action exploration [23]. Ex-perience replay is utilized for stability [22]. Mini-batchgradient descent is used [13]. The network parameterswere empirically tuned, and final hyperparameters canbe found in Table I. Both the actor and critic networksare neural nets with 2 hidden layers. The hiddenneurons all use Relu activation, actor output uses tanhactivation, whilst critic output uses linear activation.

V. SIMULATION RESULTS

The DDPG agent was trained for 650 episode with15s episode length. At the start of each trainingepisode, the environment parameters were varied bychoosing initial velocity randomly between 80kmh−1

to 130kmh−1 and friction value ranging from 0.4 to 0.6.As can be seen from the episode rewards (see Fig. 3),the agent quickly converges to an optimal policy, al-though slight improvements in the average reward can

Fig. 5: (a): Generated weighting matrix Wd f , (b): Ap-plied torque on each wheels

still be seen in the later episodes. Once the trainingphase is completed, performance of the DDPG networkis validated in various driving scenarios. All simulationexperiments presented in this paper were performedon a four wheel vehicle model with pacejka tire modeldescribed in Section II. For the simulation purposes thedriver’s input torque and steering wheel angle fromreference generator are shown in Fig. 4. The parametersused to design vehicle model are tabulated in Table II.In this section, we first demonstrated the results of thealgorithm in scenarios that DDPG was trained in andcompared the simulation results with genetic algorithmand manual tuning of the weighting matrix Wd f . It isnoted that the error signals used in the reward function(16) are chosen for designing cost function in geneticalgorithm in order to have a fair comparison betweentwo strategies. For further validation, the effectivenessof the DDPG algorithm is investigated for the con-ditions that reinforcement learning was not trainedin, under different input steering angle. This analysisaims to investigate the generalization capability of theDDPG tuning strategy, by testing its performance inscenarios beyond the training environment. The firstscenario that is chosen to demonstrate the performanceof DDPG algorithm, is under conditions where thesurface friction is 0.4 and initial velocity is 100kmh−1.It is noted that every iteration of the algorithm isactively trying to maximise the reward function byreducing the error signals presented in (16), and atthe same time maintaining a smooth and low controleffort from torque vectoring controller by generatingan optimal set of weighting matrix from reinforcementlearning. This can be observed in Fig. 5 where thetop figure shows the weighting matrix generated fromthe reinforcement learning algorithm while the bottomfigure shows the control action from torque vectoring

Page 6: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

0 5 10 15

time [s]

-5000

0

5000F

x,C

.G[N

](a)

Fx,C.G

vehicle

Fx,C.G

reference

0 5 10 15

time [s]

-5000

0

5000

10000

Fy,C

.G [

N]

(b)

Fy,C.G

vehicle

Fy,C.G

reference

0 5 10 15

time [s]

-2000

0

2000

Mz,C

.G [

N.m

] (c)

Mz,C.G

vehicle

Mz,C.G

reference

Fig. 6: Longitudinal, lateral forces and yaw momentaround C.G The yellow lines represent the actual vehi-cle behavior, and blue line represent the reference forcesestimated from CIM block

controller. The generated adjustment torque δF on eachwheel shows a symmetric and smooth evolution asa consequence of weighting matrix Wd f . It is notedthat the course of Wd f on the front left (FL) wheel isconstant over the entire simulation compared to theother weights on each wheel. This behavior indicatesthat reinforcement learning found some semi-trivialsolution to the problem for the entire operating regions,demonstrating the generalization capabilities of thetrained model which will be clear later in this section.Fig. 6, compares the actual C.G forces obtained from

the vehicle model with those estimated by CIM block.As can be seen, the generated C.G forces are able totrack the reference values, resulting in stabilizing thevehicle under friction surface µ = 0.4. However, thesmall discrepancy can be seen in the second figureof the subplot. This is due to the fact that in theformulation presented in Section III-B, since there is nodirect control of the lateral tire forces, the magnitude ofthe error in lateral C.G force is relatively large, henceit would be difficult to fully minimize the lateral C.Gerror under this condition.

To validate the effectiveness of the DDPG algorithm,

TABLE II: Design parameters

Vehicle mass 1360 kgYaw moment of inertia 2050 kgm2

Effective wheel radius 0.3 mFront distance from vehicle C.G 1.43 mRear distance from vehicle C.G 1.21 m

Front cornering stiffness 16000Nrad−1

Rear cornering stiffness 16000Nrad−1

WE diag(0.4,0.02,1500)

0 5 10 15

Time [s]

0

2

4

6

Perf

orm

ance index (

P)

105 (b)

Genetic

DDPG

manual

0 5 10 15-200

0

200

400

ex [N

]

(a)

Genetic

DDPG

manual

Fig. 7: (a): Longitudinal C.G force errors under µ =0.4, (b): Performance index results, note: Blue linerepresents genetic algorithm tuning results, yellow linerepresents manual tuning, and orange line is DDPGalgorithm results

the result of the longitudinal C.G error is comparedwith genetic algorithm and trial-and-error approach(manual tuning). It is noted that, the manual tuning ofWd f was selected based on our previous work in [24]where its corresponding values resulted in maintainingthe vehicle stability with Wd f = diag(1e2,1e2,1e2,1e2).Fig. 7 shows the comparison between parameter tuningusing DDPG algorithm, genetic algorithm and manualtuning of torque-vectoring controller. Fig. 7(a) demon-strates the evolution of error in longitudinal C.G forcesfor all tuning approaches. In this analysis the value oferrors are such that vehicle can successfully maintainthe stability. However, evolution of the error for DDPGalgorithm outperforms the rest of the approaches withthe absolute value of the maximum error of 137N,whereas in genetic algorithm and manual tuning thisvalue reaches to 205N and 366N respectively. More-over, unlike the manual tuning, the value of the errorfor both DDPG and genetic algorithm is able to returnto near-zero, however the magnitude of the error forDDPG is smaller compared to genetic algorithm whichshows the superiority of DDPG strategy over the restof the approaches. This confirms that DDPG algorithmis a reliable technique to obtain better understandingof controller parameters and the tuning, particularly inthe nonlinear region of operation of the vehicle. Notethat, in this paper, only error on the longitudinal C.Gforce is analyzed, since only the longitudinal correctiveaction is considered to be controlled in order to stabilizethe vehicle. Fig. 7(b) demonstrates the outcome of theobjective function (13) to show the optimal solution ofthe torque vectoring controller as a result of parametertuning. As can be seen, DDPG is able to find the

Page 7: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

0 5 10

Time [s]

0

1

2

3

4

Pe

rfo

rma

nce

in

de

x (

P) 10

6 (b)

Genetic

DDPG

manual

0 5 10 15-5000

0

5000e

x [

N]

(a)

Genetic

DDPG

manual

Fig. 8: (a): Longitudinal C.G force errors under µ = 0.3,(b): Performance index results

best solution by reducing the performance index to theminimum level and converges to the near-zero valuearound 8s of the simulation. This optimal value showsbetter result compared to genetic algorithm and almostdouble improvement compared to manual tuning ofthe weighting matrix.Parametric analysis has been carried out to verify theeffectiveness of DDPG algorithm. In this analysis, theabsolute value of the maximum error of longitudinalC.G error force for DDPG algorithm, genetic algorithmand manual tuning of torque-vectoring controller ispresented (see Table III). First, the analysis focuses oncomparing the DDPG algorithm with other approaches,under different friction coefficient of the road and thesame initial velocity of 100kmh−1. In this analysis, thevalue of error is increasing gradually for all cases asthe road friction decreases. However, for all frictionvalues, DDPG reduces the maximum absolute errors bymore than 50% which shows improvement comparedto genetic algorithm and manual tuning. This analysisshows the effectiveness of DDPG algorithm on theperformance of the control structure in adapting of thesystem to the changes of the environment condition.The second analysis has been conducted to evaluatethe proposed framework under different velocities witha fixed friction surface condition (µ = 0.4). It can beseen that as the velocity of the vehicle increases, theabsolute value of the maximum error increases for bothcases. This analysis confirms the ability of the DDPGalgorithm to improve the adaptive capability of thecontroller under different velocities and low surfacefriction condition.Finally to validate generalization capability of theDDPG algorithm, simulation results have been con-

ducted under the environment that DDPG was nottrained in. Therefore, friction surface of µ = 0.3 is cho-sen as an environment that was not considered duringtraining. Initial velocity of the vehicle is chosen as80kmh−1 which is a reasonable representation of high-speed driving on an icy/snow surface. Additionally,a 50° step steer manoeuvre is applied as an inputsteering angle (δ f ) to the system in order to investigatethe generalization capability of DDPG algorithm underthe unforeseen conditions. This can be seen in Fig. 8where in Fig. 8(a) the longitudinal C.G error showslarger evolution during braking and acceleration for theduration of 8 to 11s for all tuning approaches due tosudden input driving torque applied to the controller.However, as can be seen the error for DDPG algorithmconverges to zero which indicates reasonably goodtracking of this algorithm even in the environmentthat it was not trained in. On the other hand, geneticalgorithm and manual tuning show larger evolutionof the error and fail to converge to zero with slightlybetter tracking capability of genetic algorithm. Fig. 8(b)demonstrates the result of the cost function (13) inorder to show the optimal solution of the torque vector-ing controller for all approaches. As can be seen DDPGoffers a better solution compared to other methodsdespite a relatively large peak around 11s. The reasonof this is due to a large Wd f (see Fig. 5(a)) applied tothe controller to maintain the vehicle stability, whichultimately minimised the performance index to near-zero value and find an optimal solution. On the otherhand the value of performance index for genetic algo-rithm and manual tuning observed an increasing trendwith lack of converging to an optimal solution. Theresults show the superiority of the proposed DDPGapproach to parameter tuning over the manual tuningand genetic algorithm baselines. More importantly, bycarrying out tests with different road conditions, ve-hicle velocities, and steering inputs from those seenduring training, the results demonstrate that the DDPGhas learned a general approach to tune the weightingparameters of the torque vectoring controller which cangeneralize to new scenarios and environments.

VI. CONCLUSIONS

This paper has investigated the use of DDPG al-gorithm for automatically tuning a torque-vectoringcontroller under a wide range of different vehicle ve-locities and friction surface conditions. The proposedcontrol framework consists of (i) reinforcement learningmethod as an auto-tuning algorithm, (ii) a torque-vectoring controller to ensure lateral-yaw stability ofthe vehicle. The closed-loop scheme was implementedon a four wheels vehicle model with nonlinear tirecharacteristics, and the numerical results indicate thebenefits of the reinforcement learning on tuning thetorque-vectoring parameters. The DDPG has multipleadvantages over genetic algorithm and manual tuning

Page 8: Shayan Taherian, Sampo Kuutti, Marco Visca and Saber Fallah

TABLE III: Longitudinal C.G error for various driving scenarios

µ 0.4 0.45 0.5 0.55 0.6 0.65 0.7

DDPG 132 N 115 N 77 N 54 N 43 N 32 N 18 NGenetic algorithm 205 N 180 N 163 N 105 N 86 N 65 N 32 NManual tuning 366 N 289 N 223 N 170 N 127 N 95 N 70 NVx(km/h) 70 80 90 100 110 120 130

DDPG 17 N 33 N 77 N 132 N 182 N 202 N 214 NGenetic algorithm 40 N 73 N 161 N 205 N 287 N 322 N 390 NManual tuning 71 N 105 N 212 N 366 N 487 N 498 N 515 N

as it can find better weighting parameters of the con-troller, as well as tuning the parameters in an onlinemanner based on the current states of the vehicle. Theresults demonstrated the effectiveness of the DDPGalgorithm as an auto-tuning method over wide rangeof operating points, resulting in significant reductionof Longitudinal C.G errors compared to other tuningapproaches. The overall performance of the vehicleindicates the accurate tracking of references as well asmaintaining the stability of the vehicle using DDPGwhich shows the efficacy of this algorithm. While thiswork evaluated the adaptive tuning of torque vectoringparameters via Wd f , this framework is quite generaland could be adapted to different controllers and evendifferent domains.

REFERENCES

[1] Q. Lu, A. Sorniotti, P. Gruber, J. Theunissen, and J. De Smet,“H∞ loop shaping for the torque-vectoring control of elec-tric vehicles: Theoretical design and experimental assessment,”Mechatronics, vol. 35, pp. 32–43, 2016.

[2] Z. Wang, U. Montanaro, S. Fallah, A. Sorniotti, and B. Lenzo,“A gain scheduled robust linear quadratic regulator for vehicledirect yaw moment Control,” Mechatronics, vol. 51, no. January,pp. 31–45, 2018.

[3] Y. Fu, Y. Liu, Y. Wen, S. Liu, and N. Wang, “Adaptive neuro-fuzzy tracking control of UUV using sliding-mode-control-theory-based online learning algorithm,” Proceedings of the WorldCongress on Intelligent Control and Automation (WCICA), vol.2016-Septe, pp. 691–696, 2016.

[4] C. Van Kien, N. N. Son, and H. P. H. Anh, “Adaptive FuzzySliding Mode Control for Nonlinear Uncertain SISO SystemOptimized by Differential Evolution Algorithm,” InternationalJournal of Fuzzy Systems, vol. 21, no. 3, pp. 755–768, 2019.

[5] F. Cao, “PID controller optimized by genetic algorithm fordirect-drive servo system,” Neural Computing and Applications,vol. 32, no. 1, pp. 23–30, 2020.

[6] V. Rajinikanth and K. Latha, “Tuning and Retuning of PID Con-troller for Unstable Systems Using Evolutionary Algorithm,”ISRN Chemical Engineering, vol. 2012, pp. 1–11, 2012.

[7] J. Fei and C. Lu, “Adaptive Sliding Mode Control of DynamicSystems Using Double Loop Recurrent Neural Network Struc-ture,” IEEE Transactions on Neural Networks and Learning Systems,vol. 29, no. 4, pp. 1275–1286, 2018.

[8] J. Chen and T. C. Huang, “Applying neural networks to on-lineupdated PID controllers for nonlinear process control,” Journalof Process Control, vol. 14, no. 2, pp. 211–230, 2004.

[9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-vam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,T. Graepel, and D. Hassabis, “Mastering the game of Go withdeep neural networks and tree search,” Nature, vol. 529, no.7587, pp. 484–489, 2016.

[10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lil-licrap, K. Simonyan, and D. Hassabis, “Mastering Chess andShogi by Self-Play with a General Reinforcement LearningAlgorithm,” pp. 1–19, 2017.

[11] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,”SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.

[12] Y. Hsu, H. Wu, K. You, and S. Song, “Review : A selected reviewon reinforcement learning based control,” no. 2011, pp. 1–14,2019.

[13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deepreinforcement learning,” 4th International Conference on LearningRepresentations, ICLR 2016 - Conference Track Proceedings, 2016.

[14] D. Wang, S. Yue, S. Qixin, L. Guangliang, C. Xiangrui Kong,Guanzhong, and H. Bo, “Adaptive DDPG Design-Based Sliding-Mode Control for Autonomous Underwater Vehicles at Differ-ent Speeds,” IEEE Underwater Technology (UT), pp. 1–5, 2019.

[15] X. song Wang, Y. hu Cheng, and W. Sun, “A Proposal ofAdaptive PID Controller Based on Reinforcement Learning,”Journal of China University of Mining and Technology, vol. 17, no. 1,pp. 40–44, 2007.

[16] P. Falcone, M. Tufo, F. Borrelli, J. Asgarit, and H. E. Tseng, “Alinear time varying model predictive control approach to theintegrated vehicle dynamics control problem in autonomoussystems,” Proceedings of the IEEE Conference on Decision andControl, pp. 2980–2985, 2007.

[17] H. B. Pacejka, Tire and Vehicle Dynamics, 2012.[18] S. Fallah, A. Khajepour, B. Fidan, S.-K. Chen, and B. Litkouhi,

“Vehicle Optimal Torque Vectoring Using State-Derivative Feed-back and Linear Matrix Inequality,” vol. 62, no. 4, pp. 1540–1552,2013.

[19] S. Kuutti, R. Bowden, H. Joshi, R. D. Temple, and S. Fallah,“End-to-end Reinforcement Learning for Autonomous Longi-tudinal Control Using Advantage Actor Critic with TemporalContext,” 2019 IEEE Intelligent Transportation Systems Conference(ITSC), pp. 2456–2462, 2019.

[20] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, andM. Riedmiller, “Deterministic policy gradient algorithms,” 31stInternational Conference on Machine Learning, ICML 2014, vol. 1,pp. 605–619, 2014.

[21] R. S.Sutton and A. G.Barto, Reinforcement Learning: An Introduc-tion, Cambridge, 2015.

[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,“Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[23] G. E. Uhlenbeck and L. S. Ornstein, “On the Theory of theBrownian Motion,” Journal of the Physical Society of Japan, vol. 13,pp. 823–841, 1930.

[24] S. Taherian, U. Montanaro, S. Dixit, and S. Fallah, “IntegratedTrajectory Planning and Torque Vectoring for AutonomousEmergency Collision Avoidance,” International Conference on In-telligent Transportation Systems (ITSC), 2019.