a deep reinforcement learning approach for ramp metering

Research ArticleA Deep Reinforcement Learning Approach for Ramp MeteringBased on Traffic Video Data

Bing Liu 1 Yu Tang 2 Yuxiong Ji 1 Yu Shen 1 and Yuchuan Du 1

1Key Laboratory of Road and Traffic Engineering of the Ministry of Education Tongji University Shanghai 201804 China2Tandon School of Engineering New York University New York 11201 NY USA

Correspondence should be addressed to Yu Shen yshentongjieducn

Received 14 October 2020 Accepted 15 October 2021 Published 31 October 2021

Academic Editor Victor L Knoop

Copyright copy 2021 Bing Liu et alis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Ramp metering that uses traffic signals to regulate vehicle flows from the on-ramps has been widely implemented to improvevehicle mobility of the freeway Previous studies generally update signal timings in real-time based on predefined trafficmeasurements collected by point detectors such as traffic volumes and occupancies Comparing with point detectors trafficcamerasmdashwhich have been increasingly deployed on road networksmdashcould cover larger areas and provide more detailed trafficinformation In this work we propose a deep reinforcement learning (DRL) method to explore the potential of traffic video data inimproving the efficiency of ramp metering Vehicle locations are extracted from the traffic video frames and are reformed asposition matrices e proposed method takes the preprocessed video data as inputs and learns the optimal control strategiesdirectly from the high-dimensional inputs A series of simulation experiments based on real-world traffic data are conducted toevaluate the proposed approach e results demonstrate that in comparison with a state-of-the-practice method the proposedDRLmethod results in (1) lower travel times in the mainline (2) shorter vehicle queues at the on-ramp and (3) higher traffic flowsdownstream of the merging areae results suggest that the proposedmethod is able to extract useful information from the videodata for better ramp metering controls

1 Introduction

Ramp metering uses traffic signals to regulate vehicle flowsfrom on-ramps to the mainline of the freeway It alleviatesthe negative impacts of ldquocapacity droprdquo resulting frommassive merging behaviors and reduces the total time spentin the traffic system [1 2] Several field tests have demon-strated the effectiveness of ramp metering in terms ofthroughput vehicle-miles-traveled vehicle-hours-traveledand travel time reliability [3ndash6] For instance a large-scaleexperiment conducted in Minneapolis-Saint Paul found thatwhen the ramp meters were turned off freeway capacitydecreased by 9 travel time increased by 22 and crashesincreased by 26 [7]

Most of the ramp metering methods are driven by trafficmeasurements such as volumes queue lengths and occu-pancies obtained from point detectors (eg inductive loopdetectors) [8 9] ese measurements are predefined and

only reflect partial information regarding traffic operationsRecently traffic cameras have been increasingly deployed onroad networks for monitoring traffic operations anddetecting illegal driving behaviors Compared with pointdetectors traffic cameras cover larger areas and providemore detailed traffic information such as vehicle locationsvehicle speeds and headways between vehicles

We propose a deep reinforcement learning (DRL) methodto explore the potential of traffic video data in improving theefficiency of ramp metering e proposed method does notrely on traditional traffic measurements from loop detectorsInstead it uses traffic video frames as inputs and learns theoptimal control strategies directly from high-dimensional vi-sual inputs by taking advantage of the special neutral structuresin deep learning which are capable of automatically extractinghigh-level features from raw data e effectiveness of theproposed method is demonstrated in comparison with a state-of-the-practice method in a real-world case study

HindawiJournal of Advanced TransportationVolume 2021 Article ID 6669028 13 pageshttpsdoiorg10115520216669028

is paper is organized as follows A brief review of therelated literature is presented first and then the method-ology is described followed by the demonstration in a real-world case studye final section summarizes the paper anddiscusses possible directions for future research

2 Related Works

e signal timing plans for ramp metering could be fixed ortraffic responsive Wattleworth [10] firstly formulated alinear programming model to optimize a fixed signal timingplan for ramp metering based on historical traffic volumedata Responsive approaches adapt to traffic flow fluctua-tions by updating signal timings in response to real-timetraffic measurements e responsive approaches can beclassified as rule-based optimization-based and RL-based

e rule-based approaches update signal timings in real-time according to specific rules Masher et al [11] proposed afeedforward ramp metering algorithm aiming at keepingthe downstream traffic flow below the capacity Papa-georgiou et al [8] proposed a feedback ramp metering al-gorithm which is named ALINEA based on the occupancyobtained by inductive loop detectors downstream of themerging area e ALINEA regulates the inflows from theon-ramp in the scheme of closed-loop control so that thedetected occupancy would approach a predefined targetvalue e ALINEA was further extended by consideringmore traffic measurements such as queue lengths at the on-ramp and traffic volumes in the mainline [12] Wang et al[9] proposed the proportional-integral-based ALINEA (iePI-ALINEA) considering the cases where a bottleneck witha capacity lower than that of the merging area is presentfurther downstream of the merging areaey demonstratednumerically that the PI-ALINEA performs better than theoriginal ALINEA e ALINEA has been incorporated inseveral coordinated ramp metering algorithms such asMETALINE and HERO [13 14]

e optimization-based approaches optimize signaltimings based on real-time traffic data e model predictivecontrol framework which considers the interactions be-tween ramp metering and future traffic states is oftenemployed to predict traffic evolution for proactive trafficcontrols Nevertheless the models used to describe trafficdynamics may lead to nonlinear control problems whichbrings difficulty in finding the optimal signal timings[15ndash17] In addition the effectiveness of the ramp meteringstrategy depends on the degree of the fitness of the models tothe actual traffic dynamics

e RL-based approaches search for a policy that de-termines the signal timings based on the current traffic stateExisting RL-based ramp metering approaches were mainlydeveloped based on value-based RL methods such as theQ-learning algorithm e Q-learning algorithm evaluatesthe action value of each state-action pair and improves thepolicy with a heuristic searche actions usually refer to themetering rate or signal phase selection e states are rep-resented by traffic measurements such as traffic density andvolume e downstream and upstream traffic flow rampqueue length and traffic density are often selected to define

the control reward [18ndash21] e RL has also been introducedfor coordinated ramp metering [22 23] Note that existingRL methods for ramp metering were driven by traditionaltraffic measurements To the best of our knowledge nostudies have attempted to automatically extract informationfrom traffic videos and learn the optimal control strategiesdirectly from the visual inputs for ramp metering Somestudies have considered traffic video data as inputs in the RLframework for intersection signal controls [24ndash26] Nev-ertheless the elements in the RL framework such as theaction reward and state for rampmetering and intersectionsignal controls are different

Compared with the methods based on traditionalmeasurement the proposed strategy shows great potential inimproving ramp metering performance due to the spatio-temporal information contained in the video data Howeverthe high-dimension data increase the computational com-plexity in the training phase erefore a well-designed RLframework and algorithm are necessary to guarantee theeffectiveness of the proposed method

3 Methodology

We propose a DRLmethod for local rampmetering based ontraffic video data e proposed method adopts a flexibletwo-phase control scheme which is illustrated in Figure 1 Apolicy trained from the DRL determines whether the phasein the next time step is green or red at a fixed time step L

based on the current traffic state e adopted controlscheme is conceivably more flexible than the conventionalldquostop-and-gordquo scheme with fixed red and green phasesflashing alternately To reduce the computation burdentraffic video frames are used as control inputs Vehicle lo-cations from raw images are extracted to reconstruct visualrepresentations in the DRL method [24]

Figure 2 illustrates the ramp metering problem in thegeneral scheme of RL e ramp meter acts as the agent tointeract with the environment namely the traffic systemincluding the mainline and the on-ramp e arrows rep-resent the direction of information interaction in the systeme ramp meter takes certain control action at based oncurrent traffic state st en the traffic system responds tothe control action with the state transition from st to st+1e rampmeter obtains reward rt that quantifies the effect ofthe control action at given state st e interaction process isassumed to be a Markov decision process [27] at is giventhe current state st and action at the following state st+1 andreward rt are independent of previous states and actions

e RL method aims at training the agent (Ramp meter)to find an optimal policy πlowast that maximizes the discountedcumulative reward Gt after time step t To achieve that goalone approach is to learn the optimal action-value functionQlowast(s a) which is defined by the following

Qlowast(s a) max

πEπ Gt|st s at a1113858 1113859 (1)

where π is a policy mapping traffic states to control actionsOnce Qlowast(s a) is found the action to be taken can be ob-tained by the following

2 Journal of Advanced Transportation

πlowast(s) argmaxaprime

Qlowast

s aprime( 1113857 (2)

e process of learning policy from the action-valuefunction Q is termed Q-learning e key step in Q-learningis to estimate Qlowast(s a) To better capture the traffic featuresin video data We adopt a deep neural network termedQ-network to approximate Qlowast(s a) e neural networkcontains several convolutional layers and fully connectedlayers e convolutional layers extract traffic features fromthe video frames and the fully connected layers estimate theQ value based on the output of convolutional layers

31 State Action and Reward Representation

311 Action e ramp meter takes two possible actions atfixed decision time step L to determine whether the phase inthe next time step is green (G) or red (R) namely action setA G R

312 State Figure 3 illustrates the process of obtainingvisual representations from the raw images to representtraffic states For simplification purpose the vehicle isrepresented by a location point (xi yi) extracted from rawimages Considering a control area of size X times Y as shown inFigure 3(a) traffic state at time step t is represented by aposition matrix mt isin RXtimesY

mt xi yi( 1113857 UV i isin It (3)

where UV is a constant and It refers to the set of vehicles inthe control area at time step t

e position matrix additionally contains the informa-tion of the signal states

mt xR yR( 1113857 UG if atminus 1 G

UR if atminus 1 R1113896 (4)

where (xR yR) is the position of signal light in images andUG and UR are constants to represent actions execute in timestep t minus 1

e other elements in matrix mt are set to zero

mt(x y) 0 (x y) notin xi yi( 1113857|i isin It1113864 1113865cup xR yR( 11138571113864 1113865

(5)

Since one matrix is insufficient to describe vehicle dy-namics we stack the matrices of consecutive N time steps torepresent the state st

st mtminus (Nminus 1) mt1113966 1113967 (6)

For computation efficiency down-sampling is intro-duced before the matrix stack to reduce the input dimen-sions and maintain sufficient information about vehiclepositions

313 Reward e reward is critical since it motivates theagent to approach the objective Vehicle travel times throughthe system are good measurements of vehicle mobilityHowever they are not suitable rewards since they are greatlyinfluenced by previous actions and cannot well quantify thevalue of the current action Alternatively the speed in themerging area and the queue length at the on-ramp areadopted to define the reward e reward rt after action at isdefined by the following

rt μvt + ωqt μgt 0ωlt 0 (7)

where vt and qt represent the average speed in the mergingarea and the average queue length at the on-ramp duringtime step t and t + 1 respectively Positive μ and negative ωdenote the reward weights for the speed and the queuelength respectivelye defined rewards balance the priorityneeds between improving vehicle mobility in the mainlineand reducing vehicle delays at the on-ramp

32 Algorithm e training problem can be viewed as theregression problem with the loss function defined by thefollowing

Stop Bar

Current phase green

Time step

Signal State

t-L t t+L t+2Lt-2L

Figure 1 Two-phase control scheme for ramp metering

Ramp meter(Agent)

Traffic system(Environment)

Control Action atState st Reward rt

t=t+1

Figure 2 Ramp metering in the general scheme of RL

Journal of Advanced Transportation 3

L wi( 1113857 Esa Qlowast(s a) minus Q s a wi( 1113857( 1113857

21113872 1113873 (8)

where wi denotes the weights in the deep neural networkafter the ith update and Qlowast(s a) is approximated by thefollowing

Qlowast(s a) asymp rt+1 + cmaxaprimeQ st+1 aprime wi( 1113857 st s at a

(9)

e gradient-based algorithm Adam [28] is used toupdate wi to minimize the loss function

Although Q-network could handle large state space itraises the problem of instability in training [29] Two ap-proaches have been proposed to resolve the problem il-lustrated in Figure 4 e experience replay [30] stores thedata prepared for training into the replay buffer and samplesthese data uniformly for training the agent It contributes tomaking the training data independently and identicallydistributede target network [29] is introduced for a stableestimation of target value Qlowast(s a) When Q-network isbeing trained the target network is freezing Its parameterswminus are updated by copying wi in Q-network with a certainfrequency e loss function is redefined as follows

L1 wi( 1113857 Esa rt+1 + cmaxaprimeQ0

st+1 aprime wminus

( 1113857 minus Q0

s a wi( 11138571113872 11138732

1113874 1113875

(10)

where Q0(s a wi) represents the prediction values of theaction-value function

A multitask learning strategy is implemented in thetraining process to improve learning efficiency Whiletraining the ramp metering Q-network the mean speed inthe merging area and the queue length at the on-ramp arepredicted simultaneously e fully connected layer in theQ-network in multitask learning is extended to two layers tocalculate the Q-value and predict the speed and queue

length respectively Another loss function to measure theprediction accuracy is defined as follows

L2 wi( 1113857 Esa Q1

s a wi( 1113857 minus v1113872 11138732

+ Q2

s a wi( 1113857 minus u1113872 11138732

1113874 1113875

(11)

where Q1(s a wi) andQ2(s a wi) represent the predictionvalues of the action-value function mean speed and queuelength respectively u and v refer to the ground truth of thespeed and queue length respectively e total loss functionis as follows

L wi( 1113857 L1 wi( 1113857 + λL2 wi( 1113857 (12)

e algorithm for training the ramp metering policy ispresented in Algorithm 1 e ramp meter takes new actionwith the ε-greedy strategy meaning that the agent does notalways adopt the best action derived from current Q(s a w)Doing so is helpful to explore unknown policies that may bebetter

4 Case Study

41 Case Set-Up e proposed method is evaluated inSUMO an open-source microscopic traffic simulator [31]e simulation is performed in the freeway connectingQingdao and Huangdao through a tunnel in ShandongChina as shown in Figure 5 e one-lane on-ramp and theupstream three-lane mainline merge And the freewaygradually reduces to three lanes in the downstreamsegments

e simulation considers the road network covering theon-ramp the upstream and downstream mainlines and themerging areae speed limits in the mainline and at the on-ramp are 80 and 40 kmh respectively e simulationparameters regarding driving behaviors are calibrated based

b) Position matrix

Down-sampling

c) Traffic state

a) Control area

Y

X mDtndash2

XD

YD

mDtndash1 mD

t

X

(x0 y0)

(x1 ndash x0 y1 ndash y0)

(x1 y1)

Y

Figure 3 State representation (a) control area (b) position matrix mtndash1 (c) traffic state st


on empirical data e simulation time is 750 am to 900am Figure 6 presents the 10-min traffic volumes from 750am to 900 am in the mainline and at the on-ramp

e agentrsquos observation in DRL covers the simulatedroad network ree consecutive frames of the traffic videoin the controlled area are needed as the observation input ofeach decision time step After zooming and down-samplingthe size of one observation is 4 times 512 in pixels reeconsecutive observations are stacked to represent the trafficstate To accommodate the input size for ramp metering theQ-network in Mnih et al [30] is revised as Figure 7 whichconsists of three convolutional neural network (CNN) layersand one full connection (FC) layer for each output layer

us there are about 2 million weights in the deep neuralnetwork that are stored permanently after training ehyperparameters in the deep Q-learning for ramp meteringare listed in Table 1

For comparative purposes two scenarios are also con-sidered in the evaluation e first scenario does not applyramp metering and the second scenario adopts the PI-ALINEA method e PI-ALINEA method fixes the lengthof the green phase and determines the length of the red phasein the current signal cycle based on the length of the redphase in the previous cycle and the downstream occupanciesin the previous and current cyclese rampmetering rate rk

for k-th cycle is given by the following

Agent

Environment

Action atState st Reward rt

t=t+1

SampleTraining data

Store ltst at rt st+1gt

Update weights

ltst at rt st+1gtltst+1 at+1 rt+1 st+2gtltst+2 at+2 rt+2 st+3gtltst+3 at+3 rt+3 st+4gt

hellip

Replay buffer

(a)

Q-networkTarget

network

L (wi) = Esa [(rt+1 + γ max Q (st+1 arsquo wndash) ndash Q (s a wi))2]

TrainedFreezing

Update target networkwith certain frequency

arsquo

(b)

Figure 4 Approaches for stabilizing the training process in deep Q-learning (a) relay buffer (b) target network

Input Batch size k learning rate η decision step size L exploration rate εFreezing interval F number of training episode NNumber of decision time steps T in one episode

Fill the replay buffer randomlyInitialize the action-value function Q with random weights wInitialize the target network Qndashwith weights wminus w

flarr0for episode 1 N

for t 1 T

Observe state st and choose action at by ε-greedy strategyExecute action at obtain reward rt state st+1Store transition ltst at rt vt ut st+1gt into replay bufferUniformly sample k transitions from replay bufferCompute the gradient nablaw(L1 + λL2)

Update weights w in Q-network by Adam with learning rate ηflarrf + 1

end ifif mod (f F) 0Update target network Qminus wminus larrw

end ifend for

end for

ALGORITHM 1 Deep Q-learning for ramp metering


7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp

Figure 6 Traffic volumes in the mainline and the on-ramp

CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m

Road network modelled in the simulation

N

Figure 5 Road network considered in the case study

Input(3 times 4 times 512)

CNNFilters 32Size 1 times 8

Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)

Figure 7 Structure of the deep neural network


rk rkminus 1 + KR ok minus 1113954o1113858 1113859 minus KP okminus 1 minus ok1113858 1113859 (13)

where ok and okminus 1 represent the mean mainline occupancyduring the k-th and (k + 1)-th cycle respectively 1113954o refers tothe desired value formainline occupancy which is set to 16in this study KR 90 and KP 50 are two regulator pa-rameters in PI-ALINEA which are fine-tuned to minimizemean travel times in the traffic system To accommodate theldquostop-and-gordquo scheme the green time for each cycle is fixedwith time step size L and the red time for each cycle is givenby rkL

42 Evaluation and Comparison e Q-network is trainedin 106 training frames and each scenario is evaluated in 20simulation experiments e training process presented inFigure 8 can be finished in less than 24 hours on a com-puting server with NVIDIA Quadro K620 GPU and IntelXeon CPU E5-1650 (36GHz 6 cores) e simulations forthe training phase and each experiment are initialized withdifferent random seeds e episode reward and mean traveltime are both converged in 400000 training frames and canmaintain a stable value until the end of the training process

Figure 9 presents the Q1 (25) Q2 (50) and Q3 (75)of the travel times in the mainline for three scenarios eresults demonstrate that ramp metering not only reducesvehicle travel times but also improves the stability of trafficflows in the mainline e resulting travel times are closewhen the demand is low at the beginning (800ndash810) Withthe increase in the demand the median travel time increasesfaster in the no-control scenario and reaches the maximumvalue of 425min at 8 28 e maximum values of themedian travel times resulting from the PI-ALINEA methodand the DRL method are 375min and 34min respectivelywhich are 117 and 20 lower than those in the no-controlscenario As the demand decreases the travel times of thethree scenarios decrease to similar levels In addition all theramp metering methods narrow the interquartile ranges(Q3-Q1) of the travel times Overall the proposed DRLmethod results in shorter travel times with narrower traveltime ranges than that of the fine-tuned PI-ALINEA method

Table 2 presents mean vehicle travel times in 20 simu-lation experiments It is found that the performance of thePI-ALINEA method is not always better than that of the no-control scenario For example compared with the no-

control scenario the PI-ALINEA method increases vehicletravel time by 10 in the 6th experiment In contrast theproposed DRLmethod consistently results in a shorter traveltime than that of the no-control scenario demonstrating thestable performance of the proposed method On average theDRL method reduces vehicle travel time by 13 whencompared with the no-control scenario

Figure 10 presents the temporal-spatial distribution ofthe mean speeds along the mainline e results furtherdemonstrate the efficiency of the ramp metering and thebetter performance of the proposed method In the no-control scenario the minimum speed in the mainline is12ms (432 kmh) e minimum speeds resulting from thePI-ALINEA method and the DRL method are 133 and20 higher than those in the no-control scenario respec-tively e area with a speed lower than 15ms (54 kmh) isreduced with the PI-ALINEA method which is furtherreduced when the DRL method is implemented

Figure 11 presents the Q1 Q2 and Q3 of the queuelengths at the on-ramp in three scenarios e results revealthat ramp metering improves vehicle mobility in themainline at the cost of delays of vehicles at the on-rampWithout rampmetering no vehicle queues are formed at theon-ramp Both ramp metering methods result in vehiclequeues at the on-ramp Nevertheless the queues producedby the proposed method are shorter than those produced bythe PI-ALINEAmethod When the PI-ALINEAmethod andthe proposed DRL method are adopted the maximumvalues of the Q3 of the queue lengths are 126m and 67mrespectively

Figure 12 presents the mean queue lengths at the on-ramp and the red ratios (ie the proportion of the red phasein the simulation time) in each simulation for the two rampmetering methods e queue length generally increaseswith the red ratio However compared with the PI-ALINEAmethod the queues produced by the DRL method areshorter suggesting that the proposed method could betterutilize the green time

Figure 13 presents the queue lengths at the on-ramp andthe occupancy downstream of the merging area for the tworamp metering methods Figure 13(a) demonstrates that thequeue length and the occupancy are highly correlated whenthe PI-ALINEA method is adopted e results are un-derstandable According to Figure 12 a higher red ratioleads to long queue length Given high occupancy the PI-

Table 1 Key parameters in deep Q-learning for ramp metering

Parameters ValueState matrix dimension (3 4 512)

Replay buffer size B 2 times 105Training steps T 4200Learning rate η 25 times 10minus 4

Batch size k 32Exploration rate ε 01Freeze interval F 104Decision step size L 4 sLoss weight λ 0001Reward weight μ 05Reward weight ω minus 01


ALINEA method tends to adopt a long red phase whichwould lead to long queues at the on-ramp In contrastFigure 13(b) shows that the correlation between the queue

length and the occupancy is relatively low when the DRLmethod is implemented e queue length over time isrelatively stable ranging from 25ndash50m Although the oc-cupancy is low after 8 45 the ramp meter is still active eresults indicate that the proposed DRL method not onlyrelies on downstream occupancy but other informationextracted from raw data

e better performance of the proposed DRL method isfurther revealed when considering traffic flows downstreamof the merge area which is demonstrated in Figure 14 In thebeginning traffic flows in three scenarios are close Duringthe peak period between 8 10ndash8 35 the flows resultingfrom the DRL method are higher and more stable sug-gesting that the proposed DRL method could better alleviatethe negative impacts of ldquocapacity droprdquo resulting frommassive merging behaviors When traffic demand decreasesduring 8 40ndash9 00 traffic flows in the two ramp meteringscenarios also decline Nevertheless traffic flows in the no-control scenario are still high during 8 45ndash8 55 since morevehicles are held in the system in the previous periods

A perturbation analysis is conducted to evaluate theimpact of information contained in video data As shown inFigure 15 the control area is divided into eight subareasesensors applied in PI-ALINEA are located in subarea 2 einformation in the subarea is removed from original data togenerate perturbed data for each subarea e impact of

1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)

Figure 8 Convergence curves for (a) episode reward and (b) mean travel time

800 810 820 830 840 850 900

a) No controlb) PI-ALINEAc) Deep RL

25

30

35

40

45

Trav

el ti

me (

min

)

Figure 9 Vehicle travel times in the mainline

Table 2 Mean travel times in 20 simulation experiments (s)

No No controls PI-ALINEA DRL1 186 179 [minus 4] 171 [minus 8]2 203 173 [minus 15] 171 [minus 16]3 217 175 [minus 19] 177 [minus 18]4 227 181 [minus 20] 181 [minus 20]5 202 173 [minus 14] 177 [minus 12]6 178 195 [+10] 173 [minus 3]7 185 178 [minus 4] 174 [minus 6]8 226 180 [minus 20] 173 [minus 23]9 175 214 [+22] 169 [minus 3]10 206 183 [minus 11] 175 [minus 15]11 188 173 [minus 8] 167 [minus 11]12 211 182 [minus 14] 170 [minus 19]13 219 175 [minus 20] 172 [minus 21]14 170 172 [+1] 170 [minus 0]15 187 182 [minus 3] 175 [minus 6]16 216 170 [minus 21] 179 [minus 17]17 177 192 [+8] 174 [minus 2]18 228 169 [minus 26] 172 [minus 25]19 190 179 [minus 6] 175 [minus 8]20 194 180 [minus 7] 170 [minus 12]


0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)

Figure 10 Temporal-spatial distribution of speed in (a) no-control (b) PI-ALINEA (c) DRL

800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)

Figure 11 Queue length at the on-ramp


missing information for each subarea is evaluated in 20simulation experiments with perturbed data

Figure 16 presents the box plot of travel time when theinformation of different subareas is removed and the baselineis the result of experiments that take original data as input

Apart from subarea 2 the information from other subareassuch as subarea 3 and 4 also influences the control performanceobviously which proves that the DRL-based controller takesmore information from video data into consideration besidesthe measurements applied in PI-ALINEA

PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()

Figure 12 Relationship of queue length and red ratio for (a) PI-ALINEA and (b) DRL

ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)

Figure 13 Queue length and occupancy for (a) PI-ALINEA and (b) DRL


5 Conclusion and Future Research

is study proposes a DRL method for local ramp meteringbased on traffic video data e proposed method learnsoptimal strategies directly from high-dimensional visual

inputs overcoming the reliance on the traditional trafficmeasurements e better performance of the proposedmethod is demonstrated in a real-world case study Com-pared with the traditional PI-ALINEAmethod the proposedmethod results in lower travel times in the mainline and

800 805 810 815 820 825 830 835 840 845 850 855 900

No controlPI-ALINEADeep RL

5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)

Figure 14 Downstream volume in three scenarios

VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135

Figure 15 Division subareas

Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)

Figure 16 Travel time with perturbed input data


shorter vehicle queue lengths at the on-ramp and couldbetter alleviate the negative impacts of ldquocapacity droprdquoresulting from massive merging behaviors e resultssuggest that the proposed DRL method is able to extractuseful information from the video data for better rampmetering controls

It is convenient to deploy the proposed model in practicesince the well-trained model is lightweight and the inputimage data are accessible with the latest communicationtechnologies For future research additional studies aresuggested to conduct to investigate the stability of theproposed method under various traffic and weather cir-cumstances We assume that vehicle location informationfrom the videos is accurate Actually the data quality may bedisturbed by factors such as weather vehicle sizes andcolors It is valuable to design a ramp metering algorithmthat is robust to those disturbances In addition abnormalevents such as accidents are not considered in this studyEfficiently training the ramp metering to adapt to abnormalevents remains to be investigated Finally extending theproposed method for coordinated ramp metering controls isworth pursuing

Data Availability

edata used to support the results of this study are availablefrom the corresponding author upon request

Disclosure

An initial draft of this paper was submitted as a preprint inthe following link httpsarxivorgabs201212104 [32]eaim behind posting this version was to collect insightfulcomments from the research community is currentdocument is the last update of our study

Conflicts of Interest

e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National Natural ScienceFoundation of China (71671124 and 71901164) the NaturalScience Foundation of Shanghai (19ZR1460700) theShanghai Science and Technology Committee(19DZ1202900) the Shanghai Municipal Science andTechnology Major Project (2021SHZDZX0100) and theFundamental Research Funds for the Central Universitiese corresponding author was sponsored by the ShanghaiPujiang Program (2019PJC107)

References

[1] A Srivastava and N Geroliminis ldquoEmpirical observations ofcapacity drop in freeway merges with ramp control and in-tegration in a first-order modelrdquo Transportation Research PartC Emerging Technologies vol 30 pp 161ndash177 2013

[2] M Papageorgiou C Kiakaki V Dinopoulou A Kotsialosand W Yibing ldquoReview of road traffic control strategiesrdquoProceedings of the IEEE vol 91 no 12 pp 2043ndash2067 2003

[3] D Levinson and L Zhang ldquoRamp meters on trial evidencefrom the Twin Cities metering holidayrdquo Transportation Re-search Part A Policy and Practice vol 40 no 10 pp 810ndash8282006

[4] N Bhouri H Haj-Salem and J Kauppila ldquoIsolated versuscoordinated ramp metering field evaluation results of traveltime reliability and traffic impactrdquo Transportation ResearchPart C Emerging Technologies vol 28 pp 155ndash167 2013

[5] L Faulkner F Dekker D Gyles I Papamichail andM Papageorgiou ldquoEvaluation of HERO-coordinated rampmetering installation at M1 and M3 freeways in queenslandAustraliardquo Transportation Research Record Journal of theTransportation Research Board vol 2470 no1 pp13ndash23 2014

[6] X Y Lu C J Wu and X Y Lu ldquoField test implementation ofcoordinated ramp metering control strategy a case study onSR-99Nrdquo in Proceedings of the Transportation Research BoardMeeting 96th Annual Meeting Washington DC WA USAJanuary 2018

[7] C Systematics Twin Cities Ramp Meter Evaluation Cam-bridge Systematics Medford MA USA 2001

[8] M Papageorgiou H Hadj-Salem and J-M BlossevilleldquoALINEA a local feedback control law for on-ramp meter-ingrdquo Transportation Research Record vol 1320 no 1pp 58ndash67 1991

[9] Y Wang E B Kosmatopoulos M Papageorgiou andI Papamichail ldquoLocal ramp metering in the presence of adistant downstream bottleneck theoretical analysis andsimulation studyrdquo IEEE Transactions on Intelligent Trans-portation Systems vol 15 no 5 pp 2024ndash2039 2014

[10] J A Wattleworth Peak-period Analysis and Control of aFreeway System Texas Transportation Institute San AntonioTX USA 1965

[11] D P Masher D Ross P Wong P Tuan H M Zeidler andS Petracek Guidelines for Design and Operation of RampControl Systems Stanford Research Institute Menlo Park CAUSA 1975

[12] E Smaragdis and M Papageorgiou ldquoSeries of new local rampmetering strategiesrdquo Transportation Research Record Journalof the Transportation Research Board vol 1856 pp 74ndash862004

[13] M Papageorgiou J-M Blosseville and H Haj-SalemldquoModelling and real-time control of traffic flow on thesouthern part of Boulevard Peripherique in Paris Part IIcoordinated on-ramp meteringrdquo Transportation ResearchPart A General vol 24 no 5 pp 361ndash370 1990

[14] I Papamichail and M Papageorgiou ldquoTraffic-responsivelinked ramp-metering controlrdquo IEEE Transactions on Intel-ligent Transportation Systems vol 9 no 1 pp 111ndash121 2008

[15] M Papageorgiou and R Mayr ldquoOptimal decompositionmethods applied to motorway traffic controlrdquo InternationalJournal of Control vol 35 no 2 pp 269ndash280 1982

[16] T Bellemans B De Schutter and B De Moor ldquoModelpredictive control for ramp metering of motorway traffic acase studyrdquo Control Engineering Practice vol 14 no 7pp 757ndash767 2006

[17] I Papamichail A Kotsialos I Margonis andM Papageorgiou ldquoCoordinated ramp metering for freewaynetworks-a model-predictive hierarchical control approachrdquoTransportation Research Part C Emerging Technologiesvol 18 no 3 pp 311ndash331 2010


[18] M Davarynejad A Hegyi J Vrancken and J van den BergldquoMotorway ramp-metering control with queuing consider-ation using Q-learningrdquo in Proceedings of the 14th Interna-tional IEEE Conference on Intelligent Transportation Systems(ITSC) pp 1652ndash1658 Washington DC USA October 2011

[19] K Rezaee B Abdulhai and H Abdelgawad ldquoApplication ofreinforcement learning with continuous state space to rampmetering in real-world conditionsrdquo in Proceedings of the15thInternational IEEE Conference on Intelligent TransportationSystems pp 1590ndash1595 Anchorage AK USA September2012

[20] A Fares and W Gomaa ldquoFreeway ramp-metering controlbased on reinforcement learningrdquo in Proceedings of the 11thIEEE International Conference on Control amp Automation(ICCA) pp 1226ndash1231 Taichung Taiwan June 2014

[21] H Yang and H Rakha ldquoReinforcement learning rampmetering control for weaving sections in a connected vehicleenvironmentrdquo in Proceedings of the Transportation ResearchBoard 96th Annual Meeting Washington DC WA USAJanuary 2017

[22] F Belletti D Haziza G Gomes and A M Bayen ldquoExpertlevel control of ramp metering based on multi-task deepreinforcement learningrdquo IEEE Transactions on IntelligentTransportation Systems vol 19 no 4 pp 1198ndash1207 2018

[23] F Deng J Jin Y Shen and Y Du ldquoAdvanced self-improvingramp metering algorithm based on multi-agent deep rein-forcement learningrdquo in Proceedings of the 2019 IEEE Intel-ligent Transportation Systems Conference (ITSC) AucklandNew Zealand Octom 2019

[24] J Gao Y Shen J Liu M Ito and N Shiratori ldquoAdaptivetraffic signal control deep reinforcement learning algorithmwith experience replay and target networkrdquo 2017 httpsarxivorgabs170502755

[25] H Wei G Zheng H Yao and Z Li ldquoIntelliLight a rein-forcement learning approach for intelligent traffic lightcontrolrdquo in Proceedings of the 24th ACM SIGKDD Interna-tional Conference on Knowledge Discovery amp Data MiningNew York NY USA July 2018

[26] X Liang X Du G Wang and Z Han ldquoA deep reinforcementlearning network for traffic light cycle controlrdquo IEEETransactions on Vehicular Technology vol 68 no 2pp 1243ndash1253 2019

[27] R Sutton and A Barto Reinforcement Learning An Intro-duction MIT Press Cambridge MA USA 1998

[28] D P Kingma and J Ba ldquoAdam a method for stochasticoptimizationrdquo 2014 httpsarxivorgabs14126980

[29] V Mnih K Kavukcuoglu D Silver et al ldquoHuman-levelcontrol through deep reinforcement learningrdquo Naturevol 518 no 7540 pp 529ndash533 2015

[30] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httpsarxivorgabs13125602

[31] D Krajzewicz J Erdmann M Behrisch and L Bieker-WalzldquoRecent development and applications of SUMO-Simulationof Urban MObilityrdquo International Journal of Agile Systemsand Management vol 5 no 3amp4 2012

[32] B Liu Y Tang Y Ji Y Shen and Y Du ldquoA deep rein-forcement learning approach for ramp metering based ontraffic video datardquo 2020 httpsarxivorgabs201212104


is paper is organized as follows A brief review of therelated literature is presented first and then the method-ology is described followed by the demonstration in a real-world case studye final section summarizes the paper anddiscusses possible directions for future research

2 Related Works

e signal timing plans for ramp metering could be fixed ortraffic responsive Wattleworth [10] firstly formulated alinear programming model to optimize a fixed signal timingplan for ramp metering based on historical traffic volumedata Responsive approaches adapt to traffic flow fluctua-tions by updating signal timings in response to real-timetraffic measurements e responsive approaches can beclassified as rule-based optimization-based and RL-based

e rule-based approaches update signal timings in real-time according to specific rules Masher et al [11] proposed afeedforward ramp metering algorithm aiming at keepingthe downstream traffic flow below the capacity Papa-georgiou et al [8] proposed a feedback ramp metering al-gorithm which is named ALINEA based on the occupancyobtained by inductive loop detectors downstream of themerging area e ALINEA regulates the inflows from theon-ramp in the scheme of closed-loop control so that thedetected occupancy would approach a predefined targetvalue e ALINEA was further extended by consideringmore traffic measurements such as queue lengths at the on-ramp and traffic volumes in the mainline [12] Wang et al[9] proposed the proportional-integral-based ALINEA (iePI-ALINEA) considering the cases where a bottleneck witha capacity lower than that of the merging area is presentfurther downstream of the merging areaey demonstratednumerically that the PI-ALINEA performs better than theoriginal ALINEA e ALINEA has been incorporated inseveral coordinated ramp metering algorithms such asMETALINE and HERO [13 14]

e optimization-based approaches optimize signaltimings based on real-time traffic data e model predictivecontrol framework which considers the interactions be-tween ramp metering and future traffic states is oftenemployed to predict traffic evolution for proactive trafficcontrols Nevertheless the models used to describe trafficdynamics may lead to nonlinear control problems whichbrings difficulty in finding the optimal signal timings[15ndash17] In addition the effectiveness of the ramp meteringstrategy depends on the degree of the fitness of the models tothe actual traffic dynamics

e RL-based approaches search for a policy that de-termines the signal timings based on the current traffic stateExisting RL-based ramp metering approaches were mainlydeveloped based on value-based RL methods such as theQ-learning algorithm e Q-learning algorithm evaluatesthe action value of each state-action pair and improves thepolicy with a heuristic searche actions usually refer to themetering rate or signal phase selection e states are rep-resented by traffic measurements such as traffic density andvolume e downstream and upstream traffic flow rampqueue length and traffic density are often selected to define

the control reward [18ndash21] e RL has also been introducedfor coordinated ramp metering [22 23] Note that existingRL methods for ramp metering were driven by traditionaltraffic measurements To the best of our knowledge nostudies have attempted to automatically extract informationfrom traffic videos and learn the optimal control strategiesdirectly from the visual inputs for ramp metering Somestudies have considered traffic video data as inputs in the RLframework for intersection signal controls [24ndash26] Nev-ertheless the elements in the RL framework such as theaction reward and state for rampmetering and intersectionsignal controls are different

Compared with the methods based on traditionalmeasurement the proposed strategy shows great potential inimproving ramp metering performance due to the spatio-temporal information contained in the video data Howeverthe high-dimension data increase the computational com-plexity in the training phase erefore a well-designed RLframework and algorithm are necessary to guarantee theeffectiveness of the proposed method

3 Methodology

We propose a DRLmethod for local rampmetering based ontraffic video data e proposed method adopts a flexibletwo-phase control scheme which is illustrated in Figure 1 Apolicy trained from the DRL determines whether the phasein the next time step is green or red at a fixed time step L

based on the current traffic state e adopted controlscheme is conceivably more flexible than the conventionalldquostop-and-gordquo scheme with fixed red and green phasesflashing alternately To reduce the computation burdentraffic video frames are used as control inputs Vehicle lo-cations from raw images are extracted to reconstruct visualrepresentations in the DRL method [24]

Figure 2 illustrates the ramp metering problem in thegeneral scheme of RL e ramp meter acts as the agent tointeract with the environment namely the traffic systemincluding the mainline and the on-ramp e arrows rep-resent the direction of information interaction in the systeme ramp meter takes certain control action at based oncurrent traffic state st en the traffic system responds tothe control action with the state transition from st to st+1e rampmeter obtains reward rt that quantifies the effect ofthe control action at given state st e interaction process isassumed to be a Markov decision process [27] at is giventhe current state st and action at the following state st+1 andreward rt are independent of previous states and actions

e RL method aims at training the agent (Ramp meter)to find an optimal policy πlowast that maximizes the discountedcumulative reward Gt after time step t To achieve that goalone approach is to learn the optimal action-value functionQlowast(s a) which is defined by the following

Qlowast(s a) max

πEπ Gt|st s at a1113858 1113859 (1)

where π is a policy mapping traffic states to control actionsOnce Qlowast(s a) is found the action to be taken can be ob-tained by the following



Qlowast

s aprime( 1113857 (2)













(5)








Stop Bar

Current phase green

Time step

Signal State

t-L t t+L t+2Lt-2L


Ramp meter(Agent)



t=t+1




21113872 1113873 (8)



(9)




st+1 aprime wminus

( 1113857 minus Q0

s a wi( 11138571113872 11138732

1113874 1113875

(10)




L2 wi( 1113857 Esa Q1

s a wi( 1113857 minus v1113872 11138732

+ Q2

s a wi( 1113857 minus u1113872 11138732

1113874 1113875

(11)


L wi( 1113857 L1 wi( 1113857 + λL2 wi( 1113857 (12)


4 Case Study



b) Position matrix

Down-sampling

c) Traffic state

a) Control area

Y

X mDtndash2

XD

YD

mDtndash1 mD

t

X

(x0 y0)


(x1 y1)

Y








Agent

Environment


t=t+1

SampleTraining data


Update weights


hellip

Replay buffer

(a)

Q-networkTarget

network


TrainedFreezing


arsquo

(b)





for t 1 T




end ifend for

end for



7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp


CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m


N




Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)






















1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References




































Qlowast

s aprime( 1113857 (2)













(5)








Stop Bar

Current phase green

Time step

Signal State

t-L t t+L t+2Lt-2L


Ramp meter(Agent)



t=t+1




21113872 1113873 (8)



(9)




st+1 aprime wminus

( 1113857 minus Q0

s a wi( 11138571113872 11138732

1113874 1113875

(10)




L2 wi( 1113857 Esa Q1

s a wi( 1113857 minus v1113872 11138732

+ Q2

s a wi( 1113857 minus u1113872 11138732

1113874 1113875

(11)


L wi( 1113857 L1 wi( 1113857 + λL2 wi( 1113857 (12)


4 Case Study



b) Position matrix

Down-sampling

c) Traffic state

a) Control area

Y

X mDtndash2

XD

YD

mDtndash1 mD

t

X

(x0 y0)


(x1 y1)

Y








Agent

Environment


t=t+1

SampleTraining data


Update weights


hellip

Replay buffer

(a)

Q-networkTarget

network


TrainedFreezing


arsquo

(b)





for t 1 T




end ifend for

end for



7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp


CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m


N




Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)






















1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References




































21113872 1113873 (8)



(9)




st+1 aprime wminus

( 1113857 minus Q0

s a wi( 11138571113872 11138732

1113874 1113875

(10)




L2 wi( 1113857 Esa Q1

s a wi( 1113857 minus v1113872 11138732

+ Q2

s a wi( 1113857 minus u1113872 11138732

1113874 1113875

(11)


L wi( 1113857 L1 wi( 1113857 + λL2 wi( 1113857 (12)


4 Case Study



b) Position matrix

Down-sampling

c) Traffic state

a) Control area

Y

X mDtndash2

XD

YD

mDtndash1 mD

t

X

(x0 y0)


(x1 y1)

Y








Agent

Environment


t=t+1

SampleTraining data


Update weights


hellip

Replay buffer

(a)

Q-networkTarget

network


TrainedFreezing


arsquo

(b)





for t 1 T




end ifend for

end for



7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp


CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m


N




Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)






















1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References








































Agent

Environment


t=t+1

SampleTraining data


Update weights


hellip

Replay buffer

(a)

Q-networkTarget

network


TrainedFreezing


arsquo

(b)





for t 1 T




end ifend for

end for



7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp


CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m


N




Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)






















1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References



































7000

6500

6000

5500

5000

volu

me (

pcu

h)

750 800 810 820 830 840 850

MainlineRamp


CurbsideCenterline

Three-lanemainline

Three-lanemainline

Mergingarea

One-lane on-ramp

400 m

1000 m

150 m300 m


N




Stride 1 times 4


Stride 1 times 2


Stride 1 times 1

FCUnits 256

FCUnits 256

ReLu ReLu

ReLu

ReLu ReLu

ReLuOutputUnits 2

(speed+queue)

OutputUnits 2

(Q function)






















1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References






















































1e6

6250

6500

6750

7000

7250

7500Ep

isode

rew

ard

1004 06 080200Frame

(a)

180

200

220

Mea

n tr

avel

tim

e (s)

1e61004 06 080200

Frame

(b)


800 810 820 830 840 850 900


25

30

35

40

45

Trav

el ti

me (

min

)





0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References



































0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

12131415161718192021

Spee

d (m

s)

810 820 830 840 850 900800

(a)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

136144152160168176184192200208

Spee

d (m

s)

810 820 830 840 850 900800

(b)

0

500

1000

1500

2000

2500

3000

3500

4000

Dist

ance

(m)

144

152

160

168

176

184

192

200

208

Spee

d (m

s)

810 820 830 840 850 900800

(c)


800 810 820 830 840 850 900


0

50

100

150

Que

ue le

ngth

(m)






PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References






































PI-ALINEADeep RL

30

40

50

60

70

Que

ue le

ngth

(m)

050 055 060045Red ratio ()


ρ = 093

OccQueue

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120

Que

ue le

ngth

(m)

(a)

OccQueue

ρ = 050

800 810 820 830 840 850 90010

16

22

28

Occ

upan

cy (

)

0

20

40

60

80

100

120Q

ueue

leng

th (m

)

(b)






800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References






































800 805 810 815 820 825 830 835 840 845 850 855 900


5200

5400

5600

5800

6000

6200

6400

6600

Volu

me (

veh

h)


VehicleRamp meter

1

2

3

4

6

5Direction

78

100 500 900 1300 1700 2100-300x

35

55

75y

95

115

135


Baseline 1 2 3 4 5 6 7 8

170

180

190

200

210

220

Trav

el ti

me (

s)





Data Availability


Disclosure




Acknowledgments


References





































Data Availability


Disclosure




Acknowledgments


References



































a deep reinforcement learning approach for ramp metering

Documents