End-to-end system with deep reinforcement learning

Edoardo Gruppi
Apr 14, 2021
3 min read

Updated: May 26, 2021

The first Deep Reinforcement Learning application within the autonomous driving field.

To the current state, most of the studies within the autonomous driving field are based on a logic that strongly relies on external mapping infrastructure. As mentioned in the first part of the blog, such systems are made up of various components, such as Perception, Localisation, Planning and Control, that are usually engineered separately. Therefore, their scalability to complex driving scenarios suffers from the significant interdependencies that are at stake.

In a more recent approach known as imitation learning the knowledge is acquired through the observation of the behaviour exhibited by an expert agent. Although this is a promising technique, it is impossible during the learning phase to cover all the potential situations that may be encountered.

In 2018, the London Company Wayve presented the first Deep Reinforcement Learning application within the self-driving scope.

Specifically, in only 20 minutes, they correctly taught a vehicle, equipped with a single monocular camera, to follow a lane line.

Reinforcement Learning

Reinforcement learning (RL) is a branch of machine learning that recently have shown an impressive potential in a wide range of areas including games, industry, healthcare and robotics. In RL an agent undertakes actions within an environment to learn in real-time how to behave through a system of positive and negative feedbacks. It is oftentimes formalised as a Markov Decision Process (MDP) defined by:

a state-space S;
an action space A;
a state transition probability P;
an expected reward R;
a discount factor γ.

Whenever the agent undertakes an action, the environment transits from the current state to another with a certain probability. As a result of its action, the agent also receives a feedback, i.e. reward, from the environment. The agent's aim is to find an optimal policy π* that maximises the discounted cumulative reward:

The discount factor is required since future rewards are reasonably more difficult to predict.

A policy describes the agent's behaviour and how it decides which action to perform. Usually, it is learnt indirectly using a method based on the value function. The latter maps each state to the discounted cumulative reward obtained by starting from it. A further crucial aspect in RL is the trade-off between exploration and exploitation. Whereas in the former the environment is randomly explored to retrieve new meaningful information, the latter leverages the acquired knowledge to maximise the final reward. Finally, Deep RL relies on deep neural networks to achieve the mentioned objectives.

System architecture

The system is based on a common model-free RL algorithm called Deep Deterministic Policy Gradients (DDPG). It comprises:

a critic Q(s,a) that estimates the value of the expected cumulative discounted reward when acting a in state s;
an actor which tries to find the Q-optimal policy such that π(s)=argmax_a Q(s,a).

Illustration inspired by the cover image of the paper [1]

As noticeable from the above image, the actor receives as input the environment state expressed as the combination of the captured camera image and the current steering and speed measurements. Although a representation of the RGB image is originally obtained simply through a series of convolutional layers, the result can be further enhanced using a Variational AutoEncoder (VAE). The actor then returns an action corresponding to a steering angle as well as a speed value selected from a continuous space. The Q-function that drives the action selection is updated as soon as the real reward is received from the environment. As usual, in the early episodes of the experiments, which are primarily performed with a simulator, the exploration of the state space is largely promoted.

Results

The described study is the first to successfully apply RL to a full-size autonomous vehicle. The results are impressive also considering the simple reward adopted, i.e. distance covered without road rules violations, and the low number of employed parameters (around 10k).

References

Kendall, Alex, et al. "Learning to drive in a day." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
Wayve. "Learning to Drive in a Day." www.wayve.ai, 2018, wayve.ai/blog/learning-to-drive-in-a-day-with-reinforcement-learning.
Qu, Jerry. "Training Self Driving Cars Using Reinforcement Learning." Medium, 2019, towardsdatascience.com/reinforcement-learning-towards-general-ai-1bd68256c72d.

The images in the blog are either copyright free or designed from scratch. The cover image is extracted from the YouTube video referenced within the post. Some figures presented in this article are also created leveraging elements extracted from the following vector images:

"https://www.freepik.com/vectors/car" Car Vector created by macrovector.