Computer vision tracking for behavioral planning

Edoardo Gruppi
Apr 11, 2021
3 min read

Updated: May 30, 2021

Recurrent YOLO: a lightweight and efficient state-of-the-art method to track objects in the environment.

Object tracking is essential in self-driving applications to understand how the entities that populate the environment move and interact. Calculating the probable trajectories that objects will follow allows an autonomous vehicle to react accordingly and prevent accidents. Vision tracking received a renewed impetus in the last years thanks to the significant advances made by using deep learning techniques.

In 2017, Ning et al. presented Recurrent YOLO (ROLO), a novel object tracking method based on Recurrent Convolutional Neural Networks. As suggested by the name, ROLO extends the YOLO model by adding in the final part a Recurrent Neural Network (RNN) unit, i.e. more precisely a Long Short Term Memory (LSTM) cell.

You Only Look Once: YOLO

Object detection algorithms, including YOLO, aim to identify the location and the category of one or more objects in a given image. Most of the methods proposed in the literature are based on classification. This means that the prediction process is performed multiple times, i.e. one for each region of interest which is typically selected through the transition of a sliding window over the input picture. However, both the computing and time resources required by this approach are not negligible.

As the name suggests, YOLO is a relatively lightweight regression-based model in which the detection is performed by processing the entire image at once. Specifically, after dividing the input picture in a grid of cells, the network simultaneously detects for each one of them at most n objects defined by:

C, namely the class label associated with the object detected;
P, i.e. the probability the object belongs to C;
(Bx, By, Bw, Bh), the coordinates describing the bounding box.

Afterwards, the elements with low P values and highest shared area are excluded through a process called non-max suppression. Overall, YOLO represents a valid object detection method for self-driving scenarios as it also takes into account the execution speed. Nevertheless, YOLO is not capable of connecting entities detected in frames displaying the same scene in multiple consecutive time instants. To perform object tracking and visualize the trajectory of the objects it needs to be extended with an additional memory component.

Recurrent Neural Networks (RNN)

In RNNs, unlike most of the deep learning models, the outputs depend also on a recurrent input, alternatively known as hidden or internal state. The picture below displays the functioning of the basic building unit constituting RNNs, i.e an RNN cell.

As noticeable, for each time step the internal state of the cell is computed by applying a non-linear activation function on a weighted sum of the current input and the previous hidden state. The cell then returns as output its internal state multiplied by the weight matrix W_hy. The LSTM unit is an advancement of the RNN cell that overcomes the short-term memory problem to which the latter is susceptible. Specifically, an LSTM cell is characterised by the presence of several gates that properly manage the information flow.

ROLO System

In the simplest version of ROLO, the first part of the YOLO model, i.e. all the convolutional layers up to and including the earliest fully connected layer, are used to convert an image into a one-dimension vector of size 4096 that densely represents the mid-level visual features extracted from the input picture. Hence, the remaining layers of YOLO are exploited to retrieve the information of each detected object as described in the previous section. The outputs are then concatenated with the feature representation previously extracted and given in input to an LSTM cell that, evaluating also the past information, returns the predicted location of the identified objects.

ROLO Results

In conclusion, ROLO is a valuable tracking method that outperforms most of the alternatives achieving impressively accurate results while requiring low computational resources.

As demonstrated by the authors in the several experiments conducted, the model does not suffer from partial occlusions, motion blur, scale changes or ambient light variation. Furthermore, since it leverages the YOLO network, it can operate at a remarkable frame rate becoming particularly suitable for self-driving applications.

References

Ning, Guanghan, et al. "Spatially supervised recurrent convolutional neural networks for visual object tracking." 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2017.
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

The images in the blog are either copyright free or designed from scratch. Some illustrations are inspired and partially realized starting from the images presented in the paper in reference (Guanghan, et al).